Museum

Home

Lab Overview

Retrotechnology Articles

Online Manuals

⇒ regexp(3x) — AIX PS/2 1.2.1

Media Vault

Software Library

Restoration Projects

Artifacts Sought

Related Articles

NCcollate, NCcoluniq, NCeqvmap, _NCxcol, _NLxcol

regcmp, regex

ed

grep

sed



REGEXP(3x,L)                AIX Technical Reference                REGEXP(3x,L)



-------------------------------------------------------------------------------
regexp: compile, step, advance



PURPOSE

Compiles and matches regular-expression patterns.

LIBRARY

None

SYNTAX

#define INIT             declarations
#define GETC( )          getc_code
#define PEEKC( )         peekc_code
#define UNGETC(c)        ungetc_code
#define RETURN(pointer)  return_code
#define ERROR(val)       error_code

#include <regexp.h>




                char *compile (instring, ep,intdstep (p1,)p2)
                char *instring, *ep, *endbufchar *string, *expbuf;
                int seof;
                                            int advance (lp, ep)
                                            char *string, *expbuf;



DESCRIPTION

The regexp.h header file defines several general purpose subroutines that
perform regular-expression pattern matching.  Programs that perform
regular-expression pattern matching such as ed, sed, grep, bs, and expr use
this source file.  In this way, only this file needs to be changed in order to
maintain regular expression compatibility between programs.

The NLregexp.h functions compile, step and advance operate on file code
strings.  The following macros must be defined by the programmer prior to
including NLregexp.h.

INIT
   This macro is used for dependent declarations and initializations.  It is
   placed right after the declaration and opening "{" (left brace) of the
   compile subroutine.  The definition of INIT must end with a ";" (semicolon).
   INIT is frequently used to set a register variable to point the beginning of



Processed November 7, 1990       REGEXP(3x,L)                                 1





REGEXP(3x,L)                AIX Technical Reference                REGEXP(3x,L)



   the regular expression so that this register variable can be used in the
   declarations for GETC, PEEKC, and UNGETC.  Otherwise, you can use INIT to
   declare external variables that GETC, PEEKC, and UNGETC need.

     #define INIT             register char *sp = instring; \
                              int sp_len; \
                              mbchar_t sp_peekc;

GETC( )
   This macro returns the value of the next character (as an mbchar_t) in the
   regular expression pattern.  Successive calls to the GETC macro should
   return successive characters of the pattern.

     # define GETC()          (PEEK(),sp+=sp_len,sp_peekc)

PEEKC( )
   This macro returns the next character (as an mbchar_t) in the regular
   expression.  Successive calls to the PEEKC macro should return the same
   character, which should also be the next character returned by the GETC
   macro.  The special value ERR should be returned if there is an error in the
   character.

     #define PEEKC()  ( (-1==(sp_len=mbstomb (&sp_peekc,sp,MB_LEN_MAX) ) ) \
                              ?          sp_peekc=ERR\
                              :          sp_peekc)

UNGETC(c)
   This macro causes the parameter c to be returned by the next call to the
   GETC and PEEKC macros.  No more than one character of pushback is ever
   needed and this character is guaranteed to be that last character read by
   the GETC macro.  The return value of the UNGETC macro is always ignored.

     #define UNGETC (c)       (sp-=sp_len)

RETURN(pointer)
   This macro is used on normal exit of the compile subroutine.  The pointer
   parameter points to the first character immediately following the compiled
   regular expression.  This is useful to programs that have memory allocation
   to manage.

     #define RETURN(p)        return

ERROR(val)
   This macro is used on abnormal exit from the compile subroutine.  It should
   never contain a return statement.  The val parameter is an error number.
   The error values and their meanings are:

     #define ERROR(c)         regerr (c)







Processed November 7, 1990       REGEXP(3x,L)                                 2





REGEXP(3x,L)                AIX Technical Reference                REGEXP(3x,L)



Error
Name             Value   Meaning

BIG_RANGE        11      Range endpoint too large.

BAD_NUM          16      Bad number.

BAD_BACK         25      "\" digit out of range.

BAD_DELIM        36      Illegal or missing delimiter.

NO_SAVED         41      No remembered search string.

BAD_LEFTP        42      "\(\)" imbalance.

BAD_RIGHTP       43      Too many "\(".

EX_COMMA         44      More than two numbers given in \{ \}.

NO_CLOSE         45      "}" expected after "\".

MAX_MIN          46      First number exceeds second in \{ \}.

BAD_BRAK         49      "[ ]" imbalance.

TOO_BIG          50      Regular expression overflow.

STACK_EMPTY      51      Backtrack stack empty.

STACK_FULL       52      Backtrack stack full.

BAD_CHAR         60      Strange multibyte character.


The compile subroutine compiles the regular expression for later use.  The
instring parameter is never used explicitly by the compile subroutine, but you
can use it in your macros.  For instance, you may want to pass the string
containing the pattern as the instring parameter to compile and use the INIT
macro to set a pointer to the beginning of this string.  (The following example
uses this technique.)  If your macros do not use instring, then call compile
with a value of ((char *) 0) for this parameter.

The expbuf parameter points to a character array where the compiled regular
expression is to be placed.  The endbuf parameter points to the location that
immediately follows the character array where the compiled regular expression
is to be placed.  If the compiled expression cannot fit in (endbuf-expbuf)
bytes, the call ERROR(50) is made.

The eof parameter is the character that marks the end of the regular
expression.  For example, in ed this character is usually "'/'" (slash).





Processed November 7, 1990       REGEXP(3x,L)                                 3





REGEXP(3x,L)                AIX Technical Reference                REGEXP(3x,L)



The regexp.h header file defines other subroutines that perform actual
regular-expression pattern matching.  One of these is the step subroutine.

The string parameter of step is a pointer to a null-terminated string of
characters to be checked for a match.

The expbuf parameter points to the compiled regular expression, which was
obtained by a call to the compile subroutine.

The step subroutine returns the value 1 if the given string matches the
pattern, and 0 if it does not match.  If it matches, then step also sets two
global character pointers:  loc1, which points to the first character that
matches the pattern, and loc2, which points to the character immediately
following the last character that matches the pattern.  Thus, if the regular
expression matches the entire string, then loc1 points to the first character
of string and loc2 points to the null character at the end of string.

The step subroutine uses the global variable circf, which is set by compile if
the regular expression begins with a "^" (circumflex).  If this variable is
set, then step only tries to match the regular expression to the beginning of
the string.  If you compile more than one regular expression before executing
the first one, then save the value of circf for each compiled expression and
set circf to that saved value before each call to step.

The step subroutine calls a subroutine named advance with the same parameters
that it was passed.  The step function increments through the string parameter
and calls advance until advance returns a 1, indicating a match, or until the
end of string is reached.  To constrain string to the beginning of the string
in all cases, call the advance subroutine directly instead of calling step.

When advance encounters an "*" (asterisk) or a "\{ \}" sequence in the regular
expression, it advances its pointer to the string to be matched as far as
possible and recursively calls itself trying to match the rest of the string to
the rest of the regular expression.  As long as there is no match, advance
backs up along the string until it finds a match or reaches the point in the
string that initially matched the "*" or "\{ \}".  It is sometimes desirable to
stop this backing-up before the initial point in the string is reached.  If the
global character pointer locs is equal to the point in the string sometime
during the backing up process, advance breaks out of the loop that backs up and
returns 0.  This is used by ed and sed for global substitutions on the whole
line so that expressions like "s/y*//g" do not loop forever.

EXAMPLE

The following is an example of the regular expression macros and calls from the
grep command.









Processed November 7, 1990       REGEXP(3x,L)                                 4





REGEXP(3x,L)                AIX Technical Reference                REGEXP(3x,L)



  #define INIT          register char *sp=instring;
  #define GETC()        (*sp++)
  #define PEEKC()       (*sp)
  #define UNGETC(c)     (--sp)
  #define RETURN(c)     return;
  #define ERROR(c)      regerr()

  #include <regexp.h>
  ...
  compile (patstr, expbuf, &expbuf[ESIZE], '\0');
  ...
  if (step (linebuf, expbuf))
     succeed ( );
  ...

RELATED INFORMATION

In this book:  "NCcollate, NCcoluniq, NCeqvmap, _NCxcol, _NLxcol" and "regcmp,
regex."

The ed, grep, and sed commands in AIX Operating System Commands Reference.

"Introduction to International Character Support" in Managing the AIX Operating
System.

AIX Guide to Multibyte Character Set (MBCS) Support.





























Processed November 7, 1990       REGEXP(3x,L)                                 5



Typewritten Software • bear@typewritten.org • Edmonds, WA 98026