regexp: compile, step, advance
Purpose
Compiles and matches regular-expression patterns.
Library
None
Syntax
#define INIT declarations
#define GETC( ) getc_code
#define PEEKC( ) peekc_code
#define UNGETC(c) ungetc_code
#define RETURN(pointer) return_code
#define ERROR(val) error_code
#include <regexp.h>
char *compile (instring, expbuf, endbuf, eint step (string, expbuf)
char *instring, *expbuf, *endbuf; char *string, *expbuf;
char eof;
int advance (string, expbuf)
char *string, *expbuf;
Description
The regexp.h header file defines several general purpose
subroutines that perform regular-expression pattern
matching. Programs that perform regular-expression
pattern matching such as ed, sed, grep, bs, and expr use
this source file. In this way, only this file needs to
be changed in order to maintain regular expression com-
patibility between programs.
The regexp.h header file handles extended characters and
may require access to the current collating sequence.
You can disable the extended functionality of regexp.h by
defining the preprocessor variable RTPC_NO_NLS. This is
useful for tasks such as building programs to run on
prior releases of AIX. See "Overview of International
Character Support" in Managing the AIX Operating System
for more information.
The interface to this header file is complex. Programs
that include this file define the following five macros
before the #include <regexp.h> statement. These macros
are used by the compile subroutine.
INIT
This macro is used for dependent declarations and
initializations. It is placed right after the decla-
ration and opening "{" (left brace) of the compile
subroutine. The definition of INIT must end with a
";" (semicolon). INIT is frequently used to set a
register variable to point the beginning of the
regular expression so that this register variable can
be used in the declarations for GETC, PEEKC, and
UNGETC. Otherwise, you can use INIT to declare
external variables that GETC, PEEKC, and UNGETC need.
GETC( )
This macro returns the value of the next character in
the regular expression pattern. Successive calls to
the GETC macro should return successive characters of
the pattern.
PEEKC( )
This macro returns the next character in the regular
expression. Successive calls to the PEEKC macro
should return the same character, which should also be
the next character returned by the GETC macro.
UNGETC(c)
This macro causes the parameter c to be returned by
the next call to the GETC and PEEKC macros. No more
than one character of pushback is ever needed and this
character is guaranteed to be that last character read
by the GETC macro. The return value of the UNGETC
macro is always ignored.
RETURN(pointer)
This macro is used on normal exit of the compile sub-
routine. The pointer parameter points to the first
character immediately following the compiled regular
expression. This is useful to programs that have
memory allocation to manage.
ERROR(val)
This macro is used on abnormal exit from the compile
subroutine. It should never contain a return state-
ment. The val parameter is an error number. The
error values and their meanings are:
Error Meaning
11 Range endpoint too large.
16 Bad number.
25 "\"digit out of range.
36 Illegal or missing delimiter.
41 No remembered search string.
42 "\( \)" imbalance.
43 Too many "\(".
44 More than two numbers given in \{ \}.
45 "}" expected after "\".
46 First number exceeds second in \{ \}.
49 "[ ]" imbalance.
50 Regular expression overflow.
The compile subroutine compiles the regular expression
for later use. The instring parameter is never used
explicitly by the compile subroutine, but you can use it
in your macros. For instance, you may want to pass the
string containing the pattern as the instring parameter
to compile and use the INIT macro to set a pointer to the
beginning of this string. (The following example uses
this technique.) If your macros do not use instring,
then call compile with a value of ((char *) 0) for this
parameter.
The expbuf parameter points to a character array where
the compiled regular expression is to be placed. The
endbuf parameter points to the location that immediately
follows the character array where the compiled regular
expression is to be placed. If the compiled expression
cannot fit in (endbuf-expbuf) bytes, the call ERROR(50)
is made.
The eof parameter is the character that marks the end of
the regular expression. For example, in ed this char-
acter is usually "'/'" (slash).
The regexp.h header file defines other subroutines that
perform actual regular-expression pattern matching. One
of these is the step subroutine.
The string parameter of step is a pointer to a null-
terminated string of characters to be checked for a
match.
The expbuf parameter points to the compiled regular
expression, which was obtained by a call to the compile
subroutine.
The step subroutine returns the value 1 if the given
string matches the pattern, and 0 if it does not match.
If it matches, then step also sets two global character
pointers: loc1, which points to the first character that
matches the pattern, and loc2, which points to the char-
acter immediately following the last character that
matches the pattern. Thus, if the regular expression
matches the entire string, then loc1 points to the first
character of string and loc2 points to the null character
at the end of string.
The step subroutine uses the global variable circf, which
is set by compile if the regular expression begins with a
"^" (circumflex). If this variable is set, then step
only tries to match the regular expression to the begin-
ning of the string. If you compile more than one regular
expression is before executing the first one, then save
the value of circf for each compiled expression and set
circf to that saved value before each call to step.
The step subroutine calls a subroutine named advance with
the same parameters that it was passed. The step func-
tion increments through the string parameter and calls
advance until advance returns a 1, indicating a match, or
until the end of string is reached. To constrain string
to the beginning of the string in all cases, call the
advance subroutine directly instead of calling step.
When advance encounters an "*" (asterisk) or a "\{ \}"
sequence in the regular expression, it advances its
pointer to the string to be matched as far as possible
and recursively calls itself trying to match the rest of
the string to the rest of the regular expression. As
long as there is no match, advance backs up along the
string until it finds a match or reaches the point in the
string that initially matched the "*" or "\{ \}". It is
sometimes desirable to stop this backing-up before the
initial point in the string is reached. If the global
character pointer locs is equal to the point in the
string sometime during the backing up process, advance
breaks out of the loop that backs up and returns 0. This
is used by ed and sed for global substitutions on the
whole line so that expressions like "s/y*//g" do not loop
forever.
Example
The following is an example of the regular expression
macros and calls from the grep command.
#define INIT register char *sp=instring;
#define GETC() (*sp++)
#define PEEKC() (*sp)
#define UNGETC(c) (--sp)
#define RETURN(c) return;
#define ERROR(c) regerr()
#include <regexp.h>
. . .
compile (patstr, expbuf, &expbuf[ESIZE], '\0');
. . .
if (step (linebuf, expbuf))
succeed ( );
. . .
Related Information
In this book: "NCcollate, NCcoluniq, NCeqvmap, _NCxcol,
_NLxcol" and "regcmp, regex."
The ed, grep, and sed commands in AIX Operating System
Commands Reference.
"Overview of International Character Support" in Managing
the AIX Operating System.