REGEXP(3x,L) AIX Technical Reference REGEXP(3x,L)
-------------------------------------------------------------------------------
regexp: compile, step, advance
PURPOSE
Compiles and matches regular-expression patterns.
LIBRARY
None
SYNTAX
#define INIT declarations
#define GETC( ) getc_code
#define PEEKC( ) peekc_code
#define UNGETC(c) ungetc_code
#define RETURN(pointer) return_code
#define ERROR(val) error_code
#include <regexp.h>
char *compile (instring, ep,intdstep (p1,)p2)
char *instring, *ep, *endbufchar *string, *expbuf;
int seof;
int advance (lp, ep)
char *string, *expbuf;
DESCRIPTION
The regexp.h header file defines several general purpose subroutines that
perform regular-expression pattern matching. Programs that perform
regular-expression pattern matching such as ed, sed, grep, bs, and expr use
this source file. In this way, only this file needs to be changed in order to
maintain regular expression compatibility between programs.
The NLregexp.h functions compile, step and advance operate on file code
strings. The following macros must be defined by the programmer prior to
including NLregexp.h.
INIT
This macro is used for dependent declarations and initializations. It is
placed right after the declaration and opening "{" (left brace) of the
compile subroutine. The definition of INIT must end with a ";" (semicolon).
INIT is frequently used to set a register variable to point the beginning of
Processed November 7, 1990 REGEXP(3x,L) 1
REGEXP(3x,L) AIX Technical Reference REGEXP(3x,L)
the regular expression so that this register variable can be used in the
declarations for GETC, PEEKC, and UNGETC. Otherwise, you can use INIT to
declare external variables that GETC, PEEKC, and UNGETC need.
#define INIT register char *sp = instring; \
int sp_len; \
mbchar_t sp_peekc;
GETC( )
This macro returns the value of the next character (as an mbchar_t) in the
regular expression pattern. Successive calls to the GETC macro should
return successive characters of the pattern.
# define GETC() (PEEK(),sp+=sp_len,sp_peekc)
PEEKC( )
This macro returns the next character (as an mbchar_t) in the regular
expression. Successive calls to the PEEKC macro should return the same
character, which should also be the next character returned by the GETC
macro. The special value ERR should be returned if there is an error in the
character.
#define PEEKC() ( (-1==(sp_len=mbstomb (&sp_peekc,sp,MB_LEN_MAX) ) ) \
? sp_peekc=ERR\
: sp_peekc)
UNGETC(c)
This macro causes the parameter c to be returned by the next call to the
GETC and PEEKC macros. No more than one character of pushback is ever
needed and this character is guaranteed to be that last character read by
the GETC macro. The return value of the UNGETC macro is always ignored.
#define UNGETC (c) (sp-=sp_len)
RETURN(pointer)
This macro is used on normal exit of the compile subroutine. The pointer
parameter points to the first character immediately following the compiled
regular expression. This is useful to programs that have memory allocation
to manage.
#define RETURN(p) return
ERROR(val)
This macro is used on abnormal exit from the compile subroutine. It should
never contain a return statement. The val parameter is an error number.
The error values and their meanings are:
#define ERROR(c) regerr (c)
Processed November 7, 1990 REGEXP(3x,L) 2
REGEXP(3x,L) AIX Technical Reference REGEXP(3x,L)
Error
Name Value Meaning
BIG_RANGE 11 Range endpoint too large.
BAD_NUM 16 Bad number.
BAD_BACK 25 "\" digit out of range.
BAD_DELIM 36 Illegal or missing delimiter.
NO_SAVED 41 No remembered search string.
BAD_LEFTP 42 "\(\)" imbalance.
BAD_RIGHTP 43 Too many "\(".
EX_COMMA 44 More than two numbers given in \{ \}.
NO_CLOSE 45 "}" expected after "\".
MAX_MIN 46 First number exceeds second in \{ \}.
BAD_BRAK 49 "[ ]" imbalance.
TOO_BIG 50 Regular expression overflow.
STACK_EMPTY 51 Backtrack stack empty.
STACK_FULL 52 Backtrack stack full.
BAD_CHAR 60 Strange multibyte character.
The compile subroutine compiles the regular expression for later use. The
instring parameter is never used explicitly by the compile subroutine, but you
can use it in your macros. For instance, you may want to pass the string
containing the pattern as the instring parameter to compile and use the INIT
macro to set a pointer to the beginning of this string. (The following example
uses this technique.) If your macros do not use instring, then call compile
with a value of ((char *) 0) for this parameter.
The expbuf parameter points to a character array where the compiled regular
expression is to be placed. The endbuf parameter points to the location that
immediately follows the character array where the compiled regular expression
is to be placed. If the compiled expression cannot fit in (endbuf-expbuf)
bytes, the call ERROR(50) is made.
The eof parameter is the character that marks the end of the regular
expression. For example, in ed this character is usually "'/'" (slash).
Processed November 7, 1990 REGEXP(3x,L) 3
REGEXP(3x,L) AIX Technical Reference REGEXP(3x,L)
The regexp.h header file defines other subroutines that perform actual
regular-expression pattern matching. One of these is the step subroutine.
The string parameter of step is a pointer to a null-terminated string of
characters to be checked for a match.
The expbuf parameter points to the compiled regular expression, which was
obtained by a call to the compile subroutine.
The step subroutine returns the value 1 if the given string matches the
pattern, and 0 if it does not match. If it matches, then step also sets two
global character pointers: loc1, which points to the first character that
matches the pattern, and loc2, which points to the character immediately
following the last character that matches the pattern. Thus, if the regular
expression matches the entire string, then loc1 points to the first character
of string and loc2 points to the null character at the end of string.
The step subroutine uses the global variable circf, which is set by compile if
the regular expression begins with a "^" (circumflex). If this variable is
set, then step only tries to match the regular expression to the beginning of
the string. If you compile more than one regular expression before executing
the first one, then save the value of circf for each compiled expression and
set circf to that saved value before each call to step.
The step subroutine calls a subroutine named advance with the same parameters
that it was passed. The step function increments through the string parameter
and calls advance until advance returns a 1, indicating a match, or until the
end of string is reached. To constrain string to the beginning of the string
in all cases, call the advance subroutine directly instead of calling step.
When advance encounters an "*" (asterisk) or a "\{ \}" sequence in the regular
expression, it advances its pointer to the string to be matched as far as
possible and recursively calls itself trying to match the rest of the string to
the rest of the regular expression. As long as there is no match, advance
backs up along the string until it finds a match or reaches the point in the
string that initially matched the "*" or "\{ \}". It is sometimes desirable to
stop this backing-up before the initial point in the string is reached. If the
global character pointer locs is equal to the point in the string sometime
during the backing up process, advance breaks out of the loop that backs up and
returns 0. This is used by ed and sed for global substitutions on the whole
line so that expressions like "s/y*//g" do not loop forever.
EXAMPLE
The following is an example of the regular expression macros and calls from the
grep command.
Processed November 7, 1990 REGEXP(3x,L) 4
REGEXP(3x,L) AIX Technical Reference REGEXP(3x,L)
#define INIT register char *sp=instring;
#define GETC() (*sp++)
#define PEEKC() (*sp)
#define UNGETC(c) (--sp)
#define RETURN(c) return;
#define ERROR(c) regerr()
#include <regexp.h>
...
compile (patstr, expbuf, &expbuf[ESIZE], '\0');
...
if (step (linebuf, expbuf))
succeed ( );
...
RELATED INFORMATION
In this book: "NCcollate, NCcoluniq, NCeqvmap, _NCxcol, _NLxcol" and "regcmp,
regex."
The ed, grep, and sed commands in AIX Operating System Commands Reference.
"Introduction to International Character Support" in Managing the AIX Operating
System.
AIX Guide to Multibyte Character Set (MBCS) Support.
Processed November 7, 1990 REGEXP(3x,L) 5