regexp(5)
_________________________________________________________________
regexp Miscellany
regular expression compile and match routines
_________________________________________________________________
SYNTAX
#define INIT <declarations>
#define GETC() <getc code>
#define PEEKC() <peekc code>
#define UNGETC(c) <ungetc code>
#define RETURN(pointer) <return code>
#define ERROR(val) <error code>
#include <regexp.h>
char *compile (instring, expbuf, endbuf, eof)
char *instring, *expbuf, *endbuf;
int eof;
int step (string, expbuf)
char *string, *expbuf;
extern char *loc1, *loc2, *locs;
extern int circf, sed, nbra;
DESCRIPTION
This entry describes general-purpose regular expression matching
routines in the form of ed(1), defined in /usr/include/regexp.h.
Programs that perform regular expression matching use this source
file, including ed(1), sed(1), grep(1), bs(1), and expr(1). Only
this file need be changed to maintain regular expression
compatibility.
The interface to this file is unpleasantly complex. Programs
that include this file must have the following five macros
declared before the #include <regexp.h> statement. These macros
are used by the compile routine.
GETC() Return the value of the next character in the regular
expression pattern. Successive calls to GETC() should
return successive characters of the regular expression.
PEEKC() Return the next character in the regular expression.
Successive calls to PEEKC() should return the same
character (which should also be the next character
returned by GETC()).
DG/UX 4.00 Page 1
Licensed material--property of copyright holder(s)
regexp(5)
UNGETC(c) Make the argument c return from the next call to GETC()
or PEEKC(). No more that one character of pushback is
ever needed. This character is guaranteed to be the
last character read by GETC(). The value of the macro
UNGETC(c) is always ignored.
RETURN(pointer)
This macro is used on normal exit of the compile
routine. The value of the argument pointer is a
pointer to the character after the last character of
the compiled regular expression. This is useful to
programs that have to manage memory allocation.
ERROR(val)
This is the abnormal return from the compile routine.
The argument val is an error number (see table below
for meanings). This call should never return.
ERROR
MEANING
11 Range endpoint too large
16 Bad number
25 \digit out of range
36 Illegal or missing delimiter
41 No remembered search string
42 \( \) imbalance
43 Too many \(
44 More than 2 numbers given in \{ \}
45 } expected after \
46 First number exceeds second in \{ \}
49 [ ] imbalance
50 Regular expression overflow
The syntax of the compile routine is as follows:
compile(instring, expbuf, endbuf, eof)
The first parameter instring is never used explicitly by the
compile routine but is useful for programs that pass down
different pointers to input characters. It is sometimes used in
the INIT declaration (see below). Programs that call functions
to input characters or have characters in an external array can
pass down a value of ((char *) 0) for this parameter.
The next parameter expbuf is a character pointer. It points to
the place where the compiled regular expression will be placed.
The parameter endbuf is one more than the highest address where
the compiled regular expression may be placed. If the compiled
expression cannot fit in (endbuf-expbuf) bytes, a call to
ERROR(50) is made.
DG/UX 4.00 Page 2
Licensed material--property of copyright holder(s)
regexp(5)
The parameter eof is the character which marks the end of the
regular expression. For example, in ed(1), this character is
usually /.
Each program that includes this file must have a #define
statement for INIT. This definition will be placed right after
the declaration for the function compile and the opening curly
brace ({). It is used for dependent declarations and
initializations. Most often, it is used to set a register
variable to point to the beginning of the regular expression This
register variable can then be used in the declarations for
GETC(), PEEKC() and UNGETC(). Otherwise, you can use it to
declare external variables that might be used by GETC(), PEEKC()
and UNGETC(). See the example below of the declarations taken
from grep(1).
Other functions in this file perform actual regular expression
matching, one of which is the function step. The call to step is
as follows:
step(string, expbuf)
The first parameter to step is a pointer to a string of
characters to be checked for a match. This string should be null
terminated.
The second parameter expbuf is the compiled regular expression
obtained by a call of the function compile.
The function step returns non-zero if the given string matches
the regular expression, and zero if the expressions do not match.
If there is a match, two external character pointers are set as a
side effect to the call to step. The variable set in step is
loc1. This is a pointer to the first character that matched the
regular expression. The variable loc2, which is set by the
function advance, points to the character after the last
character that matches the regular expression. Thus if the
regular expression matches the entire line, loc1 will point to
the first character of string and loc2 will point to the null at
the end of string.
Step uses the external variable circf, which is set by compile if
the regular expression begins with ^. If this is set, step will
try to match the regular expression to the beginning of the
string only. If more than one regular expression is to be
compiled before the first is executed, save the value of circf
for each compiled expression and set circf to that saved value
before each call to step.
The function advance is called from step with the same arguments
as step. Step steps through the string argument and calls
DG/UX 4.00 Page 3
Licensed material--property of copyright holder(s)
regexp(5)
advance until advance returns non-zero, indicating a match, or
until the end of string is reached. If one wants to constrain
string to the beginning of the line in all cases, step need not
be called; simply call advance.
When advance encounters a * or \{ \} sequence in the regular
expression, it advances its pointer to the string to be matched
as far as possible. It recursively calls itself, trying to match
the rest of the string to the rest of the regular expression. As
long as there is no match, advance backs up along the string
until it finds a match, or until it reaches the point in the
string that initially matched the * or \{ \}. You may want to
stop this backing up before the initial point in the string is
reached. If the external character pointer locs is equal to the
point in the string at some time during the backing up process,
advance breaks out of the loop that backs up and returns zero.
This is used by ed(1) and sed(1) for substitutions done globally
(not just the first occurrence, but the whole line). Therefore,
expressions like s/y*//g do not loop forever.
The additional external variables sed and nbra are used for
special purposes.
EXAMPLES
The following shows how the regular expression macros and calls
look from grep(1):
#define INIT register char *sp = instring;
#define GETC() (*sp++)
#define PEEKC() (*sp)
#define UNGETC(c) (--sp)
#define RETURN(c) return;
#define ERROR(c) regerr()
#include <regexp.h>
...
(void) compile(*argv, expbuf, &expbuf[ESIZE], '\0');
...
if (step(linebuf, expbuf))
succeed();
FILES
/usr/include/regexp.h
SEE ALSO
DG/UX 4.00 Page 4
Licensed material--property of copyright holder(s)
regexp(5)
bs(1), ed(1), expr(1), grep(1), sed(1) in the User's Reference
for the DG/UX System
DG/UX 4.00 Page 5
Licensed material--property of copyright holder(s)