regexp(5) MISC. FILE FORMATS regexp(5)
NAME
regexp: compile, step, advance - regular expression compile
and match routines
SYNOPSIS
#define INIT declarations
#define GETC(void) getc code
#define PEEKC(void) peekc code
#define UNGETC(void) ungetc code
#define RETURN(ptr) return code
#define ERROR(val) error code
#include <regexp.h>
char *compile(char *instring, char *expbuf, char *endbuf,
int eof);
int step(char *string, char *expbuf);
int advance(char *string, char *expbuf);
extern char *loc1, *loc2, *locs;
DESCRIPTION
These functions are general purpose regular expression
matching routines to be used in programs that perform regu-
lar expression matching. These functions are defined by the
<regexp.h> header file. The functions step and advance do
pattern matching given a character string and a compiled
regular expression as input. The function compile takes as
input a regular expression as defined below and produces a
compiled expression that can be used with step or advance.
A regular expression specifies a set of character strings.
A member of this set of strings is said to be matched by the
regular expression. Some characters have special meaning
when used in a regular expression; other characters stand
for themselves. The regular expressions available for use
with the regexp functions are constructed as follows:
Expression Meaning
c the character c where c is not a special charac-
ter.
\c the character c where c is any character, except
a digit in the range 1-9.
^ the beginning of the line being compared.
$ the end of the line being compared.
. any character in the input.
[s] any character in the set s, where s is a
sequence of characters and/or a range of charac-
ters, e.g., [c-c].
1
regexp(5) MISC. FILE FORMATS regexp(5)
[^s] any character not in the set s, where s is
defined as above.
r* zero or more successive occurrences of the regu-
lar expression r. The longest leftmost match is
chosen.
rx the occurrence of regular expression r followed
by the occurrence of regular expression x.
(Concatenation)
r\{m,n\} any number of m through n successive occurrences
of the regular expression r. The regular
expression r\{m\} matches exactly m occurrences;
r\{m,\} matches at least m occurrences.
\(r\) the regular expression r. When \n (where n is a
number greater than zero) appears in a con-
structed regular expression, it stands for the
regular expression x where x is the nth regular
expression enclosed in \( and \) that appeared
earlier in the constructed regular expression.
For example, \(r\)x\(y\)z\2 is the concatenation
of regular expressions rxyzy. Characters that
have special meaning except when they appear
within square brackets ([]) or are preceded by \
are: ., *, [, \. Other special characters,
such as $ have special meaning in more res-
tricted contexts. The character ^ at the begin-
ning of an expression permits a successful match
only immediately after a newline, and the char-
acter $ at the end of an expression requires a
trailing newline. Two characters have special
meaning only when used within square brackets.
The character - denotes a range, [c-c], unless
it is just after the open bracket or before the
closing bracket, [-c] or [c-] in which case it
has no special meaning. When used within brack-
ets, the character ^ has the meaning complement
of if it immediately follows the open bracket
(example: [^c]); elsewhere between brackets
(example: [c^]) it stands for the ordinary char-
acter ^. The special meaning of the \ operator
can be escaped only by preceding it with another
\, e.g. \\. Programs must have the following
five macros declared before the #include
<regexp.h> statement. These macros are used by
the compile routine. The macros GETC, PEEKC,
and UNGETC operate on the regular expression
given as input to compile.
GETC This macro returns the value of the next
2
regexp(5) MISC. FILE FORMATS regexp(5)
character (byte) in the regular expression
pattern. Successive calls to GETC should
return successive characters of the regular
expression.
PEEKC This macro returns the next character (byte)
in the regular expression. Immediately suc-
cessive calls to PEEKC should return the same
character, which should also be the next
character returned by GETC.
UNGETC This macro causes the argument c to be
returned by the next call to GETC and PEEKC.
No more than one character of pushback is
ever needed and this character is guaranteed
to be the last character read by GETC. The
return value of the macro UNGETC(c) is always
ignored.
RETURN(ptr) This macro is used on normal exit of the com-
pile routine. The value of the argument ptr
is a pointer to the character after the last
character of the compiled regular expression.
This is useful to programs which have memory
allocation to manage.
ERROR(val) This macro is the abnormal return from the
compile routine. The argument val is an
error number [see ERRORS below for meanings].
This call should never return. The syntax of
the compile routine is as follows:
compile(instring, expbuf, endbuf, eof)
The first parameter, instring, is never used explicitly by
the compile routine but is useful for programs that pass
down different pointers to input characters. It is some-
times used in the INIT declaration (see below). Programs
which call functions to input characters or have characters
in an external array can pass down a value of (char *)0 for
this parameter. The next parameter, expbuf, is a character
pointer. It points to the place where the compiled regular
expression will be placed. The parameter endbuf is one more
than the highest address where the compiled regular expres-
sion may be placed. If the compiled expression cannot fit
in (endbuf-expbuf) bytes, a call to ERROR(50) is made. The
parameter eof is the character which marks the end of the
regular expression. This character is usually a /. Each
program that includes the <regexp.h> header file must have a
#define statement for INIT. It is used for dependent
declarations and initializations. Most often it is used to
set a register variable to point to the beginning of the
regular expression so that this register variable can be
used in the declarations for GETC, PEEKC, and UNGETC.
3
regexp(5) MISC. FILE FORMATS regexp(5)
Otherwise it can be used to declare external variables that
might be used by GETC, PEEKC and UNGETC. [See EXAMPLE
below.] The first parameter to the step and advance func-
tions is a pointer to a string of characters to be checked
for a match. This string should be null terminated. The
second parameter, expbuf, is the compiled regular expression
which was obtained by a call to the function compile. The
function step returns non-zero if some substring of string
matches the regular expression in expbuf and zero if there
is no match. If there is a match, two external character
pointers are set as a side effect to the call to step. The
variable loc1 points to the first character that matched the
regular expression; the variable loc2 points to the charac-
ter after the last character that matches the regular
expression. Thus if the regular expression matches the
entire input string, loc1 will point to the first character
of string and loc2 will point to the null at the end of
string. The function advance returns non-zero if the ini-
tial substring of string matches the regular expression in
expbuf. If there is a match, an external character pointer,
loc2, is set as a side effect. The variable loc2 points to
the next character in string after the last character that
matched. When advance encounters a * or \{ \} sequence in
the regular expression, it will advance its pointer to the
string to be matched as far as possible and will recursively
call itself trying to match the rest of the string to the
rest of the regular expression. As long as there is no
match, advance will back up along the string until it finds
a match or reaches the point in the string that initially
matched the * or \{ \}. It is sometimes desirable to stop
this backing up before the initial point in the string is
reached. If the external character pointer locs is equal to
the point in the string at sometime during the backing up
process, advance will break out of the loop that backs up
and will return zero. The external variables circf, sed,
and nbra are reserved.
RETURN VALUE
The function compile uses the macro RETURN on success and
the macro ERROR on failure (see above). The functions step
and advance return non-zero on a successful match and zero
if there is no match.
ERRORS
11 range endpoint too large.
16 bad number.
25 \ digit out of range.
36 illegal or missing delimiter.
4
regexp(5) MISC. FILE FORMATS regexp(5)
41 no remembered search string.
42 \( \) imbalance.
43 too many \(.
44 more than 2 numbers given in \{ \}.
45 } expected after \.
46 first number exceeds second in \{ \}.
49 [ ] imbalance.
50 regular expression overflow.
EXAMPLE
The following is an example of how the regular expression
macros and calls might be defined by an application program:
#define INIT register char *sp = instring;
#define GETC (*sp++)
#define PEEKC (*sp)
#define UNGETC(c) (--sp)
#define RETURN(*c) return;
#define ERROR(c) regerr
#include <regexp.h>
. . .
(void) compile(*argv, expbuf, &expbuf[ESIZE],'\0');
. . .
if (step(linebuf, expbuf))
succeed;
5