regcmp, regex
Purpose
Compiles and matches regular-expression patterns.
Library
Programmers Workbench Library (libPW.a)
Syntax
char *regcmp (str [, str, . . . |, (char *char *regex (pat, subject [, ret, . . . |)
char *str, *str, . . . ; char *pat, *subject, *ret, . . . ;
extern char *__loc1;
Description
The regcmp subroutine compiles a regular expression (or
pattern) and returns a pointer to the compiled form. The
str parameters specify the pattern to be compiled. If
more than one str parameter is given, then regcmp treats
them as if they were concatenated together. It returns a
NULL pointer if it encounters an incorrect parameter.
You can use the regcmp command to compile regular
expressions into your C program, frequently eliminating
the need to call the regcmp subroutine at run time.
The regex subroutine compares a compiled pattern to the
subject string. Additional parameters are used to
receive values. Upon successful completion, the regex
subroutine returns a pointer to the next unmatched char-
acter. If the regex subroutine fails, a NULL pointer is
returned. A global character pointer, __loc1, points to
where the match began.
The regcmp and regex subroutines are borrowed from the ed
command; however, the syntax and semantics have been
changed slightly. You can use the following symbols with
the regcmp and regex subroutines:
"[ ] * . ^"
These symbols have the same meaning as they do in the
ed command.
"-"
For regex, the minus within brackets means "through"
according to the current collating sequence. For
example, "[a-z]" can be equivalent to "[abcd" . . .
"xyz]" or "[aBbCc" . . . "xYyZz]" or even "[aa>a<a^bc"
. . . "xyz]". You can use the "-" by itself if the
"-" is the last or first character. For example, the
character class expression "[]-]" matches the "]"
(right bracket) and "-" (minus) characters.
The regcmp subroutine does not use the current col-
lating sequence, and the minus character in brackets
controls only a direct ASCII sequence. For example,
"[a-z]" always means "[abc . . . xyz]" and "[A-Z]"
always means "[ABC . . . XYZ]". If you need to
control the specific characters in a range using
regcmp, you must list them explicitly rather than
using the minus in the character class expression.
"$"
Matches the end of the string. Use "\n" to match a
new-line character.
"+"
A regular expression followed by "+" means one or more
times. For example, "[0-9]+" is equivalent to
"[0-9][0-9]*".
"{"m"}" "{"m,"}" "{"m,u"}"
Integer values enclosed in "{" "}" indicate the number
of times to apply the preceding regular expression. m
is the minimum number and u is the maximum number. u
must be less than 256. If you specify only m, it
indicates the exact number of times to apply the
regular expression. "{"m,"}" is equivalent to
"{"m,&infinity."}" and matches m or more occurrences
of the expression. The plus "+" (plus) and "*"
(asterisk) operations are equivalent to "{1,}" and
"{0,}", respectively.
"(" . . . ")$"n
This stores the value matched by the enclosed regular
expression in the (n+1)(th) ret parameter. Ten
enclosed regular expressions are allowed. regex makes
the assignments unconditionally.
"(" . . . ")"
Parentheses group subexpressions. An operator, such
as "*", "+", or "{" "}" works on a single character or
on a regular expression enclosed in parenthesis. For
example, "(a*(cb+)*)$0".
All of the above defined symbols are special. You must
precede them with a "\" (backslash) if you want to match
the special symbol itself. For example, "\$" matches a
dollar sign.
Note: regcmp uses the malloc subroutine to make the
space for the vector. Always free the vectors that are
not required. If you do not free the unrequired vectors,
you may run out of memory if regcmp is called repeatedly.
Use the following as a replacement for malloc to reuse
the same vector, thus saving time and space:
/* . . . Your Program . . . */
malloc(n)
int n;
{
static int rebuf[256];
return ((n <= sizeof(rebuf)) ? rebuf : NULL);
}
Examples
1. To perform a simple match:
char *cursor, *newcursor, *ptr;
. . .
newcursor = regex((ptr = regcmp("^\n", 0)), cursor);
free(ptr);
This matches a leading new-line character in the
subject string pointed to by "cursor".
2. To extract a substring that matches a pattern:
char ret0[9];
char *newcursor, *name;
. . .
name = regcmp("([A-Za-z][A-Za-z0-9]{0,7})$0", 0);
newcursor = regex(name, "123Testing321", ret0);
This matches the eight-character identifier
"Testing3" and returns the address of the character
after the last matched character (which is stored in
"newcursor"). The string "Testing3" is copied into
the character array "ret0".
Related Information
In this book: "malloc, free, realloc, calloc,"
"NCcollate, NCcoluniq, NCeqvmap, _NCxcol, _NLxcol," and
"regexp: compile, step, advance."
The ed and regcmp commands in AIX Operating System Com-
mands Reference.
"Overview of International Character Support" in Managing
the AIX Operating System.