lex(1) SDK R4.11 lex(1)
NAME
lex - generate programs for simple lexical tasks
SYNOPSIS
lex [ -tvn ] [ file ] ...
DESCRIPTION
Lex generates programs to do simple lexical analysis of text using
regular expressions. Lex reads its input files, or the standard
input if no files are named, to get a list of regular expressions the
generated program will look for, and C text to execute when each
expression is matched.
An output file lex.yy.c is produced that contains C code for the
generated program, which is named yylex. It must be linked using the
-ll switch, to get the lex library routines.
The input to lex is of the form:
declarations
%%
rules
%%
programs
Any of the sections may be empty. If the "programs" section is
empty, the "%%" that precedes it may be omitted. Thus the shortest
legal lex input is
%%
Rules
Each rule is of the form:
<expression> <action>
An <expression> defines a regular expression that yylex will try to
match. The <action> is the C code that yylex will execute when that
<expression> is matched.
yylex writes any input characters that match no expression to the
standard output.
The notation for lex regular expressions is described below. In the
description, X and Y stand for lex regular expressions, and x and y
stand for characters.
x An ordinary single character matches itself. Exceptions are
these meta-characters: "\[]^-?.*+|()$/{}%<>.
\ex Matches x, except for these special escape sequences beginning
with a backslash:
\en matches newline
\et matches tab
\eb matches backspace
\e\e matches backslash
"xy" A string of characters in double quotes matches the string of
characters. Any special meaning those characters (except for
backslash) might otherwise have is ignored. The string "\x"
matches whatever \x would match. For example,
"." matches a period
"\en" matches newline
"[hello]\et"
matches the 8-character string "[hello]" followed by a
tab
. A period matches any character except newline.
[xy] A string of elements inside square brackets matches any
character any of the elements match. Elements can be any of
the following:
single characters, which match themselves (except for "]"
anywhere and "-" immediately after the initial "[").
\x regular expressions, which match what they usually do.
triplets of characters x-y; these match any character from x
to y, inclusive. For example, [adm-p\n] matches any one of
these characters: a, d, m, n, o, p, newline.
A caret, ^, as the first character inside the square brackets
has special meaning: if S is a string of characters, then
[^S] matches any character except for newline and any
character that [S] would match.
XY matches anything that X would match concatenated with anything
that Y would match. For example, [ab][cd] matches "ac", "bc",
"ad", and "bd".
X* matches 0 or more successive strings each matched by X. For
example, c* matches the empty string, "c", "cc", and so forth.
X+ matches 1 or more successive strings each matched by X. For
example, c+ matches "c", "cc", and so forth.
X{j,k} where j and k are integers in the range [0,255], matches j to
k (inclusive) successive strings each matched by X. For
example, c{3,5} matches "ccc", "cccc", and "ccccc".
X{j} is equivalent to X{j,j}; it matches exactly j successive
strings each matched by X.
X{j,} matches j or more successive strings matched by X.
(X) matches whatever X matches.
X? matches the empty string and whatever X matches; it is
equivalent to X{0,1}. For example, (ab)? matches "ab" and
"".
X|Y matches anything that either X or Y would match. For example,
"bob"|(ab?c) matches "bob", "ac", and "abc".
^X A caret, ^, at the beginning of a regular expression restricts
it to only match strings at the beginning of a line. A caret
not at the beginning of a regular expression does not have
this effect. For example, ^Bob matches "Bob" when it occurs
at the beginning of a line, but nowhere else.
X$ A dollar sign, $, at the end of a regular expression restricts
it to only match strings at the end of a line. A dollar sign
not at the end of a regular expression does not have this
effect. For example, bye$ matches "bye" when it occurs at the
end of a line, but nowhere else.
X/Y restrict X to match only strings that are followed by
something Y matches. For example, (bob)/(white) matches "bob"
in the context "bobwhite" but not in the context "bobolink".
Blanks or tabs can only appear within a regular expression if each
is:
· escaped with a backslash;
· inside double quotes; or
· within square brackets.
The <action> may be a single line of C code terminated with a
semicolon, or a sequence of C statements within curly braces { and }.
Lex provides the following for use in actions:
yytext Character pointer to the text matched by the regular
expression.
yyleng Length of text in yytext.
| "|;" as the action for one rule is equivalent to the action
for the next rule. "|" may not be used inside curly braces
"{}".
ECHO Equivalent to
printf("%s", yytext)
REJECT Causes yylex to reject this match and continue looking to see
if other regular expressions will match it instead.
unput(c)
Routine that pushes a character back onto the input.
yyless(n)
Causes all but first n characters of yytext to be pushed back
onto the input.
yymore()
Causes the next input string to be matched to be catenated
onto the end of yytext, rather than overwriting it.
You can redefine several routines and macros to change how yylex
behaves. If you do this, you have to make sure that you remove the
default definitions from the resulting output from lex.
input()
By default, a macro that is called to read a character from
stdin. It returns 0 at end-of-file.
unput(c)
By default, a macro that is called to push the character c
back onto the input. The lex library allows 100 characters
worth of pushback.
If you redefine input() or unput(c), you must ensure that the
two of them are consistent with each other.
output(c)
By default, a macro that is called to write a character c to
stdout.
yyin File pointer for input; macro defined as stdin.
yyout File pointer for output; macro defined as stdout.
yywrap()
This routine is called when input() returns 0. If yywrap()
returns 1, yylex finishes wrapping up and returns. If
yywrap() returns 0, however, yylex continues to read input and
match expressions. The default yywrap() always returns 1.
Declarations
The declarations section may contain:
· C code to be placed at the head of lex.yy.c. Any lines
between lines containing only "%{" and "%}" are copied into
lex.yy.c.
· Lex substitution string definitions. Each such definition is
a line of the form:
name definition
The name must start in the first column and begin with a
letter, and it must be separated from the translation by one
or more blanks or tabs. The translation can be anything.
Such names may be used in expressions in the rules section by
surrounding them with curly braces, {}. For example,
DIGIT [0-9]
%%
{DIGIT}+ printf("integer");
The "{DIGIT}" is replaced by its definition "[0-9]".
· Start condition definitions. Each definition line is of the
form:
%Start cond1 cond2 ...
where the "%Start" begins in the first column. Each word
following it is declared to be the name of a start condition.
Expressions in the rules section may then be preceded by the
names of start conditions in angle brackets, <>; this
restricts them to be matched only when yylex is in the listed
start conditions. Several start conditions may be listed,
separated by commas; for example, "<cond1,cond2>".
The start condition yylex is in may be changed by an action
that executes a "BEGIN name;" statement, where "name" is the
name of a start condition. yylex is initially in start
condition 0, or INITIAL; "BEGIN 0;" or "BEGIN INITIAL;" will
reset it.
NOTE: Any expression not preceded by a start condition may be matched
at any time. For example,
%Start one two
%%
^one { ECHO; BEGIN one; }
^two { ECHO; BEGIN two; }
^zip { ECHO; BEGIN zip; }
onetarget { printf("one"); }
twotarget { printf("two"); }
Different rules for "target" will be executed depending on what
start condition is active.
· Table size limits for the finite state machine implemented by
yylex.
%p n Max number of positions is n (default 20000)
%n n Max number of states is n (4000)
%e n Max number of parse tree nodes is n (8000)
%a n Max number of transitions is n (16000)
%k n Max number of packed char classes is n (default 20000)
%o n Max number of output slots is n (default 24000)
Programs
The programs section may contain anything you like. It is copied to
the end of lex.yy.c.
Any line in any of the three sections that begins with a space is
copied directly into lex.yy.c.
To use yylex, you must provide a program to call it and link them
with the "-ll" option. To use yylex with a yacc(1) parser, end the
action for each lex rule with
return(token);
where "token" is the appropriate token. Access to yacc's token names
may be ensured by including the yylex code in the yacc generator with
#include "lex.yy.c"
or generating the "y.tab.h" file with yacc's "-d" option and
including it with
#include "y.tab.h"
in the definitions section of the lex input.
Options
-t Output which normally goes to lex.yy.c is sent to stdout.
-v A one-line summary of the finite state machine implemented by
yylex is printed.
-n Cancels -v option.
International Features
lex can process characters from supplementary code sets as well as
ASCII characters.
Characters from supplementary code sets can be specified in comments
which exist in definitions, rules, and user subroutines.
Characters from supplementary code sets can be specified in strings
which exist in actions in rules and in user subroutines.
Character strings from supplementary code sets can be defined as
tokens.
EXAMPLE
D [0-9]
%%
if printf("IF statement\n");
[a-z]+ printf("tag, value %s\n",yytext);
0{D}+ printf("octal number %s\n",yytext);
{D}+ printf("decimal number %s\n",yytext);
"++" printf("unary op\n");
"+" printf("binary op\n");
"/*" { loop:
while (input() != '*');
switch (input())
{
case '/': break;
case '*': unput('*');
default: go to loop;
}
}
NOTE
Remember, if you redefined any of the lex furnished macros, you must
removed the default definitions from the output produced by lex.
SEE ALSO
yacc(1), malloc(3X).
Licensed material--property of copyright holder(s)