lex(1)
_________________________________________________________________
lex Command
generate programs for simple lexical tasks
_________________________________________________________________
SYNTAX
lex [ -tvn ] [ file ] ...
DESCRIPTION
Lex generates programs to do simple lexical analysis of text
using regular expressions. Lex reads its input files, or the
standard input if no files are named, to get a list of regular
expressions the generated program will look for, and C text to
execute when each expression is matched.
An output file lex.yy.c is produced that contains C code for the
generated program, which is named yylex. It must be linked using
the "-ll" switch, to get the lex library routines.
The input to lex is of the form:
declarations
%%
rules
%%
programs
Any of the sections may be empty. If the "programs" section is
empty, the "%%" that precedes it may be omitted. Thus the
shortest legal lex input is
%%
Each rule is of the form:
<expression> <action>
An <expression> defines a regular expression that yylex will try
to match. The <action> is the C code that yylex will execute
when that <expression> is matched.
yylex writes any input characters that match no expression to the
standard output.
The notation for lex regular expressions is described below. In
the description, X and Y stand for lex regular expressions, and x
and y stand for characters.
DG/UX 4.00 Page 1
Licensed material--property of copyright holder(s)
lex(1)
x An ordinary single character matches itself. Exceptions are
these meta-characters: "\[]^-?.*+|()$/{}%<>.
\x Matches x, except for these special escape sequences
beginning with a backslash:
\n matches newline
\t matches tab
\b matches backspace
\ matches backslash
"xy" A string of characters in double quotes matches the string
of characters. Any special meaning those characters (except
for backslash) might otherwise have is ignored. The string
"\x" matches whatever \\ould match. For example,
"." matches a period
"\n" matches newline
"[hello]0"
matches the 8-character string "[hello]" followed by a tab
. A period matches any character except newline.
[xy] A string of elements inside square brackets matches any
character any of the elements match. Elements can be any of
the following:
single characters, which match themselves (except for "]"
anywhere and "-" immediately after the initial "[").
\\x regular expressions, which match what they usually do.
triplets of characters x-y; these match any character from
x to y, inclusive.
For example, [adm-p\\n] matches any one of these characters:
a, d, m, n, o, p, newline.
A caret, ^, as the first character inside the square
brackets has special meaning: if S is a string of
characters, then [^S] matches any character except for
newline and any character that [S] would match.
XY matches anything that X would match concatenated with
anything that Y would match. For example,
DG/UX 4.00 Page 2
Licensed material--property of copyright holder(s)
lex(1)
[ab][cd]
matches "ac", "bc", "ad", and "bd".
X* matches 0 or more successive strings each matched by X. For
example,
c* matches the empty string, "c", "cc", and so forth.
X+ matches 1 or more successive strings each matched by X. For
example,
c+ matches "c", "cc", and so forth.
X{j,k}
where j and k are integers in the range [0,255], matches j
to k (inclusive) successive strings each matched by X. For
example,
c{3,5}
matches "ccc", "cccc", and "ccccc".
X{j} is equivalent to X{j,j}; it matches exactly j successive
strings each matched by X.
X{j,}
matches j or more successive strings matched by X.
(X) matches whatever X matches.
X? matches the empty string and whatever X matches; it is
equivalent to X{0,1}. For example,
(ab)?
matches "ab" and "".
X|Y matches anything that either X or Y would match. For
example,
"bob"|(ab?c)
matches "bob", "ac", and "abc".
^X A caret, ^, at the beginning of a regular expression
restricts it to only match strings at the beginning of a
line. A caret not at the beginning of a regular expression
does not have this effect. For example,
DG/UX 4.00 Page 3
Licensed material--property of copyright holder(s)
lex(1)
^Bob matches "Bob" when it occurs at the beginning of a line, but
nowhere else.
X$ A dollar sign, $, at the end of a regular expression
restricts it to only match strings at the end of a line. A
dollar sign not at the end of a regular expression does not
have this effect. For example,
bye$
matches "bye" when it occurs at the end of a line, but
nowhere else.
X/Y restrict X to match only strings that are followed by
something Y matches. For example,
(bob)/(white)
matches "bob" in the context "bobwhite" but not in the
context "bobolink".
Blanks or tabs can only appear within a regular expression if
each is:
* escaped with a backslash;
* inside double quotes; or
* within square brackets.
The <action> may be a single line of C code terminated with a
semicolon, or a sequence of C statements within curly braces {
and }. Lex provides the following for use in actions:
yytext
Character pointer to the text matched by the regular
expression.
yyleng
Length of text in yytext.
| "|;" as the action for one rule is equivalent to the action
for the next rule. "|" may not be used inside curly braces
"{}".
ECHO Equivalent to
printf("%s", yytext)
REJECT
Causes yylex to reject this match and continue looking to
DG/UX 4.00 Page 4
Licensed material--property of copyright holder(s)
lex(1)
see if other regular expressions will match it instead.
unput(c)
Routine that pushes a character back onto the input.
yyless(n)
Causes all but first n characters of yytext to be pushed
back onto the input.
yymore()
Causes the next input string to be matched to be catenated
onto the end of yytext, rather than overwriting it.
You can redefine several routines and macros to change how yylex
behaves:
input()
By default, a macro that is called to read a character from
stdin. It returns 0 at end-of-file.
unput(c)
By default, a macro that is called to push the character c
back onto the input. The lex library allows 100 characters
worth of pushback.
If you redefine input() or unput(c), you must ensure that
the two of them are consistent with each other.
output(c)
By default, a macro that is called to write a character c to
stdout.
yyin File pointer for input; macro defined as stdin.
yyout
File pointer for output; macro defined as stdout.
yywrap()
This routine is called when input() returns 0. If yywrap()
returns 1, yylex finishes wrapping up and returns. If
yywrap() returns 0, however, yylex continues to read input
and match expressions. The default yywrap() always returns
1.
The declarations section may contain:
* C code to be placed at the head of lex.yy.c. Any lines
between lines containing only "%{" and "%}" are copied into
lex.yy.c.
* Lex substitution string definitions. Each such definition
DG/UX 4.00 Page 5
Licensed material--property of copyright holder(s)
lex(1)
is a line of the form:
name definition
The name must start in the first column and begin with a
letter, and it must be separated from the translation by one
or more blanks or tabs. The translation can be anything.
Such names may be used in expressions in the rules section
by surrounding them with curly braces, {}. For example,
DIGIT [0-9]
%%
{DIGIT}+ printf("integer");
The "{DIGIT}" is replaced by its definition "[0-9]".
* Start condition definitions. Each definition line is of the
form:
%Start cond1 cond2 ...
where the "%Start" begins in the first column. Each word
following it is declared to be the name of a start
condition.
Expressions in the rules section may then be preceded by the
names of start conditions in angle brackets, <>; this
restricts them to be matched only when yylex is in the
listed start conditions. Several start conditions may be
listed, separated by commas; for example, "<cond1,cond2>".
The start condition yylex is in may be changed by an action
that executes a "BEGIN name;" statement, where "name" is the
name of a start condition. yylex is initially in start
condition 0; "BEGIN 0;" will reset it.
NOTE:
Any expression not preceded by a start condition may be
matched at any time. For example,
%Start one two
%%
^one { ECHO; BEGIN one; }
^two { ECHO; BEGIN two; }
^zip { ECHO; BEGIN zip; }
<one>target { printf("one"); }
<two>target { printf("two"); }
Different rules for "target" will be executed depending on
what start condition is active.
DG/UX 4.00 Page 6
Licensed material--property of copyright holder(s)
lex(1)
* Table size limits for the finite state machine implemented
by yylex.
%p n Maximum number of positions is n (default 2000)
%n n Maximum number of states is n (500)
%t n Maximum number of parse tree nodes is n (1000)
%a n Maximum number of transitions is n (3000)
The programs section may contain anything you like. It is copied
to the end of lex.yy.c.
Any line in any of the three sections that begins with a space is
copied directly into lex.yy.c.
To use yylex, you must provide a program to call it and link them
with the "-ll" option. To use yylex with a yacc(1) parser, end
the action for each lex rule with
return(token);
where "token" is the appropriate token. Access to yacc's token
names may be ensured by including the yylex code in the yacc
generator with
#include "lex.yy.c"
or generating the "y.tab.h" file with yacc's "-d" option and
including it with
#include "y.tab.h"
in the definitions section of the lex input.
OPTIONS
-t Output which normally goes to lex.yy.c is sent to stdout.
-v A one-line summary of the finite state machine implemented
by yylex is printed.
-n Cancels -v option.
EXAMPLE
D [0-9]
%%
if printf("IF statement\n");
[a-z]+ printf("tag, value %s\n",yytext);
0{D}+ printf("octal number %s\n",yytext);
DG/UX 4.00 Page 7
Licensed material--property of copyright holder(s)
lex(1)
{D}+ printf("decimal number %s\n",yytext);
"++" printf("unary op\n");
"+" printf("binary op\n");
"/*" { loop:
while (input() != '*');
switch (input())
{
case '/': break;
case '*': unput('*');
default: go to loop;
}
}
SEE ALSO
yacc(1).
malloc(3X) in the Programmer's Reference for the DG/UX System.
DG/UX 4.00 Page 8
Licensed material--property of copyright holder(s)