lex(1) DG/UX 4.30 lex(1)
NAME
lex - generate programs for simple lexical tasks
SYNOPSIS
lex [ -tvn ] [ file ] ...
DESCRIPTION
Lex generates programs to do simple lexical analysis of text
using regular expressions. Lex reads its input files, or
the standard input if no files are named, to get a list of
regular expressions the generated program will look for, and
C text to execute when each expression is matched.
An output file lex.yy.c is produced that contains C code for
the generated program, which is named yylex. It must be
linked using the "-ll" switch, to get the lex library
routines.
The input to lex is of the form:
declarations
%%
rules
%%
programs
Any of the sections may be empty. If the "programs" section
is empty, the "%%" that precedes it may be omitted. Thus
the shortest legal lex input is
%%
Each rule is of the form:
<expression> <action>
An <expression> defines a regular expression that yylex will
try to match. The <action> is the C code that yylex will
execute when that <expression> is matched.
yylex writes any input characters that match no expression
to the standard output.
The notation for lex regular expressions is described below.
In the description, X and Y stand for lex regular
expressions, and x and y stand for characters.
x An ordinary single character matches itself.
Exceptions are these meta-characters: "\[]^-
?.*+|()$/{}%<>.
\x Matches x, except for these special escape sequences
Licensed material--property of copyright holder(s) Page 1
lex(1) DG/UX 4.30 lex(1)
beginning with a backslash:
\n matches newline
\t matches tab
\b matches backspace
\\ matches backslash
"xy" A string of characters in double quotes matches the
string of characters. Any special meaning those
characters (except for backslash) might otherwise have
is ignored. The string "\x" matches whatever \x would
match. For example,
"." matches a period
"\n" matches newline
"[hello]\t"
matches the 8-character string "[hello]" followed by a
tab
. A period matches any character except newline.
[xy] A string of elements inside square brackets matches any
character any of the elements match. Elements can be
any of the following:
single characters, which match themselves (except for
"]" anywhere and "-" immediately after the initial
"[").
\x regular expressions, which match what they usually
do.
triplets of characters x-y; these match any character
from x to y, inclusive.
For example, [adm-p\n] matches any one of these
characters: a, d, m, n, o, p, newline.
A caret, ^, as the first character inside the square
brackets has special meaning: if S is a string of
characters, then [^S] matches any character except for
newline and any character that [S] would match.
XY matches anything that X would match concatenated with
anything that Y would match. For example,
Licensed material--property of copyright holder(s) Page 2
lex(1) DG/UX 4.30 lex(1)
[ab][cd]
matches "ac", "bc", "ad", and "bd".
X* matches 0 or more successive strings each matched by X.
For example,
c* matches the empty string, "c", "cc", and so forth.
X+ matches 1 or more successive strings each matched by X.
For example,
c+ matches "c", "cc", and so forth.
X{j,k}
where j and k are integers in the range [0,255],
matches j to k (inclusive) successive strings each
matched by X. For example,
c{3,5}
matches "ccc", "cccc", and "ccccc".
X{j} is equivalent to X{j,j}; it matches exactly j
successive strings each matched by X.
X{j,}
matches j or more successive strings matched by X.
(X) matches whatever X matches.
X? matches the empty string and whatever X matches; it is
equivalent to X{0,1}. For example,
(ab)?
matches "ab" and "".
X|Y matches anything that either X or Y would match. For
example,
"bob"|(ab?c)
matches "bob", "ac", and "abc".
^X A caret, ^, at the beginning of a regular expression
restricts it to only match strings at the beginning of
a line. A caret not at the beginning of a regular
expression does not have this effect. For example,
Licensed material--property of copyright holder(s) Page 3
lex(1) DG/UX 4.30 lex(1)
^Bob matches "Bob" when it occurs at the beginning of a
line, but nowhere else.
X$ A dollar sign, $, at the end of a regular expression
restricts it to only match strings at the end of a
line. A dollar sign not at the end of a regular
expression does not have this effect. For example,
bye$
matches "bye" when it occurs at the end of a line, but
nowhere else.
X/Y restrict X to match only strings that are followed by
something Y matches. For example,
(bob)/(white)
matches "bob" in the context "bobwhite" but not in the
context "bobolink".
Blanks or tabs can only appear within a regular expression
if each is:
* escaped with a backslash;
* inside double quotes; or
* within square brackets.
The <action> may be a single line of C code terminated with
a semicolon, or a sequence of C statements within curly
braces { and }. Lex provides the following for use in
actions:
yytext
Character pointer to the text matched by the regular
expression.
yyleng
Length of text in yytext.
| "|;" as the action for one rule is equivalent to the
action for the next rule. "|" may not be used inside
curly braces "{}".
ECHO Equivalent to
printf("%s", yytext)
REJECT
Causes yylex to reject this match and continue looking
Licensed material--property of copyright holder(s) Page 4
lex(1) DG/UX 4.30 lex(1)
to see if other regular expressions will match it
instead.
unput(c)
Routine that pushes a character back onto the input.
yyless(n)
Causes all but first n characters of yytext to be
pushed back onto the input.
yymore()
Causes the next input string to be matched to be
catenated onto the end of yytext, rather than
overwriting it.
You can redefine several routines and macros to change how
yylex behaves:
input()
By default, a macro that is called to read a character
from stdin. It returns 0 at end-of-file.
unput(c)
By default, a macro that is called to push the
character c back onto the input. The lex library
allows 100 characters worth of pushback.
If you redefine input() or unput(c), you must ensure
that the two of them are consistent with each other.
output(c)
By default, a macro that is called to write a character
c to stdout.
yyin File pointer for input; macro defined as stdin.
yyout
File pointer for output; macro defined as stdout.
yywrap()
This routine is called when input() returns 0. If
yywrap() returns 1, yylex finishes wrapping up and
returns. If yywrap() returns 0, however, yylex
continues to read input and match expressions. The
default yywrap() always returns 1.
The declarations section may contain:
* C code to be placed at the head of lex.yy.c. Any lines
between lines containing only "%{" and "%}" are copied
into lex.yy.c.
Licensed material--property of copyright holder(s) Page 5
lex(1) DG/UX 4.30 lex(1)
* Lex substitution string definitions. Each such
definition is a line of the form:
name definition
The name must start in the first column and begin with
a letter, and it must be separated from the translation
by one or more blanks or tabs. The translation can be
anything.
Such names may be used in expressions in the rules
section by surrounding them with curly braces, {}. For
example,
DIGIT [0-9]
%%
{DIGIT}+ printf("integer");
The "{DIGIT}" is replaced by its definition "[0-9]".
* Start condition definitions. Each definition line is
of the form:
%Start cond1 cond2 ...
where the "%Start" begins in the first column. Each
word following it is declared to be the name of a start
condition.
Expressions in the rules section may then be preceded
by the names of start conditions in angle brackets, <>;
this restricts them to be matched only when yylex is in
the listed start conditions. Several start conditions
may be listed, separated by commas; for example,
"<cond1,cond2>".
The start condition yylex is in may be changed by an
action that executes a "BEGIN name;" statement, where
"name" is the name of a start condition. yylex is
initially in start condition 0; "BEGIN 0;" will reset
it.
NOTE:
Any expression not preceded by a start condition may be
matched at any time. For example,
%Start one two
%%
^one { ECHO; BEGIN one; }
^two { ECHO; BEGIN two; }
^zip { ECHO; BEGIN zip; }
<one>target { printf("one"); }
Licensed material--property of copyright holder(s) Page 6
lex(1) DG/UX 4.30 lex(1)
<two>target { printf("two"); }
Different rules for "target" will be executed depending
on what start condition is active.
* Table size limits for the finite state machine
implemented by yylex.
%p n Maximum number of positions is n (default 2000)
%n n Maximum number of states is n (500)
%t n Maximum number of parse tree nodes is n (1000)
%a n Maximum number of transitions is n (3000)
The programs section may contain anything you like. It is
copied to the end of lex.yy.c.
Any line in any of the three sections that begins with a
space is copied directly into lex.yy.c.
To use yylex, you must provide a program to call it and link
them with the "-ll" option. To use yylex with a yacc(1)
parser, end the action for each lex rule with
return(token);
where "token" is the appropriate token. Access to yacc's
token names may be ensured by including the yylex code in
the yacc generator with
#include "lex.yy.c"
or generating the "y.tab.h" file with yacc's "-d" option and
including it with
#include "y.tab.h"
in the definitions section of the lex input.
OPTIONS
-t Output which normally goes to lex.yy.c is sent to
stdout.
-v A one-line summary of the finite state machine
implemented by yylex is printed.
-n Cancels -v option.
EXAMPLE
D [0-9]
%%
if printf("IF statement\n");
[a-z]+ printf("tag, value %s\n",yytext);
Licensed material--property of copyright holder(s) Page 7
lex(1) DG/UX 4.30 lex(1)
0{D}+ printf("octal number %s\n",yytext);
{D}+ printf("decimal number %s\n",yytext);
"++" printf("unary op\n");
"+" printf("binary op\n");
"/*" { loop:
while (input() != '*');
switch (input())
{
case '/': break;
case '*': unput('*');
default: go to loop;
}
}
SEE ALSO
yacc(1).
malloc(3X) in the Programmer's Reference for the DG/UX
System.
Licensed material--property of copyright holder(s) Page 8