lex(CP) 6 January 1993 lex(CP) Name lex - a lexical-analyzer generator Syntax lex [ -tvn ] [ specifications-files ] ... Description The lex command generates C code which implements a lexical analyzer--a routine which reads input text and separates it into tokens. The lex input specifications (that is, all the input files concatenated together) consist of three sections: declarations; a rules section consisting of regular expressions (patterns) which define the token classes and usually some C code to be executed when tokens are found; and subroutines. The first and third sections are optional. The sections are delimited by the sequence %%. The rules section must start with this delimiter. lex generates a file of C code called lex.yy.c. This file must be com- piled by the C compiler and linked with a main routine. The program should be linked with the lex library, using the -ll option to cc or ld. This library supplies a main routine. The lexical analyzer routine pro- duced is called yylex. This routine reads its input and, when a token is recognized, executes the code associated with the token class. The default action is to write the token to the standard input. The string matched by the regular expression defining the token class is placed in yytext, a character array. The variable yyleng gives the length of this array. This value of yytext may be copied into an external array to make it available to other routines. The regular expressions understood by lex contain many of the usual operators and special characters. The following table summarizes these: string the literalstring * zero or more occurrences of the preceding pattern + one or more occurrences of the preceding pattern ? zero or one occurrences of the preceding pattern . any single character | alternation ( ) used for grouping ~ beginning of an input line ^ end of an input line pattern{n,m} n to m occurrences of pat- tern pattern{n} n occurrences of pattern [string] any character in string [^string] any character not in string [char1-char2] any character in the range char1-char2 Special characters can be escaped or quoted if they are to be used as ordinary characters. The standard C escape sequences are understood. Reg- ular expressions may be concatenated. The character ``/'' in an expres- sion indicates that the expression that follows must be matched in order for the token to be matched; only the part of the expression up to the slash is placed in yytext. The declarations section of a lex input file may contain variable declarations, #include statements, and abbreviations for regular expres- sions. The subroutines section contains user-defined functions used by the lexical analyzer. Any line beginning with a blank is assumed to contain only C text and is copied to the file lex.yy.c; if it is in the declarations section, it is copied into the external definition area of the lex.yy.c file. Variable declarations and #include statements should be placed in a section delim- ited by %{ and %}. Abbreviations consist of a symbol on the left of the line and its replacement text to the right. When abbreviations are used they are surrounded by curly braces, {}. Three I/O routines are defined: input() reads a character; unput(c) returns a character to the input stream; output(c) outputs a character. These routines may be redefined by the user. Other built-in routines include the following: REJECT, on the right side of the rule, causes the match to be rejected and the next suitable match executed; the function yymore() accumulates additional characters into yytext; the function yyless(p) pushes back the portion of the string matched beginning at position p. The variable names generated by lex all begin with the prefix yy or YY. Users should avoid defining variables starting with these prefixes. The lexical analyzer's implementation involves finite state machine; this state machine can be configured in the declarations section. This is done with a declaration of the following form, where x is a key letter, and n is an integer: %x n The following parameters may be set in this way: Key letter Meaning Default _________________________________________________________ p number of positions 2500 n number of states 500 e number of parse tree nodes 1000 a number of transitions 2000 k number of packed character classes 1000 o size of output array 3000 The use of one or more of the above automatically causes a summary of statistics to be printed. See -v and -n options, below. Options The options must appear before any files. -t This causes the generated code to be written to the standard out rather than to lex.yy.c -v Provides a one-line summary of statistics. This is flagged automati- cally if any finite state machine parameters are set. -n Suppresses the summary of statistics even if -v is turned on. Multiple files on the command line are concatenated and treated as a sin- gle file. If no files are given, standard input is used. Example The following is an example of a lex specification. It shows the use of each of the three sections in the input. %{ #include "global.h" int count; %} D [0-9] %% if { printf("IF statement\n"); count++; } [a-z]+ printf("tag, value %s\n",yytext); 0{D}+ printf("octal number %s\n",yytext); {D}+ printf("decimal number %s\n",yytext); "++" printf("unary op\n"); "+" printf("binary op\n"); "/*" skipcommnts(); %% skipcommnts() { for (;;) { while (input() != '*') ; if (input() != '/') unput(yytext[yyleng-1]); else return; } } See also yacc(CP) Standards conformance lex is conformant with: AT&T SVID Issue 2; and X/Open Portability Guide, Issue 3, 1989.