Museum

Home

Lab Overview

Retrotechnology Articles

Online Manuals

⇒ lex(CP) — OpenDesktop Software Development System 3.0.0

Media Vault

Software Library

Restoration Projects

Artifacts Sought

Related Articles

yacc(CP)


 lex(CP)                        6 January 1993                        lex(CP)


 Name

    lex - a lexical-analyzer generator

 Syntax

    lex [ -tvn ] [ specifications-files ] ...

 Description

    The lex command generates C code which implements a lexical analyzer--a
    routine which reads input text and separates it into tokens.  The lex
    input specifications (that is, all the input files concatenated together)
    consist of three sections: declarations; a rules section consisting of
    regular expressions (patterns) which define the token classes and usually
    some C code to be executed when tokens are found; and subroutines. The
    first and third sections are optional.  The sections are delimited by the
    sequence %%. The rules section must start with this delimiter.

    lex generates a file of C code called lex.yy.c. This file must be com-
    piled by the C compiler and linked with a main routine. The program
    should be linked with the lex library, using the -ll option to cc or ld.
    This library supplies a main routine.  The lexical analyzer routine pro-
    duced is called yylex. This routine reads its input and, when a token is
    recognized, executes the code associated with the token class.  The
    default action is to write the token to the standard input.  The string
    matched by the regular expression defining the token class is placed in
    yytext, a character array. The variable yyleng gives the length of this
    array.  This value of yytext may be copied into an external array to make
    it available to other routines.

    The regular expressions understood by lex contain many of the usual
    operators and special characters.  The following table summarizes these:

                      string       the literalstring
                         *         zero or more occurrences of
                                   the preceding pattern
                         +         one or more occurrences of
                                   the preceding pattern
                         ?         zero or one occurrences of
                                   the preceding pattern
                         .         any single character
                         |         alternation
                        ( )        used for grouping
                         ~         beginning of an input line
                         ^         end of an input line
                   pattern{n,m}    n to m occurrences of pat-
                                   tern
                    pattern{n}     n occurrences of pattern
                     [string]      any character in string
                     [^string]     any character not in string
                   [char1-char2]   any character in the range
                                   char1-char2

    Special characters can be escaped or quoted if they are to be used as
    ordinary characters. The standard C escape sequences are understood. Reg-
    ular expressions may be concatenated.  The character ``/'' in an expres-
    sion indicates that the expression that follows must be matched in order
    for the token to be matched; only the part of the expression up to the
    slash is placed in yytext.

    The declarations section of a lex input file may contain variable
    declarations, #include statements, and abbreviations for regular expres-
    sions. The subroutines section contains user-defined functions used by
    the lexical analyzer.

    Any line beginning with a blank is assumed to contain only C text and is
    copied to the file lex.yy.c; if it is in the declarations section, it is
    copied into the external definition area of the lex.yy.c file.  Variable
    declarations and #include statements should be placed in a section delim-
    ited by %{ and %}.  Abbreviations consist of a symbol on the left of the
    line and its replacement text to the right.  When abbreviations are used
    they are surrounded by curly braces, {}.

    Three I/O routines are defined: input() reads a character; unput(c)
    returns a character to the input stream; output(c) outputs a character.
    These routines may be redefined by the user.

    Other built-in routines include the following:  REJECT, on the right side
    of the rule, causes the match to be rejected and the next suitable match
    executed; the function yymore() accumulates additional characters into
    yytext; the function yyless(p) pushes back the portion of the string
    matched beginning at position p.

    The variable names generated by lex all begin with the prefix yy or YY.
    Users should avoid defining variables starting with these prefixes.

    The lexical analyzer's implementation involves finite state machine; this
    state machine can be configured in the declarations section. This is done
    with a declaration of the following form, where x is a key letter, and n
    is an integer:

       %x n

    The following parameters may be set in this way:

            Key letter   Meaning                              Default
            _________________________________________________________
                p        number of positions                   2500
                n        number of states                       500
                e        number of parse tree nodes            1000
                a        number of transitions                 2000
                k        number of packed character classes    1000
                o        size of output array                  3000

    The use of one or more of the above automatically causes a summary of
    statistics to be printed. See -v and -n options, below.

 Options

    The options must appear before any files.

    -t  This causes the generated code to be written to the standard out
        rather than to lex.yy.c

    -v  Provides a one-line summary of statistics. This is flagged automati-
        cally if any finite state machine parameters are set.

    -n  Suppresses the summary of statistics even if -v is turned on.

    Multiple files on the command line are concatenated and treated as a sin-
    gle file.  If no files are given, standard input is used.

 Example

    The following is an example of a lex specification.  It shows the use of
    each of the three sections in the input.

       %{
       #include "global.h"
       int count;
       %}
       D        [0-9]
       %%
       if       {
                       printf("IF statement\n");
                       count++;
                }
       [a-z]+    printf("tag, value %s\n",yytext);
       0{D}+      printf("octal number %s\n",yytext);
       {D}+       printf("decimal number %s\n",yytext);
       "++"       printf("unary op\n");
       "+"        printf("binary op\n");
       "/*"       skipcommnts();
       %%
        skipcommnts()
        {
               for (;;)
               {
                       while (input() != '*')
                               ;
                       if (input() != '/')
                               unput(yytext[yyleng-1]);
                       else
                               return;
               }
        }


 See also

    yacc(CP)

 Standards conformance

    lex is conformant with:
    AT&T SVID Issue 2;
    and X/Open Portability Guide, Issue 3, 1989.


Typewritten Software • bear@typewritten.org • Edmonds, WA 98026