Museum

Home

Lab Overview

Retrotechnology Articles

Online Manuals

⇒ lex(1) — DG/UX 4.00

Media Vault

Software Library

Restoration Projects

Artifacts Sought



                                                                   lex(1)



        _________________________________________________________________
        lex                                                       Command
        generate programs for simple lexical tasks
        _________________________________________________________________


        SYNTAX

        lex [ -tvn ] [ file ] ...


        DESCRIPTION

        Lex generates programs to do simple lexical analysis of text
        using regular expressions.  Lex reads its input files, or the
        standard input if no files are named, to get a list of regular
        expressions the generated program will look for, and C text to
        execute when each expression is matched.

        An output file lex.yy.c is produced that contains C code for the
        generated program, which is named yylex.  It must be linked using
        the "-ll" switch, to get the lex library routines.

        The input to lex is of the form:

             declarations
             %%
             rules
             %%
             programs

        Any of the sections may be empty.  If the "programs" section is
        empty, the "%%" that precedes it may be omitted.  Thus the
        shortest legal lex input is

             %%

        Each rule is of the form:

             <expression> <action>

        An <expression> defines a regular expression that yylex will try
        to match.  The <action> is the C code that yylex will execute
        when that <expression> is matched.

        yylex writes any input characters that match no expression to the
        standard output.

        The notation for lex regular expressions is described below.  In
        the description, X and Y stand for lex regular expressions, and x
        and y stand for characters.



        DG/UX 4.00                                                 Page 1
               Licensed material--property of copyright holder(s)





                                                                   lex(1)



        x    An ordinary single character matches itself.  Exceptions are
             these meta-characters:  "\[]^-?.*+|()$/{}%<>.

        \x   Matches x, except for these special escape sequences
             beginning with a backslash:

            \n   matches newline

            \t   matches tab

            \b   matches backspace

            \    matches backslash

        "xy" A string of characters in double quotes matches the string
             of characters.  Any special meaning those characters (except
             for backslash) might otherwise have is ignored.  The string
             "\x" matches whatever \\ould match.  For example,

        "."  matches a period

        "\n" matches newline

        "[hello]0"
             matches the 8-character string "[hello]" followed by a tab

        .    A period matches any character except newline.

        [xy] A string of elements inside square brackets matches any
             character any of the elements match.  Elements can be any of
             the following:

             single characters, which match themselves (except for "]"
             anywhere and "-" immediately after the initial "[").

             \\x regular expressions, which match what they usually do.

             triplets of characters x-y;  these match any character from
             x to y, inclusive.

             For example, [adm-p\\n] matches any one of these characters:
             a, d, m, n, o, p, newline.

             A caret, ^, as the first character inside the square
             brackets has special meaning:  if S is a string of
             characters, then [^S] matches any character except for
             newline and any character that [S] would match.

        XY   matches anything that X would match concatenated with
             anything that Y would match.  For example,




        DG/UX 4.00                                                 Page 2
               Licensed material--property of copyright holder(s)





                                                                   lex(1)



        [ab][cd]
             matches "ac", "bc", "ad", and "bd".

        X*   matches 0 or more successive strings each matched by X.  For
             example,


        c*   matches the empty string, "c", "cc", and so forth.

        X+   matches 1 or more successive strings each matched by X.  For
             example,


        c+   matches "c", "cc", and so forth.

        X{j,k}
             where j and k are integers in the range [0,255], matches j
             to k (inclusive) successive strings each matched by X.  For
             example,


        c{3,5}
             matches "ccc", "cccc", and "ccccc".

        X{j} is equivalent to X{j,j};  it matches exactly j successive
             strings each matched by X.

        X{j,}
             matches j or more successive strings matched by X.

        (X)  matches whatever X matches.

        X?   matches the empty string and whatever X matches; it is
             equivalent to X{0,1}.  For example,


        (ab)?
             matches "ab" and "".

        X|Y  matches anything that either X or Y would match.  For
             example,


        "bob"|(ab?c)
             matches "bob", "ac", and "abc".

        ^X   A caret, ^, at the beginning of a regular expression
             restricts it to only match strings at the beginning of a
             line.  A caret not at the beginning of a regular expression
             does not have this effect.  For example,




        DG/UX 4.00                                                 Page 3
               Licensed material--property of copyright holder(s)





                                                                   lex(1)



        ^Bob matches "Bob" when it occurs at the beginning of a line, but
             nowhere else.

        X$   A dollar sign, $, at the end of a regular expression
             restricts it to only match strings at the end of a line.  A
             dollar sign not at the end of a regular expression does not
             have this effect.  For example,


         bye$
             matches "bye" when it occurs at the end of a line, but
             nowhere else.

        X/Y  restrict X to match only strings that are followed by
             something Y matches.  For example,


        (bob)/(white)
             matches "bob" in the context "bobwhite" but not in the
             context "bobolink".

        Blanks or tabs can only appear within a regular expression if
        each is:

        *    escaped with a backslash;

        *    inside double quotes;  or

        *    within square brackets.

        The <action> may be a single line of C code terminated with a
        semicolon, or a sequence of C statements within curly braces {
        and }.  Lex provides the following for use in actions:

        yytext
             Character pointer to the text matched by the regular
             expression.

        yyleng
             Length of text in yytext.

        |    "|;" as the action for one rule is equivalent to the action
             for the next rule.  "|" may not be used inside curly braces
             "{}".

        ECHO Equivalent to

             printf("%s", yytext)

        REJECT
             Causes yylex to reject this match and continue looking to



        DG/UX 4.00                                                 Page 4
               Licensed material--property of copyright holder(s)





                                                                   lex(1)



             see if other regular expressions will match it instead.

        unput(c)
             Routine that pushes a character back onto the input.

        yyless(n)
             Causes all but first n characters of yytext to be pushed
             back onto the input.

        yymore()
             Causes the next input string to be matched to be catenated
             onto the end of yytext, rather than overwriting it.

        You can redefine several routines and macros to change how yylex
        behaves:

        input()
             By default, a macro that is called to read a character from
             stdin.  It returns 0 at end-of-file.

        unput(c)
             By default, a macro that is called to push the character c
             back onto the input.  The lex library allows 100 characters
             worth of pushback.

             If you redefine input() or unput(c), you must ensure that
             the two of them are consistent with each other.

        output(c)
             By default, a macro that is called to write a character c to
             stdout.

        yyin File pointer for input;  macro defined as stdin.

        yyout
             File pointer for output;  macro defined as stdout.

        yywrap()
             This routine is called when input() returns 0.  If yywrap()
             returns 1, yylex finishes wrapping up and returns.  If
             yywrap() returns 0, however, yylex continues to read input
             and match expressions.  The default yywrap() always returns
             1.

        The declarations section may contain:

        *    C code to be placed at the head of lex.yy.c.  Any lines
             between lines containing only "%{" and "%}" are copied into
             lex.yy.c.

        *    Lex substitution string definitions.  Each such definition



        DG/UX 4.00                                                 Page 5
               Licensed material--property of copyright holder(s)





                                                                   lex(1)



             is a line of the form:

                  name  definition

             The name must start in the first column and begin with a
             letter, and it must be separated from the translation by one
             or more blanks or tabs.  The translation can be anything.

             Such names may be used in expressions in the rules section
             by surrounding them with curly braces, {}.  For example,

                  DIGIT     [0-9]
                  %%
                  {DIGIT}+  printf("integer");

             The "{DIGIT}" is replaced by its definition "[0-9]".

        *    Start condition definitions.  Each definition line is of the
             form:

                  %Start cond1 cond2 ...

             where the "%Start" begins in the first column.  Each word
             following it is declared to be the name of a start
             condition.

             Expressions in the rules section may then be preceded by the
             names of start conditions in angle brackets, <>;  this
             restricts them to be matched only when yylex is in the
             listed start conditions.  Several start conditions may be
             listed, separated by commas;  for example, "<cond1,cond2>".

             The start condition yylex is in may be changed by an action
             that executes a "BEGIN name;" statement, where "name" is the
             name of a start condition.  yylex is initially in start
             condition 0;  "BEGIN 0;" will reset it.

        NOTE:
             Any expression not preceded by a start condition may be
             matched at any time.  For example,

                  %Start    one two
                  %%
                  ^one      { ECHO; BEGIN one; }
                  ^two      { ECHO; BEGIN two; }
                  ^zip      { ECHO; BEGIN zip; }
                  <one>target    { printf("one"); }
                  <two>target    { printf("two"); }

             Different rules for "target" will be executed depending on
             what start condition is active.



        DG/UX 4.00                                                 Page 6
               Licensed material--property of copyright holder(s)





                                                                   lex(1)



        *    Table size limits for the finite state machine implemented
             by yylex.

                  %p n Maximum number of positions is n (default 2000)
                  %n n    Maximum number of states is n (500)
                  %t n    Maximum number of parse tree nodes is n (1000)
                  %a n    Maximum number of transitions is n (3000)

        The programs section may contain anything you like.  It is copied
        to the end of lex.yy.c.

        Any line in any of the three sections that begins with a space is
        copied directly into lex.yy.c.

        To use yylex, you must provide a program to call it and link them
        with the "-ll" option.  To use yylex with a yacc(1) parser, end
        the action for each lex rule with

             return(token);

        where "token" is the appropriate token.  Access to yacc's token
        names may be ensured by including the yylex code in the yacc
        generator with

             #include "lex.yy.c"

        or generating the "y.tab.h" file with yacc's "-d" option and
        including it with

             #include "y.tab.h"

        in the definitions section of the lex input.


        OPTIONS

        -t   Output which normally goes to lex.yy.c is sent to stdout.

        -v   A one-line summary of the finite state machine implemented
             by yylex is printed.

        -n   Cancels -v option.


        EXAMPLE

                D       [0-9]
                %%
                if      printf("IF statement\n");
                [a-z]+  printf("tag, value %s\n",yytext);
                0{D}+   printf("octal number %s\n",yytext);



        DG/UX 4.00                                                 Page 7
               Licensed material--property of copyright holder(s)





                                                                   lex(1)



                {D}+    printf("decimal number %s\n",yytext);
                "++"    printf("unary op\n");
                "+"     printf("binary op\n");
                "/*"    {       loop:
                                while (input() != '*');
                                switch (input())
                                        {
                                        case '/': break;
                                        case '*': unput('*');
                                        default: go to loop;
                                        }
                                }


        SEE ALSO

        yacc(1).
        malloc(3X) in the Programmer's Reference for the DG/UX System.




































        DG/UX 4.00                                                 Page 8
               Licensed material--property of copyright holder(s)



Typewritten Software • bear@typewritten.org • Edmonds, WA 98026