lex(1) — DG/UX R4.11



lex(1)                            SDK R4.11                           lex(1)


NAME
       lex - generate programs for simple lexical tasks

SYNOPSIS
       lex [ -tvn ] [ file ] ...

DESCRIPTION
       Lex generates programs to do simple lexical analysis of text using
       regular expressions.  Lex reads its input files, or the standard
       input if no files are named, to get a list of regular expressions the
       generated program will look for, and C text to execute when each
       expression is matched.

       An output file lex.yy.c is produced that contains C code for the
       generated program, which is named yylex.  It must be linked using the
       -ll switch, to get the lex library routines.

       The input to lex is of the form:

            declarations
            %%
            rules
            %%
            programs

       Any of the sections may be empty.  If the "programs" section is
       empty, the "%%" that precedes it may be omitted.  Thus the shortest
       legal lex input is

            %%

   Rules
       Each rule is of the form:

            <expression> <action>

       An <expression> defines a regular expression that yylex will try to
       match.  The <action> is the C code that yylex will execute when that
       <expression> is matched.

       yylex writes any input characters that match no expression to the
       standard output.

       The notation for lex regular expressions is described below.  In the
       description, X and Y stand for lex regular expressions, and x and y
       stand for characters.

       x      An ordinary single character matches itself.  Exceptions are
              these meta-characters:  "\[]^-?.*+|()$/{}%<>.

       \ex    Matches x, except for these special escape sequences beginning
              with a backslash:
              \en    matches newline
              \et    matches tab
              \eb    matches backspace
              \e\e   matches backslash

       "xy"   A string of characters in double quotes matches the string of
              characters.  Any special meaning those characters (except for
              backslash) might otherwise have is ignored.  The string "\x"
              matches whatever \x would match.  For example,

              "."    matches a period

              "\en"  matches newline

              "[hello]\et"
                     matches the 8-character string "[hello]" followed by a
                     tab

       .      A period matches any character except newline.

       [xy]   A string of elements inside square brackets matches any
              character any of the elements match.  Elements can be any of
              the following:

              single characters, which match themselves (except for "]"
              anywhere and "-" immediately after the initial "[").

              \x regular expressions, which match what they usually do.

              triplets of characters x-y;  these match any character from x
              to y, inclusive.  For example, [adm-p\n] matches any one of
              these characters:  a, d, m, n, o, p, newline.

              A caret, ^, as the first character inside the square brackets
              has special meaning:  if S is a string of characters, then
              [^S] matches any character except for newline and any
              character that [S] would match.

       XY     matches anything that X would match concatenated with anything
              that Y would match.  For example, [ab][cd] matches "ac", "bc",
              "ad", and "bd".

       X*     matches 0 or more successive strings each matched by X.  For
              example, c* matches the empty string, "c", "cc", and so forth.

       X+     matches 1 or more successive strings each matched by X.  For
              example, c+ matches "c", "cc", and so forth.

       X{j,k} where j and k are integers in the range [0,255], matches j to
              k (inclusive) successive strings each matched by X.  For
              example, c{3,5} matches "ccc", "cccc", and "ccccc".

       X{j}   is equivalent to X{j,j};  it matches exactly j successive
              strings each matched by X.

       X{j,}  matches j or more successive strings matched by X.

       (X)    matches whatever X matches.

       X?     matches the empty string and whatever X matches; it is
              equivalent to X{0,1}.  For example, (ab)?  matches "ab" and
              "".

       X|Y    matches anything that either X or Y would match.  For example,
              "bob"|(ab?c) matches "bob", "ac", and "abc".

       ^X     A caret, ^, at the beginning of a regular expression restricts
              it to only match strings at the beginning of a line.  A caret
              not at the beginning of a regular expression does not have
              this effect.  For example, ^Bob matches "Bob" when it occurs
              at the beginning of a line, but nowhere else.

       X$     A dollar sign, $, at the end of a regular expression restricts
              it to only match strings at the end of a line.  A dollar sign
              not at the end of a regular expression does not have this
              effect.  For example, bye$ matches "bye" when it occurs at the
              end of a line, but nowhere else.

       X/Y    restrict X to match only strings that are followed by
              something Y matches.  For example, (bob)/(white) matches "bob"
              in the context "bobwhite" but not in the context "bobolink".

       Blanks or tabs can only appear within a regular expression if each
       is:

         ·    escaped with a backslash;

         ·    inside double quotes;  or

         ·    within square brackets.

       The <action> may be a single line of C code terminated with a
       semicolon, or a sequence of C statements within curly braces { and }.
       Lex provides the following for use in actions:

       yytext Character pointer to the text matched by the regular
              expression.

       yyleng Length of text in yytext.

       |      "|;" as the action for one rule is equivalent to the action
              for the next rule.  "|" may not be used inside curly braces
              "{}".

       ECHO   Equivalent to

              printf("%s", yytext)

       REJECT Causes yylex to reject this match and continue looking to see
              if other regular expressions will match it instead.

       unput(c)
              Routine that pushes a character back onto the input.

       yyless(n)
              Causes all but first n characters of yytext to be pushed back
              onto the input.

       yymore()
              Causes the next input string to be matched to be catenated
              onto the end of yytext, rather than overwriting it.

       You can redefine several routines and macros to change how yylex
       behaves.  If you do this, you have to make sure that you remove the
       default definitions from the resulting output from lex.

       input()
              By default, a macro that is called to read a character from
              stdin.  It returns 0 at end-of-file.

       unput(c)
              By default, a macro that is called to push the character c
              back onto the input.  The lex library allows 100 characters
              worth of pushback.

              If you redefine input() or unput(c), you must ensure that the
              two of them are consistent with each other.

       output(c)
              By default, a macro that is called to write a character c to
              stdout.

       yyin   File pointer for input;  macro defined as stdin.

       yyout  File pointer for output;  macro defined as stdout.

       yywrap()
              This routine is called when input() returns 0.  If yywrap()
              returns 1, yylex finishes wrapping up and returns.  If
              yywrap() returns 0, however, yylex continues to read input and
              match expressions.  The default yywrap() always returns 1.

   Declarations
       The declarations section may contain:

       ·      C code to be placed at the head of lex.yy.c.  Any lines
              between lines containing only "%{" and "%}" are copied into
              lex.yy.c.

       ·      Lex substitution string definitions.  Each such definition is
              a line of the form:

                   name  definition

              The name must start in the first column and begin with a
              letter, and it must be separated from the translation by one
              or more blanks or tabs.  The translation can be anything.

              Such names may be used in expressions in the rules section by
              surrounding them with curly braces, {}.  For example,

                   DIGIT      [0-9]
                   %%
                   {DIGIT}+   printf("integer");

              The "{DIGIT}" is replaced by its definition "[0-9]".

       ·      Start condition definitions.  Each definition line is of the
              form:

                   %Start cond1 cond2 ...

              where the "%Start" begins in the first column.  Each word
              following it is declared to be the name of a start condition.

              Expressions in the rules section may then be preceded by the
              names of start conditions in angle brackets, <>;  this
              restricts them to be matched only when yylex is in the listed
              start conditions.  Several start conditions may be listed,
              separated by commas;  for example, "<cond1,cond2>".

              The start condition yylex is in may be changed by an action
              that executes a "BEGIN name;" statement, where "name" is the
              name of a start condition.  yylex is initially in start
              condition 0, or INITIAL;  "BEGIN 0;" or "BEGIN INITIAL;" will
              reset it.

       NOTE: Any expression not preceded by a start condition may be matched
             at any time.  For example,

                  %Start     one two
                  %%
                  ^one               { ECHO; BEGIN one; }
                  ^two               { ECHO; BEGIN two; }
                  ^zip               { ECHO; BEGIN zip; }
                  onetarget    { printf("one"); }
                  twotarget    { printf("two"); }

             Different rules for "target" will be executed depending on what
             start condition is active.

       ·     Table size limits for the finite state machine implemented by
             yylex.

                  %p n    Max number of positions is n (default 20000)
                  %n n    Max number of states is n (4000)
                  %e n    Max number of parse tree nodes is n (8000)
                  %a n    Max number of transitions is n (16000)
                  %k n    Max number of packed char classes is n (default 20000)
                  %o n    Max number of output slots is n (default 24000)

   Programs
       The programs section may contain anything you like.  It is copied to
       the end of lex.yy.c.

       Any line in any of the three sections that begins with a space is
       copied directly into lex.yy.c.

       To use yylex, you must provide a program to call it and link them
       with the "-ll" option.  To use yylex with a yacc(1) parser, end the
       action for each lex rule with

            return(token);

       where "token" is the appropriate token.  Access to yacc's token names
       may be ensured by including the yylex code in the yacc generator with

            #include "lex.yy.c"

       or generating the "y.tab.h" file with yacc's "-d" option and
       including it with

            #include "y.tab.h"

       in the definitions section of the lex input.

   Options
       -t     Output which normally goes to lex.yy.c is sent to stdout.

       -v     A one-line summary of the finite state machine implemented by
              yylex is printed.

       -n     Cancels -v option.

   International Features
       lex can process characters from supplementary code sets as well as
       ASCII characters.

       Characters from supplementary code sets can be specified in comments
       which exist in definitions, rules, and user subroutines.

       Characters from supplementary code sets can be specified in strings
       which exist in actions in rules and in user subroutines.

       Character strings from supplementary code sets can be defined as
       tokens.

EXAMPLE
       D       [0-9]
       %%
       if      printf("IF statement\n");
       [a-z]+ printf("tag, value %s\n",yytext);
       0{D}+   printf("octal number %s\n",yytext);
       {D}+    printf("decimal number %s\n",yytext);
       "++"    printf("unary op\n");
       "+"     printf("binary op\n");
       "/*" {       loop:
                       while (input() != '*');
                       switch (input())
                               {
                               case '/': break;
                               case '*': unput('*');
                               default: go to loop;
                               }
                       }

NOTE
       Remember, if you redefined any of the lex furnished macros, you must
       removed the default definitions from the output produced by lex.

SEE ALSO
       yacc(1), malloc(3X).


Licensed material--property of copyright holder(s)
Museum

Related Articles