Museum

Home

Lab Overview

Retrotechnology Articles

Online Manuals

⇒ lex(1) — DG/UX 4.30

Media Vault

Software Library

Restoration Projects

Artifacts Sought

Related Articles

yacc(1)

malloc(3X)



     lex(1)                     DG/UX 4.30                      lex(1)



     NAME
          lex - generate programs for simple lexical tasks

     SYNOPSIS
          lex [ -tvn ] [ file ] ...

     DESCRIPTION
          Lex generates programs to do simple lexical analysis of text
          using regular expressions.  Lex reads its input files, or
          the standard input if no files are named, to get a list of
          regular expressions the generated program will look for, and
          C text to execute when each expression is matched.

          An output file lex.yy.c is produced that contains C code for
          the generated program, which is named yylex.  It must be
          linked using the "-ll" switch, to get the lex library
          routines.

          The input to lex is of the form:

               declarations
               %%
               rules
               %%
               programs

          Any of the sections may be empty.  If the "programs" section
          is empty, the "%%" that precedes it may be omitted.  Thus
          the shortest legal lex input is

               %%

          Each rule is of the form:

               <expression> <action>

          An <expression> defines a regular expression that yylex will
          try to match.  The <action> is the C code that yylex will
          execute when that <expression> is matched.

          yylex writes any input characters that match no expression
          to the standard output.

          The notation for lex regular expressions is described below.
          In the description, X and Y stand for lex regular
          expressions, and x and y stand for characters.

          x    An ordinary single character matches itself.
               Exceptions are these meta-characters:  "\[]^-
               ?.*+|()$/{}%<>.

          \x   Matches x, except for these special escape sequences



     Licensed material--property of copyright holder(s)         Page 1





     lex(1)                     DG/UX 4.30                      lex(1)



               beginning with a backslash:

              \n   matches newline

              \t   matches tab

              \b   matches backspace

              \\   matches backslash

          "xy" A string of characters in double quotes matches the
               string of characters.  Any special meaning those
               characters (except for backslash) might otherwise have
               is ignored.  The string "\x" matches whatever \x would
               match.  For example,

          "."  matches a period

          "\n" matches newline

          "[hello]\t"
               matches the 8-character string "[hello]" followed by a
               tab

          .    A period matches any character except newline.

          [xy] A string of elements inside square brackets matches any
               character any of the elements match.  Elements can be
               any of the following:

               single characters, which match themselves (except for
               "]" anywhere and "-" immediately after the initial
               "[").

               \x regular expressions, which match what they usually
               do.

               triplets of characters x-y;  these match any character
               from x to y, inclusive.

               For example, [adm-p\n] matches any one of these
               characters:  a, d, m, n, o, p, newline.

               A caret, ^, as the first character inside the square
               brackets has special meaning:  if S is a string of
               characters, then [^S] matches any character except for
               newline and any character that [S] would match.

          XY   matches anything that X would match concatenated with
               anything that Y would match.  For example,





     Licensed material--property of copyright holder(s)         Page 2





     lex(1)                     DG/UX 4.30                      lex(1)



          [ab][cd]
               matches "ac", "bc", "ad", and "bd".

          X*   matches 0 or more successive strings each matched by X.
               For example,


          c*   matches the empty string, "c", "cc", and so forth.

          X+   matches 1 or more successive strings each matched by X.
               For example,


          c+   matches "c", "cc", and so forth.

          X{j,k}
               where j and k are integers in the range [0,255],
               matches j to k (inclusive) successive strings each
               matched by X.  For example,


          c{3,5}
               matches "ccc", "cccc", and "ccccc".

          X{j} is equivalent to X{j,j};  it matches exactly j
               successive strings each matched by X.

          X{j,}
               matches j or more successive strings matched by X.

          (X)  matches whatever X matches.

          X?   matches the empty string and whatever X matches; it is
               equivalent to X{0,1}.  For example,


          (ab)?
               matches "ab" and "".

          X|Y  matches anything that either X or Y would match.  For
               example,


          "bob"|(ab?c)
               matches "bob", "ac", and "abc".

          ^X   A caret, ^, at the beginning of a regular expression
               restricts it to only match strings at the beginning of
               a line.  A caret not at the beginning of a regular
               expression does not have this effect.  For example,





     Licensed material--property of copyright holder(s)         Page 3





     lex(1)                     DG/UX 4.30                      lex(1)



          ^Bob matches "Bob" when it occurs at the beginning of a
               line, but nowhere else.

          X$   A dollar sign, $, at the end of a regular expression
               restricts it to only match strings at the end of a
               line.  A dollar sign not at the end of a regular
               expression does not have this effect.  For example,


           bye$
               matches "bye" when it occurs at the end of a line, but
               nowhere else.

          X/Y  restrict X to match only strings that are followed by
               something Y matches.  For example,


          (bob)/(white)
               matches "bob" in the context "bobwhite" but not in the
               context "bobolink".

          Blanks or tabs can only appear within a regular expression
          if each is:

          *    escaped with a backslash;

          *    inside double quotes;  or

          *    within square brackets.

          The <action> may be a single line of C code terminated with
          a semicolon, or a sequence of C statements within curly
          braces { and }.  Lex provides the following for use in
          actions:

          yytext
               Character pointer to the text matched by the regular
               expression.

          yyleng
               Length of text in yytext.

          |    "|;" as the action for one rule is equivalent to the
               action for the next rule.  "|" may not be used inside
               curly braces "{}".

          ECHO Equivalent to

               printf("%s", yytext)

          REJECT
               Causes yylex to reject this match and continue looking



     Licensed material--property of copyright holder(s)         Page 4





     lex(1)                     DG/UX 4.30                      lex(1)



               to see if other regular expressions will match it
               instead.

          unput(c)
               Routine that pushes a character back onto the input.

          yyless(n)
               Causes all but first n characters of yytext to be
               pushed back onto the input.

          yymore()
               Causes the next input string to be matched to be
               catenated onto the end of yytext, rather than
               overwriting it.

          You can redefine several routines and macros to change how
          yylex behaves:

          input()
               By default, a macro that is called to read a character
               from stdin.  It returns 0 at end-of-file.

          unput(c)
               By default, a macro that is called to push the
               character c back onto the input.  The lex library
               allows 100 characters worth of pushback.

               If you redefine input() or unput(c), you must ensure
               that the two of them are consistent with each other.

          output(c)
               By default, a macro that is called to write a character
               c to stdout.

          yyin File pointer for input;  macro defined as stdin.

          yyout
               File pointer for output;  macro defined as stdout.

          yywrap()
               This routine is called when input() returns 0.  If
               yywrap() returns 1, yylex finishes wrapping up and
               returns.  If yywrap() returns 0, however, yylex
               continues to read input and match expressions.  The
               default yywrap() always returns 1.

          The declarations section may contain:

          *    C code to be placed at the head of lex.yy.c.  Any lines
               between lines containing only "%{" and "%}" are copied
               into lex.yy.c.




     Licensed material--property of copyright holder(s)         Page 5





     lex(1)                     DG/UX 4.30                      lex(1)



          *    Lex substitution string definitions.  Each such
               definition is a line of the form:

                    name  definition

               The name must start in the first column and begin with
               a letter, and it must be separated from the translation
               by one or more blanks or tabs.  The translation can be
               anything.

               Such names may be used in expressions in the rules
               section by surrounding them with curly braces, {}.  For
               example,

                    DIGIT     [0-9]
                    %%
                    {DIGIT}+  printf("integer");

               The "{DIGIT}" is replaced by its definition "[0-9]".

          *    Start condition definitions.  Each definition line is
               of the form:

                    %Start cond1 cond2 ...

               where the "%Start" begins in the first column.  Each
               word following it is declared to be the name of a start
               condition.

               Expressions in the rules section may then be preceded
               by the names of start conditions in angle brackets, <>;
               this restricts them to be matched only when yylex is in
               the listed start conditions.  Several start conditions
               may be listed, separated by commas;  for example,
               "<cond1,cond2>".

               The start condition yylex is in may be changed by an
               action that executes a "BEGIN name;" statement, where
               "name" is the name of a start condition.  yylex is
               initially in start condition 0;  "BEGIN 0;" will reset
               it.

          NOTE:
               Any expression not preceded by a start condition may be
               matched at any time.  For example,

                    %Start    one two
                    %%
                    ^one      { ECHO; BEGIN one; }
                    ^two      { ECHO; BEGIN two; }
                    ^zip      { ECHO; BEGIN zip; }
                    <one>target    { printf("one"); }



     Licensed material--property of copyright holder(s)         Page 6





     lex(1)                     DG/UX 4.30                      lex(1)



                    <two>target    { printf("two"); }

               Different rules for "target" will be executed depending
               on what start condition is active.

          *    Table size limits for the finite state machine
               implemented by yylex.

                    %p n Maximum number of positions is n (default 2000)
                    %n n    Maximum number of states is n (500)
                    %t n    Maximum number of parse tree nodes is n (1000)
                    %a n    Maximum number of transitions is n (3000)

          The programs section may contain anything you like.  It is
          copied to the end of lex.yy.c.

          Any line in any of the three sections that begins with a
          space is copied directly into lex.yy.c.

          To use yylex, you must provide a program to call it and link
          them with the "-ll" option.  To use yylex with a yacc(1)
          parser, end the action for each lex rule with

               return(token);

          where "token" is the appropriate token.  Access to yacc's
          token names may be ensured by including the yylex code in
          the yacc generator with

               #include "lex.yy.c"

          or generating the "y.tab.h" file with yacc's "-d" option and
          including it with

               #include "y.tab.h"

          in the definitions section of the lex input.

     OPTIONS
          -t   Output which normally goes to lex.yy.c is sent to
               stdout.

          -v   A one-line summary of the finite state machine
               implemented by yylex is printed.

          -n   Cancels -v option.

     EXAMPLE
             D  [0-9]
             %%
             if printf("IF statement\n");
             [a-z]+   printf("tag, value %s\n",yytext);



     Licensed material--property of copyright holder(s)         Page 7





     lex(1)                     DG/UX 4.30                      lex(1)



             0{D}+ printf("octal number %s\n",yytext);
             {D}+  printf("decimal number %s\n",yytext);
             "++"  printf("unary op\n");
             "+"   printf("binary op\n");
             "/*"  {  loop:
                   while (input() != '*');
                   switch (input())
                      {
                      case '/': break;
                      case '*': unput('*');
                      default: go to loop;
                      }
                   }

     SEE ALSO
          yacc(1).
          malloc(3X) in the Programmer's Reference for the DG/UX
          System.





































     Licensed material--property of copyright holder(s)         Page 8



Typewritten Software • bear@typewritten.org • Edmonds, WA 98026