Museum

Home

Lab Overview

Retrotechnology Articles

Online Manuals

⇒ lex(1) — AIX/RT 2.2.1

Media Vault

Software Library

Restoration Projects

Artifacts Sought

Related Articles

yacc

lex

PURPOSE

     Generates a C Language  program that matches patterns for
     simple lexical analysis of an input stream.

SYNOPSIS
     lex [ -tvn ] [ file ] ...


DESCRIPTION

     The lex command reads file or standard input, generates a
     C  Language  program,  and  writes it  to  a  file  named
     lex.yy.c.  This  file, lex.yy.c,  is a compilable  C Lan-
     guage program.

     The lex command uses rules  and actions contained in file
     to generate  a program,  lex.yy.c, which can  be compiled
     with the  cc command.  It  can then receive  input, break
     the input into the logical pieces defined by the rules in
     file, and run program  fragments contained in the actions
     in file.  For  a more detailed discussion of  lex and its
     operation, see AIX Operating System Programming Tools and
     Interfaces.

     The  generated program  is a  C Language  function called
     yylex.  lex stores  yylex in a file  named lex.yy.c.  You
     can use yylex alone  to recognize simple, one-word input,
     or  you can  use it  with  other C  Language programs  to
     perform  more difficult  input  analysis functions.   For
     example, you can use lex  to generate a program that sim-
     plifies an  input stream  before sending  it to  a parser
     program generated by the yacc command.

     The  function yylex  analyzes  the input  stream using  a
     program structure called a  "finite state machine."  This
     structure allows the  program to exist in  only one state
     (or condition)  at a time.   There is a finite  number of
     states  allowed.  The  rules  in file  determine how  the
     program moves from one state to another.

     If you do  not specify a file, lex  reads standard input.
     It treats multiple files as a single file.

     Note:  Since  lex uses  fixed names for  intermediate and
     output files, you can have only one lex-generated program
     in a given directory.

       Input File Format (file)

     The input  file can contain three  sections; definitions,
     rules, and user subroutines.   Each section must be sepa-
     rated  from the  others  by a  line  containing only  the
     delimiter, %%.  The format is:

         definitions
         %%
         rules
         %%
         user subroutines

     The purpose and format of  each are described in the fol-
     lowing sections.

     DEFINITIONS
     :  If you  want to use variables in your  rules, you must
     define them in  this section.  The variables  make up the
     left  column, and  their  definitions make  up the  right
     column.  For example, if you want to define D as a numer-
     ical digit, you would write;

            D   [0-9]

     You can  use a defined  variable in the rules  section by
     enclosing the variable name in braces ("{D}").

     In the definitions  section, you can set  table sizes for
     the resulting  finite state  machine.  The  default sizes
     are large enough for small programs.  You may want to set
     larger sizes for more complex programs.

     %p  n   Number of positions is n (default 2000)
     %n  n   Number of states is n (default 500)
     %t  n   Number of parse tree nodes is n (default 1000)
     %a  n   Number of transitions is n (default 3000)

     If  extended  characters  appear  in  regular  expression
     strings, you may need to reset the output array size with
     the %o  parameter (possibly to  array sizes in  the range
     10,000 to  20,000).  This reset reflects  the much larger
     number  of characters  relative  to the  number of  ASCII
     characters.

     RULES
     :  Once  you have defined  your terms, you can  write the
     rules section.  It contains strings and expressions to be
     matched in file to yylex,  and C commands to execute when
     a match is  made.  This section is required,  and it must
     be preceded by the delimiter  %%, whether or not you have
     a definitions  section.  The lex command  does not recog-
     nize your rules without this delimiter.

     In this section, the left  column contains the pattern to
     be  recognized in  an  input file  to  yylex.  The  right
     column contains the C program fragment executed when that
     pattern  is recognized.   Patterns  can include  extended
     characters with one exception:   these characters may not
     appear  in range  specifications  within character  class
     expressions surrounded  by square brackets.   The columns
     are  separated by  a tab.   For example,  if you  want to
     search files for the keyword "KEY", you might write:

            (KEY)
            printf("found KEY");

     If you  include this rule  in file, the  lexical analyzer
     yylex  matches  the pattern  "KEY"  and  runs the  printf
     command.

     Each pattern may have a corresponding action, a C command
     to execute  when the pattern is  matched.  Each statement
     must  end with  a semicolon.   If you  use more  than one
     statement in an  action, you must enclose all  of them in
     braces.  A  second delimiter,  %%, must follow  the rules
     section if you have a user subroutine section.

     When  yylex matches  a  string in  the  input stream,  it
     copies the  matched file to an  external character array,
     yytext,  before it  executes  any commands  in the  rules
     section.

     You can use the following operators to form patterns that
     you want to match:

     x      Matches  the  character  written.  x  matches  the
            literal character x.
     [ ]    Matches any  one character  in the  enclosed range
            ([.-.]) or the enclosed list ([...]).  [a,b,c,x-z]
            matches a,b,c,x,y,or z.
     " "    Matches the  enclosed character or string  even if
            it is an operator.  ""$"" prevents lex from inter-
            preting the character "$" as an operator.
     \      Acts  the same  as  " ".  \"$"  also prevents  the
            shell from  interpreting the  character "$"  as an
            operator.
     *      Matches zero or more  occurrences of the character
            immediately  preceding it.    x*  matches zero  or
            more repeated
     +      Matches one  or more occurrences of  the character
            immediately preceding it.
     ?      Matches  either zero  or  one  occurrences of  the
            character immediately preceding it.
     ^      Matches the  character only at the  beginning of a
            line.  ^"x"  matches an  x at  the beginning  of a
            line.
     [^]    Matches any character but the one following the ^.
            [^"x"] matches any character but x.
     .      Matches  any character  except the  new-line char-
            acter.

     $      Matches the end of a line.
     |      Matches either of two characters.  "x | y" matches
            either x or y.
     /      Matches  one character  only  when  followed by  a
            second character.   It reads only the  first char-
            acter into yytext.  x/y matches  x when it is fol-
            lowed by y, and reads x into yytext.
     ( )    Matches the  pattern in the parentheses.   This is
            used  for grouping.   It reads  the whole  pattern
            into yytext.   A group in parentheses  can be used
            in  place of  any  single character  in any  other
            pattern.  "(xyz123)" matches  the pattern "xyz123"
            and reads the whole string into yytext.
     {}     Matches  the character  as you  defined it  in the
            definitions  section.   If  you defined  D  to  be
            numerical  digits,  "{D}"  matches  all  numerical
            digits.
     {m,n}  Matches  m  to  n occurrences  of  the  character.
            x{2,4} matches 2, 3, or 4 occurrences of x.

     If a line begins with only  a blank, lex copies it to the
     output file,  lex.yy.c.  If the  line is in  the declara-
     tions section of file, lex  copies it to the declarations
     section  of  lex.yy.c.   If  the line  is  in  the  rules
     section, lex  copies it  to the  program code  section of
     lex.yy.c.

     USER SUBROUTINES
     :   The  lex library  has  three  subroutines defined  as
     macros, and which you can use in the rules.

     input( )       Reads a character from yyin.
     unput( )       Replaces  a character  after  it has  been
                    read.
     output( )      Writes an output character to yyout.

     You can override  these three macros by  writing your own
     code for these routines  in the user subroutines section.
     But if you write your own, you must undefine these macros
     in the definition section as follows:

       %{
       #undef input
       #undef unput
       #undef output
       }%

     There is no  main( ) in lex.yy.c because  the lex library
     contains the main( ) that calls yylex.  Therefore, if you
     do not  include main( ) in the  user subroutines section,
     when    you   compile    lex.yy.c,    you   must    enter
     cc -ll lex.yy.c, where ll will call the lex library.

     External  names  generated  by  lex all  begin  with  the
     preface yy, as in yyin, yyout, yylex, and yytext.

FLAGS

     -n Suppresses the statistics summary.   When you set your
        own table sizes for the finite state machine (see page
        ), the lex automatically  produces this summary if you
        do not select this flag.
     -t Writes  lex.yy.c to  standard output  instead of  to a
        file.
     -v Provides a  one-line summary of the  generated finite-
        state-machine statistics.

FILES

     /usr/lib/libl.a    Run-time library.

RELATED INFORMATION

     The following command:  "yacc."

     The description  of lex in AIX  Operating System Program-
     ming Tools and Interfaces.

     "Overview of International Character Support" in Managing
     the AIX Operating System.

Typewritten Software • bear@typewritten.org • Edmonds, WA 98026