perfex(1) — IRIX 6.5.3f



PERFEX(1)                                                            PERFEX(1)



NAME
     perfex - a command line interface to R10000 counters

SYNOPSIS
     perfex [-a | -e event0 [-e event1]] [-mp | -s]
     [-x] [-y] [-t][-o <file>] [-c <file>] [-l <nn>] command


DESCRIPTION
     The given command is executed; after it is complete, perfex prints the
     values of various hardware performance counters.  The counts returned are
     aggregated over all processes which are descendants of the target
     command, as long as their parent process controls the child through wait
     (see wait(2)).

     The integers event0 and event1 index this table:
          0 = Cycles
          1 = Issued instructions
          2 = Issued loads
          3 = Issued stores
          4 = Issued store conditionals
          5 = Failed store conditionals
          6 = Decoded branches
          7 = Quadwords written back from scache
          8 = Correctable scache data array ECC errors
          9 = Primary instruction cache misses
          10 = Secondary instruction cache misses
          11 = Instruction misprediction from scache way prediction table
          12 = External interventions
          13 = External invalidations
          14 = Virtual coherency conditions
          15 = Graduated instructions
          16 = Cycles
          17 = Graduated instructions
          18 = Graduated loads
          19 = Graduated stores
          20 = Graduated store conditionals
          21 = Graduated floating point instructions
          22 = Quadwords written back from primary data cache
          23 = TLB misses
          24 = Mispredicted branches
          25 = Primary data cache misses
          26 = Secondary data cache misses
          27 = Data misprediction from scache way prediction table
          28 = External intervention hits in scache
          29 = External invalidation hits in scache
          30 = Store/prefetch exclusive to clean block in scache
          31 = Store/prefetch exclusive to shared block in scache







                                                                        Page 1



PERFEX(1)                                                            PERFEX(1)



BASIC OPTIONS
     -e event
          Specify an event to be counted.

          2, 1, or 0 event specifiers may be given, the default events being
          to count cycles.  Events may also be specified by setting one or
          both of the environment variables T5EVENT0 and T5EVENT1. Command
          line event specifiers if present will override these. The order of
          events specified is not important.  The counts, together with an
          event description are written to stderr, unless redirected with the
          -o option. Two events which must be counted on the same hardware
          counter (see r10kcounters(5)) will cause a conflicting counters
          error.

     -a   Multiplex over all events, projecting totals.  Ignore event
          specifiers.

          The option -a produces counts for all events by multiplexing over 16
          events per counter. The OS does the switching round robin at clock
          interrupt boundaries. The resulting counts are normalized by
          multiplying by 16 to give an estimate of the values they would have
          had for exclusive counting. Due to the equal-time nature of the
          multiplexing, it is true with high probability that any events
          present in large enough numbers to contribute significantly to the
          execution time will be fairly represented. Events concentrated in a
          few short regions (say, icache misses) may not be projected very
          accurately.

     -mp  Report per-thread counts for mp programs as well as (default)
          totals.

          By default perfex aggregates the counts of all the child threads and
          reports this number for each selected event. The -mp option causes
          the counters for each thread to be collected at thread exit time and
          printed out, followed by the counts aggregated across all threads.
          The per-thread counts are labeled by pid.

     -o <file>
          Redirect perfex output to the specified file.

     -s   Start(stop) counting on SIGUSR1(SIGUSR2) signal receipt by perfex
          process.

          This option causes perfex to wait until it (i.e. the perfex process)
          receives a SIGUSR1, before it starts counting (for the child
          process, the target). It will stop counting if it receives a
          SIGUSR2. Repeated cycles of this will aggregate counts. If no
          SIGUSR2 is received, the counting will continue until the child
          exits (a normal case).  Note that counting for descendants of the
          child will not be affected.  Thus counting for mp programs cannot be
          controlled with this option.




                                                                        Page 2



PERFEX(1)                                                            PERFEX(1)



     -x   Count at exception level (as well as the default user level).

          Exception level includes time spent on behalf of the user during,
          e.g., TLB refill exceptions.  Other counting modes (kernel,
          supervisor) are available through the OS ioctl interface ( see
          r10kcounters(5) ).

     To collect instruction and data scache miss counts on a program normally
     executed by
        % bar < bar.in > bar.out
      would be accomplished by
        % perfex -e 26 -e 10 bar < bar.in > bar.out .


COST ESTIMATE OPTIONS
     -y   Report statistics and ranges of estimated times per event.

          Without the -y option, perfex reports the counts recorded by the
          R10000 event counters for the events requested. As these are simply
          raw counts, it is difficult to know by inspection which events are
          responsible for significant portions of the job's run time. The -y
          option associates an approximate time cost with some of the event
          counts.

          The reported times are approximate.  Due to the superscalar nature
          of the R10000, and its ability to hide latency, one cannot state a
          precise cost for a single occurrence of many of the events. Cache
          misses, for example, can be overlapped with other operations, so
          there is a wide range of times possible for any cache miss.

          To account for the fact that the cost of many events cannot be known
          precisely, perfex -y reports a range of time costs for each event.
          "Maximum," "minimum," and "typical" time costs are reported. Each is
          obtained by consulting an internal table which holds the "maximum,"
          "minimum," and "typical" costs for each event, and multiplying this
          cost by the count for the event. Event costs are usually measured in
          terms of machine cycles, and so the cost of an event generally
          depends on the clock speed of the processor, which is also reported
          in the output.

          The "maximum" value contained in the table corresponds to the worst
          case cost of a single occurrence of the event. Sometimes this can be
          a very pessimistic estimate. For example, the maximum cost for
          graduated floating point instructions assumes that all such
          instructions are double precision reciprocal square roots, since
          that is the most costly R10000 floating point instruction.

          Due to the latency-hiding capabilities of the R10000, the "minimum"
          cost of virtually any event could be zero since most events can be
          overlapped with other operations. To avoid simply reporting minimum
          costs of 0, which would be of no practical use, the "minimum" time
          reported by perfex -y corresponds to the best case cost of a single



                                                                        Page 3



PERFEX(1)                                                            PERFEX(1)



          occurrence of the event. The "best case" cost is obtained by running
          the maximum number of simultaneous occurrences of that event and
          averaging the cost. For example, two floating point instructions can
          complete per cycle, so the best case cost is 0.5 cycles per floating
          point instruction.

          The "typical" cost falls somewhere between "minimum" and maximum"
          and is meant to correspond to the cost one would expect to see in
          average programs. For example, to measure the "typical" cost of a
          cache miss, stride-1 accesses to an array too big to fit in cache
          were timed and the number of cache misses generated was counted. The
          same number of stride-1 accesses to an in-cache array were then
          timed. The difference in times corresponds to the cost of the cache
          misses, and this was used to calculate the average cost of a cache
          miss. This "typical" cost is lower than the worst case in which each
          cache miss cannot be overlapped, and it is higher than the best case
          in which several independent, and hence, overlapping, cache misses
          are generated.  (Note that on Origin systems, this methodology
          yields the time for L2 cache misses to local memory only.)
          Naturally, these "typical" costs are somewhat arbitrary.  If they do
          not seem right for the application being measuring with perfex, they
          can be replaced by user-supplied values. See the -c option below.

          perfex -y prints the event counts and associated cost estimates
          sorted from most costly to least costly. While resembling a
          profiling output, this is not a true profile. The event costs
          reported are only estimates. Furthermore, since events do overlap
          with each other, the sum of the estimated times will usually exceed
          the program's run time.  This output should only be used to identify
          which events are responsible for significant portions of the
          program's run time, and to get a rough idea of what those costs
          might be.

          With this in mind, the built-in cost table does not make an attempt
          to provide detailed costs for all events. Some events provide
          summary or redundant information. These events are assigned
          "minimum" and "typical" costs of 0 so that they sort to the bottom
          of the output.  The "maximum" costs are set to 1 cycle so that one
          can get an indication of the time corresponding to these events.
          "Issued instructions" and "graduated instructions" are examples of
          such events.  In addition to these summary or redundant events,
          detailed cost information has not been provided for a few other
          events such as "external interventions" and "external invalidations"
          since it is difficult to assign costs to these asynchronous events.
          The built-in cost values may be overridden by user-supplied values
          using the -c option below.

          In addition the event counts and cost estimates, perfex -y also
          reports a number of statistics derived from the typical costs. The
          meaning of many of the statistics is self-evident, for example,
          graduated instructions/cycle. Below are listed those statistics
          whose definitions require more explanation:



                                                                        Page 4



PERFEX(1)                                                            PERFEX(1)



     Data mispredict/Data scache hits

          This is the ratio of the counts for "Data misprediction from scache
          way prediction table" and "Secondary data cache misses."


     Instruction mispredict/Instruction scache hits

          This is the ratio of the counts for "Instruction misprediction from
          scache way prediction table" and "Secondary instruction cache
          misses."


     L1 Cache Line Reuse

          The is the number of times, on average, that a primary data cache
          line is used after it has been moved into the cache. It is
          calculated as "graduated loads" plus "graduated stores" minus
          "primary data cache misses," all divided by "primary data cache
          misses."


     L2 Cache Line Reuse

          The is the number of times, on average, that a secondary data cache
          line is used after it has been moved into the cache. It is
          calculated as "primary data cache misses" minus "secondary data
          cache misses," all divided by "secondary data cache misses."

     L1 Data Cache Hit Rate

          This is the fraction of data accesses which are satisfied from a
          cache line already resident in the primary data cache. It is
          calculated as 1.0 - ("primary data cache misses" divided by the sum
          of "graduated loads" and "graduated stores").

     L2 Data Cache Hit Rate

          This is the fraction of data accesses which are satisfied from a
          cache line already resident in the secondary data cache. It is
          calculated as 1.0 - ("secondary data cache misses" divided by
          "primary data cache misses").

     Time accessing memory/Total time

          This is the sum of the typical costs of "graduated loads,"
          "graduated stores," "primary data cache misses," "secondary data
          cache misses," and "TLB misses," divided by the total program run
          time. The total program run time is calculated by multiplying
          "cycles" by the time per cycle (inverse of the processor's clock
          speed).




                                                                        Page 5



PERFEX(1)                                                            PERFEX(1)



     L1--L2 bandwidth used (MB/s, average per process)

          This is the amount of data moved between the primary and secondary
          data caches, divided by the total program run time. The anmount of
          data moved is calculated as the sum of the number of "primary data
          cache misses" multiplied by the primary cache line size and the
          number of "quadwords written back from primary data cache"
          multiplied by the size of a quadword (16 bytes).  For multiprocess
          programs, the resulting figure is a per- process average since the
          counts measured by perfex are aggregates of the counts for all the
          threads. One needs to multiply by the number of threads to get the
          total program bandwidth.

     Memory bandwidth used (MB/s, average per process)

          This is the amount of data moved between the secondary data cache
          and main memory, divided by the total program run time. The anmount
          of data moved is calculated as the sum of the number of "secondary
          data cache misses" multiplied by the secondary cache line size and
          the number of "quadwords written back from secondary data cache"
          multiplied by the size of a quadword (16 bytes).  For multiprocess
          programs, the resulting figure is a per- process average since the
          counts measured by perfex are aggregates of the counts for all the
          threads. One needs to multiply by the number of threads to get the
          total program bandwidth.

     MFLOPS (MB/s, average per process)

          This is the ratio of the "graduated floating point instructions" and
          the total program run time. Note that while a multiply-add carries
          out two floating point operations, it only counts as one
          instruction, so this statistic may underestimate the number of
          floating point operations per second. For multiprocess programs, the
          resulting figure is a per-process average since the counts measured
          by perfex are aggregates of the counts for all the threads. One
          needs to multiply by the number of threads to get the total program
          rate.

     A ststistic is only printed if counts for the events which define it have
     been gathered.


     -c <file>
          Load a cost table from <file> (requires -y).

          This option allows one to override the internal event costs used by
          the -y option. <file> contains the list of event costs which are to
          be overridden. This <file> needs to be in the same format as the
          output produced by the -c option. Costs may be specied in units of
          "clks" (machine cycles) or nsec (nanseconds). One may override all
          or only a subset of the default costs.




                                                                        Page 6



PERFEX(1)                                                            PERFEX(1)



          One may also use the file /etc/perfex.costs to override event costs.
          If this file exists, any costs listed in it will override those
          built into perfex. Costs supplied via the -c option will override
          those provided by the /etc/perfex.costs file.


     -t   print the cost table used for perfex -y cost estimates to STDOUT

          These internal costs may be overridden by specifying different
          values in the file /etc/perfex.costs, or by using the -c <file>
          option. Both <file> and /etc/perfex.costs need to use the format as
          provided by the -t option. It is recommended that one capture this
          output to a file and edit it to create a suitable file for
          /etc/perfex.costs or the -c option. One does not have to specify
          costs for every event, however.  Lines corresponding to events whose
          values one does not wish to override may simply be deleted from the
          file.

FILES
     /etc/perfex.costs


DEPENDENCIES
     perfex only works on an R10000 system. For the -mp option only, only
     binaries linked -shared are currently supported.  This is due to a
     dependency on libperfex.so.  The options -s and -mp are currently
     mutually exclusive.


LIMITATIONS
     The signal control interface (-s) can control only the immediate target
     process, not any of its descendants.  This makes it unusable with multi-
     process targets in their parallel regions.


SEE ALSO
     r10kcounters(5), libperfex(3), time(1), timex(1)


















                                                                        Page 7

Museum

Related Articles