PERFEX(1) PERFEX(1)
NAME
perfex - a command line interface to R10000 counters
SYNOPSIS
perfex [-a | -e event0 [-e event1]] [-mp | -s]
[-x] [-y] [-t][-o <file>] [-c <file>] [-l <nn>] command
DESCRIPTION
The given command is executed; after it is complete, perfex prints the
values of various hardware performance counters. The counts returned are
aggregated over all processes which are descendants of the target
command, as long as their parent process controls the child through wait
(see wait(2)).
The integers event0 and event1 index this table:
0 = Cycles
1 = Issued instructions
2 = Issued loads
3 = Issued stores
4 = Issued store conditionals
5 = Failed store conditionals
6 = Decoded branches
7 = Quadwords written back from scache
8 = Correctable scache data array ECC errors
9 = Primary instruction cache misses
10 = Secondary instruction cache misses
11 = Instruction misprediction from scache way prediction table
12 = External interventions
13 = External invalidations
14 = Virtual coherency conditions
15 = Graduated instructions
16 = Cycles
17 = Graduated instructions
18 = Graduated loads
19 = Graduated stores
20 = Graduated store conditionals
21 = Graduated floating point instructions
22 = Quadwords written back from primary data cache
23 = TLB misses
24 = Mispredicted branches
25 = Primary data cache misses
26 = Secondary data cache misses
27 = Data misprediction from scache way prediction table
28 = External intervention hits in scache
29 = External invalidation hits in scache
30 = Store/prefetch exclusive to clean block in scache
31 = Store/prefetch exclusive to shared block in scache
Page 1
PERFEX(1) PERFEX(1)
BASIC OPTIONS
-e event
Specify an event to be counted.
2, 1, or 0 event specifiers may be given, the default events being
to count cycles. Events may also be specified by setting one or
both of the environment variables T5EVENT0 and T5EVENT1. Command
line event specifiers if present will override these. The order of
events specified is not important. The counts, together with an
event description are written to stderr, unless redirected with the
-o option. Two events which must be counted on the same hardware
counter (see r10kcounters(5)) will cause a conflicting counters
error.
-a Multiplex over all events, projecting totals. Ignore event
specifiers.
The option -a produces counts for all events by multiplexing over 16
events per counter. The OS does the switching round robin at clock
interrupt boundaries. The resulting counts are normalized by
multiplying by 16 to give an estimate of the values they would have
had for exclusive counting. Due to the equal-time nature of the
multiplexing, it is true with high probability that any events
present in large enough numbers to contribute significantly to the
execution time will be fairly represented. Events concentrated in a
few short regions (say, icache misses) may not be projected very
accurately.
-mp Report per-thread counts for mp programs as well as (default)
totals.
By default perfex aggregates the counts of all the child threads and
reports this number for each selected event. The -mp option causes
the counters for each thread to be collected at thread exit time and
printed out, followed by the counts aggregated across all threads.
The per-thread counts are labeled by pid.
-o <file>
Redirect perfex output to the specified file.
-s Start(stop) counting on SIGUSR1(SIGUSR2) signal receipt by perfex
process.
This option causes perfex to wait until it (i.e. the perfex process)
receives a SIGUSR1, before it starts counting (for the child
process, the target). It will stop counting if it receives a
SIGUSR2. Repeated cycles of this will aggregate counts. If no
SIGUSR2 is received, the counting will continue until the child
exits (a normal case). Note that counting for descendants of the
child will not be affected. Thus counting for mp programs cannot be
controlled with this option.
Page 2
PERFEX(1) PERFEX(1)
-x Count at exception level (as well as the default user level).
Exception level includes time spent on behalf of the user during,
e.g., TLB refill exceptions. Other counting modes (kernel,
supervisor) are available through the OS ioctl interface ( see
r10kcounters(5) ).
To collect instruction and data scache miss counts on a program normally
executed by
% bar < bar.in > bar.out
would be accomplished by
% perfex -e 26 -e 10 bar < bar.in > bar.out .
COST ESTIMATE OPTIONS
-y Report statistics and ranges of estimated times per event.
Without the -y option, perfex reports the counts recorded by the
R10000 event counters for the events requested. As these are simply
raw counts, it is difficult to know by inspection which events are
responsible for significant portions of the job's run time. The -y
option associates an approximate time cost with some of the event
counts.
The reported times are approximate. Due to the superscalar nature
of the R10000, and its ability to hide latency, one cannot state a
precise cost for a single occurrence of many of the events. Cache
misses, for example, can be overlapped with other operations, so
there is a wide range of times possible for any cache miss.
To account for the fact that the cost of many events cannot be known
precisely, perfex -y reports a range of time costs for each event.
"Maximum," "minimum," and "typical" time costs are reported. Each is
obtained by consulting an internal table which holds the "maximum,"
"minimum," and "typical" costs for each event, and multiplying this
cost by the count for the event. Event costs are usually measured in
terms of machine cycles, and so the cost of an event generally
depends on the clock speed of the processor, which is also reported
in the output.
The "maximum" value contained in the table corresponds to the worst
case cost of a single occurrence of the event. Sometimes this can be
a very pessimistic estimate. For example, the maximum cost for
graduated floating point instructions assumes that all such
instructions are double precision reciprocal square roots, since
that is the most costly R10000 floating point instruction.
Due to the latency-hiding capabilities of the R10000, the "minimum"
cost of virtually any event could be zero since most events can be
overlapped with other operations. To avoid simply reporting minimum
costs of 0, which would be of no practical use, the "minimum" time
reported by perfex -y corresponds to the best case cost of a single
Page 3
PERFEX(1) PERFEX(1)
occurrence of the event. The "best case" cost is obtained by running
the maximum number of simultaneous occurrences of that event and
averaging the cost. For example, two floating point instructions can
complete per cycle, so the best case cost is 0.5 cycles per floating
point instruction.
The "typical" cost falls somewhere between "minimum" and maximum"
and is meant to correspond to the cost one would expect to see in
average programs. For example, to measure the "typical" cost of a
cache miss, stride-1 accesses to an array too big to fit in cache
were timed and the number of cache misses generated was counted. The
same number of stride-1 accesses to an in-cache array were then
timed. The difference in times corresponds to the cost of the cache
misses, and this was used to calculate the average cost of a cache
miss. This "typical" cost is lower than the worst case in which each
cache miss cannot be overlapped, and it is higher than the best case
in which several independent, and hence, overlapping, cache misses
are generated. (Note that on Origin systems, this methodology
yields the time for L2 cache misses to local memory only.)
Naturally, these "typical" costs are somewhat arbitrary. If they do
not seem right for the application being measuring with perfex, they
can be replaced by user-supplied values. See the -c option below.
perfex -y prints the event counts and associated cost estimates
sorted from most costly to least costly. While resembling a
profiling output, this is not a true profile. The event costs
reported are only estimates. Furthermore, since events do overlap
with each other, the sum of the estimated times will usually exceed
the program's run time. This output should only be used to identify
which events are responsible for significant portions of the
program's run time, and to get a rough idea of what those costs
might be.
With this in mind, the built-in cost table does not make an attempt
to provide detailed costs for all events. Some events provide
summary or redundant information. These events are assigned
"minimum" and "typical" costs of 0 so that they sort to the bottom
of the output. The "maximum" costs are set to 1 cycle so that one
can get an indication of the time corresponding to these events.
"Issued instructions" and "graduated instructions" are examples of
such events. In addition to these summary or redundant events,
detailed cost information has not been provided for a few other
events such as "external interventions" and "external invalidations"
since it is difficult to assign costs to these asynchronous events.
The built-in cost values may be overridden by user-supplied values
using the -c option below.
In addition the event counts and cost estimates, perfex -y also
reports a number of statistics derived from the typical costs. The
meaning of many of the statistics is self-evident, for example,
graduated instructions/cycle. Below are listed those statistics
whose definitions require more explanation:
Page 4
PERFEX(1) PERFEX(1)
Data mispredict/Data scache hits
This is the ratio of the counts for "Data misprediction from scache
way prediction table" and "Secondary data cache misses."
Instruction mispredict/Instruction scache hits
This is the ratio of the counts for "Instruction misprediction from
scache way prediction table" and "Secondary instruction cache
misses."
L1 Cache Line Reuse
The is the number of times, on average, that a primary data cache
line is used after it has been moved into the cache. It is
calculated as "graduated loads" plus "graduated stores" minus
"primary data cache misses," all divided by "primary data cache
misses."
L2 Cache Line Reuse
The is the number of times, on average, that a secondary data cache
line is used after it has been moved into the cache. It is
calculated as "primary data cache misses" minus "secondary data
cache misses," all divided by "secondary data cache misses."
L1 Data Cache Hit Rate
This is the fraction of data accesses which are satisfied from a
cache line already resident in the primary data cache. It is
calculated as 1.0 - ("primary data cache misses" divided by the sum
of "graduated loads" and "graduated stores").
L2 Data Cache Hit Rate
This is the fraction of data accesses which are satisfied from a
cache line already resident in the secondary data cache. It is
calculated as 1.0 - ("secondary data cache misses" divided by
"primary data cache misses").
Time accessing memory/Total time
This is the sum of the typical costs of "graduated loads,"
"graduated stores," "primary data cache misses," "secondary data
cache misses," and "TLB misses," divided by the total program run
time. The total program run time is calculated by multiplying
"cycles" by the time per cycle (inverse of the processor's clock
speed).
Page 5
PERFEX(1) PERFEX(1)
L1--L2 bandwidth used (MB/s, average per process)
This is the amount of data moved between the primary and secondary
data caches, divided by the total program run time. The anmount of
data moved is calculated as the sum of the number of "primary data
cache misses" multiplied by the primary cache line size and the
number of "quadwords written back from primary data cache"
multiplied by the size of a quadword (16 bytes). For multiprocess
programs, the resulting figure is a per- process average since the
counts measured by perfex are aggregates of the counts for all the
threads. One needs to multiply by the number of threads to get the
total program bandwidth.
Memory bandwidth used (MB/s, average per process)
This is the amount of data moved between the secondary data cache
and main memory, divided by the total program run time. The anmount
of data moved is calculated as the sum of the number of "secondary
data cache misses" multiplied by the secondary cache line size and
the number of "quadwords written back from secondary data cache"
multiplied by the size of a quadword (16 bytes). For multiprocess
programs, the resulting figure is a per- process average since the
counts measured by perfex are aggregates of the counts for all the
threads. One needs to multiply by the number of threads to get the
total program bandwidth.
MFLOPS (MB/s, average per process)
This is the ratio of the "graduated floating point instructions" and
the total program run time. Note that while a multiply-add carries
out two floating point operations, it only counts as one
instruction, so this statistic may underestimate the number of
floating point operations per second. For multiprocess programs, the
resulting figure is a per-process average since the counts measured
by perfex are aggregates of the counts for all the threads. One
needs to multiply by the number of threads to get the total program
rate.
A ststistic is only printed if counts for the events which define it have
been gathered.
-c <file>
Load a cost table from <file> (requires -y).
This option allows one to override the internal event costs used by
the -y option. <file> contains the list of event costs which are to
be overridden. This <file> needs to be in the same format as the
output produced by the -c option. Costs may be specied in units of
"clks" (machine cycles) or nsec (nanseconds). One may override all
or only a subset of the default costs.
Page 6
PERFEX(1) PERFEX(1)
One may also use the file /etc/perfex.costs to override event costs.
If this file exists, any costs listed in it will override those
built into perfex. Costs supplied via the -c option will override
those provided by the /etc/perfex.costs file.
-t print the cost table used for perfex -y cost estimates to STDOUT
These internal costs may be overridden by specifying different
values in the file /etc/perfex.costs, or by using the -c <file>
option. Both <file> and /etc/perfex.costs need to use the format as
provided by the -t option. It is recommended that one capture this
output to a file and edit it to create a suitable file for
/etc/perfex.costs or the -c option. One does not have to specify
costs for every event, however. Lines corresponding to events whose
values one does not wish to override may simply be deleted from the
file.
FILES
/etc/perfex.costs
DEPENDENCIES
perfex only works on an R10000 system. For the -mp option only, only
binaries linked -shared are currently supported. This is due to a
dependency on libperfex.so. The options -s and -mp are currently
mutually exclusive.
LIMITATIONS
The signal control interface (-s) can control only the immediate target
process, not any of its descendants. This makes it unusable with multi-
process targets in their parallel regions.
SEE ALSO
r10kcounters(5), libperfex(3), time(1), timex(1)
Page 7