availmon(5) availmon(5)
NAME
availmon - overview of system availability monitoring facilities
DESCRIPTION
The availability monitor (availmon) is a set of programs that
collectively monitor and report the availability of a system and the
diagnosis of system crashes. For unexpected reboots, availmon identifies
the cause of the reboot by gathering information from diagnostic programs
such as icrash(1M), which includes results from the FRU analyzer when
available, and syslog (see syslog(3C)), and system configuration
information from versions(1M), hinv(1M) and gfxinfo(1G).
Availmon can send availability and diagnostic information to various
locations, depending on configuration; it can provide local system
availability statistics and reboot history reporting, and it can provide
limited site-management facilities by collecting availmon information
from a set of systems into a single log file, and then reporting on the
composite system availability data. It also provides immediate
notification of important system log messages (that are logged in
/var/adm/SYSLOG) by passing them via a syslogd(1M) filter.
All availmon capabilities are configurable using amconfig(1M). Availmon,
by default, will not automatically send availmon reports on reboot. In
all cases, amregister(1M) must be run to enable automatic distribution of
reports. Otherwise, most configurable options are enabled for high-end
platforms/servers and disabled for low-end platforms/workstations.
Availmon reporting centers around event records. Any system reboot is an
availmon event, whether a controlled shutdown or an "unscheduled" reboot,
such as a power interruption or a "crash". An event record contains the
time at which the system was previously booted, which starts the event
period, the time the event occurred, which ends the period of "uptime",
the reason for the event, and the time that the system was rebooted. If
the system stopped as a result of a hang, the exact instant at which it
stopped is not easily known; this time is estimated by a (configurable)
ticker daemon (see amtickerd(1M)).
Events are grouped as either "Service Action" events, or "Unscheduled"
events. Service Action events are controlled shutdowns, initiated by
operators through shutdown(1M), halt(1M) and init(1M)). For such
controlled shutdowns, a (configurable) prompt is given to identify the
reason for the shutdown. Unscheduled events include system panics, power
failures, power cycles, and system resets (usually due to system hangs).
Panics are identified as either due to hardware or due to software or due
to unknown reasons. This distinction is based strictly on results of the
FRU analyzer, if present.
Availmon generates three types of reports: availability, diagnosis and
pager. Availability reports consist of the system serial number, full
hostname/internet address, the previous system start time, the time of
the event, the reason for the event (the event code), uptime, start time
(following the reboot), and a summary of the reason for the event where
Page 1
availmon(5) availmon(5)
relevant.
Diagnosis reports include all data from an availability report, and
additionally may contain the icrash analysis report, FRU analyzer result,
important syslog messages, and system hardware/software configuration and
version information. Important syslog messages include error messages
and all messages logged by sysctlrd and syslogd, since the last reboot.
Duplicated messages are eliminated even if not consecutive; the first
such message is retained with its time stamp, and the number of
duplicated messages and the last time stamp are appended. System
software version information is limited to version output for the
operating system and installed patches.
Pager reports are intended for "chatty pagers", and include only the
system hostname, a brief description of the reason for the event, and the
summary, if present.
Availability information for the local system is always permanently
stored in /var/adm/avail/availlog. Files in /var/adm/avail are
maintained by availmon and should not be deleted, modified, or moved.
The most recent reports are stored as availreport, diagreport and
pagerreport in the directory /var/adm/crash/. In addition, reports for
single-user events are stored under the same names with the suffix .su.
A copy of syslog messages is stored in /var/adm/avail/AMSYSTEMLOG which
is rotated at regular intervals via a cron job (see crontab(1)).
CONFIGURATION
Once availmon is installed, "registration" is required before availmon
reports are automatically distributed, and configuration of local options
may also be desired. The most important configuration option is
autoemail, which enables automatic distribution. Normally,
amregister(1M) is run to accomplish initial email configuration and to
set autoemail to on. See also amconfig(1M) for detailed explanation of
availmon's configurable options and their exact default values.
Registration of a system can normally be accomplished simply by running:
amregister -r
This assumes that the default configuration is acceptable and that the
local system is a relatively recent platform, where the system serial
number is machine readable (see amsysinfo(1M) for an exact list). For
the case where the serial number is not machine readable, see
amregister(1M) for configuration details. The default distribution of
email reports is to send a diagnosis report to availmon@csd.sgi.com,
which enters the report into the SGI Availmon Database.
Applying a common configuration to multiple systems is easily
accomplished by using amconfig(1M) on a single system to produce the
desired autoemail.list configuration file; copying the result,
/var/adm/avail/config/autoemail.list, to all systems, and then running
amregister -r on each system. To change any other availmon configuration
Page 2
availmon(5) availmon(5)
options, run amconfig(1M) appropriately on each system.
There are several other configuration options that can prove useful. One
is to configure sending availmon reports from one or more systems to a
standard system administrator email alias. This provides real-time
notification of system activity. Another similar option is to configure
availmon pager reports for real-time notification to "chatty" pagers.
Or, availmon diagnostic reports may be sent to a local support office, or
to a system administrator for detailed evaluation.
The site-management facilities of availmon can be used by configuring to
send availmon reports to a "concentrator account". Such an account would
be a common email alias on a single system that pipes incoming email
through amreceive(1M) and then appends it to an aggregate site log file.
See amreport(1M) -s option for reporting on site log files.
Availmon can also send generate periodic status report that indicate that
a system is still running and "registered" to send email reports. This
is controlled by the statusinterveral configuration value, which defaults
to 60 days. Such reports are sent by the availmon ticker daemon, so they
are sent only if the tickerd config flag is on.
Even where sending of availmon reports is not enabled, local system
availability data is always maintained, and amreport(1M) can be run to
produce statistical or event detail reports for the local system. Such
reporting can be automated on a regular basis using the -f ("from")
argument to amreport(1M). It is also possible to manually send availmon
reports after any reboot using either amnotify(1M) or amsend(1M).
REPORT VIEWING
The amreport(1M) program reviews saved availability report information
and provides statistical and event history reports. By default, it
processes the availability data on the local system. It can also process
aggregate site log files; that is, an appended accumulation of availmon
reports from different systems.
amreport can be run interactively or it can generate statistical or event
history reports that are written to standard output. Interactively, it
presents a statistical summary and allows hierarchical selection and
display of a list of events or detail on particular events. Run
interactively on a site logfile, it presents the same statistical or
event information either on all systems or on each system individually.
In either case, it can generate statistical, event list, event detail or
combined reports written to standard output.
amreport accepts -f "from" and/or -t "to" arguments which can be used on
the local system to bound the time period which is reported. This
capability can be used to generate regular statistical or event
list/detail reports.
Page 3
availmon(5) availmon(5)
Run interactively on the local system, amreport also supports resending
event data from selected historical events. This allows recreating prior
reports and resending them in the case where an email report may have
been lost. Some information included in the original report may not be
included in the resent report. This includes current status information,
such as hinv(1M) data, which may have changed from the time of the
original report to the current time (and which is therefore not included)
or information derived from files in /var/adm/crash/, such as icrash
files, which may have been removed.
FILES
/var/adm/avail/config/{autoemail,shutdownreason,tickerd,hinvupdate,livenotification}
configuration files containing flag values for autoemail,
shutdownreason, tickerd, hinvupdate, livenotification.
/var/adm/avail/config/statusinterval
configuration file containing the value of statusinterval.
/var/adm/avail/config/autoemail.list
configuration file containing the autoemail.list address lists.
/var/adm/avail/availlog
primary log of availability monitor
/var/adm/avail/AMSYSTEMLOG
A copy of system log messages which is maintained by availmon
/var/adm/avail/lasttick
uptime in seconds since Jan 1, 1970 (written by tickerd)
/var/adm/crash/*
availmon report files: availreport, diagreport, pagerreport,
availreport.su, diagreport.su, pagerreport.su
/etc/init.d/availmon, /usr/etc/amstart
init scripts that log start/stop and initiate notification
SEE ALSO
Mail(1), amconfig(1M), amnotify(1M), amparse(1M), amreceive(1M),
amregister(1M), amreport(1M), amsend(1M), amsysinfo(1M), amsyslog(1M),
amtickerd(1M), amtime1970(1M), chkconfig(1M), halt(1M), hinv(1M),
icrash(1M), init(1M), shutdown(1M), versions(1M), syslogd(1M),
syslog(3C), crontab(1).
Page 4