cpqw(7) cpqw(7)
NAME
cpqw - Compaq Wellness Driver
DESCRIPTION
The Compaq Wellness Driver collects and monitors important
operational data on your server to ensure that the system is
``healthy.'' Any abnormal conditions are logged into the EISA
Non Volatile RAM (Health log) and the information is
optionally sent to the Compaq Insight Manager via SNMP traps.
The health log information can be displayed using inspect.
Compaq Servers are equipped with hardware and firmware to
monitor certain abnormal conditions such as abnormal
temperature readings, fan failures, ECC memory errors, etc.
The cpqw driver monitors these conditions and reports the
status to the administrator by printing a message on the
console, and also logging the condition into the EISA NVRAM.
A tightly coupled SNMP agent (a daemon process) opens the
driver and waits for abnormal conditions. If an abnormal
condition is detected the SNMP agent issues a trap to the
Compaq Insight Manager (CIM).
cpqw acts as a multiplexer for three drivers: asr, csm, and
ecc_nmi.
asr Driver
This driver allows a user based process to periodically update
the Automatic Server Recovery (ASR) timer, a hardware
``heartbeat'' counter that is loaded with an initial value and
starts counting down. If the timer cannot be updated, and if
the ASR manages to count down to zero, it assumes that the
operating system is locked up, and the firmware attempts to
reboot the system. Before rebooting, asr displays a message
to the console stating the problem, and makes an entry in the
system health log. When ASR reboots, all the boards are
reset.
csm Driver
The Compaq Server Management csm driver monitors (via IRQ 13
which is shared with FPU) the aspects of the system in the
following list. It writes its status to EISA NVRAM.
Temperature sensors
If the normal operating temperature is
exceeded, or a cooling fan fails, the csm
driver displays a message to the console
Copyright 1994 Novell, Inc. Page 1
cpqw(7) cpqw(7)
stating the problem, makes an entry in the
system health log, and shuts the system down
(optionally) to avoid hardware damage. Use
EISA configuration utility to control the
option.
Fan sensors If a cooling fan fails, csm displays a message
to the console stating the problem, makes an
entry in the system health log, and shuts the
system down (optionally) to avoid hardware
damage. Use the EISA configuration utility to
control the option.
Hobbs meter updates
Total system uptime is stored in non-volatile
memory and displayed by using the inspect
utility. This is a 4-byte counter (the
environment variable, CPQHOB) on the ProLiant
system board that acts as a total ``uptime''
meter (the amount of time the system was
running under OS). Once a driver for the OS is
installed, the driver causes this timer to
tick. It is not resettable.
EISA bus utilization
The driver maintains the EISA bus utilization
statistics. EISA bus utilization is obtained
by periodically polling a hardware counter.
The driver stores the number of idle BLCK's and
poll frequency in a variable that can be
retrieved/displayed via the performance monitor
utility, rtpm.
ecc_nmi Driver
If an ECC memory error occurs, the ecc_nmi driver logs the
error in the health log including the error causing address.
If too many errors occur at the same memory location, the
driver disables the ECC error interrupts to prevent flooding
the console from warnings (the hardware automatically corrects
the ECC error).
There are two types of error logs maintained in the EISA
NVRAM, the Correctable Error Log and the Critical Error Log.
Each error log consists of 16 entries in a circular queue.
Each entry is 8 bytes long. Each record has a time stamp with
a date and time.
Copyright 1994 Novell, Inc. Page 2
cpqw(7) cpqw(7)
The Error Information field provides specific information for
each error type. Diagnostics uses this information to resolve
the source of the error.
Some examples of the NMI and ECC errors are:
Uncorrectable memory error:
Expansion bus master timeout
Expansion bus slave timeout
Expansion board timeout
Expansion bus arbitration error
Processor cache parity timeout
Processor parity error
System concurrency error
Processor failure
Processor exception
Processor internal error
Critical Temperature
Failsave timer expiration
Abnormal program termination
System fan failure
UPS - battery depletion detected
REFERENCES
inspect(1M), rtpm(1M), crom(7)
NOTICES
This command is only supported on applicable Compaq systems.
Copyright 1994 Novell, Inc. Page 3