cpqidamon(1M) cpqidamon(1M)
NAME
cpqidamon - Compaq Intelligent Disk Array Monitoring Agent
SYNOPSIS
cpqidamon start|stop [pollingtime]
DESCRIPTION
The IDA Monitoring agent monitors the condition of all IDA
controllers in the system. The agent will issue mail alarm
messages if it finds any failures or performance degradation
in the controller. The agent also identifies IDA hardware
characteristics. The information is collected and sent to the
COMPAQ Insight Manager (if installed) for monitoring.
Ida Monitoring Agent
This agent monitors the condition of all IDA controllers in
the system and issues the mail alarm messages if it finds any
failures or performance degradation in the controller.
Starting/Stopping IDA Monitoring Agent
The IDA Monitoring agent is automatically started during the
system startup. You can manually start it by entering
cpqidamon start pollingtime
where pollingtime is the number of seconds to wait between
data collection intervals. The minimum allowed value is 2
seconds. The default time is 30 seconds. To stop the agent,
enter:
cpqidamon stop
Alarm Messages
The Mail alarm messages and causes associated with the
monitoring agents are as follows.
Logical Drive Status Change
The agent issues the following alarm message if it
detects a change in the status of a Compaq Drive Array
Logical Drive.
Logical Drive status change, device: ***, slot number:*,
drive number:*
status is now status
Copyright 1994 Novell, Inc. Page 1
cpqidamon(1M) cpqidamon(1M)
The types of status are:
OK This message occurs whenever the logical drive
status returns to a normal state from any other
state. For example, in a fault tolerance
configuration it is displayed after the resolution
of the "Drive Array Logical Drive Status change -
REBUILDING" message.
FAILED
One or more physical drives have failed. Data is
no longer protected on the drive array. Replace
the failed drives.
UNCONFIGURED
A drive installed in the mass storage subsystem is
not configured. Use the Compaq EISA configuration
utility to configure the drive.
RECOVERING
A physical drive failed within the Compaq Drive
Array. The drive array is in a recovery mode. No
data has been lost, due to the fault tolerant mode
currently in use.
READY FOR REBUILD
A failed drive has been replaced, and the system
is ready to begin Automatic Data Recovery on the
logical drive. This warning message is displayed
only after a failed drive has been replaced.
REBUILDING
Automatic Data Recovery is underway. This warning
message may be displayed following the "Drive
Array Logical Drives status change - READY FOR
REBUILD" message.
WRONG DRIVE
The incorrect physical drive was replaced in an
array. This condition is critical.
BAD CONNECTION
The physical drive in a Compaq Drive Array is not
responding to commands from the array controller.
Several causes are possible.
Copyright 1994 Novell, Inc. Page 2
cpqidamon(1M) cpqidamon(1M)
The data cable connecting the drive to the array
has failed. Do not attempt to reset any cables
while the server is on. Damage to the drives or
array controller will result.
The cable connecting to the array has become loose
at either end and must be reseated.
The cable connecting an external storage subsystem
to the server has become loose.
The power to the drive has been interrupted. This
can be caused by a loosened drive power supply
cable, or a failed power supply in the server or
disk subsystem.
OVERHEATING
The temperature inside a drive array enclosure has
risen above factory preset levels. If the
temperature continues to rise, damage to the
drives within the enclosure may result.
SHUTDOWN
The external drive array has stopped operating
because of elevated temperatures. Do not attempt
to operate the disk storage subsystem while
temperature are elevated. Severe damage to the
disk will occur.
Drive Array Physical Drive Status Change
The agent issues the following alarm message if it
detects a change in the status of a Compaq Drive Array
physical Drive.
Physical Drive Status Change, device: ***, slot number:*,
drive:*
Status is now status
The values for status are:
OK This message indicates an improving condition.
This message is issued after a physical drive
fault has been corrected, or when you add a new
hot pluggable physical drive.
Copyright 1994 Novell, Inc. Page 3
cpqidamon(1M) cpqidamon(1M)
FAILED
A physical drive has failed in a mass storage
subsystem. In configurations that are not fault
tolerant, this status is critical. The mass
storage subsystem failed and server operation
stopped. You must replace the failed drive before
system operation can begin again. In fault
tolerant configurations, the overall system
condition is degraded, but still operational. You
may receive additional alarms, such as "Drive
Array Logical Drive Status Change - RECOVERING",
or "Drive Array Spare Drive status change -
ACTIVE".
Drive Array Physical Drive Threshold Exceeded
The agent issues the following alarm message if it
detects the threshold for the physical drive(s) has been
exceeded.
Physical Drive Threshold passed for device:***, slot
number:***, drive:*
The Server issuing this alarm has a drive that exceeded
one or more factory preset thresholds for performance
degradation. Many Compaq high-performance drive array
hard drives are "stamped" by the drive manufacturer with
minimum performance characteristics. As a result of
normal wear and tear, the performance of a hard drive
may gradually deteriorate. If certain thresholds are
exceeded, the drive may not perform to specified levels,
and may be subject to hardware failure sometime in the
future. Drives that exceed these thresholds are
considered "failed", although true catastrophic failure
has not yet occurred.
Drive Array Accelerator Status Change
This agent issues the following alarm message if it
detects a change in the Drive Array Accelerator Status.
Accelerator Board status change for device:***, slot number:*,
Status is now status
The possible values for status are:
Copyright 1994 Novell, Inc. Page 4
cpqidamon(1M) cpqidamon(1M)
ENABLED
This staus is informational and requires no
action. This alarm typically occurs when a Compaq
Array Accelerator set has fully recharged from a
discharged condition. The array accelerator is
ready to accept posted write.
TEMPORARILY DISABLED
This status is non-critical, user should take
action soon. The Array Accelerator on the drive
array controller has been temporarily disabled,
due to one of the following reasons:
The accelerator is configured for a different
Drive Array Controller. Make sure the controller
is installed in the correct system.
Battery charge level is below 75 percent.
Sufficient resources to perform posted writes are
not available. This may be due to a current
rebuilding process.
At reset initialization, data was found to be
invalid at the mirror data compare test. Data
integrity was lost.
PERMANENTLY DISABLED
This status is critical, user should take
immediate action. The write cache operations of
the Array Accelerator has been permanently
disabled due to one of the following reasons:
Data was found at reset initialization in the
posted write memory, however, the mirror data
compare test failed resulting in the data being
marked as invalid. This is a possible data loss
circumstance.
Soft errors occurred when trying to read the same
data from both sides of the mirrored posted write
memory. This is a definite data loss circumstance
Data could not be written to the posted write
memory in duplicate due to the detection of parity
errors. This is not a data loss circumstance.
Copyright 1994 Novell, Inc. Page 5
cpqidamon(1M) cpqidamon(1M)
A BMIC Set Configuration command was issued.
Posted write operations remain disabled until a
BMIC Set Posted Write command is issued once
again.
Drive Array Accelerator Bad Data
The agent issues the following alarm message if it
detects the Drive Array Accelerator has lost battery
power.
Data may have been lost.
Accelerator Board Bad Data, device:***, slot number:*,
Accelerator lost battery power. Data loss possible
The possible reasons for this message are as follows.
If the system was without power for eight days, and the
battery packs were on (battery sets activate only if the
system loses power unexpectedly), any data stored in the
cache was lost.
The battery set may have problems. Check the battery
status for more information.
The Array Accelerator board has been replaced with a new
board that has discharged batteries. In this case, no
data is lost, and posted writes are automatically
enabled when the batteries reach full charge.
Drive Array Battery Status Change
The agent issues the following alarm message if it
detects a battery status change associated with the
Compaq Array Accelerator Write Cache Board.
Accelerator Board Battery Failed, device:***, slot number:*,
Battery failed, status: status
The possible values for status are:
OK This alarm indicates an improving condition. This
alarm is issued in response to a change in the
charge condition of the accelerator set.
Typically, the "Drive Array Accelerator Battery
Status-RECHARGING" alarm is issued before this
alarm.
Copyright 1994 Novell, Inc. Page 6
cpqidamon(1M) cpqidamon(1M)
RECHARGING
The Array Accelerator battery set has not fully
charged. This condition is not usually a cause
for concern. However, if you do not receive the
"Drive Array Accelerator Battery status - OK"
alarm within 36 hours after beginning the charge,
the battery pack status is set to failed, and you
will receive the "Drive Array Accelerator Status-
FAILED" alarm.
FAILED
The Array Accelerator can no longer protect data
in the cache in the event of a power interruption
to the server. Replace the Array Accelerator as
soon as possible. The replaced Array Accelerator
batteries must reach full charge before the
controller allows data to be written to the cache
memory subsystem.
Server performance may be affected during this
time. Another alarm, "Drive Array Accelerator
Battery status - OK" is issued, indicating that
the cache subsystem is now back in operation.
DEGRADED
The battery set in the Array Accelerator is
operating, but one of the batteries has failed to
recharge correctly. This condition jeopardizes
the integrity of the battery-backed cache.
Replace the Array Accelerator as soon as possible.
Drive Array Spare Drive Status Change
The agent issues the following alarm if it detects a a
"Drive Array Spare Status Change" in a Compaq drive
array logical drive.
Spare Drive Status Change, device:***, slot number:*, drive
id:*.
Status is now status*****
The possible values for status are:
FAILED
This alarm does not indicate that the server or
the drive array failed. It indicates that a spare
drive in a drive array failed. Replace the spare
Copyright 1994 Novell, Inc. Page 7
cpqidamon(1M) cpqidamon(1M)
as soon as possible.
INACTIVE
This alarm signifies that the spare is in a ready
state. This alarm is typically issued if you
install or replace a spare drive in a Compaq Drive
Array. It is issued after you replace a failed
spare drive with a new one. This alarm may also
occur when a drive fails and the spare becomes
alive, the failed drive is replaced, and the spare
eventually becomes inactive again.
BUILDING
This alarm is issued when a spare drive is brought
online to replace a failed drive, and the drive
array subsystem begins to build data onto the
spare.
ACTIVE
A physical drive failed. The array successfully
restored data onto a spare. That spare drive has
now become active and replaces the failed drive.
Storage System Fan Status Change
The agent issues the following alarm message when it
detects a change in "Storage system Fan Status" for the
SMART Controller.
Fan Status Change, device:***, slot number:***, bus number:*
Storage System fan status changed to status
The possible values for status are:
OK This alarm is issued when a change occurs in the
operating condition of a monitored server's
external storage system fan. A fan has been
replaced, or has otherwise returned to normal
operation. This alarm typically follows the
"Storage System Fan Status Change-FAILED" alarm.
FAILED
The storage subsystem internal fan failed and the
temperature may soon rise beyond factory preset
levels. The storage system may shut down
automatically to prevent damage to hardware and
data loss.
Copyright 1994 Novell, Inc. Page 8
cpqidamon(1M) cpqidamon(1M)
Storage System Temperature Failed
The agent issues the following alarm message when it
detects the internal temperature of storage system has
risen beyond factory preset levels. This message
requires a SMART Controller in the system.
Storage System Temperature Failure, device:***, slot number:*,
bus number:*, storage system will be shut down
The system may shut down automatically to prevent damage
to hardware and data loss.
Storage System Temperature Degraded
The agent issues the following alarm message when it
detects the temperature inside a storage subsystem is
outside normal operating range. This message requires a
SMART Controller in the system.
Storage System Temperature Degraded, device:***,
Slot number:*, bus number:*,
Temperature is outside of normal range.
A fan may have failed. The storage subsystem is
operating without the proper cooling capacity, and
internal unit temperatures may soon rise beyond safe
levels. Take corrective action soon.
Storage System Temperature OK
The agent issues the following alarm message if it
detects temperatures that were abnormal have returned
to a normal state. This message requires a SMART
Controller in the system.
Compaq Storage System, device:***, slot number:*, bus number:*,
Storage system temperature OK
A fan may have failed.
This message indicates an improving condition, and
generally requires no further action.
Storage System Side Panel Removed
The agent issues the following alarm message when it
detects the side panel of the storage subsystem has been
removed. This message requires a SMART Controller in
Copyright 1994 Novell, Inc. Page 9
cpqidamon(1M) cpqidamon(1M)
the system.
Storage System side panel is removed, device:***,
Slot number:*, bus number:*,
A fan may have failed.
Storage System Side Panel In Place
The agent issues the following alarm message if it
detects the side panel that was removed has been re-
installed. This message requires a SMART Controller in
the system.
Storage System side panel is in place, device:***,
Slot number:*, bus number:*,
This message indicates an improving condition, and
generally requires no further action.
Error Messages
The various error messages that the IDA Monitoring agent
software can produce are listed below, and information is
provided about how each problem can be resolved.
These messages appear in the Agent/Standard error log file.
The name of this file is /usr/bin/compaq/agenterrs.log.
IDA3001 cpqidamon: Return drive M & P threshold failed,
device: name, slot: number.
The drives in this controller have not been factory-stamped.
Run COMPAQ Diagnostic Utility to stamp the drives.
The drives in this controller have not been factory-stamped.
You should run the COMPAQ Diagnostic Utility to stamp the
drives or contact COMPAQ service representatives.
IDA3002 cpqidamon: Read drive capacity cmd failed,
device: name, slot: number, drive: number.
The agent can't read the capacity of the drive, check the
cables connected to the drive.
IDA3003 cpqidamon: Return drive M & P statistics since power-on failed,
device: name, slot: number.
Copyright 1994 Novell, Inc. Page 10
cpqidamon(1M) cpqidamon(1M)
The agent can't read the M & P statistics of the drive(s),
check all cables connected to the drive(s).
NOTICES
This command is only supported on applicable Compaq systems.
Copyright 1994 Novell, Inc. Page 11