Museum

Home

Lab Overview

Retrotechnology Articles

Online Manuals

⇒ cpqidamon(1M) — UnixWare 2.01

Media Vault

Software Library

Restoration Projects

Artifacts Sought






       cpqidamon(1M)                                          cpqidamon(1M)


       NAME
             cpqidamon - Compaq Intelligent Disk Array Monitoring Agent

       SYNOPSIS
             cpqidamon start|stop [pollingtime]

       DESCRIPTION
             The IDA Monitoring agent monitors the condition of all IDA
             controllers in the system.  The agent will issue mail alarm
             messages if it finds any failures or performance degradation
             in the controller.  The agent also identifies IDA hardware
             characteristics.  The information is collected and sent to the
             COMPAQ Insight Manager (if installed) for monitoring.

          Ida Monitoring Agent
             This agent monitors the condition of all IDA controllers in
             the system and issues the mail alarm messages if it finds any
             failures or performance degradation in the controller.

             Starting/Stopping IDA Monitoring Agent

             The IDA Monitoring agent is automatically started during the
             system startup.  You can  manually start it by entering

                         cpqidamon start pollingtime

             where pollingtime is the number of  seconds  to  wait  between
             data  collection  intervals.   The  minimum allowed value is 2
             seconds.  The default time is 30 seconds.  To stop the  agent,
             enter:

                         cpqidamon stop

             Alarm Messages

             The  Mail  alarm  messages  and  causes  associated  with  the
             monitoring agents are as follows.

             Logical Drive Status Change
                   The agent issues  the  following  alarm  message  if  it
                   detects  a  change in the status of a Compaq Drive Array
                   Logical Drive.

                               Logical Drive status change, device: ***, slot number:*,
                               drive number:*
                               status is now status


                           Copyright 1994 Novell, Inc.               Page 1













      cpqidamon(1M)                                          cpqidamon(1M)


                  The types of status are:

                  OK    This message occurs  whenever  the  logical  drive
                        status  returns  to  a normal state from any other
                        state.   For  example,  in   a   fault   tolerance
                        configuration it is displayed after the resolution
                        of the "Drive Array Logical Drive Status change  -
                        REBUILDING" message.

                  FAILED
                        One or more physical drives have failed.  Data  is
                        no  longer  protected on the drive array.  Replace
                        the failed drives.

                  UNCONFIGURED
                        A drive installed in the mass storage subsystem is
                        not configured.  Use the Compaq EISA configuration
                        utility to configure the drive.

                  RECOVERING
                        A physical drive failed within  the  Compaq  Drive
                        Array.  The drive array is in a recovery mode.  No
                        data has been lost, due to the fault tolerant mode
                        currently in use.

                  READY FOR REBUILD
                        A failed drive has been replaced, and  the  system
                        is  ready  to begin Automatic Data Recovery on the
                        logical drive.  This warning message is  displayed
                        only after a failed drive has been replaced.

                  REBUILDING
                        Automatic Data Recovery is underway.  This warning
                        message  may  be  displayed  following  the "Drive
                        Array Logical Drives status  change  -  READY  FOR
                        REBUILD" message.

                  WRONG DRIVE
                        The incorrect physical drive was  replaced  in  an
                        array.  This condition is critical.

                  BAD CONNECTION
                        The physical drive in a Compaq Drive Array is  not
                        responding  to commands from the array controller.
                        Several causes are possible.



                          Copyright 1994 Novell, Inc.               Page 2













       cpqidamon(1M)                                          cpqidamon(1M)


                         The data cable connecting the drive to  the  array
                         has  failed.   Do  not attempt to reset any cables
                         while the server is on.  Damage to the  drives  or
                         array controller will result.

                         The cable connecting to the array has become loose
                         at either end and must be reseated.

                         The cable connecting an external storage subsystem
                         to the server has become loose.

                         The power to the drive has been interrupted.  This
                         can  be  caused  by  a loosened drive power supply
                         cable, or a failed power supply in the  server  or
                         disk subsystem.

                   OVERHEATING
                         The temperature inside a drive array enclosure has
                         risen   above   factory  preset  levels.   If  the
                         temperature  continues  to  rise,  damage  to  the
                         drives within the enclosure may result.

                   SHUTDOWN
                         The external drive  array  has  stopped  operating
                         because  of elevated temperatures.  Do not attempt
                         to  operate  the  disk  storage  subsystem   while
                         temperature  are  elevated.   Severe damage to the
                         disk will occur.

             Drive Array Physical Drive Status Change
                   The agent issues  the  following  alarm  message  if  it
                   detects  a  change in the status of a Compaq Drive Array
                   physical Drive.

                               Physical Drive Status Change, device: ***, slot number:*,
                               drive:*
                               Status is now status

                   The values for status are:

                   OK    This message  indicates  an  improving  condition.
                         This  message  is  issued  after  a physical drive
                         fault has been corrected, or when you  add  a  new
                         hot pluggable physical drive.




                           Copyright 1994 Novell, Inc.               Page 3













      cpqidamon(1M)                                          cpqidamon(1M)


                  FAILED
                        A physical drive has  failed  in  a  mass  storage
                        subsystem.   In  configurations that are not fault
                        tolerant,  this  status  is  critical.   The  mass
                        storage  subsystem  failed  and  server  operation
                        stopped.  You must replace the failed drive before
                        system   operation  can  begin  again.   In  fault
                        tolerant  configurations,   the   overall   system
                        condition is degraded, but still operational.  You
                        may receive  additional  alarms,  such  as  "Drive
                        Array  Logical  Drive Status Change - RECOVERING",
                        or  "Drive  Array  Spare  Drive  status  change  -
                        ACTIVE".

            Drive Array Physical Drive Threshold Exceeded
                  The agent issues  the  following  alarm  message  if  it
                  detects the threshold for the physical drive(s) has been
                  exceeded.

                              Physical Drive Threshold passed for device:***, slot
                              number:***, drive:*

                  The Server issuing this alarm has a drive that  exceeded
                  one  or  more  factory preset thresholds for performance
                  degradation.  Many Compaq high-performance  drive  array
                  hard drives are "stamped" by the drive manufacturer with
                  minimum performance characteristics.   As  a  result  of
                  normal  wear  and  tear, the performance of a hard drive
                  may gradually deteriorate.  If  certain  thresholds  are
                  exceeded, the drive may not perform to specified levels,
                  and may be subject to hardware failure sometime  in  the
                  future.    Drives   that  exceed  these  thresholds  are
                  considered "failed", although true catastrophic  failure
                  has not yet occurred.

            Drive Array Accelerator Status Change
                  This agent issues the  following  alarm  message  if  it
                  detects a change in the Drive Array Accelerator Status.

                              Accelerator Board status change for device:***, slot number:*,
                              Status is now status

                  The possible values for status are:





                          Copyright 1994 Novell, Inc.               Page 4













       cpqidamon(1M)                                          cpqidamon(1M)


                   ENABLED
                         This  staus  is  informational  and  requires   no
                         action.  This alarm typically occurs when a Compaq
                         Array Accelerator set has fully recharged  from  a
                         discharged  condition.   The  array accelerator is
                         ready to accept posted write.

                   TEMPORARILY DISABLED
                         This status  is  non-critical,  user  should  take
                         action  soon.   The Array Accelerator on the drive
                         array controller has  been  temporarily  disabled,
                         due to one of the following reasons:

                         The accelerator  is  configured  for  a  different
                         Drive  Array Controller.  Make sure the controller
                         is installed in the correct system.

                         Battery charge level is below 75 percent.

                         Sufficient resources to perform posted writes  are
                         not  available.   This  may  be  due  to a current
                         rebuilding process.

                         At reset initialization,  data  was  found  to  be
                         invalid  at  the  mirror  data compare test.  Data
                         integrity was lost.

                   PERMANENTLY DISABLED
                         This  status  is  critical,   user   should   take
                         immediate  action.   The write cache operations of
                         the  Array  Accelerator   has   been   permanently
                         disabled due to one of the following reasons:

                         Data was found  at  reset  initialization  in  the
                         posted  write  memory,  however,  the  mirror data
                         compare test failed resulting in  the  data  being
                         marked  as  invalid.  This is a possible data loss
                         circumstance.

                         Soft errors occurred when trying to read the  same
                         data  from both sides of the mirrored posted write
                         memory.  This is a definite data loss circumstance

                         Data could not be  written  to  the  posted  write
                         memory in duplicate due to the detection of parity
                         errors.  This is not a data loss circumstance.


                           Copyright 1994 Novell, Inc.               Page 5













      cpqidamon(1M)                                          cpqidamon(1M)


                        A  BMIC  Set  Configuration  command  was  issued.
                        Posted  write  operations  remain disabled until a
                        BMIC Set  Posted  Write  command  is  issued  once
                        again.

            Drive Array Accelerator Bad Data
                  The agent issues  the  following  alarm  message  if  it
                  detects  the  Drive  Array  Accelerator has lost battery
                  power.

                              Data may have been lost.
                              Accelerator Board Bad Data, device:***, slot number:*,
                              Accelerator lost battery power.  Data loss possible

                  The possible reasons for this message are as follows.

                  If the system was without power for eight days, and  the
                  battery packs were on (battery sets activate only if the
                  system loses power unexpectedly), any data stored in the
                  cache was lost.

                  The battery set may have problems.   Check  the  battery
                  status for more information.

                  The Array Accelerator board has been replaced with a new
                  board  that  has discharged batteries.  In this case, no
                  data  is  lost,  and  posted  writes  are  automatically
                  enabled when the batteries reach full charge.

            Drive Array Battery Status Change
                  The agent issues  the  following  alarm  message  if  it
                  detects  a  battery  status  change  associated with the
                  Compaq Array Accelerator Write Cache Board.

                              Accelerator Board Battery Failed, device:***, slot number:*,
                              Battery failed, status: status

                  The possible values for status are:

                  OK    This alarm indicates an improving condition.  This
                        alarm  is  issued  in  response to a change in the
                        charge   condition   of   the   accelerator   set.
                        Typically,  the  "Drive  Array Accelerator Battery
                        Status-RECHARGING" alarm  is  issued  before  this
                        alarm.



                          Copyright 1994 Novell, Inc.               Page 6













       cpqidamon(1M)                                          cpqidamon(1M)


                   RECHARGING
                         The Array Accelerator battery set  has  not  fully
                         charged.   This  condition  is not usually a cause
                         for concern.  However, if you do not  receive  the
                         "Drive  Array  Accelerator  Battery  status  - OK"
                         alarm within 36 hours after beginning the  charge,
                         the  battery pack status is set to failed, and you
                         will receive the "Drive Array Accelerator  Status-
                         FAILED" alarm.

                   FAILED
                         The Array Accelerator can no longer  protect  data
                         in  the cache in the event of a power interruption
                         to the server.  Replace the Array  Accelerator  as
                         soon  as possible.  The replaced Array Accelerator
                         batteries  must  reach  full  charge  before   the
                         controller  allows data to be written to the cache
                         memory subsystem.

                         Server performance may  be  affected  during  this
                         time.   Another  alarm,  "Drive  Array Accelerator
                         Battery status - OK" is  issued,  indicating  that
                         the cache subsystem is now back in operation.

                   DEGRADED
                         The  battery  set  in  the  Array  Accelerator  is
                         operating,  but one of the batteries has failed to
                         recharge correctly.   This  condition  jeopardizes
                         the   integrity   of   the  battery-backed  cache.
                         Replace the Array Accelerator as soon as possible.

             Drive Array Spare Drive Status Change
                   The agent issues the following alarm if it detects  a  a
                   "Drive  Array  Spare  Status  Change"  in a Compaq drive
                   array logical drive.

                               Spare Drive Status Change, device:***, slot number:*, drive
                               id:*.
                               Status is now status*****

                   The possible values for status are:

                   FAILED
                         This alarm does not indicate that  the  server  or
                         the drive array failed.  It indicates that a spare
                         drive in a drive array failed.  Replace the  spare


                           Copyright 1994 Novell, Inc.               Page 7













      cpqidamon(1M)                                          cpqidamon(1M)


                        as soon as possible.

                  INACTIVE
                        This alarm signifies that the spare is in a  ready
                        state.   This  alarm  is  typically  issued if you
                        install or replace a spare drive in a Compaq Drive
                        Array.   It  is  issued after you replace a failed
                        spare drive with a new one.  This alarm  may  also
                        occur  when  a  drive  fails and the spare becomes
                        alive, the failed drive is replaced, and the spare
                        eventually becomes inactive again.

                  BUILDING
                        This alarm is issued when a spare drive is brought
                        online  to  replace  a failed drive, and the drive
                        array subsystem begins  to  build  data  onto  the
                        spare.

                  ACTIVE
                        A physical drive failed.  The  array  successfully
                        restored  data onto a spare.  That spare drive has
                        now become active and replaces the failed drive.

            Storage System Fan Status Change
                  The agent issues the following  alarm  message  when  it
                  detects  a change in "Storage system Fan Status" for the
                  SMART Controller.

                              Fan Status Change, device:***, slot number:***, bus number:*
                              Storage System fan status changed to status

                  The possible values for status are:

                  OK    This alarm is issued when a change occurs  in  the
                        operating   condition   of  a  monitored  server's
                        external storage  system  fan.   A  fan  has  been
                        replaced,  or  has  otherwise  returned  to normal
                        operation.   This  alarm  typically  follows   the
                        "Storage System Fan Status Change-FAILED" alarm.

                  FAILED
                        The storage subsystem internal fan failed and  the
                        temperature  may  soon  rise beyond factory preset
                        levels.   The  storage  system   may   shut   down
                        automatically  to  prevent  damage to hardware and
                        data loss.


                          Copyright 1994 Novell, Inc.               Page 8













       cpqidamon(1M)                                          cpqidamon(1M)


             Storage System Temperature Failed
                   The agent issues the following  alarm  message  when  it
                   detects  the  internal temperature of storage system has
                   risen  beyond  factory  preset  levels.   This   message
                   requires a SMART Controller in the system.

                               Storage System Temperature Failure, device:***, slot number:*,
                               bus number:*, storage system will be shut down

                   The system may shut down automatically to prevent damage
                   to hardware and data loss.

             Storage System Temperature Degraded
                   The agent issues the following  alarm  message  when  it
                   detects  the  temperature  inside a storage subsystem is
                   outside normal operating range.  This message requires a
                   SMART Controller in the system.

                               Storage System Temperature Degraded, device:***,
                               Slot number:*, bus number:*,

                               Temperature is outside of normal range.

                   A  fan  may  have  failed.   The  storage  subsystem  is
                   operating  without  the  proper  cooling  capacity,  and
                   internal unit temperatures may  soon  rise  beyond  safe
                   levels.  Take corrective action soon.

             Storage System Temperature OK
                   The agent issues  the  following  alarm  message  if  it
                   detects  temperatures  that were  abnormal have returned
                   to a  normal  state.   This  message  requires  a  SMART
                   Controller in the system.

                               Compaq Storage System, device:***, slot number:*, bus number:*,
                               Storage system temperature OK

                   A fan may have failed.

                   This  message  indicates  an  improving  condition,  and
                   generally requires no further action.

             Storage System Side Panel Removed
                   The agent issues the following  alarm  message  when  it
                   detects the side panel of the storage subsystem has been
                   removed.  This message requires a  SMART  Controller  in


                           Copyright 1994 Novell, Inc.               Page 9













      cpqidamon(1M)                                          cpqidamon(1M)


                  the system.

                              Storage System side panel is removed, device:***,
                              Slot number:*, bus number:*,

                  A fan may have failed.

            Storage System Side Panel In Place
                  The agent issues  the  following  alarm  message  if  it
                  detects  the  side  panel  that was removed has been re-
                  installed.  This message requires a SMART Controller  in
                  the system.

                              Storage System side panel is in place, device:***,
                              Slot number:*, bus number:*,

                  This  message  indicates  an  improving  condition,  and
                  generally requires no further action.

         Error Messages
            The various error messages that the IDA Monitoring agent
            software can produce are listed below, and information is
            provided about how each problem can be resolved.

            These messages appear in the Agent/Standard error log file.
            The name of this file is /usr/bin/compaq/agenterrs.log.

                        IDA3001 cpqidamon: Return drive M & P threshold failed,
                              device: name, slot: number.
                              The drives in this controller have not been factory-stamped.
                              Run COMPAQ Diagnostic Utility to stamp the drives.


            The drives in this controller have not  been  factory-stamped.
            You  should  run  the  COMPAQ  Diagnostic Utility to stamp the
            drives or contact COMPAQ service representatives.

                        IDA3002 cpqidamon: Read drive capacity cmd failed,
                              device: name, slot: number, drive: number.

            The agent can't read the capacity  of  the  drive,  check  the
            cables connected to the drive.

                        IDA3003 cpqidamon: Return drive M & P statistics since power-on failed,
                              device: name, slot: number.



                          Copyright 1994 Novell, Inc.              Page 10













       cpqidamon(1M)                                          cpqidamon(1M)


             The agent can't read the M & P  statistics  of  the  drive(s),
             check all cables connected to the drive(s).

       NOTICES
             This command is only supported on applicable Compaq systems.











































                           Copyright 1994 Novell, Inc.              Page 11








Typewritten Software • bear@typewritten.org • Edmonds, WA 98026