failovermon(1M) DG/UX 5.4 Rel. 2.01 failovermon(1M)
NAME
failovermon - manage failover monitors
SYNOPSIS
failovermon -o add [ -i interval ] [ -r retries ] [ -l lost-pulse ] [
-g regain-pulse ] [ -b ] [ -s ] hostname
failovermon -o delete hostname
failovermon -o modify [ -i interval ] [ -r retries ] [ -l lost-pulse
] [ -g regain-pulse ] [ -bn ] [ -s ] hostname
failovermon -o list [ -qv ] [ hostname ... ]
failovermon -o start [ hostname ... ]
failovermon -o stop [ hostname ... ]
DESCRIPTION
failovermon provides operations for manipulating entries in the
failover monitors(4M) database as well as operations for starting and
stopping failovermon monitors. Failover monitors and their action
scripts (lost-pulse and regain-pulse) are set up and execute on the
system that is serving in the backup role. This system should already
have been set up for failover using the operator initiated failover
operations through sysadm.
The failovermon process monitors the specified system with a
heartbeat message. This message is sent from the failovermon process
to the failoverd(1M) process on the host being monitored. The
heartbeat is sent over all communication paths that have been set up
for the host being monitored using the admfailoveraltcommpath(1M)
command. As long as at least one response is received by the monitor
the heartbeat is successful. The monitor then sleeps for the number
of seconds specified in its interval value.
If no response is received on any of the communications paths, the
retries value is examined to determine whether or not to declare the
host failed. If the retries value is zero the monitor immediately
executes the lost-pulse script. If the retries value is not zero, the
monitor continues to try and communicate with the host until the
retry value is exceeded. Then the monitor executes the lost-pulse
action script.
The monitor continues to attempt to communicate with the failed host.
When communications are re-established the regain-pulse action script
is executed.
When a failovermon(1M) monitor is started, a child process is
fork(2)'ed and put in the background. The start operation will report
the monitor as started if this succeeds. The monitor can then fail if
the host it is suppossed to monitor is not accepting communications.
Most times when this situation occurs, you should check to see if the
listen(1M) portmonitor is running on the remote system. If it is not
Licensed material--property of copyright holder(s) 1
failovermon(1M) DG/UX 5.4 Rel. 2.01 failovermon(1M)
running you will get the "connection refused" error message on the
system console.
The failovermon monitor can be configured to monitor the host it is
running on. This type of monitoring is used to detect a system hang.
The monitor determines if the system it is invoked on has and can use
the wdt() driver. This driver is available for use on AV4600 and
above systems. The wdt() driver will internally reset a register
every second. If it fails to reset the timer in one second, it will
trigger a warm reset of the system. The failovermon monitor
communicates with the wdt() driver for a higher level of monitoring.
The failovermon process will attempt to open and close a file every
30 seconds. Upon successful completion, the failovermon process will
send a message to the wdt() driver indicating the system is alive.
If the wdt() driver does not get a message from the failovermon
process within 30 seconds of the last message, the wdt() driver will
initiate a system panic to alleviate the hang.
When the failovermon process is stopped or terminates abnormally, the
wdt() driver ceases the high level monitoring. The wdt() driver
continues to perform its lower level monitoring until the driver is
deconfigured from the system.
The failovermon monitor can be configured to be started when the
system is rebooted.
Operations
add Add a failovermon monitor entry for hostname to the
failover monitors database. This operation will optionally
allow the administrator to start the monitor at this time.
delete Delete a failovermon monitor entry for hostname from the
failover monitors database. This operation will also
terminate an existing monitor if one is running.
modify Modify a failovermon monitor entry for hostname. This
operation will optionally allow the administrator to
restart the current monitor (if one is running) or start
one using the new information.
list List failover monitors database entries. The list operation
reports the following monitor information to stdout:
the name of the host that is being monitored
a flag indicating that a monitor is running or not
flag indicating whether the monitor is brought up
at system reboot time
the interval value
the retries value
the lost pulse action script name
the regain pulse action script name
With the `verbose´ format (-v), information is printed in
Licensed material--property of copyright holder(s) 2
failovermon(1M) DG/UX 5.4 Rel. 2.01 failovermon(1M)
aligned col umns with headers. With the `quiet´ format (-q)
headers are sup pressed and each host entry is printed on a
separate line. If both -q and -v are specified, the output
will be in `quiet´ format.
start Start a failovermon monitor for the specified host(s).
stop Stop a failovermon monitor for the specified host(s).
Options
The following options can be used with the add or modify operations:
-b Start on reboot. This option specifies that this monitor is
to be brought up when the system is rebooted.
-i interval
The time in seconds that the failovermon monitor waits
after receiving a reply to a handshake before initiating
the next handshake. The default is zero for an add
operation or the current interval value for a modify
operation.
-r retries
The number of times the failovermon monitor should continue
to try and communicate with the failoverd daemon of the
specified system, before declaring the system failed. The
default is zero for an add operation or the current retries
value for a modify operation.
-l lost-pulse
The full pathname to the user created script to be executed
when the monitor declares a system to be failed. This
script should contain an admfailoverdisk(1M) command line
to transfer the physical disks from the failed host to the
backup host. This script should also contain any system set
up required for the application or its users. The default
is /etc/failover/failovermon_lost_pulse for an add
operation or the current lost_pulse value for a modify
operation.
-g regain-pulse
The full pathname to the user created script to be executed
when the monitor regains the pulse of the system it is
monitoring. This script should contain any actions that
should be performed when the heart beat is regained (e.g.,
the administrator may want to shutdown the application and
move the disks back to the original host).The default is
/etc/failover/failovermon_regain_pulse for an add operation
or the current regain-pulse value for a modify operation.
Licensed material--property of copyright holder(s) 3
failovermon(1M) DG/UX 5.4 Rel. 2.01 failovermon(1M)
-s If specified on an add operation this option indicates that
the monitor should be started. If specified on a modify
operation this option indicates that the currently running
monitor should be stopped and restarted with the new
values. If no monitor is running, then one will be started.
The following option can be used with the modify operation:
-n Do not start on reboot. This option specifies that this
monitor is not to be brought up when the system is
rebooted.
The following options can be used with the list operation:
-q Quiet. Produce an unformatted listing with no headers,
fields delimited by a single space. -v Verbose. Produce a
formatted listing with headers and aligned columns. This
option is the default.
EXAMPLE
To add and start a failovermon monitor that will monitor a system
named hostA. This monitor will send messages every 60 seconds, and
will retry the handshake message 3 times before executing the
/hostA_has_failed script. Should hostA return, the /hostA_is_back
script will be executed. This can be done with the following command
line:
failovermon -o add -i 60 -r 3 -l /hostAhasfailed -g /hostAisback hostA
The monitor can then be started with the following command line:
failovermon -o start hostA
To modify this monitor and restart it with an interval of 1200
seconds (i.e., 20 minutes). The following command line could be
submitted for off-peak monitoring:
failovermon -o modify -i 1200 -s hostA
To stop this monitor, use the following command line:
failovermon -o stop hostA
FILES
/etc/failover/monitors
failover monitors database
DIAGNOSTICS
Licensed material--property of copyright holder(s) 4
failovermon(1M) DG/UX 5.4 Rel. 2.01 failovermon(1M)
Warnings
- Cannot initiate connection with host <hostname>, retrying.
- A monitor for <hostname> is already running.
- An attempt was made to delete a monitors database entry that
did not exist
Errors
- failovermon connection refused.
- Monitor for <hostname> not running.
- Monitor for <hostname> is already running.
- An attempt was made to add, delete, modify, or list a monitor
for an invalid host.
- An attempt was made to modify or list a monitors database
entry that did not exist.
- An attempt was made to add a monitors database entry that
already existed.
- The wdt() driver is not supported on this system.
Exit Codes
0 The operation was successful.
1 The operation was unsuccessful.
2 The operation failed due to access restrictions.
3 There was an error in the command line.
SEE ALSO
sysadm(1M), admfailoverdisk(1M), failoverd(1M), failover(4M).
NOTES
Super-user privilege is required for all operations except list.
It is possible for systems to be in a state where users get no
response but the monitor continues to detect a heartbeat. If this is
detected you should reset or `hot-key´ the system that is hung. This
will allow the monitor to detect a failure and perform its functions
that will allow the applications to be restarted while the failed
system is rebooted.
If you add additional communications paths to the failover
altcommpath database after a monitor has been started, you will need
to stop and start the monitor in order for those additional paths to
be used.
If you intend to shutdown a system that is being monitored and do not
Licensed material--property of copyright holder(s) 5
failovermon(1M) DG/UX 5.4 Rel. 2.01 failovermon(1M)
want the monitor to detect the system being down and execute its
lost-pulse action script, you should stop the monitor before shutting
down the system.
Licensed material--property of copyright holder(s) 6