CRASH(8) — Silicon Graphics

NAME

crash − what to do when the system crashes

DESCRIPTION

This entry gives at least a few clues about how to proceed if the system crashes. It can’t pretend to be complete.

In restarting after a crash, always bring up the system single-user, as specified in boot(8). Then perform an fsck(1M) on all file systems which could have been in use at the time of the crash. If any serious file system problems are found, they should be repaired. When you are satisfied with the health of your disks, check and set the date if necessary, then come up multi-user.

To even boot UNIX at all, certain files (and the directories leading to them) must be intact. First, the initialization program /etc/init must be present and executable. For init to work correctly, /dev/console, /bin/sh and /bin/env must be present. If one of these does not exist, the symptom is best described as thrashing. Init will go into a fork/exec loop trying to create a Shell with proper standard input and output. The file /etc/rc should also be there and be executable; the system will come up but will not be fully initialized without it.

If you cannot get the system to boot, a runnable system must be obtained from a backup medium. The root file system may then be doctored as a mounted file system as described below. If there are any problems with the root file system, it is probably prudent to go to a backup system to avoid working on a mounted file system.

Repairing disks. The first rule to keep in mind is that an addled disk should be treated gently; it shouldn’t be mounted unless necessary, and if it is very valuable yet in quite bad shape, perhaps it should be copied before trying surgery on it. This is an area where experience and informed courage count for much.

fsck(1M) is adept at diagnosing and repairing file system problems. It first identifies all of the files that contain bad (out of range) blocks or blocks that appear in more than one file. Any such files are then identified by name and fsck requests permission to remove them from the file system. Files with bad blocks should be removed. In the case of duplicate blocks, all of the files except the most recently modified should be removed. The contents of the survivor should be checked after the file system is repaired to ensure that it contains the proper data. (Note that running fsck with the −n option will cause it to report all problems without attempting any repair.)

fsck will also report on incorrect link counts and will request permission to adjust any that are erroneous. In addition, it will reconnect any files or directories that are allocated but have no file system references to a “lost+found” directory. Finally, if the free list is bad (out of range, missing, or duplicate blocks) fsck will, with the operators concurrence, construct a new one.

Why did it crash? UNIX types a message on the console typewriter when it voluntarily crashes. Here is the current list of such messages. The message has the form “panic: ...”, possibly accompanied by other information. Left unstated in all cases is the possibility that hardware or software error produced the message in some unexpected way.

default_intr
An interrupt has occured for which there is no device driver.

I/O err in swap
While swapping a user process, a hard error occured on the swap disk.

parity error
A parity error occured somewhere in the onboard memory. A message will precede this diagnostic to indicate where in physical memory the error occured. Unfortunately, UNIX can’t diagnose the memory failure. If the error persists, the memory diagnostic should be used.

iinit
The system was not able to read the root file system. This could either be a hardware or a software problem, but it most likely means that either the disk drive is damaged, or the root file system on the disk drive is damaged.

The following diagnostics indicate that something is wrong with the disk controller.

dsd: couldn’t start!
qicstart: couldn’t start!
dsd: no status posted
dsdstatus
dsdstart unknown type

riomap
The system attempted to issue a raw I/O request which was larger than the CPU can physically handle.

getmajor
While attempting to boot, the system configured a disk drive which had no entry in the bdevsw[] array (the array of block devices). The system was incorrectly configured.

out of memory during boot
The system is too large to run in the given memory.

The following diagnostics indicate that something is wrong with the buffer cache / inode tables.

devtab
bflush: bad free list
no fs
no imt
dsdattach: geteblk() failed

swap error: swapping beyond process
Something is wrong with the user memory management code.

timeout table overflow
The system attempted to put a time driven event on a queue, and there was no room in the queue. If this happens often, then the system has been incorrectly configured.

no procs
The system decided that it had a process slot available for a fork, and then redecided that it didn’t. If this happens often, then the system has been incorrectly configured.

init died!
The init process was killed. If done with a user program or the Shell kill command, then nothing is wrong. If this happens during the boot procedure, then something is wrong with the root file system.

out of swap space
Too many processes for the given swap space. Try running a more modest number of processes.

The following diagnostics indicate problems with the kernel.

unexpected kernel trap
kernel address error
kernel bus error
defrasterfont: font too large

syscall
The system attempted to do a system call, which is not allowed. The system was incorrectly configured.

The following diagnostics indicate that the Ethernet hardware/software has a bug.

nxpresent: cleared
xns_ttstart

panic recursion
The system got a panic message while trying to inform the console of a panic.

Museum

Related Articles

CRASH(8) — Silicon Graphics

NAME

DESCRIPTION

SEE ALSO