Why computers crash

It’s safe to say that anyone who’s ever worked on a Windows system has seen the infamous “blue screen of death.” This solid blue screen with white lettering tells you that the system has crashed and gives you a couple of options. If you’re lucky, all you do is reboot and redo all the work you just lost, but you could find yourself dealing with major system corruption.

Simply put, the blue screen of death is just a serious error message, a sign that your computer has hung up due to an error. The Mac OS equivalent is a blank screen with a small text box containing a picture of a bomb with a lit fuse.

The upside of this unwelcome shutdown screen is that it contains some information about what caused the crash.

A “core dump” will often appear on the screen, with coded information from the system’s RAM. It might provide information that helps you determine exactly what went wrong with your machine and prevent it from happening again if you record the screen information.

An infinite loop is another of the many errors that can bring a computer to its knees.

A loop is a series of instructions that gets repeated until a specified condition is met. When that condition can’t be met, the loop cycles endlessly and never quits or moves to the next part of the program.

Thrashing is another problem condition. Any computer has a finite amount of memory and processing capabilities. When a process or program (or, with respect to a server, a user) makes a request of the operating system that can’t be met, the operating system borrows the necessary resources from another process. But then the borrowed-from process asks for resources, and the operating system has to find them somewhere else.

Eventually, the entire system is looking for help, and the computer user is looking at a stagnant or blue screen.

Consider what happens when several users need lots of resources at the same time. Here, the operating system may give one process exclusive use of all its resources for a short period of time, then go and reallocate those to the next user, and so on.

When the system moves from one user to the next, however, it has to save everything the previous user was doing (such as data or the state of its processes) out to disk, which is relatively slow. Then it has to load again from a slow disk the next user’s saved data and programs before any computing can resume.

Since intervals between changes or requests are measured in milliseconds, it’s easy to see that just the overhead of changing users and reallocating resources can consume virtually all of the computer’s time and capacity, so that little or no real work gets done.

Finally, there’s the classic fatal error. There are certain commands that an ordinary user isn’t allowed to issue. These typically have to do with the operation of the hardware, memory and processing of the machine.

Sometimes, however, a program steps into that forbidden area and, to protect itself, the machine shuts down. That way, when you reboot, everything still works the way it should. Except for all the data you lost due to the shutdown.

Perhaps the best-liked feature of Windows 2000 has been its stability in the face of such errors, its ability to shut down just the offending process without forcing a reboot.

Disks crash, too

There’s another type of computer problem that’s also called a crash, and it happens in hard disks.

Normally, the heads of a disk drive actually fly over the surface of the platters, never touching the magnetic media. But if there’s a sudden physical shock say you drop your laptop the heads can touch the rapidly spinning platters. Such a disk crash (also called a head crash) usually causes a loss of data or program files and damage to both the platter and the head, inevitably necessitating a new drive.

The deadly embrace

Deadlock (which is sometimes called the deadly embrace) is another crippling condition. It occurs when two or more programs are each waiting for the others to complete or even just to produce a data value before proceeding.

The programs act like the overly congenial gophers in some Looney Tunes cartoons:

“Oh please, you first,” says one.

“No, no, I insist, you first,” says the other. And nothing goes anywhere.

Generally, deadlock occurs in systems that run multiple tasks or in servers with multiple clients. Operating systems and middleware that queues messages have attempted to eradicate deadlock, but it still pops up now and then.

There’s a historic reason why deadlock exists.

Early operating systems ran only one program at a time, and all of the resources in the system were automatically made available to that one program.

In order to run multiple programs at once, programmers figured out how to start fulfilling a program’s needs without giving it the system’s full resources. In mid-operation, a program can request additional resources as it needs them, but if another program or programs have already grabbed the resources the program needs the result is deadlock.

Beyond operating systems, deadlock can occur in databases and Web browsers. Generally, deadlock happens less often in newer applications because new hierarchies for requests are able to skirt around the problem in many cases.