“Is software reliability important? Ask your grandma,” says operating systems guru Dr. Andrew Tanenbaum.
When consumers go to buy an electrical appliance such as a TV or stereo they expect to bring it home, plug it in and see it work. And it is exactly what happens — for years on end. But not so with computers, even though it should, says Tanenbaum, author and Professor of Computer Science at Vrije Universiteit in Holland .
Tanenbaum used last week’s linux.conf.au in Australia to introduce his new metric: LFs — Lifetime Failures, which he says is the number of times software, particularly the operating system, has crashed in a user’s lifetime.
He said there was no reason why PC consumers should expect mediocrity from their operating systems. “A TV doesn’t have a rest button,” he said.
But how to do this?
“I think it is time we rethink operating systems,” he said. “We have to rethink where we are going in 2007. We have basically infinite hardware and the only reason it’s slow is because the performance is so bad.”
To this he added a disclaimer: “Performance for the most part isn’t an issue: bad code is.”
To illustrate the complexity of operating system software, he pointed out the rise in the amount of code for Microsoft’s Windows software over the past decade. Windows NT 3.5 started out with 6 million lines of code (LoC) in 1993. NT 4 in 1996 had 16 million LoC, Windows 2000 had 29 million LoC and XP had 50 million LoC.
With an average bug rate of anywhere from 10-75 per 1000 LoC, the chances for errors and failures rises sharply, he said.
Tanenbaum was critical of software design today, saying there were far too many features, many of which were unnecessary in applications. He said as software gets more bloated, it becomes less reliable, more buggy, and slow. “I think that is a bad direction to go into.”
He referred to RAID arrays and ECC memory as hardware devices which, when they encounter errors, can correct them on the fly.
“Correcting bad software on the fly surely should be easier than correcting bad hardware. So I think we need to go in the direction of self healing software,” he said. —PB— To achieve a Lifetime Failure of zero, he said systems needed to be small. This should start with minimizing the code in the OS kernel, which he also said needed to be modular.
The next step is to isolate components such as drives and file systems so that problems, should they arise, can’t spread.
On this matter, Tanenbaum referred to the Principle of Least Authority (POLA): “Don’t give something more authority than it needs.” In this instance, the failure of one component should not crash other components in the OS.
RAID’s self healing ability was a perfect illustration: “If one drive fails it shouldn’t pollute the other drives.”
Tanenbaum said MINIX 3, the most recent version of the operating system which he created 20 years ago, and which aided Linus Torvalds in writing Linux, deploys many of the features he highlighted in his presentation. MINIX today is primarily used as a teaching tool for computer science students worldwide.
“Maybe the direction Linux could go would be [as] the system that is ultra reliable, that works all the time and has not got all the problems that you get in Windows,” he suggested.
Although it was a proposition, it may have some weight.