The airline disruption was probably the most inconvenient problem caused by the addition of a second to atomic clocks, necessary to account for a miniscule accumulated slowing of the earth’s rotation of about 1.4 milliseconds per year.
Australian airline Qantas reported two-hour delays as Saturday turned into Sunday UTC (noon, Sydney time), which echoed wider and diffuse Internet problems that temporarily downed a clutch of well-known websites and services including Reddit, LinedkIn, FourSquare, Stumbleupon and Cisco Systems Inc. videoconferencing.
“This incident was caused by the Linux bug triggered by the ‘leap second’ inserted into clocks worldwide on June 30th,” read an Amadeus statement on the problem that beautifully understates a complex if predictable software issue affecting some Linux-based programs while leaving others untouched.
Which programs were left exposed? Anything with a Java component plus MySQL, Firefox, Thunderbird, and Debian, the latter causing server blades to ‘go dark’.
It shouldn’t have happened as anyone who remembers the hugely expensive damp squib of the Y2K bug at the turn of the Millenium will be reminding the engineering fraternity. In some cases it appears that some techies only guessed that the problem was related to the leap second because it happened to occur precisely at the moment of midnight UTC/GMT.
Although leap seconds caused by the need to compensate for the earth’s rotation are extremely rare occurrences – the last whole second adjustment would have happened in 1820 had atomic clocks and NTP (network time protocol) servers existed – there have in fact been 25 leap seconds for other reasons since the beginning of atomically-measured time in 1971.
Of course, Linux’s weakness was its sheer diversity rather than an inherent issue with the open source model of collaboration itself. Linux is a hugely important foundation of Internet services without that fact being obvious.
The preferred time adjustment technique is to add tiny increments of time gradually, which allows systems to add these logically when they add up to a whole second. Such an approach is already used by Google’s NTP servers.
It now appears that not everyone got the memo.
“Initial reporting often fingered Java or even Cassandra as the culprit … but the actual problem was a kind of livelock in the Linux system calls responsible for timers,” wrote Jonathan Ellis in a blog post.
Although the Network Time Protocol, the most widely used mechanism to synchronize the time across the Internet, was designed to handle leap seconds, a number of popular Internet services briefly went offline after the second was inserted in their servers.
ReddIt engineers had initially assumed that Cassandra, along with Java, was source of its leap- second related outage on Saturday. The problem wasn’t with either of those technologies, Ellis countered, but rather with the underlying OS. (Oracle, which manages Java, did not immediately respond to comment).
A system administrator would have first noticed the problem manifesting as an extremely high system load or even a system crash that could be traced back, via the normal administrative tools, to an application such as Cassandra, the Java Virtual Machine, Hadoop, or MySQL. The actual culprit, however, turned out to be a harder-to-pinpoint bug in the way Linux updated its clocks when a leap second was introduced, Ellis said.
Many had found that resetting the application did not restore the server to normal operation. They could, however, remedy the issue by resetting the system clock or rebooting the server.
Whatever the cause, the bug disrupted the Saturday evenings of many a system administrator. On the Time-Nuts mailing list, one admin reported spending the evening rebooting hundreds of servers