VMware bug crashes servers

Many VMware customers Tuesday were prevented from logging onto their virtual servers as a bug distributed in a software update effectively stopped the boxes from powering up.

According to VMware, the issue involves ESX 3.5 Update 2 and ESXi 3.5 and customers powering on virtual machines (VM) that have been upgraded with those releases. In a statement, VMware said it is “working on an immediate patch for customers in production. VMware expects to fix the issue in code in the next 36 hours once QA testing has been completed.”

The company says the date bug only affects customers that had updated their systems with the July 27 releases of ESX 3.5 Update 2 and ESX1 3.5, but VMware has not specified exactly how many customers that could be. VMware is sure to take a publicity hit with the news of a bug that slipped through its fingers, industry watchers say.

According to a blog posting by VMware CEO Paul Maritz, the bug was caused by a piece of code accidentally left enabled in the relase version of Update 2. After the system date clock passed Aug. 12, VMs powered off would not power up, suspended VMs would not come out of suspend mode, and VMs could not be migrated using VMotion.

“This certainly appears to be the most publicized bug for VMware so far, and I think it is damaging to VMware and virtualization as a whole. The hypervisor is the lowest software level on the server and if you have an issue like this, boom, all your infrastructure is down,” says Gary Chen, a senior analyst with Yankee Group. “Software will always have bugs, but a widespread issue like this that affects all VMs is really damaging, especially at this point in time where virtualization is starting to take off. VMware is going to have to fix this fast, provide an explanation, and outline what they will do to strengthen their QA in the future.”

Customers around the world have been affected and sharing their experiences in VMware’s forum. One customer wrote: “We’ve just encountered a serious bug with our ESX cluster — serious enough that I thought I should post about it here as a prior warning for others running ESX 3.5 Update 2.” The customer goes on to explain messages received from the VM, which in essence state that the product has expired.

John Sloan, senior analyst with Info-Tech Research Group in London, Ont., says it’s easy to overestimate the magnitude of the problem — it only affects organizations running the latest version of ESX with the most up-to-date patches, and most organizations are running older versions — but not of the effect. The hypervisor layer is “a single point of vulnerability,” and a failure there will affect a lot of a company’s infrastructure, he said.

There’s increasing competition in the virtualization market with the release of Microsoft’s Hyper-V hypervisor. “All the players are under a lot of pressure to rush things out the door,” Sloan said. Stringent quality assurance is crucial.

“You want to make sure you can trust that product,” Sloan said. “The hypervisor is so critical.”

Microsoft has had a history of rushing product, Sloan says — “You never trust Version 1” — but in the case of Hyper-V, the company was much more cautious, for example, holding off on motioning. It was a “remarkably restrained and unusual” approach for Microsoft.

According to Chen, the bug prevents customers from powering on a VM, but it doesn’t seem to affect VMs already running. A workaround that seems to be effective for now, Chen says, involves setting the date back, powering on the VM and then resetting the date. That may solve the problem in the moment, but Chen says customers may be wary of supporting a homogenous virtual infrastructure going forward.

“As enterprises move towards a fully virtualized infrastructure, issues like this certainly will make people think about adopting multiple hypervisors and not putting all your eggs in one basket,” Chen says. “If you are 100 per cent virtualized using a single vendor, one software bug or an exploitable security flaw in the hypervisor could instantly freeze your entire infrastructure. These are the risks you take if you have a monoculture; we’ve seen it before with things like Windows, IE, etc.”

“VMware is steadfast in promoting that monoculture,” says Sloan, pointing out that while Microsoft’s and Citrix’s management software accommodates other hypervisors, VMware’s doesn’t.

For VMware customer Jake Seitz, enterprise architect at The First American Corp., in Santa Ana, Calif., this bug didn’t cause any problems in part because he had not upgraded his systems yet and in part because VMware contacted him Monday to alert him on how to avoid a problem by not powering down VMs with the update.

“We were proactively notified by VMware Monday. They told us these are the symptoms and what would happen if you powered down your virtual machines,” he explains. “They gave us the general prescription in terms of troubleshooting and things to avoid such as powering down.”

While this bug didn’t hit his environment, Seitz says it sounded like one of the worse ones to come out of VMware thus far.

“I would consider this a very severe bug, just by the nature of it. It sounded worse than previous bugs,” Seitz says. “Normally they are on top of their game so I am surprised that they missed this one, that it made it through.”

Related Download
Five Reasons to Take Your Virtualization Environment to a New Level Sponsor: VMware
Five Reasons to Take Your Virtualization Environment to a New Level
Download this white paper to learn how vSphere with Operations Management helps you identify problems and proactively address issues before they affect your end users.
Register Now