Site icon IT World Canada

Netflix’s former chief cloud architect says the ‘fantasy’ of failover has to end

Rugged operation will be the next big challenge for cloud computing in the next year, said a cloud computing guru in Vancouver last week.

Adrian Cockcroft, technology fellow at venture capital firm Battery Ventures, outlined some current trends in cloud computing in a talk at the Dun & Bradstreet Cloud Innovation Center on Thursday. Cockcroft was the chief architect of cloud services at Netflix, and help to guide it through rapid growth, along with some nasty cloud outages along the way, so he knows the cloud inside out.

One trend he identified is a rise in SaaS investment as companies replace their own in-house systems with cloud-based services. “The typical on-premise SAP or Oracle set of services that a lot of companies run their business on is being replaced by a set of SaaS-based applications,” he said.

The rise of SaaS is happening in areas from marketing through to HR, but online services provided by web-scale companies is also targeting more entrenched industries, he suggested. Some people in the financial space are also moving their services into this realm.

“We’re starting to see a lot of banks, finance and insurance companies trying to get themselves into a much more agile space. They’re worried that they are getting left behind,” he warned.

There are really only two main players in the public cloud computing space: AWS and Microsoft’s Azure, said Cockcroft. He cited a Gartner Magic Quadrant that described the two as the only leaders – and which said AWS had a multi-year competitive advantage over Azure.

“Google has still got something to prove here. They still want to be relevant to the enterprise,” he added.

Netflix uses Amazon Web Services as its cloud provider. Several Amazon outages happened on Cockcroft’s watch, including a service degradation in October 2012 that took several large websites offline for hours. The popular video service stayed up, though.

“Most enterprises are more concerned about agility and the practical ability to move things around, rather than the actual keeping stuff up and running. There’s a lot more ability to distribute systems in the cloud because you can just fire up things all over the world,” Cockcroft said. Netflix uses Apache’s Cassandra for this purpose, but there are alternatives.

The bigger issue for many companies is making sure that their failover in the cloud actually works. Companies can put in place all the cloud failover mechanisms that they want, but it doesn’t mean that they actually work.

“It’s like having a disaster recovery datacenter that you’ve never failed over to,” he said. “If they’ve never done the DR exercise, then it’s just a fantasy.”

Netflix proves that its systems are reliable by deliberately disrupting them at random. It does this using two programs: Chaos Monkey, which kills individual machines, and then Chaos Gorilla, which takes down larger parts of its infrastructure. Every quarter, it lets these programs loose on its infrastructure, to make sure that the systems fail over properly so that services keep running. That prepares it for real-world outages.

Such outages are common in cloud-based systems, and the uptime figures posted by cloud vendors can be misleading. For example, just over a year ago, Microsoft’s Office 365 service went down for nine consecutive hours. Yet the firm posted a 99.95% availability figure for that quarter. If a service is up for 99.95% of the time during a single quarter, then math dictates that a service can be down for one hour and five minutes maximum.

Some of the confusion here might lie in how uptime and downtime are defined. Microsoft has a specific designation called “service interruption/outage”, described as “slow, sluggish, or occasionally unresponsive for brief periods.” This is separate to another designation, “service interruption”, which means that users can’t access their email, documents, or presence information at all.

It isn’t clear whether the separate designation enables a service to be unresponsive for a short time without being technically down, as far as the firm is concerned, but it might explain the discrepancy in the uptime figures.

“AWS figure out a lot of availability things at scale. They have been very solid recently. A lot of the weak spots have been designed out in the last few years,” Cockcroft said, adding that while AWS has become more solid in recent years, Microsoft has had problems limiting outages regionally. “I think that Microsoft is still struggling a little bit with availability. It’s definitely a less mature system from that point of view.”

Microsoft declined to comment on this issue or to discuss how it categorizes downtime incidents.

Clearly, reliability will be an important topic for developers in the coming year as cloud-based systems continue to take hold. “Developers are concerned about being lean and rugged. There’s a lot of interest in that – extending the range of things that you have to understand in a developer environment,” Cockcroft said.

DevOps can play a part in that, creating stronger links between developers and operations staff that can make systems more adaptable, but security must be built in by design, he added.

This will be the next major challenge, now that infrastructure challenges in the cloud are “settling down”, Cockcroft said. Services like Docker are removing much of the complexity from cloud deployment. He believes that Docker, which provides a container that enables companies to run cloud-based applications on any platform, will become a default way to deploy everything. “It’s turning into the standard building block,” he concluded.

 

Exit mobile version