It is not often that retail stores unexpectedly lock their doors during business hours. More common, unfortunately, is the closing of on-line businesses due to equipment failure.
Companies running large Web sites face many challenges trying to keep their sites up and running 24 by 7. No one knows this better than Greg Weir, the Web master at Tucows in Toronto.
The site, which offers a mixture of editorial content, software, e-commerce services, music and a domain name registration service, has up to a 100 million page views per month, Weir said. That’s a lot of traffic, and there would be a lot of unhappy customers if the site went down for even an hour.
A study done by Forrester Research Inc. in January found that companies such as Dell, Cisco and Intel, which make US$30 to US$35 million dollars a day in revenue, can lose up to US$1.5 million dollars of revenue an hour when they are down. Other costs, which can’t be measured, include stock devaluation and a blow to reputation.
That’s why Tucows was careful when it came time to construct its Web site, in an effort to take into account everything that could go wrong.
And just about everything can go wrong, said Doug Hadden, the senior product manager for SiteAssure at Platform Computing Corp. in Markham, Ont. He went over some of the challenges faced by companies such as Tucows at a presentation at ISPCON in Toronto last month.
From a customer’s perspective, not being available is like being out of business, Hadden said.
Complex sites are multi-tiered, cluttered with heterogeneous devices and have a multitude of users in different time zones. And most Web companies are also dealing with limited budgets and a lack of IT expertise.
“One of the biggest issues is the fact that you may not have cross-platform expertise. It’s very difficult for you to find these mythical creatures that understand all your operating systems and components in your network,” Hadden said.
Although it’s sometimes possible to predict peak hours, a quick mention of your site in a chat room can unexpectedly drive up the number of hits a site experiences. This means the sites have to be prepared for hits to increase dramatically and unexpectedly.
“In the old days, when you had a mainframe, you knew where the problem was if you had a performance issue. Today, with the idea of being able to plug in various components (and) various servers at various levels of your environment, you’ve added a whole tier of complexity that you didn’t have before,” Hadden said.
In order to deal with these myriad problems, companies need to take the time to design their systems properly, said Carl Howe, the e-business infrastructure research director at Cambridge, Mass.-based Forrester Research.
“You’ve got to design that stuff in, not just say, ‘Well, I want my uptime to be higher, I think I’ll buy this other product.’ That doesn’t get you there,” he said.
Tucows, for one, took the time.
The company, who’s name initially stood for The Ultimate Collection of Winsock Software, has an extensive worldwide network with about 750 affiliates in over 1,200 locations around the world.
Their content is distributed on dumb edge devices from core machines. Tucows owns the core boxes, but the affiliates, which include ISPs, telephone companies and cable companies, own the mirroring boxes. The affiliates primarily run Linux, although some also run various flavours Unix, and a few have Windows NT boxes.
Tucows found it wasn’t possible to mandate the type of operating systems and software their affiliates chose to run.
“At one point, we asked our affiliates to run a particular version of Red Hat Linux, and we found that at that point a large number of them favoured a different distribution of Linux, or they liked Solaris, or they liked whatever,” Weir said.
The hardware is provided by the affiliates but the content is provided by Tucows through its core boxes. This content is locked up in a firewall and is actually sourced from a database that Tucows refers to as the “playground.” That database is dumped on a daily basis to Tucow’s Web servers, which host an exact duplicate of the database content.
“That way we don’t have to worry about network traffic to the other machines,” Weir explained.
A load balancing machine connects to various servers that send out ad content. “Since the only content we have out there is on the Web, we only need to monitor port 80. And we need to monitor the content as well as the availability. We want to make sure that our affiliate sites have the most up-to-date content,” Weir said.
A Perl-based crawler monitors Web site availability and constantly connects to the affiliates to check if certain files are present. The time stamp on the files is also checked to make sure the mirroring is complete. A MySQL database is updated with site status.
“We have redundancy built into the network. By definition, our network is the ultimate redundancy because all of our content is on hundreds of Web servers around the world,” Weir said.
According to Forrester’s Howe, the most common cause of Web downtime is operation issues, not the more commonly reported attacks by crackers. Well-designed sites have multiple data centres and no single points of failure, he said.
“They run over 300 servers just to ensure that if any one server goes down, the others can pick up the slack. They run a tiered architecture whereby they’ve got separate servers for the Web site and then separate servers for transactions and separate servers for databases,” he continued.
Sites should also be designed for fail-overs so geographical issues, such as tornadoes in Kansas, don’t bring down the site.
But it’s also possible to go too far.
“Once you get past the three nines, 99.9 per cent availability, you’re reaching the point where you’re spending more on infrastructure than the additional revenue that you’re getting back. You get a cross-over point. If you take this to the absurd extreme – take the five nines, for example – you’re spending like $2,000 in infrastructure cost to get the next dollar’s-worth of revenue from uptime. You’ve reached the point of diminishing returns,” Howe said.
Tucows has hit what Howe would consider an ideal uptime – 99.9 per cent.