BBC tackles big data dilemma with smart IT structuring

BBC Online operates the website for the U.K.’s public service broadcaster. It’s one of the most-visited websites in the world, but obtains little of its revenue from traffic, so as demand for its services soars, it will become harder to pay the bills for bandwidth and web servers. However, the organization has found ways to cope with this problem. 

To keep the BBC independent of commercial interests and to make sure the public interest is served, the corporation is not permitted to carry advertising or sponsorship on U.K. media. Instead, it derives its income from a mandatory license fee: Every household in the U.K. that wants to watch or record live television pays £145.50 per year (US$232). Of that, £8.04 per license payer, or £199 million in total, went to BBC Online in the 2009-2010 financial year.

Web analytics firm Alexa puts in 44th place in its list of the top 500 sites ranked by traffic — and the traffic volumes and visitor numbers keep growing, according to the BBC’s Chief Technical Architect, Dirk-Willem van Gulik. 
This leaves BBC Online with a dilemma: Increasing demand for its services is not matched by an increase in revenue, largely determined by the license fee. Since 2007, the website has generated some advertising revenue — but only from international visitors, not covered by the ban on advertising — but that was not enough to fill the gap between supply and demand. To make matters worse, the BBC will cut its online budget by 25 percent, it said in January, reducing funding for online services by £34 million.

“For the BBC, broadcasting on the web is just like broadcasting on antennas, we have to do it to get our stuff out there but it is not our core business,” Van Gulik said in a telephone interview earlier this month. 

“The BBC has found ways to be incredibly efficient with its infrastructure. Where a company like Yahoo would use many tens of thousands of servers to serve just the U.K., the BBC would do that literally with a handful of them,” he said.

And that is where it gets “really fun and interesting,” said Van Gulik, one of the inventors of the Apache web server that is now used by 66 percent of the world’s biggest websites.

Because of its reliance on the license fee, the BBC cannot scale its online operations in the same way as companies such as Google or Yahoo.

“If they get twice as much traffic or ten times as much traffic they are all jumping in the streets because that means that they get ten times as much revenue because they have advertising and all sorts of other things,” he said. “Our income stays exactly the same, we don’t get a penny more. So when we get ten times as many users we have to figure out a way to do things ten times cheaper.”

To complicate matters further, buying more server power is expensive.

To tackle this problem the BBC created an in-house software engineering practice to write software to serve pages more efficiently and reduce the load on its servers.

“In the past 3 years we’ve moved to a dynamic three-tier stack, largely PHP, Java and MySQL, fronted by software loadbalancers, called the Platform,” Van Gulik explained. 

“This Linux-based system is hosted at two data centers. It makes use of a lot of key-value stores, for example NoSQL products like CouchDB, clever use of message queues, lots of efficient automated build which is mostly based on Maven, automated test tools such as Hudson, simple fast caches such as Varnish, lots of SNMP and some solid Zenoss monitoring, novel log file handling using an internal system called teleport, and so on,” he said.

The second thing the BBC did was develop servers in-house.

“When you are in the top 100 of the biggest websites in the world, you can’t buy your equipment off the shelf,” said Van Gulik. “It isn’t available commercially because it would have a market of only 100 customers. […] So that means that the really heavy lifting special stuff we actually have to construct in house and think about how to build that.” 

The next challenge for the BBC is digitizing the video archive it has gathered since 1927, now stored on 80 miles of shelving. That will require around 10 petabytes of storage. The BBC has already digitized older archives, stored on one-inch and two-inch tapes, and is now gradually processing the more modern formats.

The digitized data is stored on tapes too, though, because they consume less power and thus less money than hard disk storage.

“Hard disks are nice but they take power and they break,” Van Gulik said.

Instead, he said, “We store it in very large tape robots the size of a small tennis field which is basically packed with tapes and lots of arms and robots moving around to move the tapes around.”

A hard disk containing one or two hours of high-definition video would cost between £30 and £40 a year in electricity alone. With overhead costs for the building or cooling that amount would be even higher. If the data is stored on tapes in a tape robot the cost for the same amount of data can be reduced to pennies, he said.

Related Download
EMC Data Protection For VMWare-Winning In The Real World Sponsor: EMC
EMC Data Protection For VMWare-Winning In The Real World
Download this white paper for a deep dive analysis based on truly real world comparison of EMC data protection vs. Veritas NetBackup for VMware backup and recovery.
Register Now