Archiving the

If you’ve ever wondered what books Inc. was recommending five years ago, what the first Web cam was focused on or what was on more than 10 billion other Web pages dating back to 1996, you’re in luck. That’s exactly what’s offered by the Internet Archive’s free “Wayback Machine.”

The Wayback Machine, unveiled to the public last week, is an online archive of Web sites, as well as the public face of San Francisco-based Internet Archive, a non-profit organization dedicated to building a digital library for the public. A University of Maryland professor has been using the archive to index Hungarian texts, while researchers from Xerox Corp.’s Palo Alto Research Center (PARC) have used it to find out whether the dominance of English on the Web is killing off less widely used languages.

Thanks to the archive, we now know there are 1.5 million Hungarian-language pages on the Web. Xerox researchers also found there are 201 other languages represented on the Web and they are thriving in the digital universe.

The sheer size of the Internet Archive almost guarantees there is something for everybody. If you digitized all the books in the U.S. Library of Congress, it would take up about 20 terabytes of storage space, just scratching the surface of more than 100 terabytes of information available using the Wayback Machine.

“Our opportunity is to not have to be selective,” Brewster Kahle, director of the Internet Archive, said at Wednesday night’s launch of the Wayback Machine at the University of California at Berkeley. “Our opportunity is not only to have it all, but to make it widely available.”

However, having it all isn’t easy – the Internet Archive is still growing by 12 terabytes a month, meaning that the material archived every two months contains more data than all the books in the Library of Congress. To keep up with this growth, the Internet Archive keeps all this information on a networked chain of 300 desktop PCs in the basement of a former military building in the Presidio of San Francisco.

These aren’t your average desktops, though. Most of them only have a single 1.5GHz processor, but they also have 640MB of RAM and four 80G-byte hard drives, said Niall O’Driscoll, vice-president of engineering for Alexa Internet Corp., one of the companies behind the Internet Archive. Alexa was also co-founded by Kahle, and was purchased by online bookseller in 1999.

“There are basically 20 machines in the front line,” O’Driscoll said. “When you query it, it asks all 20 and one says ‘I have it,’ or ‘I know where to find it,’ and it redirects to the actual machine with the information.”

Perusing through the Web of yesteryear is a lot like looking at a huge collection of old newspapers without getting your fingers dirty: You never know what you’re going to find. One page Kahle found while perusing the Web was a 1996 White House statement from then-U.S. President Bill Clinton discussing aviation security following the crash of TWA Flight 800 off the coast of New York. “Shortly, I will submit to Congress a budget request for more than US$1 billion to expand our FBI anti-terrorism forces and to put the most sophisticated bomb detection machines in America’s airports,” Clinton said on Sept. 9, 1996.

“The overwhelming metaphor for the Internet is a library,” Stanford University Law professor and author Lawrence Lessig, said at the launch. Following that metaphor, the Internet Archive is “that quiet librarian working in a room making sure that you can have access to what you want access to,” he said.

During his presentation Wednesday night, Kahle also showed the audience other historical uses of the archive. These included watching the evolution of Microsoft Corp.’s privacy policy between 1996 and 2001, and viewing the original home page of a group of then-unknown Web designers called Heaven’s Gate, who later drew worldwide attention as a cult that committed group suicide in Southern California in 1997.

The Wayback Machine even houses four “special collections;” covering the terrorist attacks on Sept. 11, the history of the U.S. government on the Web and the vote controversy following the 2000 U.S. presidential election. The “Web Pioneers” collection pays tribute to Web sites that “shaped the character of the Net in the early years.” And among those influential sites is the University of Cambridge Trojan Room Coffee Machine – the first Web cam, which traces its roots back to 1991.

“We’ve finally created a library for the world,” Lessig said.

The Internet Archive, in San Francisco, can be reached at

The Wayback Machine can be found at