No one really knows how many Web pages are out there, though it is probably in the billions. We do know that the pages are being served from the four corners of the earth, criss-cross the Internet in an apparently haphazard fashion and live and die in the blink of an eye.
But documenting it all – that is another story.
Imagine trying to keep track of the Library of Congress, but with one caveat: there are no rules about borrowing and returning items, and there was never any record of what was there in the first place. Welcome to the nefarious underbelly of the Internet and the working environment of your favourite search engine.
Like beer, most people seem to have a favourite search engine and, like beer, they are not quite sure why they use the one they do. Maybe they found a few obscure items when all other engines failed and decided to stick with that app. Or maybe they hate other search engines and use the chosen one by default. Regardless, searching the Web without an engine is the world’s closest analogy to looking for a needle in a haystack.
But from all of this take heed. The search engines of today are vastly superior to their predecessors and are surprisingly accurate. Yes, you may get 1.7 million results when you type in “sports cars” but if you look at the first 50 results you will most likely find what you are after.
Though all search engines vary in their exact internal workings and use of proprietary algorithm technology, they all have three fundamental parts: the spider, the index and the actual search engine software.
If it weren’t for links there would be no search engines. Links are much like the theory of six degrees of separation, in that by starting at one site you can link to any other site by jumping from link to link. It may take a while but it can be done.
This is how a spider or crawler searches the Web. It might start at Microsoft’s site and, after a few hundred jumps, find itself at the official site of WWF wrestling. This method, though accurate, is slow and thus problematic for something as dynamic as the Web.
“That is always a problem that every search engine has…since the cycle is about a month,” said Mark Dykeman, Web producer for Sympatico-Lycos Inc. in Toronto, explaining an spider takes about a month to travel the entire Web.
“There is no search engine in the world that can spider the entire Web fast enough to keep every link up-to-date.”
When the Web is then re-crawled all dead links are automatically removed.
“We make sure that we re-crawl the Web frequently enough so that even if there is a dead page, it won’t last very long,” said Craig Silverstein, director of technology at Mountainview Calif.-based Google.
finding your Web site
“If the information is out there, we will find it,” said Sandro Berardocco, general manager with Alta Vista Canada in Edmonton.
He said Alta Vista makes its own life easier by using its proprietary technology called CyberFence.
“It basically puts a fence around a geographical area. We have..determined Canada is a geographical area, so basically it puts a fence around Canada.” This means when the company sends crawlers out, they find all Canadian publicly-accessible Web sites, whether a dot-com or dot-ca. There are currently about 30 million pages belonging to around 200,000 Canadian URLs, Berardocco said.
He won’t say how the technology works, only that it is extremely accurate. Not surprisingly it is a popular option for Alta Vista’s visitors. About 80 per cent of visitors choose the Canadian-only search option, he said.
Sympatico-Lycos also has the Canadian-only option, which is used by about 50 per cent of its visitors, Dykeman said.
Once the Web has been crawled, it is then put into an enormous indexed database which acts as a repository of all Web sites and pages. The size of each each database varies but, for example, Google says you can search 1,326,920,000 Web pages with its engine.
The last part of the search engine is the software designed to retrieve your request. This is the closely-guarded algorithmic technology. In fact, Google is confident enough of its retrieval technology that it has an “I’m feeling lucky” button which takes you directly to the first site on the retrieval list.
“I think ‘I’m feeling lucky’ is great…[but] the fact is it is not used that much,” Silverstein admitted.
internal workings of an engine
Google uses something it calls page rank to decide how high up in the results a page is listed. According to Silverstein, page rank is a number which describes the quality of a page. The technology grew out of research that was done at Stanford University in Stanford, Calif.
The more pages that link to you the better. If two pages have the same number of links pointing to them, the one with the mostly highly-ranked referral links (page rank) will return with a higher placing.
“As a result, if you search for American Airlines you are more likely to get the company home page than a site where someone talks about a flight they took on American,” Silverstein explained.
Alta Vista uses similar, but equally guarded, technology that also uses complex algorithms, Berardocco said. He said creating an accurate title tag for your site is very important, as is the number of sites that link to your site.
“The more popular your site is, the higher the rankings come up,” he said.
And to move your site up the rankings?
“We provide a whole section for Web masters on designing your page right and how to get yourself up the rankings,” he explained.
The results show that search engines are doing a good job.
“We average about two page views per query,” said Tasha Irvine, lead program manager for search and reference at MetaCrawler in Seattle.
“We would start to worry if we saw our page views per query up in the three or four numbers. That is when you start getting suspicious.”