With well over a billion indexable pages on the Internet, lucky surfers may stumble over the needle, but simply rooting around in the haystack is far more likely.
But companies that design search engines are determined to point people directly to the needle. Some use human editors to determine which search results to rank first, others use popularity as an indicator, placing sites with the most previous hits at the top, and still others – such as IBM – are writing algorithms to determine which sites are the most significant.
“Normally, when you do a search for something, you get zillions of answers back, and it doesn’t really give you a sense of what’s authoritative and not,” said Byron Dom, the manager of information management principles at IBM’s Almaden Research Centre in San Jose, Calif.
IBM researchers hypothesize good hubs and authorities point to each other. A hub is a site created by an organization or individual that pinpoints and links to good authorities on any given topic. A hub is like a restaurant critic who lists good restaurants; the authorities are the restaurants, Dom explained.
“What we have tried to do is tap into the endorsements people make when they put a link to other Web pages,” he said.
IBM’s search engine, code-named Clever, counts the number of pages that link to other Web pages. Every page has a hub score and authority score. The hub score of a page is equal to the sum of the authority scores of all the pages to which it links. And the authority score of a page is equal to the sum of all the hub pages that link to it.
“Implicit in all of this is that most of the endorsements are made for positive reasons. Occasionally, somebody will say, ‘This is the worst page that I ever saw,’ and then make a link to it, but that’s a rare occurrence. And typically, it gets swamped out by the fact that most of the links are positive,” Dom said.
Mountain View, Calif.-based Google Inc.’s search engine works on similar principles, although Google doesn’t incorporate the idea of hubs – its engine works with the premise that good authorities point to one another.
The more authoritative the pages are that link to any given site, the higher that site will be placed in Google’s search results. The more sites pointing to a given site, the more authoritative that site is considered to be.
“Our users tell us that our algorithm delivers far more valuable results than other search engines out there,” said Cindy McCaffrey, director of corporate communications at Google.
At least one analyst agrees.
“If you use Google, in a lot of cases – it totally depends on the kind of search you want – you can get some pretty accurate results. Google really looks at the quality of results. With AltaVista, there isn’t any of this idea of authority, so you can have a lot of false hits,” said Kathleen Hall, an associate analyst for Giga Information Group in Cambridge, Mass.
But Google may not be the right search engine for all cases. “If you’re looking for something that’s more obscure, you might not find it on Google,” Hall added.
Timing is important
Other search engines, like Microsoft Corp.’s MSN Search, rely on human editors to ensure accurate results. Microsoft also considers factors such as the time of year when listing results, said BJ Riseland, the MSN product manager at Microsoft in Redmond, Wash.
Someone searching for turkey during the holiday season might be looking for turkey recipes, so during that time of year recipes will be placed higher than sites on Turkey, the nation. But a quick link to results on the country is also provided.
But no matter how search engines go about ranking pages, chances are any one only covers a small portion of what’s available, said Steve Lawrence, a research scientist at NEC Research Institute in Princeton, N.J., who recently conducted a study of search engines along with fellow NEC scientist C. Lee Giles.
“It’s a common misperception that search engines index most of the Web; in fact they’re only indexing a small amount of what’s out there,” Lawrence said.
He found that even the best search engines only covered about 16 per cent of available pages. He estimates that there are currently more than 1.5 billion pages on the Internet. In mid-1999, when he did his study, he estimated there were about 800 million pages.
“At the moment, search engines are like phone books where most of the pages have been ripped out, with a bias towards (listing) the more popular people and companies – and they aren’t updated very often,” he said.
On average, it takes up to six months for search engines to document new sites, and the Web is changing all the time.
“It’s not so much a case [of] search engines not increasing their databases over time. It’s more a case of the Web increasing in size faster than search engine databases. So they’re not able to keep up with all the documents out there.” So even as scientists, such as those at IBM, are working to make search engines better, the number of pages on the Internet are growing exponentially, he said.