SHARE
Follow this article on Twitter Facebook LinkedIn Bookmark and Share
Home >> Information Architecture >> Data Warehousing

Finding stuff the Search Engines can't

Finding stuff the Search Engines can't

By:  Lee Ratzan  On: 14 Dec 2006 For: Computerworld (US online) Creator

Just because a Web search engine can't find something doesn't mean it isn't there. You may be looking for info in all the wrong places.

COMMENT ON THIS ARTICLE

Just because a Web search engine can't find something doesn't mean it isn't there. You may be looking for info in all the wrong places.

The Deep Web is a vast information repository not always indexed by automated search engines but readily accessible to enlightened individuals.

The Shallow Web, also known as the Surface Web or Static Web, is a collection of Web sites indexed by automated search engines. A search engine bot or Web crawler follows URL links, indexes the content and then relays the results back to search engine central for consolidation and user query. Ideally, the process eventually scours the entire Web, subject to vendor time and storage constraints.

The crux of the process lies in the indexing. A bot does not report what it can't index. This was a minor issue when the early Web consisted primarily of static generic HTML code, but contemporary Web sites now contain multimedia, scripts and other forms of dynamic content.

The Deep Web consists of Web pages that search engines cannot or will not index. The popular term "Invisible Web" is actually a misnomer, because the information is not invisible, it's just not bot indexed. Depending on whom you ask, the Deep Web is five to 500 times as vast as the Shallow Web, thus making it an immense and extraordinary online resource. Do the math: If major search engines together index only 20% of the Web, then they miss 80% of the content.

What makes it deep?

Search engines typically do not index the following types of Web sites:

-- Proprietary sites

-- Sites requiring a registration

-- Sites with scripts

-- Dynamic sites

-- Ephemeral sites

-- Sites blocked by local webmasters

-- Sites blocked by search engine policy

-- Sites with special formats

-- Searchable databases

Proprietary sites require a fee. Registration sites require a login or password. A bot can index script code (e.g., Flash, JavaScript), but it can't always ascertain what the script actually does. Some nasty script junkies have been known to trap bots within infinite loops.

Dynamic Web sites are created on demand and have no existence prior to the query and limited existence afterward (e.g., airline schedules). If you ever noticed an interesting link on a news site, but were unable to find it later in the day, then you have encountered an ephemeral Web site.

Webmasters can request that their sites not be indexed (Robot Exclusion Protocol), and some search engines skip sites based on their own inscrutable corporate policies. Not long ago, search engines could not index files in PDF, thus missing an enormous quantity of vendor white papers and technical reports, not to mention government documents. Special formats become less of an issue as index engines become smarter.


Sign up for our Newsletters












Print |  Views: 787   |   Rating:offoffoffoffoff  (0 votes)
Rate this article on a scale of
1 to 5 stars,5 being the best.




Lee Ratzan Lee Ratzan is a contributor to the International Data Group (IDG) News Service, which publishes global technology stories from bureaus around the world to more than 300 publications in more than 60 countries.

Related Content

Europe should put six-month limit on search engine data storage: Committee
Europe should put six-month limit on search engine data storage: CommitteeThe collection of data en masse by search engines has considerable privacy implications, says a group looking at how companies comply with European regulations
What to do when Google doesn’t cut it
What to do when Google doesn’t cut itAsk Ontario, which is staffed by 200 librarians, lets IT professionals get content on technology and career development through instant messaging. A former CIPS president says why he finds it useful
UK program gives patients access to hospital booking
UK program gives patients access to hospital bookingU.K. National Health Service patients will be able to choose and book outpatient hospital appointments directly using their own PCs or those in local libraries, under a pilot plan launched Wednesday.
Why computer memory can't fully compete with human brains
if we could all think more like google, maybe we wouldn’t need google. this week’s issue of the new york times magazine featu
blog comments powered by Disqus