The Web as we know it, the blogs we read, the information we search for on Google, the updates we post on Facebook and Twitter, is just the tip of the information iceberg. Hiding underneath, in the dark recesses where even search engine spiders fear to crawl, is the deep Web.
OK, the deep Web isn’t actually frightening. It’s just that as opposed to the “surface Web,” the deep Web consists of pages and sites that cannot be indexed in the typical fashion by search engines because these pages are
- Dynamically created on the fly by user queries and forms that access deep databases
- Password-protected, whether on a private site or a subscription-only site
- “Hidden” because no other page links to them
- Not meant to be found because the Web developer added coding that disallows search engines to index them
- Too new to have been indexed, particularly in the lightning-fast world of social media
- Embedded in multimedia file types not accessible to search engine crawlers
In recent years, researchers have said that there are 91,000 terabytes of information on the Web and only 167 terabytes are on the surface. Much of this below-the-surface data is in databases. These databases can range from publication archives, catalogs, and image archives to sites we use every day, such as airline and stock market Web sites.
Indexing the Deep Web
While indexing the deep Web has its controversies (some database owners don’t want nonsubscribers to access content), there are plenty of companies forging into this space. Google explores the deep Web using HTML forms. If it finds a “high-quality” site with forms that don’t require user information (i.e., a login), its computers will perform a small number of queries on that form using words found on the site. If the resulting pages are “valid, interesting, and include content not in [Google’s] index,” they will be added to its main index.
Other sites, however, are even further along. On CompletePlanet.com, you can search or browse more than 70,000 databases and specialty search engines. It’s powered by BrightPlanet, a leading deep Web search company that counts the federal government as a client.
Deep Web Technologies is another company at the forefront of deep Web search, building federated search solutions for libraries and government and enterprise customers. Federated search takes a query and transmits it to several databases or sources and then merges the collected results, presenting them in an easy-to-digest fashion for the end user. Science.gov, Mednar.com, Biznar.com, and Scitopia.org are a few of the sites the company powers on science, medical, and business topics.
DeepDyve.com takes the search indexing technology used in the Human Genome Project to index large passages using character recognition, not semantics, so that it can search for extremely long and complicated queries. The company, which has Apple cofounder Steve Wozniak on its advisory board, is currently partnered with about 30,000 journals and industry sources.
Accessing the Deep Web
These deep Web search engines are just one way to access the deep Web for research purposes. There are subscription-based vertical search engines such as Westlaw.com and LexisNexis. And there are smaller deep Web search engines, often housed at universities, such as OAIster.org, which has 1,100-plus contributing resources such as Jet Propulsion Laboratories, the Public Broadcasting Service, the Library of Congress, and Infomine, a librarian-built tool for scholarly research.
In addition to using these tools, you can search for relevant databases using your favorite “surface” search engine and create your own list of deep Web sites. Another great resource is your local public or university library. Not only can libraries offer their patrons access to databases you might otherwise have to pay for, research librarians, for whom this stuff is like catnip, can point you in the direction of the deepest of deep Web resources.
Lastly, check out this list of 99 deep Web resources to get started on your exploration of the invisible web.