Return to Articles
legal.online
Column 19, September 1996
Copyright 1996 Robert J. Ambrogi
`Crawling' For Needles In The
Internet Haystack
By Robert J. Ambrogi
There is a lot of information out there in Cyberspace -- if
only you can find it.
Organization of the Internet is anarchic, to say the least.
The resource you need could as easily be on a major university's
Web site as on some high school kid's home page.
Not so long ago, the only way to find resources on the
Internet was using Archie or Veronica -- not the cartoon
characters, but software that allowed you to search particular
sites for specific file names.
Now, there are tools that allow you to search through
virtually the full text of everything on the World Wide Web for
specific words and concepts.
These tools use what are called "crawlers" or
"spiders." Think of crawlers as little Web-surfing
robots, gathering everything they find into an enormous database.
That database becomes a searchable index of almost everything
available on the Web.
What follows is a guide to these search tools. The list is not
exhaustive, but offers an introduction to crawlers, both general
and legal.
Crawlers
Among the best of the crawlers are these:
- AltaVista, http://www.altavista.digital.com.
AltaVista was the bully on the block when it was
introduced in December 1995 by Digital Equipment Corp.,
if for no other reason than the magnitude of its
database, which claimed to index the full text of some 30
million Web pages.
- HotBot, http://www.hotbot.com.
Bullies age, and young upstarts move in. As the most
complete index to the Web, AltaVista has been replaced by
HotBot. Launched in May 1996 by Wired Magazine, it claims
to have indexed the full text of 54 million Web pages.
- Infoseek Ultra, http://ultra.infoseek.com.
Still in testing, this new search engine from Infoseek
Corp. claims already to have found more than 80 million
Web pages and to have indexed more than 50 million. It
also claims to be the only real-time index of the Web,
meaning that you are searching the Web as it is today,
not as it was a week or a month ago.
- Lycos, http://lycos.cs.cmu.edu.
Lycos is one of the oldest search engines, operating
since 1994. Although it too claims to have indexed more
than 50 million Web pages, it differs from others in that
it indexes only an abstract of the page rather than the
full text.
- Excite NetSearch, http://www.excite.com.
Launched late in 1995, Excite quickly became one of the
most popular search engines. It is also a directory of
Web sites organized by topic, and it includes reviews of
many sites, including several in the legal field, written
by its own staff of reviewers.
- Webcrawler, http://webcrawler.com.
This is America Online's search engine. It describes its
index as "comprehensive, yet selective," which
translates to a database of only a half million Web
pages. A fun feature is the ability to search backwards
from a site, finding which sites have links leading to
it. A recent tryout, however, yielded inaccurate results.
- Yahoo, http://www.yahoo.com.
Yahoo is not a crawler, it is a directory. But as
the oldest Web directory and most likely the best known,
it warrants mention. It differs from crawlers in that it
does not search for Web sites; it relies on users to
submit sites. As a result, it is far from comprehensive.
(In fact, it includes a link to AltaVista for more
exhaustive searches.) Its strength is its well-organized
and easy-to-use catalog of Web sites. You can search the
catalog by key words, or simply browse.
The `Ambrogi' Test
It's not scientific, but it works for me. Given a somewhat
uncommon surname, I use it as my search-engine acid test -- the
more matches for "ambrogi," the more thorough the
database. In a recent search, here were the results:
- HotBot, 382 matches.
- Excite, 235 matches.
- Infoseek Ultra, 165 matches.
- AltaVista, 159 matches.
- Lycos, 37 matches.
- Webcrawler, 6 matches.
- Yahoo, no matches.
Meta-Searchers
Think of these as one-stop shopping. They allow you to use
several search engines from one location, and sometimes in one
search.
- All-in-One Search Page, http://www.albany.net/allinone.
Various forms-based search engines are combined here in a
consistent interface. You still must search each service
separately, albeit all from a single page.
- SavvySearch, http://rampal.cs.colostate.edu:2000.
Savvy Search is an experimental search system that
queries multiple search engines simultaneously. At
present, 19 search engines are queried, including
AltaVista, Yahoo, FTPsearch95, the Virtual Software
Library, Excite, Lycos, DejaNews, and OKRA. Results are
grouped by search engine.
- Search.com, http://www.search.com.
This is a collection of search tools on a single page
using a single interface. You can choose from any of a
number of search engines, including those described
above, but you can search only one at a time. There is
also a subject index of search tools.
- Webtaxi, http://www.webtaxi.com.
Webtaxi offers an array of search options. Its
"database dispatcher" offers a pop-up list of
search engines by category, while its
"supersearch" allows you to use several search
engines simultaneously. Webtaxi opens a new frame for
each search engine's results, allowing the user to
continue searching directly from that site. You must have
a browser that supports frames to use this site.
Legal Searchers
The problem with crawlers is that they can return too much
information. A key-word search of the Web can produce thousands
of matches. Law crawlers index only law-related Web sites, making
your search more targeted and the matches more likely to be
relevant to your query.
- LawCrawler, http://www.lawcrawler.com.
This appears to be the most comprehensive of the
legal-specific crawlers. It allows users to limit their
searches to particular servers, such as specific
countries, states or U.S. government agencies. For
example, you could search all servers maintained by the
U.S. Department of Commerce. LawCrawler is part of
FindLaw (http://www.findlaw.com),
a directory of legal sites on the Internet.
- Meta-Index for U.S. Legal Research, http://gsulaw.gsu.edu/metaindex.
This page presents search forms for many U.S.
government law sources; each form contains sample search
criteria. From here, you can search for opinions of the
U.S. Supreme Court and all federal circuit courts. A
legislative section allows searching of the U.S. Code, as
well as of bills and the full text of the Congressional
Record. There are also search forms for federal
regulations, people in law, and other legal sources. The
site is provided by the Georgia State University College
of Law.
- WashLawWEB, http://lawlib.wuacc.edu.
WashLawWEB provides for full-text searching of many
Internet legal resources, including federal and state
case law, federal statutory and administrative materials,
and law journals that publish on the net.
Other Finding Tools
- Deja News, http://www.dejanews.com.
This service allows you to search the text of messages
posted to Internet news groups. Its archive includes more
than 25,000 news groups and more than 53 million
articles.
- Reference.com, http://www.reference.com.
Here you can search more than 16,000 news groups and a
number of publicly accessible mailing lists.
Robert J. Ambrogi, a lawyer in Rockport, Mass., is editor
of legal.online,
a monthly newsletter about the Internet published by Legal
Communications Ltd., Philadelphia. He can be reached by e-mail at
rambrogi@legaline.com
or by phone at (978) 546-7898.