What is almost certain is that this entry on search engines will soon be obsolete—so rapid and dynamic are the changes that affect this central technology and service on the Internet. Thus in the entry published in the last edition of this volume the name Google did not even appear, but just a few years later Google has become the leading search engine provider the world over. So what is a search engine?
Search engines are software systems that associate search words entered by a user, looking for information, with websites on the World Wide Web that contain the words of the query. To accomplish this linking, search engines must be backed by databases that hold words that Web sites use as linked lists. Search words may produce just a handful or a very large number of Web sites. The search word "supercalifragilisticexpialidocious" produced around 294,000 hits in 2006 on Google; the somewhat obscure and specialized word "nunciature" (the office or period of office of a nuncio) produced 82,400 hits; the word "nuncio" itself (an ambassador for the papacy) yielded 1,050,000 hits. The name Chu Yuan-chang, the 14th century founder of the Ming Dynasty in China, produced 725 hits. It is difficult to find stand-alone search words with a low number of hits; even misspellings bring rich results—because words are often misspelled on Web pages too and dutifully indexed by the search engines. This very wealth of hits makes it necessary for search engines to store additional information about every Web site in order to enable the engine somehow to present results in some kind of rationally ranked order. Complex algorithms are used to rank hits. The principal method is to present those sites first which have been clicked on most frequently in the past; and sites with more links to other sites get preference, all things equal.
A search engine, thus, requires its own internal logic and functionality, the software, and a database. But this database must first be built, maintained, updated, and grown as new sites are added to the Internet. Search engines, therefore, have a massive data acquisition function. In the early days the databases were built by people who scanned the web, followed links on Web sites, and indexed new pages they found. This technique is still in use with specialized Web sites and, until October 2002, was used by the world's second-ranking search engine, Yahoo. In the mid-2000s the databases of almost all search engines are built and maintained by search robots that seek out sites and capture their contents for indexing—unless the site itself prohibits this activity. The robots are themselves software programs. They are known as "crawlers" because they "crawl the Web" acquiring information. Alternatively, Web site owners can also register their sites with search engines—a technique used by commercial sites eager to be found.
Search engines are 1) technologies of searching, 2) databases in support of searching, and 3) services provided to users. Search engine owners can cover their costs by all three means. The technology they own can be licensed or deployed for others at a fee; the databases can be made available for money; and the services provided can be paid for using advertising. The most effective linking of the search function itself with advertising was pioneered by Google under the name "Adwords." Specific words are sold to advertisers. When searches using the words appear, the advertisers' small ads are displayed with search results. Advertisers pay a fee when the engine users "click through" to the advertiser's own site. Other techniques make use of search words or phrases and display closely matching spot ads on the Web page.
ENGINES AND THE INTERNET
The Internet owes its dramatic growth to the development of search engines. The first such engine was Lycos, launched in mid-1994 with 54,000 documents. Using its crawler technology, it had expanded its database to 1.5 million documents by early 1995 and had 60 million by the end of 1996. Another claimant to the founding role was AltaVista, introduced in 1995 and still active on the Web. Until Lycos and AltaVista appeared, access to the Internet required advanced knowledge of Web addresses, and roaming the Internet involved following links from site to site as these referred to each other.
The services provided by search engines become obvious with a few statistics. According to the Internet Systems Consortium (ISC), which conducts four surveys every year, in January 2006 around 395 million Internet hosts were in operation, each one hosting multiple sites, each site consisting of several Web pages on average. Extremely simple searches on leading engines provided up to 17 billion hits on Google in 2006 (for the word "the," for instance); AltaVista produced 7.4 billion, Ask.com 2.1 billion, and MSN 2.4 billion hits on the word. AltaVista uses Yahoo technology; Yahoo itself, asked to search for "the," simply shrugged off the labor and provided a single hit on a corporation with the THE acronym. Some estimates put the number of pages on the Internet at hundreds of billions, but as the ISC points out from a depth of survey experience, it is not possible to determine the actual size of the Internet. In any case, several million hosts, never mind 17 billion pages, are already astronomically big numbers. The ability of search engines to provide access to such magnitudes in matters of a second or so makes the Internet the useful phenomenon that it is. The rankings of hits, which actually reflect frequency of use by others, makes using very massive search results practical. Who, after all, can afford to review 60,000 hits—or even 700.
STRUCTURE OF THE INDUSTRY
Search Engine Watch, a Web journal concentrating on search engines and related matters, began operations in 1997, thus three years after the first search engine appeared. The company offers prizes, has public information as well as a membership service, and is an excellent source of developments in this field. Search Engine Watch (hereafter referred to as SEW) produces rankings and technical information about this industry. What follows has been gleaned largely from searchenginewatch.com.
SEW identifies Google, Yahoo, and Ask.com as the top search engines on the Internet. Ask.com may be more familiar to users as AskJeeves.com; the company simplified its name in 2006. All three of these leaders began with proprietary methods and technologies. Google's search engine is the most widely used by others under license. Yahoo, which began by using human indexers, began to shift its data acquisition processes to crawlers in October 2002 after a period of using Google technology. Ask.com's basic search engine was developed by Teoma, a company that it owns, but Ask also developed an expert-based indexing technique that, in the past, enabled it to serve more "human language" queries.
In a second tier SEW lists AllTheWeb.com (powered by Yahoo), AOL Search (powered by Google), and HotBot (using Google, Yahoo, and Teoma—currently merged with Ask).
Under a category SEW calls "Other Choices," it lists AltaVista (using Yahoo), Gigablast (a tiny engine with propriety technology), LookSmart (compiled by people), Lycos (using HotBot and others), MSN (Microsoft's search engine, developing proprietary methods), Netscape (using Google), and Open Directory (using Google).
As is evident from this listing, the number of proprietary technologies widely used is much smaller than the search engines on offer—many of them on the Web using Google and Yahoo. But each of the search engines has its special features and add-ons.
Chris Sherman, writing for Search Engine Watch, defined this category as follows: "Unlike search engines, metacrawlers don't crawl the Web themselves to build listings. Instead, they allow searches to be sent to several search engines all at once. The results are then blended together onto one page." Thus metalcrawlers, also called metasearch engines, have carried the basic strategies of search engine companies a step further: they simply use search engines, being an intermediate between others. Sherman listed 21 such metacrawlers operating in 2005. Those that had won SEW awards included Dogpile, Vivisimo, Kartoo, Mamma, and Surfwax.
From the viewpoint of the small business hoping that its Web site is found as often as possible by searchers on the Web (traffic equals sales, after all), the chief issue regarding search engines is how to be found by them and—more importantly—how to be ranked high enough actually to be seen at all. Being 82nd in a list of 200 hits is almost equivalent to invisibility. On a typical Google search result, the entry will be on the 9th page—and rare the user who will examine nine pages of a search.
Creating, promoting, and structuring a company's site for maximum visibility is a very complex subject and will require substantial homework or expert advice. A good beginning point is SEW's Web page entitled "Search Engine Submission Tips." It provides a systematic tutorial on the major aspects, including registering the site with search engines, which may be free or may have to be paid for, using advertising services such as Google's Adwords program, and internally structuring the Web site to present the most favorable features to Web crawlers. Rankings go up when a site offers multiple links to other sites—and also when many other sites point to one's own. Self-contained sites (one might say solitary or self-centered sites) tend to be ranked low. Search engines inherently favoring a communal spirit of interconnectedness—the very essence of the Internet. The small business intent on maximizing its exposure should engage an experienced Web page design firm. Such organizations typically have the know-how to structure the Web page appropriately and also to guide the owner on additional steps to take.
FRUSTRATIONS AND PLEASURES
A discussion of search engines would be incomplete without pointing to the frustrations and pleasures of using such services. Thus, for instance, it may be possible to find 700-some-odd pages on an ancient Chinese emperor—but frustrating sometimes when a specific phrase is sought, usually entered into the search engine between quotes, and getting the standard "Your search—'X—did not match any documents." At the same time, it is often quite easy, remembering just a little snatch of a song's lyrics, to enter that truncated phrase and to get pages and pages of hits with the lyrics—and more: the music itself, played on the sound system to bring back the tune. This experience—whether in a serious business context or just for fun—is exhilarating. And things are moving so rapidly that by the time this text is out in print or visible on the Internet it may well be possible that search engines will provide genuinely helpful suggestions when the "did not match" message appears. Currently the advice is next to useless. But just wait a while.
"ISC Internet Domain Survey." Internet Systems Consortium. Available from http://www.isc.org/index.pl?/ops/ds/. Retrieved on 27 May 2006.
"Lycos: A brief history of the Lycos search engine." The Web Marketing Workshop. Available from http://www.websearchworkshop.co.uk/lycos_history.php. Retrieved on 27 May 2006.
"Search Engine Submission Tips." Search Engine Watch. Available from http://searchenginewatch.com/webmasters/. Retrieved on 25 May 2006.
SearchEngineWatch. Web Site. Available from http://searchenginewatch.com/. Retrieved on 26 May 2006.
Sherman, Chris. "Metacrawlers and Metasearch Engines." SearchEngineWatch. 23 March 2005. Available from http://searchenginewatch.com/links/article.php/2156241. Retrieved on 27 May 2006.
Sullivan, Danny. "Major Search Engines and Directories." SearchEngineWatch. 28 April 2004. Available from http://searchenginewatch.com/links/article.php/2156221. Retrieved on 26 May 2006.