CMSY129 Principles of the Internet
Week #3 - Search Engines
January 12 - January 18, 2002
There are millions of pages on the World Wide Web, with new pages added daily. Add up all the hyperlinks on those pages and the count will more than likely run into the billions. Most of those links lead absolutely nowhere, but you won't know for sure until you've wasted minutes, or even hours, waiting for pages to load.
Search engines can help you find what you're looking for, if you the right search engine for your needs, and construct your search so that the pages that best match what you're looking for end up at the top of the list.
There are two types of search engines: category-based and indexed. Category based sites such as Yahoo and Magellan have a staff of reviewers who organize the Internet into tree-style hierachies. They're good at identifying general information, and group websites together under similar categories, such as Travel Resources, Real Estate Listings, and French Cooking. The results of your search will be a list of websites that cover the subject you're interested in. You can click on links to drill down to increasingly detailed divisions until you find the information you're looking for.
Indexed search engines such as Excite, Lycos, WebCrawler, and AltaVista use automated web crawlers to roam the Net to search all the contents of a site, not just the file name; then they index the text on those pages. Indexes use software programs called spiders or robots that examine all the URLs on the Internet, analyzing millions of web pages and newsgroup postings, indexing all of the words. You enter terms in the search box, and the search engine returns links that contain those terms. Indexed search engines find individual pages of a web site that match your search, even if the site itself has nothing to do with what you're looking for.
Several popular search engines combine both techniques on their home pages.
Category searches are ideal for broad, open-ended questions. For example, if you use Yahoo to plan a trip to Australia, it takes fewer than five clicks to reach lists of hotels, restaurants, and local destinations. But because the list of links is selective, you're less likely to find obscure sites. For all-encompassing searches, or to find answers to questions that aren't easily categorized, you're better off using an indexed search site.
Getting results with an indexed search engine takes practice. If the search terms are too narrow, you'll end up with millions of hits in no particular order. It usually takes at least two search terms to return a useful list, and the best engines let you enter words or sites to ignore.
Rapid Growth of Sites on the Internet
In 1994, Matthew Gray, a webmaster at MIT, created a World Wide Web Wanderer robot to search out all the existing web sites. By November 1994 the robot had found all 870 web sites.
One year later (1995) there were 31,750 sites
In 1996 there were 525,906 sites
As of April 1999, there were 5,040,663 sites.Source: Webtechniques magazine July 1999.
Advanced Techniques
Boolean queries combine search terms using the logical operators AND, OR, and NOT to increase the likelihood that the information you're looking for will appear at the top of the results list. Every search engine uses a slightly different syntax, but the following techniques work on all the major search sites:
- use lowercase letters unless you want to restrict your search to proper names.
- the default operator is usually OR, which means if you enter two or more words the search engine will return any page that contains any of the words; use an ampersand or AND to find only pages that contain all the words you enter.
- use quotation marks to find pages that have a certain phrase (without quotes you'll get pages that contain any of the search words even if they're in different paragraphs.
Some search engines offer an advanced option that lets you narrow your search with precision. You can specify a site to search (with an URL), and tell the search engine to return only pages that were last updated with a specific range of dates. To filter out irrelevant sites, use the NOT operator (often expressed as a minus sign). On Excite, for example, search for Voyager AND NASA AND NOT Trek to find pages about real-life explorations of the solar system, without being distracted by hits for Star Trek fans. On AltaVista, the expression Voyager+NASA-Trek would produce similar results; add +host:nasa.gov to show only official pages.
Search Engine Comparisons
For more specific tips for specific search engines, click here.
Search Engine Stats
Search engines don't actually search the entire web. The smaller indexes are biased towards more popular pages. If what you're looking for is in most of the engines, it may be harder to find in a larger engine.
Who Searches What - % of web searched by:
HotBot: 34%
AltaVista: 28%
Northern Light: 20%
Excite: 14%
InfoSeek: 10%
Lycos: 3%Stats above are from CIO Web Business, Section 2, June 1, 1998.
What are people searching for?
Here are the most common key words searched:
http://www.searchenginewatch.com/facts/searches.htmlSpy on people surfing in real time:
http://www.metaspy.com/
Reading: URLS with information on this subject
Self Test: Take the quiz for this unitAssignments:
- Search Engines (100 points)
Do a comparison of five sites retrieved by different search engines, using the same search terms, and submit via this online form.- ISP Comparison (50 points)
Compare the services and prices of three different internet service providers (ISPs) via this online form.Click this link to review the grading system for this course
Deadline: Plan to turn this in no later than January 18, 2002.
Email me: if you have questions about this assignment