University of Texas as Austin

Graduate School of Library and Information Science

 

LIS 385T: Information Architecture and Design

Instructor: Dr. Turnbull

 

Research Topic: Search Results

By Roger, Chia-hung Wei

Email Address: rogerwei@mail.utexas.edu

 

Assignment Due: October 22, 2002


1. What Search Results Are

Search results refer to the content that matches the user’s search query (Rosenfeld & Morville, 2002).  Search results consist of individual hits, each containing a link to a Web page along with descriptive information to help the user filter among the hits.

 

About 85% of web users surveyed use search engines and search services to find specific information (Kobayashi & Takeda, 2000).  Web search engines have become the largest and most frequently used tools for finding information on the web (Chowdhury & chowdhury, 2001).  The results of a web search mainly depend on the selected search engines because search engines differ in the way they choose, update, and index information on the web, as well as in the search and retrieval features they provide (Poulter, Tseng & Sargent, 1999; Glossbrenner & Glossbrenner, 1999).

 

The search engine creates the search results by checking its index and saving the matching entries (“Search Terms Glossary,” 1998).  The search engine then sorts them, usually based on its relevance algorithm, and generates HTML pages to display the search results.  The simplest results list is just the document title and a URL link to the document.  More helpful results include the meta description data or the first few lines on the document, the date modified, file size, and a relevance ranking.

 

2. Brief History of Electronic Information Searching

 

3. Relevance & Ranking Process

Henninger (1999) identified that generally the searching and ranking process is to: (1) match the search words to the words in the index; (2) process the search syntax; (3) construct a set of documents containing the words; (4) assign a weight to each word based on the number and position of the word in each document; (5) use the assigned weights to rank the documents based on their relevance to the search syntax; (6) display the search results as a list in ranked order.

 

When there are a large number of matches for a query, the search engines must rank the results by relevance score, sorting the results listing so that the document most likely to be relevant will appear first.  There are different algorithms to define relevance, and some are more productive than others.  What kind of algorithms work best for locating information on the web?  This has become a critical question given the heavy use of search on the web.

 

Relevance is an abstract measure of how well a document satisfies the user's information need (Weiss, 1997).  Ideally, search tools should be able to retrieve all of the relevant documents in the database for searchers.  Unfortunately, relevance is a subjective notion and difficult to quantify because only the searcher can actually define the relevance for the search results.

 

Because not all retrieved results are equally relevant to a query, a key capability of a search engine is to rank hits and place the most relevant hits high on the results list.  However, due to the competitive nature of the search engine industry, details of the retrieval and ranking algorithms are closely kept as a commercial secret.  How it should be ranked in terms of relevance, usually determines the order where search results are presented to the user.  The known factors that determine whether or not the document is retrieved and the document’s ranking usually incorporate some combination of the following (Hock, 2001):

 

 

4. Re-ranking search results

The merged search results come from a re-ranking process by meta search engines, which can be considered as an umbrella site.  The user enters a query in the meta search engines which transmit the query simultaneously to individual search engines.  After the results from the underlying search engines are received, meta search engines combine the results into a single ranked list by their own merging approaches.  The results are then presented to the user.

 

However, ideal results are not easy to merge due to the various heterogeneities among the underlying search engines (Meng et al, 2002).  Normally, returned documents are ranked based on their local ranking scores or similarities.  Some of the underlying search engines make the local similarities of returned documents available to the user, but some do not.  Furthermore, the local similarities and the global similarities of the same document may be quite different.

 

Meng et al (2002) indicates that existing result merging approaches can be classified into the following two types:

using additional information such as the quality of component databases. A variation converts local document ranks to similarities.

 

5. Search Evaluation

The measures of an information retrieval tool’s effectiveness are usually precision and recall.  Precision is a measure of the ratio of relevant documents located by the information retrieval compared to the total number of documents located.  Recall is defined as the proportion of all relevant documents in the database that the search locates.  Perfect precision and recall means that every relevant document in the database would be located and only relevant documents would be included in the response.

 

It is virtually meaningless to examine only a single measure of effectiveness (Yang, 1998).  This is because the two measures estimate the two conflicting goals of information retrieval.  Thus, measures must be examined in pairs. There are two issues concerning the concepts of relevance, recall, and precision as they are used for the evaluation of retrieval effectiveness.  First, it is not easy to define the meanings of relevancy and non-relevancy.  Second, it is difficult to measure recall since it is difficult to calculate the number of relevant documents that have not been retrieved in a database.

 

Precision and recall are inversely related (Baeza-Yates & Ribeiro-Neto, 1999).  Systems with simple search tools must give up precision so as to obtain high recall, and vice versa.  Both of them are not easy to measure in a large and vast database.  Recall, which requires identification of all relevant documents in the entire collection, is especially difficult to measure.  The problem lies within the need to estimate the relevant documents that have not been retrieved.  For traditional databases, recall represents the completeness of a search.  However, it is difficult to measure the coverage of databases for Web search engines.  There is little overlap in relevant citations retrieved by different queries on the same topic. Especially on the Web, it is a large and unstructured vast database, and as a result, recall becomes meaningless.

 

Given the dynamic nature of the web, it is difficult to create a large representative test collection and the resources needed for traditional evaluation methodologies.  Many researchers propose methods for the web evaluation: Chowdhury & Soboroff (2002) present a method for automatically comparing search engines based on how they rank known item search results; Jansen (2000) investigates the the effects of query structure on the results retrieved by Web-based information retrieval systems.

 

6. Search Results & Information Architecture

Information architecture is the science of organizing information to help people effectively reach their information needs (Hagedorn, 2000).  The process of developing an information architecture is based on an understanding of the content and the tools used to leverage that content (e.g., search, indexes). 

 

Rosenfeld & Morville (2002), as an information architecture standpoint, point out that there are two main issues to consider the way to display search results: which content components to display for each retrieved document, and how to list or group those results.  They also provide guidelines with these two issues.  Therefore, information architecture can offer more considerations for configuring the way search engine displays results.  Furthermore, when researchers can not have a major breakthrough in retrieval algorithms, development of information architecture can be used to help search tools display and organize the search results.  This indirectly increase user’s satisfation in information needs.

 

7. Survey of Existing Search Tools on the Internet

Morville et al (1999) claims current major search tools on the Internet can be categorized as virtual libraries/categories, Internet directories and search engines.  The first two are primarily browsing tools, although many incorporate search features.  The third one is primarily for specific searching.  Search engines can be categorized in different ways.  Two broad categories are search engines and meta search engines (Chowdhury & chowdhury, 2001).  The former depends upon a robot to traverse the Web, following links between pages.  The latter refers to tools that allow users to conduct concurrent searches on more than one search engine.  Sullivan, D. (2002) on the Search Engine Watch website lists some top choices in various categories:

 

8. Interface at the Search Results Page

When users connect with a search engine, what they see is the interface.  The interface displays search results from input queries as a list of hits.  Van Duyne et al  (2002) provides several principles in arranging and categorizing search results:

In addition, Farkas & Farkas (2002) suggest several principles for the design of the results list page:

Rosenfeld & Morville (2002) offer a few of variables to consider when desinging the search interface:

 

Example 1: Google.com

How to Interpret your Search Results
Return to Google homepage. How to Interpret your Search Results. ... Hit enter
or click on the Google Search button for your list of relevant results. ...
www.google.com/help/interpret.html - 13k - Cached - Similar pages

 

Example 2: Hotbot.com

Aircraft, Airplane, Jets, Single Engine - Wings Online.com

Interactive ad system listing aircraft for sale, planes for sale, biplanes for sale, ultralight airplanes for sale, homebuilt aircraft for sale.
http://www.wingsonline.com/
See results from this site only.

From a design standpoint to evaluate the interface, it is found that Google provides users with a good search result interface (“Google: How to Interpret your Search Results,” 2002).  Google allows users to go to the document currently on the web or to go to a “cached” copy that Google stored when it retrieved the document.  When the server where the document resides is temporarily down, the cached copy becomes very useful for users.  Hotbot does not offer this function.

  

When users select the “Similar Pages” link for a particular result, Google automatically scouts the web for pages that are related to this result.  Hotbot does not offer this function.

 

Although Google has added advertising to its results pages, the advertisements do not include graphics like those found on most search engines.  This allows the results to load faster which saves users time;  Hotbot adds advertising to its results pages with graphics and always puts commercial links on the top where users usually pay more attention.

 

The text shown in the results page just after the title is not an abstract or description taken from the meta tags or first sentence of the document.  Google shows the text surrounding the search term that caused the document to be retrieved.  Google highlights the retrieving word(s) by making them  bold.  Hotbot does not make the search term bold.

Users can not know why the documents are retrieved though Hotbot offers the description of the document.

  

 

One impressive feature on Google’s search pages are the bold and underlined search terms at the statistics bar.  By clicking on such terms, users will be taken to useful sources on Dictionary.comwhich provides extensive dictionary definitions, geographic descriptions, acronym identifications, etc.  Hotbot does not offer this function.

 

“Statistics Bar” describes the search and indicates the number of results returned as well as the amount of time it took to complete the search.  Hotbot indicates the number of results only when fewer documents are retrieved.

 

 

References