Description-based Retrieval on the World Wide Web

Robert MacGregor
USC/Information Sciences Institute
macgregor@isi.edu

Introduction

As the volume of information available on the Web increases, the amount of useful information increases, but finding what you want, and only what you want, becomes progressively more difficult. The performance of an information retrieval tool is evaluated for recall: (finding everything related to your inquiry) and precision: (finding only what's relevant). The current generation of Web query tools has much room for improvement. Here is our informal estimate of their performance:

Technology
Hyper-links (Web surfing)
Category Browsers (e.g., YAHOO)
Web Search Engines
Recall
Very Low
Low
Medium
Precision
Medium
Low-Medium
Low

We conjecture that Web query tools using indexing schemes based on category hierarchies (taxonomies) will proliferate, resulting in very large taxonomies, and large numbers of independently constructed and managed taxonomies.

WWW

Currently, large taxonomies, such as the Library of Congress subject heading index, are constructed manually. As the average size of a Web taxonomy grows from hundreds of nodes to millions of nodes, manual construction of taxonomies will become impossible. Instead, automated and semi-automated techniques will be necessary. Classification of new objects into a taxonomy must become automated, and unbalanced hierarchies must be able to restructure themselves. Also, as taxonomies grow in size, it becomes necessary to make increasingly fine-grained distinctions between information objects if they are to remain distinguishable to a search engine. Below, we introduce the notion of "content descriptions", and describe how they facilitate the construction of very large taxonomies.

Content Descriptions

A description is an annotation that advertises the contents of an information object. Objects are retrieved by matching their descriptions against a user's query. A keyword list attached to a text document represents an informal style of description. Keyword lists are inherently limited in their ability to describe the contents of an information object. For example, it is ambiguous whether a list of keywords represents the conjunction or the disjunction of the topics mentioned in the list. Keyword lists provide no means for indicating the semantic relationships that hold between keywords. Because of this informality and lack of expressive power, keyword-based search exhibits relatively low rates of precision. Increasing the expressive power and the formality of the semantics of descriptions enables increasingly precise retrieval schemes. The use of formal semantics also opens the door to the possibility of automated classification schemes.

Suppose we wish to advertise an organization under the description "support groups for parents of diabetic children". Given the following fragment of a taxonomy

Support group
Support group for recovering substance abusers
Parental support group
Support group for parents of congenitally-ill children
Support group for single mothers

our description should classify as a subtopic below "support groups for parents of congenitally-ill children". If each of these descriptions has a formal definition, then this classification can be performed automatically. Here is a formal version of the description "support groups for parents of diabetic children", phrased in an OSQL-like syntax:

     select g in Support-Group
     where forall m in g.members
           always (exists c in m.children where c.diabetic)

Million-node Taxonomies

Formal descriptions permit one to draw arbitrarily-fine distinctions between pairs of information items and they permit automatic categorization, both of which will be needed to manage very large taxonomies. They also provide the representational framework needed to generate "virtual nodes" used to reduce fan-out. Information retrieval techniques that introduce attribute-value pairs partially meet the same goals as our descriptions.

The catch with formal descriptions is that we need to develop automatic or semi-automatic ways to synthesize them. Some possibilities:

  1. Get Web advertisers "into the habit" of including a formal description of their information (analogous to including keyword in a document)
  2. Generate descriptions from the structured portions of semi-structured objects
  3. Develop content-interpreters that can scan a document (e.g., text or an image) and produce a formal description that summarizes it.

Searching across Hundreds of Taxonomies

To process a query like "Find documents about support groups for parents of diabetic children" first requires a means for finding all taxonomies that might contain relevant information. Then, the query engine would need to work its way down each such taxonomy, looking for nodes that match the query. A new generation of tools will be needed to perform these searches (i.e., traditional database query tools solve only a portion of the search problem). Classifier technology from the field of knowledge representation (KR) is directly relevant, although it will need to be industrialized (more powerful indexing techniques, integration with relational database management systems (RDBMSs), parallel search algorithms).

The Babel Problem

If each taxonomy is uses its own unique vocabulary , and if there is no means for aligning that vocabulary with that used in other taxonomies, then we have the Babel situation where everyone is speaking a different language, and no one can understand a vocabulary other than their own. The most direct means to solve this is to adopt an overarching framework that defines the vocabulary used by everyone in all taxonomies (e.g., we all adopt the Library of Congress scheme, or the SENSUS taxonomy, or the Cyc taxonomy as the foundation for our individual taxonomies). A more likely scenario would be that many semi-standardized taxonomies will exist for different domains (c.f., the MeSH taxonomy used in the domain of medicine). Searches that span multiple taxonomies will require some means for adjusting to each of the different vocabularies. Merging of smaller taxonomies to form larger ones will be common place, and a new set of tools will be needed for performing this type of integration. The merging task will be simpler and the results more dependable if a formal semantics underlies each of the merging taxonomies.

Conclusions

Description-based technology offers a strategy for developing self-organizing Web indices, which is a prerequisite to the construction of very large taxonomies. The degree of retrieval precision obtainable using descriptions depends on the expressiveness of the description language and the degree of formality of the description semantics. The field of KR offers several technologies that will assist development of a new generation of Web query tools.