- Robert MacGregor
- USC/Information Sciences Institute
- macgregor@isi.edu
As the volume of information available on the Web increases, the
amount of useful information increases, but finding what you want, and
only what you want, becomes progressively more difficult. The
performance of an information retrieval tool is evaluated for
recall: (finding everything related to your inquiry) and
precision: (finding only what's relevant). The current
generation of Web query tools has much room for improvement. Here is
our informal estimate of their performance:
Technology
Hyper-links (Web surfing)
Category Browsers (e.g., YAHOO)
Web Search Engines
|
Recall Very Low Low Medium
|
Precision Medium Low-Medium Low
|
We conjecture that Web query tools using indexing schemes based on
category hierarchies (taxonomies) will proliferate, resulting in very
large taxonomies, and large numbers of independently constructed and
managed taxonomies.

WWW
Currently, large taxonomies, such as the Library of Congress subject
heading index, are constructed manually. As the average size of a Web
taxonomy grows from hundreds of nodes to millions of nodes, manual
construction of taxonomies will become impossible. Instead, automated
and semi-automated techniques will be necessary. Classification of
new objects into a taxonomy must become automated, and unbalanced
hierarchies must be able to restructure themselves. Also, as
taxonomies grow in size, it becomes necessary to make increasingly
fine-grained distinctions between information objects if they are to
remain distinguishable to a search engine. Below, we introduce the
notion of "content descriptions", and describe how they facilitate the
construction of very large taxonomies.
A description is an annotation that advertises the contents of
an information object. Objects are retrieved by matching their
descriptions against a user's query. A keyword list attached to a
text document represents an informal style of description. Keyword
lists are inherently limited in their ability to describe the contents
of an information object. For example, it is ambiguous whether a list
of keywords represents the conjunction or the disjunction of the
topics mentioned in the list. Keyword lists provide no means for
indicating the semantic relationships that hold between keywords.
Because of this informality and lack of expressive power,
keyword-based search exhibits relatively low rates of precision.
Increasing the expressive power and the formality of the semantics of
descriptions enables increasingly precise retrieval schemes. The use
of formal semantics also opens the door to the possibility of
automated classification schemes.
Suppose we wish to advertise
an organization under the description "support groups for parents
of diabetic children". Given the following fragment of a
taxonomy
- Support group
- Support group for recovering substance abusers
- Parental support group
- Support group for parents of congenitally-ill children
- Support group for single mothers
our description should classify as a subtopic below "support groups
for parents of congenitally-ill children". If each of these
descriptions has a formal definition, then this classification can be
performed automatically. Here is a formal version of the description
"support groups for parents of diabetic children", phrased in
an OSQL-like syntax:
select g in Support-Group
where forall m in g.members
always (exists c in m.children where c.diabetic)
Formal descriptions permit one to draw arbitrarily-fine distinctions
between pairs of information items and they permit automatic
categorization, both of which will be needed to manage very large
taxonomies. They also provide the representational framework needed
to generate "virtual nodes" used to reduce fan-out.
Information retrieval techniques that introduce attribute-value
pairs partially meet the same goals as our descriptions.
The catch with formal descriptions is that we need to develop
automatic or semi-automatic ways to synthesize them. Some
possibilities:
- Get Web advertisers "into the habit" of including a formal
description of their information (analogous to including keyword in a
document)
- Generate descriptions from the structured portions of
semi-structured objects
- Develop content-interpreters that can scan a document (e.g.,
text or an image) and produce a formal description that summarizes it.
To process a query like "Find documents about support groups for
parents of diabetic children" first requires a means for finding all
taxonomies that might contain relevant information. Then, the query
engine would need to work its way down each such taxonomy, looking for
nodes that match the query. A new generation of tools will be needed
to perform these searches (i.e., traditional database query tools
solve only a portion of the search problem). Classifier technology
from the field of knowledge representation (KR) is directly relevant,
although it will need to be industrialized (more powerful indexing
techniques, integration with relational database management systems
(RDBMSs), parallel search algorithms).
If each taxonomy is uses its own unique vocabulary , and if there is
no means for aligning that vocabulary with that used in other
taxonomies, then we have the Babel situation where everyone is
speaking a different language, and no one can understand a vocabulary
other than their own. The most direct means to solve this is to adopt
an overarching framework that defines the vocabulary used by everyone
in all taxonomies (e.g., we all adopt the Library of Congress scheme,
or the SENSUS
taxonomy, or the Cyc
taxonomy as the foundation for our
individual taxonomies). A more likely scenario would be that many
semi-standardized taxonomies will exist for different domains (c.f.,
the MeSH
taxonomy used in the domain of medicine). Searches that span
multiple taxonomies will require some means for adjusting to each of
the different vocabularies. Merging of smaller taxonomies to form
larger ones will be common place, and a new set of tools will be
needed for performing this type of integration. The merging task will
be simpler and the results more dependable if a formal semantics
underlies each of the merging taxonomies.
Description-based technology offers a strategy for developing
self-organizing Web indices, which is a prerequisite to the
construction of very large taxonomies. The degree of retrieval
precision obtainable using descriptions depends on the expressiveness
of the description language and the degree of formality of the
description semantics. The field of KR offers several technologies
that will assist development of a new generation of Web query
tools.