Finding Information on the Web

A Knowledge Representation Approach

W. A. Woods (e-mail: william.woods@east.sun.com)
Sun Microsystems Laboratories
Chelmsford, Mass.

Current approaches to finding information on the web are to start somewhere and search by following hypertext links, or to go to a site that has indexed a large number of web pages and use a keyword search engine. Neither of these options is totally satisfactory when you have a specific information need that you'd like to answer quickly. Techniques from knowledge representation research have the potential to improve this situation by annotating documents with natural language descriptions of their content and taking account of the conceptual structures of those descriptions.

I have been doing research on what I call the "paraphrase problem" in information retrieval -- the fact that the words and concepts used by an information seeker are often different from the words and concepts used by the author of the desired material. There are three challenges here:

what information is required to connect the terms of a query to those of a relevant information item,
how can this information be organized and used efficiently to make the necessary connections between a request and the desired material, and
to what extent can descriptions of the content of a document be automatically extracted from the document itself.

I have found that techniques from knowledge representation and natural language processing can make a useful contribution to solving the paraphrase problem. Specifically, by searching a structured conceptual taxonomy of the words and phrases used in a body of information, organized by the relationship of subsumption (linking more general concepts to more specific concepts that they subsume), paths connecting concepts in a request with related concepts in relevant texts can be efficiently found, and specific information items can be effectively located in response to specific information requests.

Structured conceptual taxonomies, which are a kind of principled semantic network, can be constructed automatically from words and phrases extracted automatically from text material, using knowledge bases of general semantic facts. For example, given the general semantic facts that "washing" is a kind of "cleaning" and "car" is a kind of "automobile", an algorithmic classification system can automatically classify "car washing" as a kind of "automobile cleaning" and link the latter as subsuming the former if both phrases occur in a body of material. These same algorithms can also detect occurrences of one of these concepts, given the other, if one occurs in the text and the other is used as a request. This makes it possible to annotate material with detailed descriptions of content using nonstandardized terminology and still make connections with the unconstrained terminology of a request.

A rigorous semantics for structured concepts and the relationships among them makes it possible to automatically combine and organize information from different taxonomies distributed over the web into a single virtual taxonomy that is available to individual information seekers. Such a taxonomy can be used both to make connections between a request and possible answers and also as structure to support browsing in conceptual space.

For more information on knowledge representation, semantic networks, and structured conceptual taxonomies, see:

W. A. Woods, "Important Issues in Knowledge Representation," Proceedings of the IEEE, Vol. 74, No. 10 (October, 1986), pp 1322-1334. Reprinted in Peter G. Raeth (ed.), Expert Systems: A Software Methodology for Modern Applications, Los Alamitos:IEEE Computer Society Press, 1990, pp 180-204.

W. A. Woods, "Understanding Subsumption and Taxonomy: A Framework for Progress," in John Sowa (ed.), Principles of Semantic Networks: Explorations in the Representation of Knowledge, San Mateo:Morgan Kaufmann, 1991, pp 45-94.

W. A. Woods and James Schmolze, "The KL-ONE Family," Computers & Mathematics with Applications, Vol 23, Nos. 2-5, (January-March, 1992), special issue on Semantic Networks in Artificial Intelligence , Part 1, pp 133-177. Also reprinted in Fritz Lehmann (ed), Semantic Networks in Artificial Intelligence, Pergamon Press, 1992, pp 133-177.