November 13th, 1997
4:15pm
refreshments at 4:00pm
NE43 - 8th Floor Playroom
Currently, information on the Web can only be accessed by browsing, or by keyword search. Ideally, one would like to use information on the Web to answer complex queries---queries that require deduction to answer. One way of accomplishing this is to translate several information sources into a single common knowledge base, and then query that knowledge base. A number of "knowledge integration" systems that work like this have been built. However, they have had limited impact, because the initial translation step is quite expensive in terms of human effort.
In my talk I will propose a new way of representing knowledge that is midway between the representation used by a conventional database, and the representation used by a search engine. Specifically, I will propose representing information with a collection of documents organized into relations. I will then present a logic that uses this representation to efficiently approximate certain database operations by reasoning about the similarity of pairs of documents. (Similarity is measured using the vector space model, a metric widely used in statistical information retrieval.) I will argue that this scheme is much more appropriate for representing the sort of loosely coupled, heterogeneous information sources typically found on the Web. I will also argue that adopting this data model reduces the cost of knowledge integration by roughly an order of magnitude, thus making it possible to integrate significant numbers of sites into a single common knowledge base. Time and facilities allowing, I will demo my system on a couple of sample domains.