Adaptive Information Filtering with Minimal Instruction
Progress Report: January 1, 2002óJune 30, 2002
Tommi Jaakkola and Tomaso Poggio
This project concerns with automated methods for finding a few pieces of relevant information (such as research articles) within a large dataset of predominantly incomplete and superficially similar information (such as technical report archives). While many such information filtering tasks vary considerably depending on the context, the primary challenges associated with automated techniques are often shared across different tasks. In this project, we develop the foundations for adaptive information retrieval tools with the ability to function accurately with minimal instruction of what is relevant, learn from related filtering problems, and make use of any optional feed-back automatically queried from the user
Progress Through June 2002
Our progress during the six month period has been in three key areas: 1) active querying of information, 2) estimation with predominantly incomplete information, and 3) collaborative filtering. We provide here a brief description of the results with links to associated publications and presentations.
1. Active querying of information
We formulate the problem of retrieving information from a database (or web) as an active learning problem, where the user is automatically queried for additional information in response to a user-initiated query. The information is elicited from the user at multiple levels of abstraction to quickly determine the set of elements that the user is after. The interaction with the user defines a restricted information channel, which we exploit to minimize the overall time of the exchange. The overall framework has been described in previous reports along with the resulting publications.
For this framework to be applicable we must understand the structure and relations among the elements in the database. When the structure is unknown we seek to recover the structure with minimal effort. We approach this sub-problem also as an active learning problem. This sub-problem can be cast more generally as the problem of recovering the structure of a graphical model (Bayesian network) with minimal number of queries. We have developed a computationally feasible approach to this problem. The key idea is to ensure that we can maintain and update only a few candidate structures at each stage and minimize an appropriate measure of disagreement among the candidates. The details can be found in
Steck and Jaakkola, "Unsupervised Active Learning in Large Domains", Proceedings of the Eighteenth Annual Conference on Uncertainty in Artificial Intelligence, 2002.
We are also in the process of extending the overall active retrieval framework to cases where the documents (or elements in the database) naturally belong to overlapping sets of subtopics. This changes the type of information the user may be after, the set of queries that can be performed, as well as how we can expect the user to respond to the new queries. We are pursuing this work in collaboration with Dr. Naonori Ueda (NTT) whose recent work on multi-category labeling complements our progress on active learning.
2. Estimation with predominantly incomplete information
The available data for text filtering involves predominantly incomplete or fragmented information. A typical realization of this problem involves a few annotated documents exemplifying the filtering task and a large database of unannotated documents. In general, incorporating large amounts of incomplete data can either dramatically increase or decrease the filtering performance.
We have developed two fundamentally new approaches to this problem. First, we address the inherent instability of exploiting heterogeneous sources of information in estimation. Our approach (partially described in previous reports) efficiently recovers a continuum of estimation solutions at varying levels of mixing of the sources. This permits us to identify and avoid any instabilities due to mixing. The paper is available from
Corduneanu and Jaakkola, "Continuation Methods for Mixing Heterogeneous Sources", Proceedings of the Eighteenth Annual Conference on Uncertainty in Artificial Intelligence, 2002.
The associated UAI 2002 conference presentation is also available
We have given several seminar talks about this approach including those at MIT, University of Massachusetts at Amherst, and Yale University.
Our second approach characterizes how unlabeled documents can be tied to their hypothetical labels. The key intuition is that any tight clusters of documents should possess unambiguous labels. We can express this idea mathematically as a regularization approach, where the marginal distribution of documents constrains how the conditional distribution of labels given documents is chosen. The regularization limits the amount of information that related documents can contain about the labels. The details can be found in
Szummer and Jaakkola, "Information Regularization with Partially Labeled Data", submitted, 2002.
3. Collaborative filtering
It is often beneficial to solve multiple filtering tasks concurrently so as to exploit any similarities and regularities across the tasks. Collaborative filtering exploits this idea in contexts that involve multiple users or a single user across multiple filtering criteria. We have developed a new approach to this problem by casting the joint filtering task as a latent matrix factorization problem. The resulting low rank matrix captures the regularities across the related filtering tasks. The underlying mathematical problem and our solution is described in
Srebro and Jaakkola, "Generalized Low-Rank Approximations", submitted, 2002.
Research Plan for the Next Six Months
There are several key goals that we hope to achieve over the next six months:
In collaboration with Dr. Naonori Ueda, we seek to extend the overall active learning framework to settings where the documents possess multiple simultaneous category labels (topics). More precisely, we extend the framework to efficiently annotate documents in large databases with multiple topic labels. A related problem is to efficiently identify documents (in unannotated databases) that conform to user specified constraints on the sub-topics.
We hope to release software for combining heterogeneous data sources. Our method provides a stable solution to a wide variety of estimation tasks involving multiple sources of information of varying type or quality.
We extend and test our information regularization approach in filtering tasks involving large numbers of unlabeled documents. The computational complexity of this approach is inherently tied to the resolution at which we wish to solve the problem and does not scale directly with the size of the dataset.
We employ the new matrix factorization approach to concurrently solve multiple related filtering tasks. The goal is to extend and test the approach so that it can be appropriately integrated with the active learning framework.
A more longer term goal involves the design and implemention of user interfaces supporting the overall interactive approach to information retrieval. The basic elements for this are already in place; our on-going work progressively refines these elements.