Adaptive Information Filtering with Minimal Instruction
Progress Report: July 1, 2002óDecember 31, 2002
Tommi S. Jaakkola and Tomaso Poggio
This project concerns with automated methods for finding a few pieces of relevant information (such as research articles) within a large dataset of predominantly incomplete and superficially similar information (such as technical report archives). While many such information filtering tasks vary considerably depending on the context, the primary challenges associated with automated techniques are often shared across different tasks. In this project, we develop the foundations for adaptive information retrieval tools with the ability to function accurately with minimal instruction of what is relevant, learn from related filtering problems, and make use of any optional feed-back automatically queried from the user
Progress Through December 2002
We have made progress on several fronts and outline here three areas in particular: 1) classification with unlabeled examples, 2) collaborative filtering, and 3) feature induction for text classification.
Classification with unlabeled examples. We have developed a new principle for combing unlabeled and labeled examples for the purpose of accurate classification of documents. The key motivation for this work is that the labels or relevance assessments that one is interested in learning to predict are often largely missing in the available data. Standard statistical methods are therefore not effective in this context as they have been developed primarily under the assumption that the key information is missing in only a small fraction of the available cases. Our principle is the first to articulate in probabilistic terms how the conditional distribution of labels given documents -- quantity needed for classification -- should in general relate to the marginal distribution of documents or the large number of unannoted documents. The principle is cast in terms of regularization theory and penalizes information about the labels that is introduced beyond the few available labeled examples. This ensures that the classification decisions for unlabeled documents are based firmly on the available information. Effectively, the regularization principle states how the conditional probabilities should be interpolated over the unlabeled points on the basis of the few labeled examples. More generally, the principle is formulated in terms of Tikhonov regularization theory.
(While information regularization was already mentioned in an earlier report, the idea has been extended considerably in the last six months.)
We are actively developing efficient algorithms for exploiting this principle in practical applications. We currently have two publications pertaining to early versions of the information regularization work:
Ph.D. thesis by Martin Szummer (jointly advised by T. Jaakkola and T. Poggio):
Final version of the paper presented at the NIPS conference:
Other material currently in preparation will be distributed as they are completed.
Collaborative filtering. Collaborative filtering provides a simple and efficient way of exploiting what is common to groups of decision makers. A typical collaborative filtering approach involves finding a low rank factorization of the data matrix containing user identities and their preference decisions. The low rank factorization that should capture shared features across users is, however, often estimated by assuming a simple squared loss between the data matrix and the predictions. This error measure is mathematically convenient but rarely reflects the structure of the problem. More appropriate error measures such as those based on probabilistic classification methods make the low rank factorization problem somewhat more challenging from a computational perspective. In particular, we can no longer obtain the solution in closed form and have to find the underlying low rank representation with iterative algorithms. We have developed several new algorithms for finding low rank matrix factorizations in the more general contexts relevant for collaborative filtering. The simplest such algorithms cast the iterative steps of the estimation process in terms of incomplete data estimation and formulate an EM algorithm for finding the low rank matrix factorizations. The mathematical foundation of these ideas can be found in the following technical report
N. Srebro and T. Jaakkola, "Generalized Low-Rank Approximations", AI Memo 2003-001.
The paper detailing an application of these ideas to collaborative filtering is under preparation and will be made available shortly.
Feature induction for text classification. An important problem in document classification is to find the set of text features that are useful for classification, both in general and in the context of a specific classification task. We cast the problem in terms of data compression and systematically search for features in the documents so as to best compress the labels (or summaries) that one is interested in predicting. The features come in two forms: those that are common to a number of classification tasks, and those that are useful only within a single task. Generic features that are useful across tasks are much less costly to introduce (they need to be encoded only once) but may not suffice for the specific task at hand.
A brief and preliminary description of some of these ideas can be found in:
J. Rennie and T. Jaakkola, ìFeature induction for text classificationî, AI Abstract 2003,
Research Plan for the Next Six Months
Our plan for the next six months consists primarily of the following tasks:
Developing and testing efficient algorithms for exploiting information regularization principle in large scale applied problems. Specifically, we hope to solve the regularization problem within a constrained class of conditional probabilities.
More extensive adaptation of generalized low rank estimation algorithms to collaborative filtering tasks.
Formulating and testing active learning methods in combination with the information regularization framework. This approach would not only make an appropriate use of the available unlabeled documents but would also exploit resulting classification decisions to iteratively find the most informative new documents to label.
We seek to extend the overall active learning framework to efficiently annotate documents in large databases with multiple topic labels. This would be done in collaboration with Dr. Naonori Ueda.