Haystack: Per-User Information Evironments
Progress Report: January 1, 2000June 30, 2000
David R. Karger and Lynn Andrea Stein
The Haystack project is investigating the ways in which electronic infrastructure can be used to triangulate among the different knowledges of an individual, of collegial communities to which we belong, and of the world at large. Its infrastructure consists of a community of independent, interacting information repositories, each customized to its individual user. It provides automated data gathering (through active observation of user activity), customized information collection, and adaptation to individual query needs. It also facilities inter-haystack collaboration, exposing (public subsets of) the tremendous wealth of individual knowledge that each of us currently has locked up in our personal information spaces. The Haystack-NTT project involves augmenting its customization, learning and adaptation, and inter-haystack communication.
Progress Through June 2000
In the first six months of this grant, we rewrote the core of our Haystack system to create a privileged kernel. that assures the persistence and transaction-safety of the data in a Haystack. We also began to investigate query self-organization of an individuals haystack in response to queries.
In the past six months, we have completed this work and in addition built on this functionality by providing rudimentary learning capabilities for haystack, by adding facilities to combine both structured and unstructured queries, and by integrating additional input facilities. Finally, Haystack can now archive substantial corpora (5K documents).
The Haystack data model is semi-structured; it allows both arbitrary text and structured relations such as authorship. During the past six months, we developed a system that performs database-like queries on semistructured data using current relational database technologies. To achieve this goal, we designed a database schema that would allow us to store our data model. We also specified the format in which the user can enter database queries and implemented procedures that translate these user queries into SQL. Finally, we designed and integrated these structural searches with standard text search within Haystack.
In a separate project, we added several machine learning techniques to Haystack. Now, when a user makes a query on a topic similar to a previous query, the system uses any relevance feedback available from that prior query to provide an improved result set for the current query. The core learning module was designed to be modular and extensible so that more sophisticated learning algorithms and techniques can be easily implemented in the future. Testing of our system shows that learning based on relevance feedback even using the simplest algorithm provides some improvement on the results of the queries.
Haystack is a ubiquitous information repository; it aims to collect all information with which a user interacts. In a final project over the past six months, we integrated scanning and OCR capabilities into Haystack. This extends the reach of Haystacks suite of information management tools beyond the purely electronic realm and into the paper office. The scan-to-haystack pipeline is fully automated; dropping a document into the scanners feeder results in archival of the documents content.
Research Plan for the Next Six Months
Over the next six months, we will work on a variety of back end enhancements and improvements to increase the robustness of the system; we will begin construction of a new front end designed to allow increased user customizability; and we will continue to improve the performance of Haystacks archive and query functions to increase its utility over increasingly larger corpora.
On the back end, we will begin work on a more robust transaction manager to increase the systems fault tolerance and allow for backing out of erroneously derived data. We will also start to design a versioning system that will make it possible to query previous system states. Finally, we will undertake a re-modularization of the kernel system to allow increased extensibility as well as further evolution of the system.
We also plan to build a more object-oriented user interface that will display each object in the way best suited for that object. We intend to explore a new user interface for haystack based on hints. We expect to record a users preferences regarding the display of a particular object e.g., arrangement of individual objects within a folder as Haystack data for use by multiple display tools as well as for potential use by some of Haystacks retrieval services.
By the end of December 2000, we expect to be able to demonstrate a robust Haystack system running on a sizable corpus exploiting several of these new features.