6.891 Machine Learning Approaches for Natural Language Processing, Fall 2003

6.891 (Fall 2003): Instructions for Assignment II (Class project)

Instructor: Michael Collins (home)

Back to course homepage.

Instructions

Overview: The second (and final) assignment for 6.891 (fall 2003) is a class project. This assignment will make up 70% of the final grade (the remaining 30% depends on the first assignment).

The idea behind the project is to take some idea/technique/problem from the class, and develop it further. Most likely the project will involve an implementation of a machine learning approach to some natural language problem, followed by experimentation with the approach; but more theoretical projects are also possible.

Relation to current/previous research. The project can be related to your current research. However, please do not submit any work which you have completed prior to taking the course.

Due date: The project is due by Wednesday December 10th.

Collaboration policy: Group projects are fine, but should represent n times the work put into a project by a single person, if n people are collaborating. For the final project you should hand in a (group) written report. You should aim to partition the work in such a way that different people in the project are working on clearly defined "components" of the system. The final project should identify exactly who contributed to which part of the project.

Project proposal: Please send me a project proposal of around 2 paragraphs in length, by Monday, November 9th, or if you're not sure about what you'd like to do, and would like to discuss possible projects, send me email by that date to set up a meeting (I'll be available for meetings Thursday/Friday November 6th/7th, or most days of the week of November 9th). If you're planning a group project, please let me know the people involved, and how you plan to partition the work.

The final report: The final report for the project should be around 10 pages in length (12pt, single spaced), excluding figures.

Project examples: Here I'll sketch examples of projects that you might consider. These examples are intended to both give you concrete suggestions for projects, and also to give illustrative examples of the kind of project that would be suitable. You could pick a project from the list below (in which case, arrange an appointment with me so we can go over the details), or you could choose your own project.

Features for Parse Reranking. Recently, in the global linear models section of class, we've described reranking methods which represent parse trees using feature-vector representations. One learning algorithm we went over was boosting. Both datasets (n-best parse trees), and boosting code are available. The aim of this project would be to investigate how different parse-tree features affect performance of the reranking approach. Given the data and code currently available, this project would involve: 1) writing code which identified features in parse trees; 2) experimentation investigating how different feature choices affect performance.

Algorithms for Parse Reranking. We've described various algorithms applied to parameter estimation for parse reranking: the perceptron, boosting, and log-linear models, for example. Other methods are possible: for example, Winnow (an algorithm that is related to the perceptron but has interesting properties), or stochastic gradient descent. These algorithms haven't been applied to the reranking problem in the past, but it would be interesting to see how they perform. Datasets are available, in terms of files representing the parse data that is input to the parameter estimation code. If you're interested in this project, I'll give you pointers to methods such as Winnow and stochastic gradient descent which you might apply.

Building a Machine Translation System. Resources for building statistical machine translation systems exist in publicly available sources on the web: for example, parallel aligned corpora in various language pairs; the Giza++ system for training various IBM Models; and the ISI rewrite decoder. In this project the goal would be to put together a machine translation system using these tools.

Alignments in Machine Translation. In Machine translation Part IV, we went over phrasal models for machine translation. These methods used heuristic methods which started from the IBM model alignments, and then searched for alignments between phrases in two languages (e.g., see the Koehn, Och and Marcu paper). In this project you could use the Giza++ system to induce initial, IBM Model alignments, then work on heuristics to improve these alignments, and to find phrasal pairs.

Language Modeling In Lecture 2 we went over language modeling techniques, the goal being to come up with a statistical model with low "perplexity" on naturally occurring text. Much of this work has been done on English datasets; other languages might present very different challenges. (For example, in Chinese it is not clear where word boundaries occur, and in languages such as Czech the rich morphology means that there is a huge number of possible word forms.) The aim of this project would be to investigate language modeling for one of these languages which has different properties from English. We should be able to find relatively large datasets for this problem in a number of languages.

Implementing a perceptron tagger. We just went over the idea of using the perceptron algorithm to train a tagger that uses "local" feature representations. In this project the goal would be to implement such a tagger. If implemented in sufficiently general form, the method could be applied to a number of tasks. The project could involve replicating results for the tagger on previously studied problems such as POS tagging of English; or applying the tagger to new problems such as tagging in a language other than English.

Learning to Parse Efficiently. The lexicalized statistical parsers we described earlier in class (for example Eugene Charniak's parser, or the parser I developed) can be computationally quite intensive when searching for the most likely parse for the sentence. It's quite frequent for the search for the most likely parse to involve a few 10s of thousands of partial hypotheses (entries in the dynamic programming structures). Something that hasn't been investigated much in the past is the idea of using machine learning to improve parsing efficiency. This could be accomplished by using learning methods to build a module that identifies partial hypotheses during search which are very unlikely to be part of the final parse. You could use the parser described in my thesis as the baseline system for this problem; the goal would be to improve efficiency without compromising accuracy. This approach could well lead to big efficiency gains.