6.891 (Fall 2003):
Instructions for Assignment II (Class project)
Instructor: Michael Collins (home)
Back
to course homepage.
Instructions
Overview: The second (and final) assignment for 6.891 (fall 2003) is
a class project. This assignment will make up 70% of the final grade (the
remaining 30% depends on the first assignment).
The idea behind the project is to take some idea/technique/problem
from the class, and develop it further. Most likely the project will
involve an implementation of a machine learning approach to some
natural language problem, followed by experimentation with the
approach; but more theoretical projects are also possible.
Relation to current/previous research. The project can be
related to your current research. However, please do not submit any
work which you have completed prior to taking the course.
Due date: The project is due by Wednesday December 10th.
Collaboration policy: Group projects are fine, but should
represent n times the work put into a project by a single person, if n
people are collaborating. For the final project you should hand in a
(group) written report. You should aim to partition the work in such
a way that different people in the project are working on clearly
defined "components" of the system. The final project should identify
exactly who contributed to which part of the project.
Project proposal: Please send me a project proposal of around
2 paragraphs in length, by Monday, November 9th, or if you're
not sure about what you'd like to do, and would like to discuss
possible projects, send me email by that date to set up a meeting
(I'll be available for meetings Thursday/Friday November 6th/7th,
or most days of the week of November 9th).
If you're planning a group project, please let me know the people
involved, and how you plan to partition the work.
The final report: The final report for the project should be
around 10 pages in length (12pt, single spaced), excluding figures.
Project examples:
Here I'll sketch examples of projects that you
might consider. These examples are intended to both give you concrete
suggestions for projects, and also to give illustrative examples of
the kind of project that would be suitable. You could pick a project
from the list below (in which case, arrange an appointment with me
so we can go over the details), or you could choose your own project.
-
Features for Parse Reranking.
Recently, in the global linear models section of class,
we've described reranking methods which represent parse trees using
feature-vector representations. One learning algorithm we went over
was boosting. Both datasets (n-best parse trees), and boosting code
are available. The aim of this project would be to investigate how
different parse-tree features affect performance of the reranking
approach. Given the data and code currently available, this project
would involve: 1) writing code which identified features in parse
trees; 2) experimentation investigating how different feature choices
affect performance.
-
Algorithms for Parse Reranking.
We've described various algorithms applied to parameter
estimation for parse
reranking: the perceptron, boosting, and log-linear models, for
example. Other methods are possible: for example, Winnow (an algorithm
that is related to the perceptron but has interesting properties), or
stochastic gradient descent. These algorithms haven't been applied to
the reranking problem in the past, but it would be interesting to see how they
perform. Datasets are available, in terms of files representing the
parse data that is input to the parameter estimation code.
If you're interested in this project, I'll give you pointers to
methods such as Winnow and stochastic gradient descent which you might
apply.
-
Building a Machine Translation System.
Resources for building statistical machine
translation systems exist in publicly available sources on the web:
for example, parallel aligned corpora in various language pairs; the
Giza++ system for training various IBM Models; and the ISI rewrite
decoder. In this project the goal would be to put together a
machine translation system using these tools.
-
Alignments in Machine Translation.
In Machine translation Part IV, we went over phrasal
models for machine translation. These methods used heuristic methods
which started from the IBM model alignments, and then searched for
alignments between phrases in two languages (e.g., see the Koehn,
Och and Marcu paper). In this project you could use the
Giza++ system to induce initial, IBM Model alignments, then work
on heuristics to improve these alignments, and to find phrasal pairs.
-
Language Modeling
In Lecture 2 we went over language modeling techniques,
the goal being to come up with a statistical model with low
"perplexity" on naturally occurring text. Much of this work has been
done on English datasets; other languages might present very different
challenges. (For example, in Chinese it is not clear where word
boundaries occur, and in languages such as Czech the rich morphology
means that there is a huge number of possible word forms.) The aim of
this project would be to investigate language modeling for one of
these languages which has different properties from English. We should
be able to find relatively large datasets for this problem in a number
of languages.
-
Implementing a perceptron tagger.
We just went over the idea of using the perceptron
algorithm to train a tagger that uses "local" feature
representations. In this project the goal would be to implement such a
tagger. If implemented in sufficiently general form, the method could
be applied to a number of tasks. The project could involve replicating
results for the tagger on previously studied problems such as POS
tagging of English; or applying the tagger to new problems such as
tagging in a language other than English.
-
Learning to Parse Efficiently.
The lexicalized statistical parsers we described earlier
in class (for example Eugene Charniak's parser, or the parser I
developed) can be computationally quite intensive when searching for
the most likely parse for the sentence. It's quite frequent for the
search for the most likely parse to involve a few 10s of thousands of
partial hypotheses (entries in the dynamic programming
structures). Something that hasn't been investigated much in the past
is the idea of using machine learning to improve parsing
efficiency. This could be accomplished by using learning methods
to build a module that identifies partial hypotheses during search
which are very unlikely to be part of the final parse. You could use
the parser described in my thesis as the baseline system for this
problem; the goal would be to improve efficiency without compromising
accuracy. This approach could well lead to big efficiency gains.