6.867 Project
Project due date: Wednesday, December 3
As a part of the assigned work for this course, you are required to
complete a project of your own choosing that is based on the material
of this course. The premise of the project must be closely related to
some aspect of the material but may explore an avenue that was left
unaddressed in class.
Project type and policies
There are various types of projects you can consider:
- The project may be very practical in terms of applying
techniques you have learned in the course to a real problem such as
classification of email messages.
- The project may involve designing or adapting existing
algorithms to a novel class of problems. For example, how might we
solve multiple related classification tasks? How can we improve
document clustering by designing a new clustering metric?
- The project may consist of a theoretical analysis of a method we
have discussed. For example, this may be in terms of complexity,
convergence, etc.
- The project can be a theoretical or more applied survey of a
branch of machine learning that we didn't go through in detail. For
example, you may write about the use of machine learning in natural
language processing or review sample complexity of machine learning
algorithms.
The project can be related to your research area (if you have
one). Do not submit anything you have completed prior to attending the
course. You also should not submit a project that is largely a
collaborative effort with people outside the course. For example, if
your research involves other people in a larger project, you could
propose to address a slightly different question (still related to
your research) but one that you are pursuing alone or in collaboration
with other students taking the course.
You can and are encouraged to collaborate with other students. If
you do, we ask that you outline the role of each person in the
project. Projects involving more than one person have to scale in
``size'' with the number of people.
Project proposal:
In order to help guide your choice of a project, you are strongly
encouraged to submit a brief project proposal (at most one or two
paragraphs) that describes your idea for the project, in rough terms
the work you intend to perform, and all the people involved in the
project. You should submit the proposal via email to
6.867-staff@ai.mit.edu.
Project size and the final report:
We expect that the ``size'' of your project should be equal to about
the amount of work required for one and a half homework
assignments. The project, however, should be in some sense
``complete''. By this we mean that you should not ignore relevant
machine learning issues. In the final report you shouldn't just say
what you did but also why it was a reasonable thing to do given the
course material.
The final report should include at most four (4) pages of text
(12pt font, single spaced) per person (not including figures). You
shouldn't worry about getting ``great'' results. The idea and your
understanding of the machine learning issues involved are much more
important than getting ``great'' results.
Some examples:
There are many avenues that you may pursue for this project and we
encourage you to be creative even if you don't think you'll
necessarily get ``great'' results. Here are some ideas (the list is by
no means comprehensive):
- Comparison of algorithms: Throughout the course,
we've been discussing various algorithms and their properties, but
only on occasion have we dealt with these algorithms with real sets of
data. Often times, algorithms don't work like expected and algorithms
may need to be adapted or modified to better fit the assumptions
inherent in the problem or the available data. What work needs to be
done to adapt a model to an interesting set of data that you've found?
How do various algorithms perform on the same set of data? What are
the properties of the various algorithms that exhibit such
performance?
- Missing information: Various real world classification
problems involve missing components in the input vectors. How can you
deal with such missing information? Do you expect your method to
degrade rapidly if more information is missing?
- Clustering metric: How do we cluster various types of
examples such as sequences? Can you devise a clustering metric or a
clustering algorithm that is appropriate in such cases? What if we
know that the examples can be transformed in various ways (e.g.,
translation of images) without changing their ``essence''. How can we
incorporate such prior knowledge into a clustering algorithm?
- The choice of the kernel function in SVMs: The kernel
function in SVMs defines how examples are to be compared. How do we
choose the kernel function? How could we adjust the kernel function if
we thought it should have a particular form? Can you adapt/design a
kernel function to a specific problem we are interested in solving?
Other project ideas include:
- Simple language modeling using Markov models/HMMs
- Improve classification by using EM with unlabeled data
- Selecting the number of mixture components based on data
- Clustering input attributes, designing clustering metrics
- Image/email/biosequence classification
- Detecting abnormal/novel examples in a stream of data
- Recognition of acoustic features (towards speech recognition)
- Creating backgammon/go/chess/etc. player
Some data repositories you might find useful:
UCI ML Repository (Various)
UCI KDD Repository (Various)
Protein data bank
20 Newsgroups (Text)
Reuters Documents (Text)
Genome data
Gene expression data
|