6.863J/9.611J Parsing Project Ideas

[For syntax fans] Extend the rules with ‘slashes’ from Lab 3b to handle all the roughly 100 distinct sentences in the original Generalized Phrase structure grammar paper - either with or without features. The reference: Unbounded dependencies and coordinate structure,'' Linguistic Inquiry, 12, 155-184. Reprinted in Walter J. Savitch, Emmon Bach, William Marsh and Gila Safran-Naveh, eds. The Formal Complexity of Natural Language, Dordrecht: Reidel, 183-226 (1987). This is fun if you like syntax (trust me).
Build a set of feature-based context-free rules that will handle Spanish, Italian, German, Hebrew, Gaelic, Russian, Esperanto, … sentences. Please include sentences including conjunctions (and discuss this with me in advance). (Yes, people have done Gaelic in the past.) (For Russian, see the last item – it is a ‘free’ word order language.)
Examine the Scheme parser in http://www.mit.edu/~niyogi/minimal.htm (See http://www.mit.edu/~niyogi/m0.scm ) and (i) add a probabilistic component to the rules; or (ii) add a well-formed substring table to the method it uses for parsing. This method is based on very recent linguistic theories (‘minimalism’) – I can give you references on that, too.
Link grammars: Look at another method for parsing distinct from forming ‘tree structures’ – e.g., ‘link grammar’, which builds syntactic relations between words, rather than trees. Working parser for windoze/linux/etc is available, along with many details; see http://www.link.cs.cmu.edu/link/. Compare this to ‘traditional’ context-free parsing for natural languages – which one works better? How could you tell?
Investigate optimizations/efficiency hacks for Earley parsing: precomputing ‘chains’ of rules; lookahead; using an ‘ill-formed’ substring table; matrix multiplication. I can give you a reference for this from Graham, Harrison, and Ruzzo.
[For psychologically oriented folks] Push deeper into the distinctions between shift-reduce/shift-shift conflicts as psychologically valid models for human sentence processing – implement the 2-stage Chomsky-Miller parser, based on intonational structure/shallow phrase breaks, followed by tree ‘readjustment’
Extend a feature parser so that it can be used in a ‘relaxation mode’ for grammar checking (a la Microsoft Word).
Compare ‘deep’ (ordinary context-free) vs. ‘shallow’ chunking parsers for a task like message understanding: see the parser in the course locker /mit/6.863/sco1h/ (and see me for assistance in using it); there is an associated paper by S. Abney scol.ps in this directory under the /doc subdir. This parser is also on the CD that was handed out at the beginning of the term.
Build a parser that directly computes precedence and dominance relations, rather than using context-free rules, and see if you can modify chart parsing to handle this. [for algorithm/programming fans]
Parallel context-free parsing: using a chart parser, or parallel stacks. Harder: fold probabilistic rules into the system.
Implement a method to adjudicate between structurally different parses using semantic information on-line – e.g., use the semantic interpreter to rule out certain subtrees as the Earley algorithm tries to build them, much as the way that token lookahead is used, to ‘fix’ examples like “I shot the elephant in my pajamas’.
[For linguistically oriented folks] Use a feature-based grammar system to implement a part of a ‘real’ linguistic theory, like Chomsky’s ‘Government-Binding’ theory. I have a reference paper for this as well, to see which aspects should be implemented.
Probabilistic context-free parsing: See http://www.cog.brown.edu/~mj/code/inside-outside.tgz for a version written in C. Project: apply this to a corpus of your choosing.
Apple Pie probabilistic parser: see http://nlp.cs.nyu.edu/app/index.html for a downloadable probabilistic parser, windoze executable is available. Project: apply this to a corpus of your choosing, and evaluate the results.
Probabilistic context-free parsing to find rules from treebanks: see http://www.cog.brown.edu/~mj/code/cky.tbz along with the rule induction method in http://www.cog.brown.edu/~mj/code/johnson-97.pdf
Build an automatic phrase structure compiler: a system that (i) inputs tree structures that a linguist would build outputs the associated context-free rules to build those trees; and then (ii) given a collection of tree structures, induces a metarule to generate that collection.
Parsing and free word order languages. We can generalize context-free rules so that they had `free' right-hand sides, in the form of sets, e.g., VP -> {V, NP} means either VP è V NP or VP è NP V. Generalize the Earley parser so that it can handle rules of this sort. Establish the time complexity of this extended algorithm. Apply it to a language with ‘free’ worder order, like Russian, etc.