0.0 Course & Homework Discussion
1.0 Suggestions/Complaints
0.1.0 Welcome to the 6.863 Course Discussion Board.  Feel free to post and reply comments, or questions about the homework for any to answer.  This discussion board is optionally anonymous, so feel free to post your questions.

Thanks and Welcome to Natural Language Processing!

0.2.0 Course Survey

If you filled out the course survey before 4 today you'll have to redo it.  We're sorry for any inconvience.

- Catherine
Catherine
1.1.0 If you have a problem with the course software please email nlp-dev@mit.edu.  For any other problems, including webpage and lab issues please email me directly.

0.3.0 Documentation - Is there any documentation for the lexicon file format?  

Also, what does the output of the 'tracing' window mean?

0.4.0 > - section 'what you must do' 
> refers to a non-existent section 4 
>(there are 2 sections 3 so i assume 
>that the second one is meant)

It refers to section 3.

> - the same section says that there
> are 5+1 questions in section 4 - i
> can only locate 4 of them

There are only 4 questions.

- Catherine
akiezun
0.3.1.0 The tracing window shows you what the automaton is doing.  The letters in brackets are the letters it is currently trying.  The numbers with arrow are the states that the machines are transitioning between (in order).

Example: [1->2] [1->1] means that the first automaton is going from state 1 to state 2, and the second is staying in state 1.

For the lexicon file - you can look at /mit/6.863/pykimmo/english.lex

0.5.0 Do we need other rules besides the ones explicitly mentioned??

On p.16 of the handout, it mentioned that cojas must be parsed as coja+as. Do we have to follow that? Or can we parse it as cog+as and apply g:j mutation?

In general, do we just have to be able to recognize the words in spanish.rec? Or do we also have to pass some generation tests? If so, can you give us examples?

0.5.1.0 Hi - in case Catherine hasn't answered this: it's fine to follow cog+as
and use g->j mutation, as long as you
can get coja+as to work out (ie,
surface as its correct spelling).
There are no explicitly provided
generation tests (ie, hidden ones),
because there are multiple ways
to build the underlying forms.  Your
underlying forms should try to mesh
with the 'linguistics' - eg g->j, as
above.

0.5.1.1.0 What do you mean by "get coja+as to work out"?

0.5.2.0 Hi - in case Catherine hasn't answered this: it's fine to follow cog+as
and use g->j mutation, as long as you
can get coja+as to work out (ie,
surface as its correct spelling).
There are no explicitly provided
generation tests (ie, hidden ones),
because there are multiple ways
to build the underlying forms.  Your
underlying forms should try to mesh
with the 'linguistics' - eg g->j, as
above.

0.5.1.1.1.0 As it said in the originaL:
"(ie,surface as its correct spelling). "
This is what 'work out' (= generate properly) comes to.

0.5.1.1.1.1.0 so if our lexical root for coger is cog, do we still need to have coja+as generate properly?

0.6.0 I looked up cuezas on verba.org and it showed the conjugation to be cozamos for 1p pl.  Should we use this conjugation or the one given in the handout to make it simpler.

0.5.1.1.2.0 As it said in the originaL:
"(ie,surface as its correct spelling). "
This is what 'work out' (= generate properly) comes to.

0.5.1.1.1.1.1.0 Your job is  to make sure that
your roots + affixes surface properly,
so if you posit 'cog' as the root, and
then 'coj+as' as the root plus the
suffix, it should surface as 'cojas'.
But this all has to fit into the
rest of your system and the other
spelling change rules.  (Of course,
when you are done, recognizing
'cojas' should return your root
plus the suffix - ie, it should be the
exact inverse of generation.

0.5.1.1.1.1.2.0 Your job is  to make sure that
your roots + affixes surface properly,
so if you posit 'cog' as the root, and
then 'coj+as' as the root plus the
suffix, it should surface as 'cojas'.
But this all has to fit into the
rest of your system and the other
spelling change rules.  (Of course,
when you are done, recognizing
'cojas' should return your root
plus the suffix - ie, it should be the
exact inverse of generation.

0.6.1.0 Either one is OK.  Actually a good example if your 'field linguist' gets the data wrong. See if you can do cozamos anyway.


0.7.0 there's a final exam time posted on the registrar's page. it's the 21st.
we actually have our projects due by the 14th, is that right?

0.8.0 FYI: The link off lab 2 for Eric Brill's "Transformation-Based Error-Driven Learning and Natural Language Processing" is broken......

0.8.1.0 Thanks, checking on it

0.8.2.0 Fixed!


0.7.1.0 There's no final exam, so we'll
get the Registrar to remove that, thanks
for the tip.


0.9.0 For lab 2b, I'm running the java program to find differences between our tagged texts and the gold standard, and something is going wrong. First of all, for a few of the  texts all that appeared in the output html file was "Premature end of tagged input". Secondly, some of the files for individual texts display more errors for one text than the file for the "all" text displays for all the texts combined. So something appears not to be working right. Any suggestions?

0.9.1.0 Can you send me the files you are working with and the results you recieved?

- Catherine

0.9.2.0 I think the problem is fixed now - try again.

0.9.2.1.0 I'm also experiencing the same problem as of 1:16AM 02/27/04.  More specifically, I'm finding that some of the individual files like ce04.raw had more differences than all.raw  The same error message appears: "Premature end of POS - line:1233" in the html file generated from comparing the tagged hmm text of ce04.raw to ce04.pos

0.10.0 I'm still getting more errors in individual files, such as wsj_1975, than in the "all" file.

0.11.0 I'm experiencing errors in Part II when running compare-taggers.pl

For cj01, sw2019, and all texts that were tagged by HMM, compare-taggers.pl outputted warnings like below:

~~~~~~~~~~~~~~ for cj01 ~~~~~~~~~~~~~~~

/mit/6.863/tagging> compare-taggers.pl -h -m -k -x ~/6.863/lab2/hmm/cj01_hmm.txt cj01.pos > ~/6.863/lab2/hmm/kappa/cj01_hmm_kappa.csv
WARNING: different number of words on lines.
  6_CD '_''._. 7_CD  
 6'.7/CD 
WARNING: different number of words on lines.
  18_CD '_''._. 5_CD  
 18'.5/CD 

~~~~~~~~~~~~~~ for sw2019 ~~~~~~~~~~~~~~

/mit/6.863/tagging> compare-taggers.pl -h -m -k -x ~/6.863/lab2/hmm/sw2019_hmm.txt sw2019.pos > ~/6.863/lab2/hmm/kappa/sw2019_hmm_kappa.csv
WARNING: different number of words on lines.
 the_DT shap_NN -_: ,_,  the_DT shape_NN  of_IN the_DT ,_,
the/DT shap-/NN ,/, the/DT shape/NN of/IN the/DT ,/,
WARNING: different number of words on lines.
V_NN  -_: er_NN  ,_, and_CC  it_PRP   's_VBZ  ,_,  it_PRP 
V/NN -er/NN ,/, and/CC it/PRP 's/BES ,/, it/PRP
WARNING: different number of words on lines.
uh_UH ,_,  she_PRP  ,_,  she_PRP   was_VBD alm_RB -_: ,_,
uh/UH ,/, she/PRP ,/, she/PRP was/VBD alm-/XX ,/,

~~~~~~~~~~~~~~ for all ~~~~~~~~~~~~~~~

/mit/6.863/tagging> compare-taggers.pl -h -m -k -x ~/6.863/lab2/hmm/all_hmm.txt all.pos > ~/6.863/lab2/hmm/kappa/all_hmm_kappa.csv
WARNING: different number of words on lines.
  6_CD '_''._. 7_CD  
 6'.7/CD 
WARNING: different number of words on lines.
  18_CD '_''._. 5_CD  
 18'.5/CD 
WARNING: different number of words on lines.
 the_DT shap_NN -_: ,_,  the_DT shape_NN  of_IN the_DT ,_,
the/DT shap-/NN ,/, the/DT shape/NN of/IN the/DT ,/,
WARNING: different number of words on lines.
V_NN  -_: er_NN  ,_, and_CC  it_PRP   's_VBZ  ,_,  it_PRP 
V/NN -er/NN ,/, and/CC it/PRP 's/BES ,/, it/PRP
WARNING: different number of words on lines.
uh_UH ,_,  she_PRP  ,_,  she_PRP   was_VBD alm_RB -_: ,_,
uh/UH ,/, she/PRP ,/, she/PRP was/VBD alm-/XX ,/,


0.12.0 I don't quite understand what the dimensions of the "histogram" for the first bullet in Part II should be.  Can you please clarify by stating exactly what kind of histogram you want?  Thank you.

0.13.0 I am also having problems in Part II of lab2b with compare-taggers.pl. 
(1)First of all, for the "all" file I got a kappa of 0.9127 with the HMM tagger (and something similar with the brill tagger) but a kappa of only 0.8810 for "wsj_1975", which really doesn't sound right to me. 
(2)Secondly, I'm also getting errors such as:

WARNING: different number of words on lines.
  6_CD '_''._. 7_CD
 6'.7/CD
WARNING: different number of words on lines.
  18_CD '_''._. 5_CD
 18'.5/CD
WARNING: different number of words on lines.
 the_DT shap_NN -_: ,_,  the_DT shape_NN  of_IN the_DT ,_,
the/DT shap-/NN ,/, the/DT shape/NN of/IN the/DT ,/,
WARNING: different number of words on lines.
V_NN  -_: er_NN  ,_, and_CC  it_PRP   's_VBZ  ,_,  it_PRP
V/NN -er/NN ,/, and/CC it/PRP 's/BES ,/, it/PRP
WARNING: different number of words on lines.
uh_UH ,_,  she_PRP  ,_,  she_PRP   was_VBD alm_RB -_: ,_,
uh/UH ,/, she/PRP ,/, she/PRP was/VBD alm-/XX ,/,


0.11.1.0 Try using compare-taggers.old.pl
and see if you get the same error
msg.  (It doesn't have the same
nice csv output or -x switch,
however - it just does
histograms) - pls tell me and
catherine whether you get the
same errors with this.
As for all.pos - that is something
for you to figure out, in fact.
Part of the exercise...makes it
truly realistic, actually.

berwick@ai.mit.edu
0.13.1.0 Try using compare-taggers.old.pl
and see you get the same error msgs.
(You won't have the -x switch for that
older pgm).
whatever the results, pls can you
email me and catherine if there's any
difference?
thanks, bob

berwick@ai.mit.edu
0.13.2.0 Try using compare-taggers.old.pl
and see you get the same error msgs.
(You won't have the -x switch for that
older pgm).
whatever the results, pls can you
email me and catherine if there's any
difference?
thanks, bob

berwick@ai.mit.edu
0.14.0 I am getting most of my kappa values for both taggers in the .92-.94 range, with no significant difference between genres. In tha lab handout it said that typically excellent values for kappa are above .8. Something doesn't seem right......

0.15.0 It says in the lab handout that the confusion matrix coefficients in the output of compare-taggers are a count of the number of times that one tag was incorrectly substituted for another - but the coefficients are not whole numbers, so they can't be a count. Are they some sort of percentage, or what?

0.15.1.0 Sorry, a brain-o.  They are normalized
counts. Bob
berwick@ai.mit.edu
0.12.1.0 X axis:  the different corpuses
(no scale of course)
y axis: kappa values
2 histogram bars for each corpus,
one for Brill, one for Hmm

0.14.1.0 You are getting the exactly right answers.
Both taggers do very well.  You have
to dig down into details to see any
differences, with some corpus sub-comparisons.


0.11.1.1.0 I used compare-taggers.old.pl without the -x switch and still got the same errors.  Are these errors important?  If not, I can proceed with the lab.  Thank you.

0.16.0 Lab 2b: Part I
In discussing the tagging errors for brill, do we have to trace through how it made the error (like in lab2a)?

0.17.0 Last Week's Lecture slides?
Can you post them online so we can use them as reference for Lab 2?

0.16.1.0 It would be good to discuss *why*
the errors occurred... which
means figuring out where
it went astray.
thanks, bob

berwick@ai.mit.edu
0.17.1.0 Yes, I'm sorry I've been slow - 
I am posting these now.
Bob

berwick@ai.mit.edu
0.9.2.1.1.0 Still having problems (Mar 01, 21:25)
I just rerun the java program on my HMM files.
I'm getting the Premature end of POS" on:
ce04.html:Premature end of POS - line:1233
cf05.html:Premature end of POS - line:1177
cj01.html:Premature end of POS - line:1128
cm05.html:Premature end of POS - line:1288
cp01.html:Premature end of POS - line:1318
cr03.html:Premature end of POS - line:1348

0.17.2.0 This is now done.

0.18.0 In Lab2b, part 1, we are instructed to include "Log files exhibiting the tagging errors you are discussing"
What is meant by 'log file'?
gerber@mit.edu
0.18.1.0 You should link to the output of the program compairing the two taggers.

0.9.2.1.1.1.0 me too

0.13.2.1.0 same errors with the old.pl and no -x flag - no difference

0.19.0 LAB 3A: Earley Parsing

For Question 8, it asks us to use the sentence "I saw my dog with a cookie". However, in previous sections of the lab, we have been using "John ate the cookie on the table", should we use this instead since we need to compare?

0.19.1.0 Ah, probably my brain-o.  You should
use comparable sentences to compare,
of course... 
best, bob

berwick@ai.mit.edu
0.20.0 How can we incorporate the shots of the shift reduce parser state into our html document lab writeup? 

0.21.0 What's going on with Lab 3b? It was supposed to be posted on Wednesday. When will it get posted? When will it be due?

0.22.0 Project Ideas? Anyone like to start discussing some project ideas.

0.23.0 How do I comment out a line in a Earley parser .grm file?  Thanks.

0.23.1.0 Hi, use the python comment character #
Put it at the beginning of the line & leave a space, eg
# S -> NP VP

bob

berwick@ai.mit.edu
0.24.0 --drawtree (-d) option doesn't seem to be working for earley in batch mode.  Here is the command I'm typing:

earley --batch=... --drawtrees

I've also tried -d instead of --drawtrees as well as putting --drawtrees and -d before --batch to no avail.  In all these cases, the parser just outputs the parses in text but doesn't show any trees in any gui window.  Any suggestions?  Thank you.

0.22.1.0 Sure! I'd love to get an early start on this final project, and avoid all nighters in May.....

0.20.1.0 Hi, did catherine answer
this for you?  If you are
using Xwindows, you can use
the Xwindows windows snapshot
tools, to capture 'png' files...
if this is all a new topic
to you, that's ok - i can
tell you more. (or ask catherine)
best, bob

berwick@ai.mit.edu
0.24.1.0 I wrote the parser, and you're right, batch doesn't work with drawtrees. I didn't code it because I didn't see why anyone would want it.

What do you think it should do?

0.25.0 Is there any way of finding out how we did on the labs we've handed in so far? I never got back any grades / comments.....

0.26.0 In lab 3b, everywhere where it says to hand in "output from the parser" showing that it has parsed certain sentences correctly, do we just put in the postscript picture of the parse tree(s) produced, or do you also want to see anything else, like the chart matrix or the edges produced during the parse? 
Also, what do we put in our lab report to show that a sentence was rejected by the parser?

0.26.1.0 Hi,
The chart matrix or the edges
would serve to show both
that you have the right rules
for the correct parses
and that your grammar rejects
the 'starred' sentences...
i'll put that clarification
in, thanks. bob

berwick@ai.mit.edu
0.24.1.1.0 Hi, for the lab report, we're required to output final parser charts for a bunch of sentences for each question.  It would be very labor intensive to do this manually as there are 6-10 sentences for each question.  

So if you can code something that outputs the final parser charts for each sentence of a batch run to a user supplied directory, that would be very helpful.  

Thank you.

0.27.0 Project list? What's the link?

0.24.1.1.1.0 By final parser charts I mean the edges

0.28.0 I can't screen-capture the whole set of edges in the Earley final chart panel because of the scrolling.  Even if I resize the window, the chart area doesn't scale.  Any suggestions?

0.25.1.0 If you never got a grade for lab 1a, please email me since I probably sent it to an address which you do not use.  I only got the grading metrics from Bob for the other labs a day or two ago and couldn't grade them until I had that information.

I have a midterm on Wednesday and hope to have all the labs I have been given metrics for (1b, 2a and 2b) graded by Thursday.  I will then sending all sorts of information to each one of you on how you're doing.

- Catherine

0.28.1.0 Use text batch mode.  I don't think you ever need to show all the edges.  

- Catherine

0.26.2.0 Catherine is right:
I take it back. It's a bad idea
to use screen capture; use batch
mode.
bob


0.27.1.0 Check the front page, it's up now.

- Catherine

1.2.0 I could be blind but I don't see the file lab3b-sentences.txt in the python-earley directory.  Is it somewhere else or under a different name?

0.29.0 Where is the full list of sentences we have to handle for Lab 3b? I don't see it under 6.863/python-earley

0.30.0 Lab3b-Grammar1:what is meant by full preposition for the verb "thought"? Can we assume it means just full sentence for this part of the lab?

0.31.0 Lab3b-Grammar1:do we have to handle "believed"?

0.32.0 Is anyone looking for another final project member? While flexible, I'm more interested in the Semantic Interpretation or System Building... Please email me at fumi@mit.edu, thanks!
fumi@mit.edu
0.33.0 For the final project, if anyone is interested in building a smarter search engine using NLP, please email me at wlz@mit.edu.  I'm thinking for this project, maybe attack a piece of this problem -- possibly restricting to a subject (i.e. shopping) or functional (i.e. query processor or spider) domain.

0.34.0 Any idea when lab 4a will be posted? I assume it won't really be up this Monday, right?

0.30.1.0 A typo.  = Full proposition"
so that means 'a full sentence'
bob


0.29.1.0 Permit bits fixed - try it now:
sentences-lab3b.txt
bob


0.31.1.0 You should be able to - it's the
same as 'think'. but do 'think'
at least.
bob


0.34.1.0 Hi,
Spot-on.. this won't go out until Weds.,
Apr 14 and again it's a warmup exercise for 4b.
bob


1.2.1.0 Hi, not your fault... the permission
bits were off.  Hard to see thru
the unix haze for anyone...
also there as lab3b-sentences.txt;
sentences-lab3b.txt.

bob


0.35.0 Where is lambda.pdf posted?

0.35.1.0 On: www.ai.mit.edu/courses/6.863/lambda.pdf

bob


0.36.0 The problem statement of 2.6.1 states that we should modify the grammar from section 2.3 with features to handle the sentences of 2.3.  But in sentences-lab3b.txt in the course directory, you've indicated that 2.6 should handle all of 2.5 sentences also.  So should we handle the sentences up to 2.3 or up to 2.5 (+ the new sentence "To whom did Poirot...")?  Thank you.


0.37.0 The problem statement of 2.6.1 states that we should modify the grammar from section 2.3 with features to handle the sentences of 2.3.  But in sentences-lab3b.txt in the course directory, you've indicated that 2.6 should handle all of 2.5 sentences also.  So should we handle the sentences up to 2.3 or up to 2.5 (+ the new sentence "To whom did Poirot...")?  Thank you.


0.37.1.0 Hi, your grammar should be able
to do everything up to 2.5 +
the new sentence, "To whom did Poirot.."
Does that help?
best, bob


0.38.0 Would anyone like to compare answers for the grammar size calculations we need to do in section 2.5 (with vs w/o features)?

0.39.0 Will labs 4a and 4b be shorter than the others, since we have so little time to do them?

0.39.1.0 Will there be lab5? The labs are very long...

0.40.0 Section 2.6 of Lab3b: are we still allowed to use WhNP nonterminal? we just have to remove S-WhNP and V-WhNP, right?

0.41.0 What is the grading scheme for lab3b? As long as we support the cases in lab3b-sentences.txt, that will be sufficient?

0.39.1.1.0 Hi - there's no lab 5. 
The last lab is the final project.
best, bob


0.39.2.0 I hope so! -- we can play it by ear,
see how it goes.  
bob


0.40.1.0 Yes, you can do it with WhNP,
but it's preferable if you use an
NP with a "WH" feature...
bob


0.41.1.0 Yes, that's fine - just the lab3b
sentences suffice.
bob

0.41.2.0 But remember you need to answer the questions asked in the lab too...

- Catherine

0.42.0 When will Lab 4a be released?  Thank you.

0.43.0 Course Grading - is there anything quiet/shy individuals can do to make up for the class participation portion of the grade?

0.42.1.0 I don't know.  Our group is in the process of moving to the stata center...

0.44.0 When I try to run the program "semantic.py -h" in "/mit/6.863/python-semantics/", I get the following error:

/usr/bin/env: No such file or directory

I appreciate any help.  Thanks.

0.44.1.0 This has been corrected in the lab.

0.44.1.1.0 nope, i still see the same problem

0.44.1.1.1.0 my bad. it works. remember to type 'semantic', not 'semantic.py'

0.45.0 Problem with semantic: When I try to run semantic I get an error "Error: Could not write to temp directory (/tmp).". This happens both when I try to run "semantic lab_rules.py" from the course directory and when I try to run my own rule file. Typing "semantic -h" does work, though. What should I do?


0.46.0 What is the last day for handing in Lab4 if one has a lot of late dates left?

0.46.1.0 For lab 4, it's the last week
of class - this friday.
thanks, bob