Ling239e: Grammar Engineering

Homework Assignment for Week 9

Due: Tuesday, March 9 (by midnight)
Submit assignments electronically to both professors (rmkaplan@stanford.edu and thking@stanford.edu)


Turn in: 1. the final grammar you end up with for the generation part (eng-week9.lfg)
2. your eng-week9-test.lfg.regen file
Please name your grammars with name-eng-week9.lfg
Exercises on:
PART 1: Generation Grammar
PART 2: Generating Paradigms
PART 3: Testsuites

There is one grammar this week; it uses the same default tokenizer and morphology as last time:

You also need a generation tokenizer and the testfile (for Part 3):

If you put a file called .xlerc in the directory with your grammar and in .xlerc you put:

  create-parser eng-week9.lfg

then whenever you start xle in that directory, it will automatically load eng-week9.lfg. This will save a lot of time when making and testing changes.


PART 1: Making a Generation Grammar

For this part, you should use the grammar eng-week9.lfg, the tokenizers default-parse-tokenizer.fsmfile and default-gen-tokenizer.fst, and the morphological guesser basic.english.fst.

This part will probably take much longer than the other two.

The current grammar is very similar to the one you created for last week's exercise. Up until this point, we have only used the grammars in a parsing direction. Now we will use them to generate. Doing so will involve adding a set of OT marks to constrain the output of the generator by selectively removing or disprefering parts of the grammar in generation.

As a first step, do:

   create-generator eng-week9.lfg

You should see a message like:

   parse error at ;, maybe near line 459, column 34 in 
        /tilde/thking/gram-eng04/hw/week9/eng-week9.lfg

   Errors found in lexicon; please correct and retry

Correct the error.

EXERCISE 1: Blocking generation of ungrammatical input

This grammar allows mismatched subject-verb agreement. This is good for parsing, but not for generation. Try:

   regenerate "they laughs."

There will be four strings generated. Two with laugh and two with laughs (ignore the optional period; we will deal with that later). We only want to generate the correct form with laugh. To do this, you need to put an OT mark in the GENOPTIMALITYORDER ranking in the CONFIG. This should be a NOGOOD mark. Since this construction already has an OT mark associated with it, just use that one.

Reload the grammar and try regenerating the string again. NOTE: When generating you must reload the grammar even if you only make changes to the lexicon; otherwise the changes are not in effect.

You should now see:

  % regenerate "they laughs."
  parsing {they laughs.}
  *1 solutions, 0.05 CPU seconds, 29 subtrees unified

  They {laugh.|laugh}

Use a parsing (ungrammatical *) OT mark to make it so that you can parse:

   girl laughs.

Add this OT mark to the GENOPTIMALITYORDER so that your grammar will not generate these strings. Unlike the subject-verb example, the generator cannot correct this error (we will do this in Part 2). Instead, you should see:

  % regenerate "girl laughs."
  parsing {girl laughs.}
  *1 solutions, 0.04 CPU seconds, 27 subtrees unified
  Generator failed!
  { }

EXERCISE 2: (Dis)prefering options

In some cases the multiple possibilities that the generator produces are relatively legitimate. However, for practical reasons you may not want to have so many options. You can use the OT marks to disprefer or prefer these options. In many cases, this involves introducing new OT marks that occur only in the GENOPTIMALITYORDER and not the parsing ranking.

This grammar parses sentences with and without ending punctuation. Try:

  regenerate "they laugh"

  regenerate "who laughs"

  regenerate "they laugh."

  regenerate "who laughs?"

Put in an OT mark so that all four sentences still parse but that you only generate the versions with the final punctuation. Use a preference mark to prefer the punctuation. You should get:

  % regenerate "they appear"
  parsing {they appear}
  1 solutions, 0.03 CPU seconds, 17 subtrees unified

  They appear.

This grammar allows the POS and labeled bracketing markup that was in the last assignment. This is fine for parsing, but we do not want to generate this markup. To see why, do:

  regenerate "happy boys laughed and laughed."

The result is pretty horrifying. Constrain the grammar to block the generation of POS tags and brackets (labelled or otherwise). Make sure that you can still parse:

  [LAB-NP happy  girls] laughed.

  happy POS-A girls laughed.

but that when you generate from these you only produce the versions without the markup. You should now get:

   % regenerate {happy boys laughed and laughed.}
   parsing {happy boys laughed and laughed.}
   2 solutions, 0.13 CPU seconds, 144 subtrees unified

   Happy boys laughed and laughed.

Adverbs are notorious for appearing in many positions. With this grammar, it is possible to parse:

  quickly they laugh.

  they quickly laugh.

  they laugh quickly.

Constrain the generation so that you only generate the last one of these with the VP final adverb. For example, you want:

  % regenerate "they quickly laughed."
  parsing {they quickly laughed.}
  1 solutions, 0.04 CPU seconds, 37 subtrees unified

  They laughed quickly.

Turn in: The new version of your grammar.

PART 2: Generating Paradigms and Fixing Generation Input

XLE can be used to produce paradigms. This is done by allowing the generator to add and remove specified features. For example, if we allows XLE to add and remove the NUM feature, then we can produce singular and plural forms from one input. Make your .xlerc look like:

  create-parser eng-week9.lfg
  create-generator eng-week9.lfg

  set-gen-adds remove "NUM"
  set-gen-adds add "NUM"

Then do:

  regenerate "the boys laugh."

This should produce two strings, one with a plural subject and one with a singular one. This also allows XLE to "correct" the nouns without specifiers by making them plural. To see this, do:

  regenerate "girl laughs."

You task is to get XLE to produce verb paradigms. In this grammar, all of the tense and aspect information are under TNS-ASP. So, alter the set-gen-adds remove to remove the TNS-ASP feature. You will now not be able to generating anything with a verb in it because XLE has removed this information. You need to alter the set-gen-add add to add back in all these features. Note that adding back in TNS-ASP is not enough; you need to specify each feature you want to add in. To list multiple features, you do:

  set-gen-adds add "AAA BBB CCC"

where AAA, BBB, and CCC are the features in question. Assuming that you left in the ability to add and remove NUM, you should get something like:

  % regenerate "the boys laugh."
  parsing {the boys laugh.}
  1 solutions, 0.10 CPU seconds, 26 subtrees unified

  The 
  { boy 
      { {was|is} laughing.
       |laughed.
       |laughs.
       |did laugh.}
   |boys {were laughing.|did laugh.|laughed.|laugh.}}

Turn in: Put a copy of your .xlerc as a comment at the top of your grammar file. Remember that comments are enclosed in double quotes:

   " comment ".

Part 3: Testsuites and Regeneration

This part is largely a walk-through to show you what types of statistics XLE provides you when parse a testfile. You will need the testfile eng-week9-test.lfg. When you look at the file, you will notice that comments begin with a hash mark and that the sentences are separated by blank lines. In addition, there are parse statistics in parentheses after each sentence. For example:

  they laugh. (1 0.10 19)

The first number is the number of parses, the second is the time in CPU seconds, and the third is a complexity measure. Some sentences have four numbers:

  they laughs. (0! *1 0.04 29)

This first number which is followed by a ! indicates the number of parses that were expected.

Parse the testfile with you grammar using:

  parse-testfile eng-week9-test.lfg

XLE will report the numbers that your grammar gets. Minimally, it should complain about an error when it parses the ones with 0!; this will be indicated by ERROR. If your grammar gives a different number of solutions to a sentence, you will see MISMATCH. Look through the bottom statistics to get a feel for the types of information that is being reported. You will also now have four testfiles instead of one:

  eng-week9-test.lfg         
  eng-week9-test.lfg.new
  eng-week9-test.lfg.errors  
  eng-week9-test.lfg.stats

The .lfg one is the original. The .new one has the sentences with parse statistics from your grammar. The .errors one lists only those sentences with different parse numbers from the original. The .stats one lists just the new statistics without the sentences themselves.

Finally, do (do not do this until you have finished part 1 since otherwise you will get huge amounts of output):

   regenerate-testfile eng-week9-test.lfg

This will parse each sentence, regenerate from the first parse it gets, and report back if the strings match. Note that almost nothing will match because the generation tokenizer is capitalizing the first word while the original strings do not have initial caps. You should see things like:

  Regenerating 'they laugh.'

  They laugh.

  1:REGENERATION DID NOT MATCH 'they laugh.'.
  ((1) (1 0.10 19) -> (1 0.17 43) (2 words))

Turn in: your eng-week9-test.lfg.regen
This is the file that XLE produces to report on regeneration results.


If you have any questions, you can send us email (rmkaplan@stanford.edu and thking@stanford.edu), call us (Ron: 812-4348; Tracy: 812-4808), or talk to us during office hours.