Ling187/287: Grammar Engineering

Homework Assignment for Week 7

Due: Wednesday, May 24 (by midnight)
Submit assignments electronically to both professors (kaplan "at" parc.com and thking "at" parc.com)


Turn in: 1. the final grammar you end up with for the sublexical rule part (eng-week7.lfg)
Please name your grammars with name-eng-week7.lfg
Exercise on:
PART 1: sublexical rules (connecting fsts to xle)

There are two grammars, as well as three fsts to use with the grammars, this week:

If you put a file called xlerc in the directory with your grammar and in xlerc you put:

  create-parser eng-week7-sublext.lfg

then whenever you start xle in that directory, it will automatically load eng-week7-sublex.lfg. This will save a lot of time when making and testing changes.

The two parts are completely separate and so can be done in either order.


PART 1: Sublexical Rules

For this part, you should use the grammar eng-week7-sublex.lfg and the tokenizer default-parse-tokenizer.fsmfile and the morphological guesser english.morph.regular.fst.

The tokenizer and morphology are already hooked up to the grammar by the MORPHOLOGY section at the top of the grammar (you can read about MORPHOLOGY sections in the XLE documentation and in the slides, but you won't need it for this assignment). Take a look at the section but don't edit it.

Load up the grammar. To get a feel for what the morphology is doing, try some morphemes commands in XLE:

  morphemes push
  morphemes pushed
  morphemes pushing
  morphemes red
  morphemes redder
  morphemes reddest
  morphemes quickly

Note the issues with trying to guess a stem based on the surface form. For example, when you tried redder it stems to both redd and red. This will result in more ambiguity than you expect; it should show up with having different PREDs in the f-structures. The main time you are going to need to worry about this is when parsing the verbs. The verbs are all listed and when you put them in sentences you are going to need to make the forms regular (thinked instead of thought). Also, the fst morphology has no idea which things are which parts of speech so that redder is correctly a comparative adjective but can also be a base form verb or a singular noun. This will also increase ambiguity.

EXERCISE 1: Building sublexical c-structures

There is a list of tags at the bottom of the file and an entry for -unknown. The first task is to make sublexical rules for A, N, and V. There is a sample sublexical rule for ADV that you can use as a model.

At first, work on making basic lexical entries and sublexical rules so that you get a valid c-structure. Do not worry about the f-structure yet. The adjective (A) rule will be the simplest (but a bit more complicated than the adverb due to the possibility of having a comparative or superlative tag). First do:

  morphemes red
  morphemes redder
  morphemes reddest

to see what all the possible tags sequences for adjectives are. Then look at the category of these tags in the lexicon and put the categories into the sublexical rule (remember to add on the _BASE). You can change the category names if you want to. You should be able to get a tree for:

  
  A: red
  A: redder
  A: reddest

Next look at the nouns. Do:

  
  morphemes dog
  morphemes dogs

to see what all the possible tags sequences for nouns are. Then decide what category to make the relevant tags and put the categories into the sublexical rule (remember to add on the _BASE). You should be able to get a tree for:

  
  N: dog
  N: dogs

Finally, do the V sublexical rule. Do:

  
  morphemes laugh
  morphemes laughs
  morphemes laughed
  morphemes laughing

to see what all the possible tags sequences for verbs are. Note that forms like laugh get two sets of verbal tags. Then decide what category to make the relevant tags and put the categories into the sublexical rule (remember to add on the _BASE). You should be able to get a tree for (two for laugh):

  
  V: laugh
  V: laughs
  V: laughed
  V: laughing

EXERCISE 2: The f-structures

At this point, you will be able to build up c-structures, but the f-structures are going to be a bit off. The task is to modify the entries for the tags and the -unknown to fix them up. In particular, you will be adding equations like (^ NUM)=pl.

First look at the adjectives. Postive (red), comparative (redder), and superlative (reddest) adjectives all get the same f-structure. Add a DEGREE feature to comparatives and superlatives with the values compar and super to differentiate them from the standard positive forms. There is no tag for the positive forms that you can put an equation on to give DEGREE a value positive. Modify either the ADJECTIVE template or the A rule to put in a DEGREE positive feature when there is no compar or super value. That is, you want the f-structure for red to look like:

  [ PRED 'red'
    DEGREE positive ]

Next do the nouns. You will want to be able to parse:

  
  NP: girls
  NP: a girl
  NP: girl

All three should now get a valid c-structure. Make sure that they get the correct NUM feature, PERS feature, and an NTYPE common (we only have common nouns). Note that, based on the tags, there is no way to tell if a noun is a count noun or a mass noun and hence whether it needs a determiner in the singular.

Finally, work on the verbs. Your grammar should be able to parse:

  she laughs.
  she laughed.
  she is laughing.
  she eats bananas.
  she eated bananas.
  she is eating bananas.
  she wants to eat.
  she thinked that she wanted to eat.

Try to get the tensed forms first and then deal with the infinitive. Don't worry about the fact that -ed verbs like laughed are stemmed as both past tense and base form verbs; just make sure that when they have past tense tags the f-structure reflects this and that when they have infinitive tags the f-structure reflects this (and that only the infinitive versions can appear with to). The feature declaration is set up to allow the types of structures you have been working with already (don't worry about doing complicated things like they will have been being eaten; just stick with a single auxiliary like we have been in previous grammars):

  [ PRED 'some_verb'
    TNS-ASP [ TENSE pres/past
              ASPECT prog/perf/simple ]
    VFORM prog/pass/perf
    PASSIVE + ]

Make sure that your grammar cannot parse:

  she laugh.
  they laughs.

Your grammar should be able to parse all types of silly things like:

  mtjesk thinked that the tnejsk tejsks want to eat a thejk tehjh.

as long as you use regular forms of the few verbs it knows. These verbs are listed in the lexicon.

EXERCISE 3: Guessing verbs

If you got the above part to work, this should not be difficult.

Modify the -unknown lexical entry to allow the grammar to guess verbs as being either transitive or intransitive. You should be able to parse:

  she jjjs.
  she jjjs him.
  they jjj.
  they jjjed.
  she is jjjing.

Make a testsuite of all the examples used in this assignment. Add a few real-world sentences (e.g. from a web page, something you are reading). Turn the fragments back on (change the OPTIMALITYORDER by re-commenting the Fragment and Token marks). Run it through your grammar to see how the ambiguity is and what you are getting. The ambiguity will probably be pretty impressive.

If you want to see more about xfst for building morphologies and other things (it has the script for an fst to calculate what kinds of coins you can use to buy a 65 cent soda), there are some sample exercises at XRCE Grenoble. They also have more information on FSTs in general there.

PART 2: Generation

For this part, you should use the grammar eng-week7-gen.lfg, the tokenizers default-parse-tokenizer.fsmfile and default-gen-tokenizer.fst.

Up until this point, we have only used the grammars in a parsing direction. Now we will use them to generate. Doing so will involve adding a set of OT marks to constrain the output of the generator by selectively removing or disprefering parts of the grammar in generation.

As a first step, do:

   create-generator eng-week7-gen.lfg

You should see an error message with a line number telling you where the error is. Correct the error and restart to make sure it is really fixed.

EXERCISE 1: Blocking generation of ungrammatical input

This grammar allows mismatched subject-verb agreement. This is good for parsing, but not for generation. Try:

   regenerate "they laughs."

There will be several strings generated. First get rid of the ones with sleep (hint: look carefully at the lexical entries).

There are both singular and plural forms being generated. We only want to generate the correct form with laugh. To do this, you need to put an OT mark in the GENOPTIMALITYORDER ranking in the CONFIG. This should be a NOGOOD mark. Since this construction already has an OT mark associated with it (BadVAgr), just use that one.

Reload the grammar and try regenerating the string again. NOTE: When generating you must reload the grammar even if you only make changes to the lexicon; otherwise the changes are not in effect.

You should now see (strings may be in a different order; that is unimportant):

  % regenerate "they laughs."
  parsing {they laughs.}
  1 solutions, 0.05 CPU seconds, 29 subtrees unified

  They {laugh.|laugh|Laugh!}

Use a parsing (ungrammatical/dispreference) OT mark to make it so that you can parse count nouns without specifiers like:

   girl laughs.

Add this OT mark to the GENOPTIMALITYORDER so that your grammar will not generate these strings. Unlike the subject-verb example, the generator cannot correct this error (we will do this in Part 2). Instead, you should see:

  % regenerate "girl laughs."
  parsing {girl laughs.}
  1 solutions, 0.04 CPU seconds, 27 subtrees unified
  Generator failed!
  { }

EXERCISE 2: (Dis)prefering options

In some cases the multiple possibilities that the generator produces are relatively legitimate. However, for practical reasons you may not want to have so many options. You can use the OT marks to disprefer or prefer these options. In many cases, this involves introducing new OT marks that occur only in the GENOPTIMALITYORDER and not the parsing ranking.

This grammar parses sentences with and without ending punctuation. Try:

  regenerate "they laugh"

  regenerate "who laughs"

  regenerate "they laugh."

  regenerate "who laughs?"

  regenerate "they laugh!"

Put in an OT mark so that all four sentences still parse but that you only generate the versions with the final punctuation. Use a preference mark to prefer the punctuation. You should get:

  % regenerate "they appear"
  parsing {they appear}
  1 solutions, 0.03 CPU seconds, 17 subtrees unified

  They appear.
  They appear!

Adverbs are notorious for appearing in many positions. With this grammar, it is possible to parse:

  quickly they laugh.

  they quickly laugh.

  they laugh quickly.

Constrain the generation so that you only generate the last one of these with the VP final adverb. For example, you want:

  % regenerate "they quickly laughed."
  parsing {they quickly laughed.}
  1 solutions, 0.04 CPU seconds, 37 subtrees unified

  They laughed quickly.

Don't worry if you also get:

   They did laughed quickly.

As we will see in the next exercise, the grammar is a bit too loose and allows this possibility.

EXERCISE 3: Generating Paradigms and Fixing Generation Input

XLE can be used to produce paradigms. This is done by allowing the generator to add and remove specified features. For example, if we allows XLE to add and remove the NUM feature, then we can produce singular and plural forms from one input. Make your .xlerc look like:

  create-parser eng-week7-gen.lfg
  create-generator eng-week7-gen.lfg

  set-gen-adds remove "NUM"
  set-gen-adds add "NUM"

Then do:

  regenerate "the girls laugh."

This should produce two strings, one with a plural subject and one with a singular one. This also allows XLE to "correct" the nouns without specifiers by making them plural. To see this, do:

  regenerate "girl laughs."

You task is to get XLE to produce verb paradigms. In this grammar, the tense and aspect information is under TNS-ASP and VFORM. So, alter the set-gen-adds remove to remove the TNS-ASP feature. You will now not be able to generating anything with a verb in it because XLE has removed this information. You need to alter the set-gen-add add to add back in all these features. Note that adding back in TNS-ASP and VFORM is not enough; you need to specify each feature you want to add in (but not the atomic values). To list multiple features, you do:

  set-gen-adds add "AAA BBB CCC"

where AAA, BBB, and CCC are the features in question. Assuming that you left in the ability to add and remove NUM, you should get something like:

  % regenerate "the girls laugh."
  parsing {the girls laugh.}
  1 solutions, 0.10 CPU seconds, 26 subtrees unified

  The 
  { girl 
      { {was|is} laughing.
       |laughed.
       |laughs.
       |did laugh.}
   |girls {were laughing.|did laugh.|laughed.|laugh.}}

There may be some strange combinations like The girls did laughed. which is indicative of the parsing grammar not being tightly enough written; you do not need to fix this.

Put a copy of your xlerc as a comment at the top of your grammar file. Remember that comments are enclosed in double quotes:

   " comment ".

Turn in:


If you have any questions, you can send us email or call us (Ron: 812-4348; Tracy: 812-4808).