Ling239e: Grammar Engineering

Homework Assignment for Week 6

Due: Tuesday, February 17 (by midnight)
Submit assignments electronically to both professors (rmkaplan@stanford.edu and thking@stanford.edu)


Turn in: 1. the final fst infile you made (eng-week6.infile)
2. the final grammar you end up with for the sublexical rule part (eng-week6.lfg)
Please name your grammars with name-eng-week6.infile and name-eng-week6.lfg
Exercises on:
PART 1: sublexical rules (connecting fsts to xle)
PART 2: building FSTs

There is one FST infile and one grammar, as well as two fsts to use with the grammar, this week:

If you put a file called .xlerc in the directory with your grammar and in .xlerc you put:

  create-parser eng-week6.lfg

then whenever you start xle in that directory, it will automatically load eng-week6.lfg. This will save a lot of time when making and testing changes.


PART 1: Sublexical Rules

For this part, you should use the grammar eng-week6.lfg and the tokenizer default-parse-tokenizer.fsmfile and the morphological guesser basic.english.fst.

The tokenizer and morphology are already hooked up to the grammar by the MORPHOLOGY section at the top of the grammar (you can read about MORPHOLOGY sections in the XLE documentation and in the slides, but you won't need it for this assignment). Take a look at the section but don't edit it.

Load up the grammar. To get a feel for what the morphology is doing, try some morphemes commands in XLE:

  morphemes push
  morphemes pushed
  morphemes pushing
  morphemes red
  morphemes redder
  morphemes reddest
  morphemes Mary
  morphemes quickly

Note that this guesser is really primitive. For example, when you tried redder it stems to redd because it just stripped off the er and did not know to undouble the consonant. The only time you are going to need to worry about this is when parsing the verbs. The verbs are all listed and when you put them in sentences you are going to need to make the forms regular (thinked instead of thought).

EXERCISE 1: Building sublexical c-structures

There is a list of tags commented out at the bottom of the file and an entry for -unknown. The first task is to make sublexical rules for A, N, and V. There is a sample sublexical rule for ADV that you can use as a model.

At first, work on making basic lexical entries and sublexical rules so that you get a valid c-structure. Do not worry about the f-structure yet. The adjective (A) rule will be the simplest. First do:

  morphemes red
  morphemes redder
  morphemes reddest

to see what all the possible tags sequences for adjectives are. Then decide what category to make the relevant tags and put the categories into the sublexical rule (remember to add on the _BASE). You should be able to get a tree for:

  A: red
  A: redder
  A: reddest

Next look at the nouns. Do:

  morphemes Mary
  morphemes dog
  morphemes dogs
  morphemes IBM

to see what all the possible tags sequences for nouns are. Then decide what category to make the relevant tags and put the categories into the sublexical rule (remember to add on the _BASE). You should be able to get a tree for:

  N: Mary
  N: dog
  N: dogs
  N: IBM

Finally, do the V sublexical rule. Do:

  morphemes laugh
  morphemes laughs
  morphemes laughed
  morphemes laughing

to see what all the possible tags sequences for verbs are. Note that forms like laugh get two sets of verbal tags. Then decide what category to make the relevant tags and put the categories into the sublexical rule (remember to add on the _BASE). You should be able to get a tree for (two for laugh):

  V: laugh
  V: laughs
  V: laughed
  V: laughing

EXERCISE 2: The f-structures

At this point, you will be able to build up c-structures, but the f-structures are going to be a bit off. The task is to modify the entries for the tags and the -unknown to fix them up.

First look at the adjectives. Postive (red), comparative (redder), and superlative (reddest) adjectives all get the same f-structure. Add a DEGREE feature to comparatives and superlatives with the values compar and super to differentiate them from the standard positive forms.

The morphology gives a special tag to proper nouns like Mary. Add a feature NTYPE with the value proper to these forms.

Parse:

  NP: girls
  NP: a girl
  NP: girl

All three should now get a valid c-structure. Make sure that they get the correct NUM feature and that NP: girl does not get a valid f-structure. When doing this last part, make sure that you can still parse:

  NP: Mary

Hint: one way to do this is to put a disjunction in the N entry for -unknown.

Finally, work on the verbs. Your grammar should be able to parse:

  she laughs.
  she laughed.
  she is laughing.
  she eats bananas.
  she eated bananas.
  she is eating bananas.
  she wants to eat.
  she sayed that she wants to eat.

Try to get the tensed forms first and then deal with the infinitive. Make sure that your grammar cannot parse:

  she laugh.
  they laughs.

Your grammar should be able to parse all types of silly things like:

  Mtjesk thinked that the tnejsk tejsks want to eat a thejk tehjh.

as long as you use regular forms of the few verbs it knows.

EXTRA CREDIT: Guessing verbs

If you got the above part to work, this should not be difficult.

Modify the -unknown lexical entry to allow the grammar to guess verbs as being either transitive or intransitive. You should be able to parse:

  she jjjs.
  she jjjs him.
  they jjj.
  they jjjed.
  she is jjjing.

Turn in: The new version of your grammar.

PART 2: Building FST scripts

For this part, you should use the FST infile eng-week6.infile.

This part is basically an xfst walkthrough. The program you need is in:

  /afs/ir.stanford.edu/class/linguist239e/xerox-solaris 

You should be able to just work in that directory if you want to.

EXERCISE 1: A noun phrase chunker

Start up xfst. If you are in the above directory, do:

  ./xfst

It should list a message and give you a prompt that says:

  xfst[0]:

We are going to build a small finite state machine that takes determiners (d), adjectives (a), nouns (n), and prepositional phrases (p) in certain sequences and brackets them with { }. This simulates an NP chunker.

At the prompt type in:

   regex [ (d) a* n+ p* -> %{ ... %} ] ;

To see what this does, at the prompt (which should now be xfst[1]:) type:

  apply down

You can now type in sequences of the four symbols and see what it does. Do not put in spaces between the symbols. Try:

  n
  nn
  an
  daanppp

There is a bit of a problem in that for an NP chunker you usually want the longest possible NP you can build. This fsm is building all possible NPs from the string. This can be fixed.

Do control-d (^-d) to get out of the apply down mode. You should be back at the xfst[1]: prompt. Now type in the same command only with an @ before the arrow; this tells xfst to take only the longest match.

  regex [ (d) a* n+ p* @-> %{ ... %} ] ;

Once again, type:

  apply down

Now try:

  danp
  daanpvnnpdan

The first will give one big NP. The second will build three NPs and leave the v stranded in the middle.

To get out, do control-d and then quit at the prompt.
There is nothing to turn in for this exercise.

EXERCISE 2: Using infiles

If you are going to make big FSTs, you do not want to have to type everything in at the command line all the time. Instead, you want to work on and store your commands in a file. xfst allows you to load ascii text files with the commands you want to use. You can then manipulate them in xfst, save them as fsts, etc.

Look at the infile eng-week6.infile. It maps back and forth between the surface forms of leave/left/leaf and the stem forms plus some very basic tags. To load it up do:

./xfst -l eng-week6.infile

You should see a prompt that says:

  xfst[1]:

You can now do:

  apply up

and at the prompt can enter the surface forms to get the stem and tags. For example:

  leaves

If you enter a form that was not on the right hand side of the .x. in the infile, then it returns nothing.

Do control-d to exit the apply up and do:

  apply down

Here you can enter the stem form with the tag and get the surface form. Do not put any space between the stem and the tag. Try:

  leave+VBZ
  leave+NNS
  left+JJ

Do control-d to get out of the apply down and quit to get out of fst.

The task is to modify the infile to add:

Put all of these forms in the same regex expression that is already there. The | is the disjunction symbol (there is one at the end of each line except the last). Square brackets are used for grouping.

Load up the script again:

./xfst -l eng-week6.infile

Do apply up and try out some forms such as:

  churches
  dogs
  trampled
  trampling
  flew
  redder

Turn in: The new version of your infile.

If you want do play around more with xfst, there are some sample exercises at XRCE Grenoble. They also have more information on FSTs in general there. The two exercises here are based on ones on that site.


If you have any questions, you can send us email (rmkaplan@stanford.edu and thking@stanford.edu), call us (Ron: 812-4348; Tracy: 812-4808), or talk to us during office hours.