Due: Tuesday, March 2 (by midnight)
Submit assignments electronically to both professors
(rmkaplan@stanford.edu and thking@stanford.edu)
| Turn in: | 1. the final grammar you end up with for the sublexical rule part (eng-week8.lfg) |
| Please name your grammars with name-eng-week8.lfg |
| Exercises on: | |
| PART 1: | Fragments |
| PART 2: | Shallow markup |
There is one grammar this week; it uses the same default tokenizer and morphology as last time:
If you put a file called .xlerc in the directory with your grammar and in .xlerc you put:
create-parser eng-week8.lfg
then whenever you start xle in that directory, it will automatically load eng-week8.lfg. This will save a lot of time when making and testing changes.
For this part, you should use the grammar eng-week8.lfg and the tokenizer default-parse-tokenizer.fsmfile and the morphological guesser basic.english.fst.
The goal of this exercise is to add a fragment grammar to the current grammar to improve robustness. If you look at the CONFIG, you will see that the REPARSECAT has already been defined:
REPARSECAT FRAGMENTS.
In addition, there is already a lexical entry for -token:
-token TOKEN * (^ TOKEN)= %stem; ONLY.
although this will need to be augmented slightly.
Modify the current FRAGMENTS rule (which just goes to FALSE) to cover categories such as NP and VP. Use at least four categories that you think will be useful. Note that in order to parse a VP as a FRAGMENT, you will need to provide a null subject for the verb in the VP disjunct. The basic form of a FRAGMENTS rule is:
FRAGMENTS -->
{ XP: (^ FIRST)=!
|YP: (^ FIRST)=!
|TOKEN: (^ FIRST)=!}
(FRAGMENTS: (^ REST)=!).
where TOKEN is a specially defined category to match things that do not fit in the XP or YP possibilities. Try parsing some things with your new grammar such as:
the the girl laughs. the girl ! laughs. ? [ thesk-Tehjsk .
NOTE: To parse the last one, you have to surround the string with {} instead of "". (XLE is a bit picky about what square brackets mean and if you have one in initial position it gets confused.) For example:
parse {? [ thesk-Tehjsk .}
You should be getting a lot of parses. This is because there is nothing constraining the FRAGMENTS to build the fewest number of chunks and avoid tokens unless necessary.
Add OT marks to the FRAGMENTS rule and to the -token entry to constrain your rule. Make the OT marks ungrammatical ones by prefixing them with an *; this way you will be able to tell quickly if you have triggered the fragment grammar. For the "sentence":
the the girl laughs.
your grammar should get *1 parse (plus a lot of suboptimals). For other things, you may still be getting a lot of parses, but they should be fewer than you were getting before the OT marks were added.
Use the grammar you created for Part 1 and the input grammar for Part 2. You will just be making some additional changes. This part does not depend on what you did for the previous part.
NOTE: if in xle you type:
set-OT-rank Fragment NOGOOD
where Fragment is the name of the OT mark you used to constrain your FRAGMENTS rule, then xle will not parse any FRAGMENTS. This will make debugging much easier. You can add this line into your .xlerc so that when you reload the FRAGMENTS will automatically be turned off.
The goal of this exercise is to modify METARULEMACRO to allow for labelled bracketing. We are going to do this in two steps: (1) allow for bracketing of constituents; (2) allow for optional labels that match only specific categories.
Add a disjunct to METARULEMACRO that allows square brackets around any category. You will need to put in lexical entries for the square brackets. You should be able to parse things like:
[the girl] laughs. the girl [devoured a banana] . [the girl] [devoured [a banana]].
NOTE: To parse these, you have to surround the string with {} instead of "". (XLE is a bit picky about what square brackets mean and if you have one in initial position it gets confused.) For example:
parse {[the girl] laughs.}
You should not be able to parse things with brackets around non-constituents:
the [girl devoured] a banana.
where "not able to parse" means getting 0 parses (not even a tree) if you have the fragments turned off and some type of fragment parse if they are turned on.
Once your unlabelled brackets are working, you can add in the labels. The lexical entries for the labels are already in the grammar as:
LAB-NP CAT[NP] *; ONLY. LAB-VP CAT[VP] *; ONLY. LAB-PP CAT[PP] *; ONLY.
You want to add these into METARULEMACRO optionally after the opening left square bracket. The fact that these are complex categories (which were discussed in class but never used in the homeworks before) allows you to write a very succint rule to match the category. In particular, the part of the rule should look like:
(CAT[_CAT])
where the _CAT is going to match the _CAT that is an argument of METARULEMACRO. By adding this in, your grammar should now parse all the unlabelled bracket examples above and:
[LAB-NP the girl] laughed. the girl [LAB-VP laughed]. [LAB-NP the girl] [LAB-VP laughed].
It should not be able to parse things like:
[LAB-AP the girl] laughed. the girl [LAB-NP laughed].
where "not able to parse" means getting 0 parses (not even a tree) if you have the fragments turned off and some type of fragment parse if they are turned on.
Parse a few of these ungrammatical ones with the
FRAGMENTS turned on. You may notice some rather odd FRAGMENTS. To
constrain these a bit, you can constrain METARULEMACRO slightly by
blocking FRAGMENTS as a _CAT from coordination and from the labelled
bracketing.
Do this. You should use the constraint on SC-COORD in
METARULEMACRO as a model.
In the bigger grammars, POS tagging is constrained by the finite-state machines in the morphconfig as described in class. In this exercise, we are going to mock up POS tagging in the sublexical rules so that you can see the ways in which having POS tags can constrain ambiguity.
There are three POS tags in the lexicon already:
POS-N TAG-N *; ONLY. POS-A TAG-A *; ONLY. POS-V TAG-V *; ONLY.
The goal is to be able to parse strings like:
the happy POS-A girl POS-N laughed POS-V.
But not be able to parse strings like:
the girl POS-A laughed. the girl laughed POS-A.
To do this, you need to modify the A, N, and V sublexical rules. In particular, keep the sublexical part that is already there the same. You want to add a disjunct to each rule that allows the category, e.g., V, to be that category followed by the correct part of speech tag. You will end up with trees that look like:
N
/ \
N TAG-N
| |
girl POS-N
where you can click on the N over girl to expand out the sublexical parts. For example:
N
/ \
N TAG-N
/ | \ |
N_BASE N_SFX_BASE N_NUM_BASE POS-N
| | |
girl +Noun +Sg
If you are having trouble with these, (1) make sure to turn off the fragments as mentioned above and (2) try parsing them at a relatively low level until you get the trees you want. For example:
NP: girls POS-N AP: happy POS-A
should get a tree and an f-structure in this grammar.
Once you have these in place, you should get two parses when you parse:
the happy girls laughed. NP: the tractor trailer
but only one parse when you parse:
the happy POS-A girls laughed. NP: the tractor POS-N trailer
Finally, try out the combination of labelled bracketing and POS tagging. For:
[LAB-NP the happy POS-A girls] devoured [LAB-NP the bananas in the park].
you should get one parse.
Turn in: The new version of your grammar.
If you have any questions, you can send us email (rmkaplan@stanford.edu and thking@stanford.edu), call us (Ron: 812-4348; Tracy: 812-4808), or talk to us during office hours.