XLE

XLE consists of cutting-edge algorithms for parsing and generating Lexical Functional Grammars (LFGs) along with a rich graphical user interface for writing and debugging such grammars. It is the basis for the Parallel Grammar Project, which is developing industrial-strength grammars for English, French, German, Norwegian, Japanese, and Urdu. XLE is written in C and uses Tcl/Tk for the user interface. It currently runs on Solaris Unix, Linux, and Mac OS X.

One of the main goals of the XLE is to parse and generate with LFGs efficiently. This is difficult because the LFG formalism, like most unification-based grammar formalisms, is NP complete. This means that in the worst case the time that it takes to parse or generate with an LFG can be exponential in the length of the input. However, natural languages are mostly context-free equivalent, and one should be able to parse them in mostly cubic time. XLE is designed to automatically take advantage of context-freeness in the grammar of a natural language so that it typically parses in cubic time and generates in linear time. This lets grammar writers write grammars in an expressive formalism without necessarily sacrificing performance.

How the Parser Works

There are three key ideas that XLE uses to make its parser efficient. The first idea is to pay careful attention to the interface between the phrasal and functional constraints. In particular, XLE processes all of the phrasal constraints first using a chart, and then uses the results to decide which functional constraints to process. This is more efficient than interleaving phrasal and functional constraints since the phrasal constraints can be processed in cubic time, whereas the functional constraints may take exponential time. Thus, the phrasal constraints make a good polynomial filter for the functional constraints.

The second key idea is to use contexted unification to merge multiple feature structures together into a single, packed feature structure. For instance, here is a contexted feature structure for "They saw the girl with the telescope":

In this sentence, "saw" is ambiguous between the past tense of "saw" (labeled a:1) and the present tense of "see" (a:2). Also, "with the telescope" is either an adjunct of "saw" (b:1) or "the girl" (b:2). The contexts a:1 and a:2 are mutually exclusive. Similarly for b:1 and b:2. Furthermore, a:1 and a:2 are independent from b:1 and b:2. So this feature structure represents four different solutions. XLE uses a contexted feature structure like this for each edge in a chart to represent all of the possible feature structures that the edge can have.

The third key idea is to use lazy contexted copying during unification. Lazy contexted unification only copies up as much of the two daughter feature structures of a subtree as is needed to determine whether the feature structures are unifiable. If the interactions between the daughter feature structures are bounded, then XLE will do a bounded amount of work per subtree. Since there are at most a cubic number of subtrees in the chart, then XLE can parse sentences in cubic time when the interactions between daughter feature structures are bounded.

Here is a scatterplot of XLE applied to a corpus of sentences from the documentation of the HomeCentre, a multi-function copier/printer device marketed by Xerox. Each sentence is plotted according to the number of local subtrees that it has (on the x-axis) and the time that it takes to parse the sentence (on the y-axis). If the time grows linearly in the number of subtrees, then this means that a bounded amount of work is being done per subtree and so XLE is parsing in cubic time. By doing a linear regression we find that 79 percent of the time can be explained by the number of subtrees in this corpus. The data points that are higher than expected correspond to sentences that have a lot of coordination and/or long-distance dependencies. These constructions require a lot of the feature structure to be copied up, and so they increase the effective grammar constant.

Ambiguity Management

XLE produces a packed representation of all possible solutions as its output. This has a number of advantages over trying to pick the "right" solution:

Robustness

XLE uses a form of Optimality Theory that allows the grammar writer to indicate that certain constructions are dispreferred. This means that the grammar can parse sentences that are ill-formed in specific ways. For instance, the grammar may be able to parse sentences with subject-verb disagreements (e.g. "They walks"). Because the ill-formed analyses are dispreferred, they are not included in the output if there are well-formed analyses present. In addition, XLE has the capability of producing well-formed fragments if the grammar doesn't cover the entire input. The combination of these two capabilities makes XLE robust in the face of ill-formed inputs and shortfalls in the coverage of the grammar.

Documentation

Documentation for the current version of XLE is available here.

Obtaining XLE

XLE is available for non-commercial purposes by filling out this license and mailing it to maxwell "at" parc.com. Then contact miriam.butt "at" uni-konstanz.de to obtain XLE. Here is the list of current licensees.

If you are interested in using XLE for educational purposes or purely academic research, then we recommend that you look at the Grammar Writer's Workbench. The Grammar Writer's Workbench is available at no cost by anonymous FTP under a simple license.

Acknowledgements

The XLE project began in October of 1993 and was originally a joint project between the NLTT group at PARC and the MLTT group in Grenoble. It evolved from the Grammar Writer's Workbench, which was implemented by Ron Kaplan and John Maxwell. John Maxwell was the technical lead for XLE and also designed the algorithms and implemented the user interface. Hadar Shemtov did the initial implementation of the parser and generator, and Max Copperman wrote the LFG formalism parser and the bracketing tool. Paula Newman wrote the morphology and lexicon components and also worked on the generator. Andreas Eisele implemented complex categories and parameterized rules. Martin Kay wrote a prototype system for doing transfer in Prolog. Dick Crouch rewrote the transfer system and is also responsible for the rest of the Prolog subsystem. Alex Vasserman wrote the code for stochastic disambiguation under the supervision of Stefan Riezler. Ron Kaplan and Annie Zaenen provided managerial support. XLE makes heavy use of the c-fsm package for processing the morphology, which was written in C by Lauri Karttunen and others in Grenoble based on work done by Ron Kaplan in Interlisp.

Recent Papers

See PARC's publication database.

Older Papers

For more information, contact maxwell "at" parc.com.