About NLTT
NLTT is one of the oldest research groups in PARC, and it has been the
source of innovations in computational linguistics and linguistic
theory that have been broadly influential. Past achievements include:
-
We developed the theory, algorithms, and engineering platforms for
Finite-State Morphology. This is now a standard technology for
low-level text processing tasks such as spell-checking, stemming,
named-entity identification, OCR language modeling, and information
extraction. It has been embedded in many commercial products from
Xerox and through licenses and spin-off companies. These include
the Terminology Suite from
Xerox MKMS,
LinguistX
from
Inxight,
and
TextBridge
from
Scansoft.
Finite-state databases for 12 Western and 4 Asian languages are
currently offered for sale, and preliminary versions of databases for
Eastern European languages and Arabic have been constructed by
developers in the Grenoble MLTT laboratory, a
part of Xerox Research Centre
Europe.
-
In collaboration with colleagues at MIT and Stanford, we created the
theory of Lexical Functional
Grammar, an approach to grammatical description that enables the
mapping from linguistic expressions to canonical representations to be
characterized as a system of constraints. This theory has been used
to solve descriptive problems in a wide range of the world's
languages, and has proven to be suitable for languages as different as
English, Japanese, Warlpiri (an Australian aboriginal language), and
Urdu (an Indo-Aryan language spoken in South Asia). There are now
also several textbooks on LFG, and it is taught in many universities.
Of equal importance, the constraint-based approach makes it very easy
to provide a computational interpretation for the theory.
Implementations have been constructed by many different research
groups, including the Grammar
Writer's Workbench for Lexical Functional Grammar, originally
developed as a research platform by NLTT and now used primarily for
teaching purposes.
Our current research centers around XLE,
a full-scale implementation of Lexical Functional Grammar. Given an
LFG grammar and lexicon for a particular language, this system does
high-speed parsing (converting strings to canonical
meaning-representations) and generation (converting
meaning-representations to the strings that express them) for that
language. XLE also includes an efficient, ambiguity-enabled ordered
rewrite system that is used to process parser output for producing
deeper semantic and Knowledge Representation structures and for applications
such as sentence condensation, machine translation, and entity
relation extraction.
-
We discovered a general solution to the most difficult computational
problem in language processing, the problem of dealing in a practical
but still correct way with all of the ambiguity that occurs in natural
language sentences. Ambiguity gives rise to exponential--and hence
intractable--computations in most implementations, but our method of
Disjunctive Constraint
Satisfaction, implemented in the XLE system, provides for
polynomial time bounds for the constructions that typically appear in
human languages. Our method gives us a systematic "ambiguity
management" capability that enables us to avoid exponential explosions
even when alternative linguistic interpretations are passed to other
components of a larger application.
-
We have developed a new model of
statistical disambiguation for constraint-based parsing. This
project applies statistical machine learning techniques to
automatically induce disambiguation routines for broad-coverage
constraint-based parsing, such as the XLE system carries out with the
Pargram grammars. Current work on statistical estimation and
disambigation can be summarized under the name COMET - Conditional
Maximum-Entropy Estimation from Truncated Data. The key points of this
research are flexibility and robustness in the design of the
disambiguation model, the possibility of automatic inference of
disambiguation weights by statistical estimation from partially
annotated / truncated data, and improved disambiguation performance by
discriminative estimation.
-
We developed a new approach to the
interface between syntax and semantics, allowing a logical
representation of a linguistic structure to be derived on the basis of
its syntactic structure. Knowledge-based systems can then internalize
and reason about these logical formulas. In this theory of "glue
semantics", linear logic is used to do the interpretation, and this
provides an abstract and simple way of dealing with many semantic
difficulties. The XLE-based glue semantics implementation takes
advantage of our expertise in ambiguity management to rapidly derive
semantic representations for input sentences.
- We have developed a custom implementation of a scalable,
high-performance search engine optimized for semantically-based
information retrieval. This system indexes the semantic
representations that XLE generates for each sentence in a corpus,
capturing the logical relations among concepts. When a user enters a
query, the linguistic structure of the query is
compared to those in the index, and the system finds the most relevant
matches based on the meaning of the text. By designing the engine
from the ground up to efficiently index and compare semantic
representations, we are able to perform deep semantic search at speeds
comparable to state-of-the-art full-text search systems. The semantic
information is compressed into a format that enables the system to
quickly locate sections of text that are likely to contain relevant
passages. These candidate passages are then aligned using a
unification-based algorithm that selects those that are actually
responsive to the query.
Our current activities center on well-engineered, comprehensive
systems that provide efficient implementations for linguistic
processing. The speed and space reductions that we achieved in the
finite-state arena are what enabled the large array of commercial
products. Our ambiguity management techniques are embedded and
exploited in the XLE system; whereas alternative systems typically
exhibit exponential behavior and can only operate on short sentences
or with specialized grammars, XLE performs well on long, real-world
sentences and on full, broad-coverage grammars with semantic
specifications.
The XLE system makes it (relatively) easy to write computational
grammars for different languages. At Xerox we have produced
large-scale LFG grammars of Chinese, English and French, and our
collaborators in the Parallel
Grammar Project have produced substantial grammars for Arabic,
German, Japanese, and Norwegian. These and other grammars, including
Turkish and Urdu, are continually under development.
These grammars and the XLE implementation are resources that can be
combined with other modules to make up various applications. To take
an example, we have constructed a semantic-based question answering
system that makes use of the ambiguity-management features of our
algorithms to map text into semantic (and then
Knowledge Representation) structures using the XLE parser and
ordered rewrite system. These structures are input to our scalable,
high-performance search engine. When a user enters a query into the
system, the linguistic structure of the query is compared to those in
the index, and the system finds the most relevant matches based on the
meaning of the text. These candidate passages are then aligned using
a unification-based algorithm that selects those that are responsive
to the query and these are returned to the user. (See also our work on
Machine Translation.)
To summarize, we have developed a suite of concepts, theories,
algorithms, and implementations that make it easy to construct
descriptions of new, even "exotic" languages and to perform efficient, deep
analysis and generation with those descriptions in order to solve
practical language processing problems. More information about these
issues and our approach to them can be found in the
selected bibliography of NLTT publications.
Natural Language Theory and Technology
Palo Alto Research Center
3333 Coyote Hill Rd.
Palo Alto, CA 94304 USA
fax: (650) 812-4374
Last updated: Wednesday, 11-Oct-2006 08:57:58 PDT.