COMET - Conditional Maximum Entropy
Estimation from Truncated Data
A Machine Learning Tool for Statistical Disambiguation
in Constraint-Based Parsing
Research Goal
Highly precise disambiguation of analyses is a key problem and
prerequiste for real-world applications for broad-coverage
parsing. The goal of the COMET project is to apply statistical machine
learning techniques to induce disambiguation routines for
broad-coverage constraint-based parsers automatically from data. The
grammar and parsing tools used in this project are developed in the
Pargram project
for the
XLE parsing system.
Current work on statistical estimation and disambigation can be
summarized under the name COMET - Conditional Maximum-Entropy
Estimation from Truncated Data.
Estimation and Disambiguation Tools
Depending on the availability of data (fully labeled, partially
labeled, unlabeled) and the complexity of the ambiguity space of the
grammars different estimators have been invented, implemented, and evaluated.
- COMET - Conditional Maximum Entropy Estimation from Truncated
Data: The COMET estimator and disambiguator performs discriminative estimation of maximum entropy models from partially labeled data. In case of such truncated data, discriminative or conditional criteria are defined with respect to the set of grammar parses consistent with the treebank annotations . That is, the treebank annotations are used to guide the discriminative estimation of an exponential probability model on linguistically fine grained parses. This estimator and disambiguator has been trained and tested in a large scale experiment using the UPenn Wall Street Journal data set.
References:
- Stefan Riezler, Tracy H. King, Ronald M. Kaplan, Richard Crouch, John T. Maxwell III, Mark Johnson (2002). Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL'02), Philadephia, PA.
- Richard Crouch, Ronald M. Kaplan, Tracy H. King, Stefan Riezler (2002). A Comparison of Evaluation Metrics for a Broad Coverage Stochastic Parser. In Proceedings of the Workshop on "Parseval and Beyond" at the 3rd International Conference on Language Resources and Evaluation (LREC'02), 2000, Las Palmas, Spain.
- The Pseudo-Likelihood Estimator: This estimator assumes
fully labeled data and performs maximum a-posteriori estimation of
conditional probabilities p(x|y) of analyses x given
their corresponding sentences y. The conditioning of the parse
space on the sentences of the training corpus makes both estimation
tractable and yields improved performance due to the discriminative
approach of the estimation procedure.
References:
- LoLIDa - Estimating Log-Linear Models from Incomplete Data:
This estimator applies to the case where no training labels, i.e. no
corpus of parsed and manually disambiguated sentences is available. It
uses the EM algorithm (Dempster et al. (1977)) to
maximize posterior marginal probabilities of the corpus of parsed
sentences.
References:
For more information and references, visit the
NLTT
page.
Last modified: Friday, 06-Oct-2006 13:01:07 PDT