Parameter Estimation for Statistical Parsing Models: Theory and Practice of Distribution-Free Methods

Collins, Michael

doi:10.1007/1-4020-2295-6_2

Michael Collins¹⁵

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 23))

466 Accesses
14 Citations

Abstract

A fundamental problem in statistical parsing is the choice of criteria and algorithms used to estimate the parameters in a model. The predominant approach in computational linguistics has been to use a parametric model with some variant of maximum-likelihood estimation. The assumptions under which maximum-likelihood estimation is justified are arguably quite strong. This chapter discusses the statistical theory underlying various parameter-estimation methods, and gives algorithms which depend on alternatives to (smoothed) maximum-likelihood estimation. We first give an overview of results from statistical learning theory. We then show how important concepts from the classification literature-specifically, generalization results based on margins on training data — can be derived for parsing models. Finally, we describe parameter estimation algorithms which are motivated by these generalization bounds.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abney, S. (1997). Stochastic attribute-value grammars. Computational Linguistics, 23, 597–618.
MathSciNet Google Scholar
Anthony, M. and P. L. Bartlett. (1999). Neural Network Learning: Theoretical Foundations. Cambridge University Press.
Google Scholar
Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2): 525–536, 1998.
Article MathSciNet MATH Google Scholar
Block, H. D. (1962). The perceptron: A model for brain functioning. Reviews of Modern Physics, 34, 123–135.
Article MathSciNet MATH Google Scholar
Bod, R. (1998). Beyond Grammar: An Experience-Based Theory of Language. CSLI Publications/Cambridge University Press.
Google Scholar
Booth, T. L., and Thompson, R. A. (1973). Applying probability measures to abstract languages. IEEE Transactions on Computers, C-22(5), 442–450.
Article MathSciNet Google Scholar
Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing. PhD Dissertation, University of Pennsylvania.
Google Scholar
Collins, M. (2000). Discriminative reranking for natural language parsing. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), pages 175–182. San Francisco: Morgan Kaufmann.
Google Scholar
Collins, M., and Duffy, N. (2001). Convolution kernels for natural language. In Dietterich, T. G., Becker, S., and Ghahramani, Z., (eds.) Advances in Neural Information Processing Systems 14 (NIPS 14). MIT Press, Cambridge, MA.
Google Scholar
Collins, M., Schapire, R. E., and Singer, Y. (2002). Logistic regression, AdaBoost and Bregman distances. In Machine Learning, 48(1–3):253–285.
Article MATH Google Scholar
Collins, M., and Duffy, N. (2002). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pages 263–270. San Francisco: Morgan Kaufmann.
Google Scholar
Collins, M. (2002a). Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pages 489–496. San Francisco: Morgan Kaufmann.
Google Scholar
Collins, M. (2002b). Discriminative training methods for hidden markov models: Theory and experiments with the perceptron algorithm. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 1–8.
Google Scholar
Cortes, C. and Vapnik, V. (1995). Support-vector networks. In Machine Learning, 20(3):273–297.
MATH Google Scholar
Crammer, K., and Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. In Journal of Machine Learning Research, 2(Dec):265–292.
Google Scholar
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines (and other Kernel-Based Learning Methods). Cambridge University Press.
Google Scholar
Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 380–393.
Article Google Scholar
Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer.
Google Scholar
Demiriz, A., Bennett, K. P., and Shawe-Taylor, J. (2001). Linear programming boosting via column generation. In Machine Learning, 46(1):225–254.
Google Scholar
Elisseeff, A., Guermeur, Y., and Paugam-Moisy, H. (1999). Margin error and generalization capabilities of multiclass discriminant systems. Technical Report NeuroCOLT2, 1999-051.
Google Scholar
Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139.
Article MathSciNet MATH Google Scholar
Freund, Y. and Schapire, R. (1999). Large margin classification using the perceptron algorithm. In Machine Learning, 37(3):277–296.
Article MATH Google Scholar
Freund, Y, Iyer, R., Schapire, R.E., & Singer, Y. (1998). An efficient boosting algorithm for combining preferences. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), pages 170–178. Morgan Kaufmann.
Google Scholar
Friedman, J. H., Hastie, T. and Tibshirani, R. (1998). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 38(2), 337–374.
MathSciNet Google Scholar
Joachims, T. (1998). Making large-scale SVM learning practical. In (Scholkopf et al., 1998), pages 169–184.
Google Scholar
Johnson, M., Geman, S., Canon, S., Chi, S., & Riezler, S. (1999). Estimators for stochastic “unification-based” grammars. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 99), pages 535–541. San Francisco: Morgan Kaufmann.
Google Scholar
Lafferty, J. (1999). Additive models, boosting, and inference for generalized divergences. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT’99), pages 125–133.
Google Scholar
Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pages 282–289. Morgan Kaufmann.
Google Scholar
Lebanon, G., and Lafferty, J. (2001). Boosting and maximum likelihood for exponential models. In Dietterich, T. G., Becker, S., and Ghahramani, Z., (eds.) Advances in Neural Information Processing Systems 14 (NIPS 14). MIT Press, Cambridge, MA.
Google Scholar
Littlestone, N., and Warmuth, M. (1986). Relating data compression and learn-ability. Technical report, University of California, Santa Cruz.
Google Scholar
Novikoff, A. B. J. (1962). On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, Vol XII, 615–622.
MathSciNet Google Scholar
Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In (Scholkopf et al., 1998), pages 185–208.
Google Scholar
Ratnaparkhi, A., Roukos, S., and Ward, R. T. (1994). A maximum entropy model for parsing. In Proceedings of the International Conference on Spoken Language Processing (ICSLP 1994), pages 803–806. Yokohama, Japan.
Google Scholar
Ratnaparkhi, A. (1996). A maximum entropypart-of-speech tagger. In Proceedings of the 1996 Conference on Empirical Methods in Natural Language Processing (EMNLP 1996), pages 133–142.
Google Scholar
Rosenblatt, F. (1958). The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408.
Article MathSciNet Google Scholar
Schapire R., Freund Y., Bartlett P. and Lee W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651–1686.
Article MathSciNet MATH Google Scholar
Scholkopf, B., Burges, C., and Smola, A. (eds.). (1998). Advances in Kernel Methods — Support Vector Learning, MIT Press.
Google Scholar
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., and Anthony, M. (1998). Structural Risk Minimization over Data-Dependent Hierarchies. IEEE Transactions on Information Theory, 44(5): 1926–1940.
Article MathSciNet MATH Google Scholar
Valiant, L. G. (1984). A theory of the learnable. In Communications of the ACM, 27(11): 1134–1142.
Article MATH Google Scholar
Vapnik, V. N., and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of probability and its applications, 16(2):264–280.
Article MathSciNet MATH Google Scholar
Vapnik, V. N. (1998). Statistical Learning Theory. New York: Wiley.
MATH Google Scholar
Walker, M., Rambow, O., and Rogati, M. (2001). SPoT: A trainable sentence planner. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2001), pages 17–24.
Google Scholar
Zhang, T. (2002). Covering Number Bounds of Certain Regularized Linear Function Classes. In Journal of Machine Learning Research, 2(Mar):527–550, 2002.
Google Scholar

Download references

Author information

Authors and Affiliations

MIT Computer Science and Artificial Intelligence Laboratory, 200 Technology Square, Cambridge, MA, 02193, USA
Michael Collins

Authors

Michael Collins
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Tilburg University, Tilburg, The Netherlands
Harry Bunt
University of Sussex, Brighton, UK
John Carroll
University of Padua, Padua, Italy
Giorgio Satta

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Collins, M. (2004). Parameter Estimation for Statistical Parsing Models: Theory and Practice of Distribution-Free Methods. In: Bunt, H., Carroll, J., Satta, G. (eds) New Developments in Parsing Technology. Text, Speech and Language Technology, vol 23. Springer, Dordrecht. https://doi.org/10.1007/1-4020-2295-6_2

Download citation

DOI: https://doi.org/10.1007/1-4020-2295-6_2
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-2293-7
Online ISBN: 978-1-4020-2295-1
eBook Packages: Humanities, Social Sciences and Law

Publish with us

Policies and ethics