Skip to main content

Parameter Estimation for Statistical Parsing Models: Theory and Practice of Distribution-Free Methods

  • Chapter
New Developments in Parsing Technology

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 23))

Abstract

A fundamental problem in statistical parsing is the choice of criteria and algorithms used to estimate the parameters in a model. The predominant approach in computational linguistics has been to use a parametric model with some variant of maximum-likelihood estimation. The assumptions under which maximum-likelihood estimation is justified are arguably quite strong. This chapter discusses the statistical theory underlying various parameter-estimation methods, and gives algorithms which depend on alternatives to (smoothed) maximum-likelihood estimation. We first give an overview of results from statistical learning theory. We then show how important concepts from the classification literature-specifically, generalization results based on margins on training data — can be derived for parsing models. Finally, we describe parameter estimation algorithms which are motivated by these generalization bounds.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Abney, S. (1997). Stochastic attribute-value grammars. Computational Linguistics, 23, 597–618.

    MathSciNet  Google Scholar 

  • Anthony, M. and P. L. Bartlett. (1999). Neural Network Learning: Theoretical Foundations. Cambridge University Press.

    Google Scholar 

  • Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2): 525–536, 1998.

    Article  MathSciNet  MATH  Google Scholar 

  • Block, H. D. (1962). The perceptron: A model for brain functioning. Reviews of Modern Physics, 34, 123–135.

    Article  MathSciNet  MATH  Google Scholar 

  • Bod, R. (1998). Beyond Grammar: An Experience-Based Theory of Language. CSLI Publications/Cambridge University Press.

    Google Scholar 

  • Booth, T. L., and Thompson, R. A. (1973). Applying probability measures to abstract languages. IEEE Transactions on Computers, C-22(5), 442–450.

    Article  MathSciNet  Google Scholar 

  • Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing. PhD Dissertation, University of Pennsylvania.

    Google Scholar 

  • Collins, M. (2000). Discriminative reranking for natural language parsing. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), pages 175–182. San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Collins, M., and Duffy, N. (2001). Convolution kernels for natural language. In Dietterich, T. G., Becker, S., and Ghahramani, Z., (eds.) Advances in Neural Information Processing Systems 14 (NIPS 14). MIT Press, Cambridge, MA.

    Google Scholar 

  • Collins, M., Schapire, R. E., and Singer, Y. (2002). Logistic regression, AdaBoost and Bregman distances. In Machine Learning, 48(1–3):253–285.

    Article  MATH  Google Scholar 

  • Collins, M., and Duffy, N. (2002). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pages 263–270. San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Collins, M. (2002a). Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pages 489–496. San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Collins, M. (2002b). Discriminative training methods for hidden markov models: Theory and experiments with the perceptron algorithm. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 1–8.

    Google Scholar 

  • Cortes, C. and Vapnik, V. (1995). Support-vector networks. In Machine Learning, 20(3):273–297.

    MATH  Google Scholar 

  • Crammer, K., and Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. In Journal of Machine Learning Research, 2(Dec):265–292.

    Google Scholar 

  • Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines (and other Kernel-Based Learning Methods). Cambridge University Press.

    Google Scholar 

  • Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 380–393.

    Article  Google Scholar 

  • Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer.

    Google Scholar 

  • Demiriz, A., Bennett, K. P., and Shawe-Taylor, J. (2001). Linear programming boosting via column generation. In Machine Learning, 46(1):225–254.

    Google Scholar 

  • Elisseeff, A., Guermeur, Y., and Paugam-Moisy, H. (1999). Margin error and generalization capabilities of multiclass discriminant systems. Technical Report NeuroCOLT2, 1999-051.

    Google Scholar 

  • Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139.

    Article  MathSciNet  MATH  Google Scholar 

  • Freund, Y. and Schapire, R. (1999). Large margin classification using the perceptron algorithm. In Machine Learning, 37(3):277–296.

    Article  MATH  Google Scholar 

  • Freund, Y, Iyer, R., Schapire, R.E., & Singer, Y. (1998). An efficient boosting algorithm for combining preferences. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), pages 170–178. Morgan Kaufmann.

    Google Scholar 

  • Friedman, J. H., Hastie, T. and Tibshirani, R. (1998). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 38(2), 337–374.

    MathSciNet  Google Scholar 

  • Joachims, T. (1998). Making large-scale SVM learning practical. In (Scholkopf et al., 1998), pages 169–184.

    Google Scholar 

  • Johnson, M., Geman, S., Canon, S., Chi, S., & Riezler, S. (1999). Estimators for stochastic “unification-based” grammars. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 99), pages 535–541. San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Lafferty, J. (1999). Additive models, boosting, and inference for generalized divergences. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT’99), pages 125–133.

    Google Scholar 

  • Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pages 282–289. Morgan Kaufmann.

    Google Scholar 

  • Lebanon, G., and Lafferty, J. (2001). Boosting and maximum likelihood for exponential models. In Dietterich, T. G., Becker, S., and Ghahramani, Z., (eds.) Advances in Neural Information Processing Systems 14 (NIPS 14). MIT Press, Cambridge, MA.

    Google Scholar 

  • Littlestone, N., and Warmuth, M. (1986). Relating data compression and learn-ability. Technical report, University of California, Santa Cruz.

    Google Scholar 

  • Novikoff, A. B. J. (1962). On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, Vol XII, 615–622.

    MathSciNet  Google Scholar 

  • Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In (Scholkopf et al., 1998), pages 185–208.

    Google Scholar 

  • Ratnaparkhi, A., Roukos, S., and Ward, R. T. (1994). A maximum entropy model for parsing. In Proceedings of the International Conference on Spoken Language Processing (ICSLP 1994), pages 803–806. Yokohama, Japan.

    Google Scholar 

  • Ratnaparkhi, A. (1996). A maximum entropypart-of-speech tagger. In Proceedings of the 1996 Conference on Empirical Methods in Natural Language Processing (EMNLP 1996), pages 133–142.

    Google Scholar 

  • Rosenblatt, F. (1958). The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408.

    Article  MathSciNet  Google Scholar 

  • Schapire R., Freund Y., Bartlett P. and Lee W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651–1686.

    Article  MathSciNet  MATH  Google Scholar 

  • Scholkopf, B., Burges, C., and Smola, A. (eds.). (1998). Advances in Kernel Methods — Support Vector Learning, MIT Press.

    Google Scholar 

  • Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., and Anthony, M. (1998). Structural Risk Minimization over Data-Dependent Hierarchies. IEEE Transactions on Information Theory, 44(5): 1926–1940.

    Article  MathSciNet  MATH  Google Scholar 

  • Valiant, L. G. (1984). A theory of the learnable. In Communications of the ACM, 27(11): 1134–1142.

    Article  MATH  Google Scholar 

  • Vapnik, V. N., and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of probability and its applications, 16(2):264–280.

    Article  MathSciNet  MATH  Google Scholar 

  • Vapnik, V. N. (1998). Statistical Learning Theory. New York: Wiley.

    MATH  Google Scholar 

  • Walker, M., Rambow, O., and Rogati, M. (2001). SPoT: A trainable sentence planner. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2001), pages 17–24.

    Google Scholar 

  • Zhang, T. (2002). Covering Number Bounds of Certain Regularized Linear Function Classes. In Journal of Machine Learning Research, 2(Mar):527–550, 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Kluwer Academic Publishers

About this chapter

Cite this chapter

Collins, M. (2004). Parameter Estimation for Statistical Parsing Models: Theory and Practice of Distribution-Free Methods. In: Bunt, H., Carroll, J., Satta, G. (eds) New Developments in Parsing Technology. Text, Speech and Language Technology, vol 23. Springer, Dordrecht. https://doi.org/10.1007/1-4020-2295-6_2

Download citation

  • DOI: https://doi.org/10.1007/1-4020-2295-6_2

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-1-4020-2293-7

  • Online ISBN: 978-1-4020-2295-1

  • eBook Packages: Humanities, Social Sciences and Law

Publish with us

Policies and ethics