Abstract
A fundamental problem in statistical parsing is the choice of criteria and algorithms used to estimate the parameters in a model. The predominant approach in computational linguistics has been to use a parametric model with some variant of maximum-likelihood estimation. The assumptions under which maximum-likelihood estimation is justified are arguably quite strong. This chapter discusses the statistical theory underlying various parameter-estimation methods, and gives algorithms which depend on alternatives to (smoothed) maximum-likelihood estimation. We first give an overview of results from statistical learning theory. We then show how important concepts from the classification literature-specifically, generalization results based on margins on training data — can be derived for parsing models. Finally, we describe parameter estimation algorithms which are motivated by these generalization bounds.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abney, S. (1997). Stochastic attribute-value grammars. Computational Linguistics, 23, 597–618.
Anthony, M. and P. L. Bartlett. (1999). Neural Network Learning: Theoretical Foundations. Cambridge University Press.
Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2): 525–536, 1998.
Block, H. D. (1962). The perceptron: A model for brain functioning. Reviews of Modern Physics, 34, 123–135.
Bod, R. (1998). Beyond Grammar: An Experience-Based Theory of Language. CSLI Publications/Cambridge University Press.
Booth, T. L., and Thompson, R. A. (1973). Applying probability measures to abstract languages. IEEE Transactions on Computers, C-22(5), 442–450.
Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing. PhD Dissertation, University of Pennsylvania.
Collins, M. (2000). Discriminative reranking for natural language parsing. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), pages 175–182. San Francisco: Morgan Kaufmann.
Collins, M., and Duffy, N. (2001). Convolution kernels for natural language. In Dietterich, T. G., Becker, S., and Ghahramani, Z., (eds.) Advances in Neural Information Processing Systems 14 (NIPS 14). MIT Press, Cambridge, MA.
Collins, M., Schapire, R. E., and Singer, Y. (2002). Logistic regression, AdaBoost and Bregman distances. In Machine Learning, 48(1–3):253–285.
Collins, M., and Duffy, N. (2002). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pages 263–270. San Francisco: Morgan Kaufmann.
Collins, M. (2002a). Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pages 489–496. San Francisco: Morgan Kaufmann.
Collins, M. (2002b). Discriminative training methods for hidden markov models: Theory and experiments with the perceptron algorithm. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 1–8.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. In Machine Learning, 20(3):273–297.
Crammer, K., and Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. In Journal of Machine Learning Research, 2(Dec):265–292.
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines (and other Kernel-Based Learning Methods). Cambridge University Press.
Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 380–393.
Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer.
Demiriz, A., Bennett, K. P., and Shawe-Taylor, J. (2001). Linear programming boosting via column generation. In Machine Learning, 46(1):225–254.
Elisseeff, A., Guermeur, Y., and Paugam-Moisy, H. (1999). Margin error and generalization capabilities of multiclass discriminant systems. Technical Report NeuroCOLT2, 1999-051.
Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139.
Freund, Y. and Schapire, R. (1999). Large margin classification using the perceptron algorithm. In Machine Learning, 37(3):277–296.
Freund, Y, Iyer, R., Schapire, R.E., & Singer, Y. (1998). An efficient boosting algorithm for combining preferences. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), pages 170–178. Morgan Kaufmann.
Friedman, J. H., Hastie, T. and Tibshirani, R. (1998). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 38(2), 337–374.
Joachims, T. (1998). Making large-scale SVM learning practical. In (Scholkopf et al., 1998), pages 169–184.
Johnson, M., Geman, S., Canon, S., Chi, S., & Riezler, S. (1999). Estimators for stochastic “unification-based” grammars. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 99), pages 535–541. San Francisco: Morgan Kaufmann.
Lafferty, J. (1999). Additive models, boosting, and inference for generalized divergences. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT’99), pages 125–133.
Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pages 282–289. Morgan Kaufmann.
Lebanon, G., and Lafferty, J. (2001). Boosting and maximum likelihood for exponential models. In Dietterich, T. G., Becker, S., and Ghahramani, Z., (eds.) Advances in Neural Information Processing Systems 14 (NIPS 14). MIT Press, Cambridge, MA.
Littlestone, N., and Warmuth, M. (1986). Relating data compression and learn-ability. Technical report, University of California, Santa Cruz.
Novikoff, A. B. J. (1962). On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, Vol XII, 615–622.
Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In (Scholkopf et al., 1998), pages 185–208.
Ratnaparkhi, A., Roukos, S., and Ward, R. T. (1994). A maximum entropy model for parsing. In Proceedings of the International Conference on Spoken Language Processing (ICSLP 1994), pages 803–806. Yokohama, Japan.
Ratnaparkhi, A. (1996). A maximum entropypart-of-speech tagger. In Proceedings of the 1996 Conference on Empirical Methods in Natural Language Processing (EMNLP 1996), pages 133–142.
Rosenblatt, F. (1958). The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408.
Schapire R., Freund Y., Bartlett P. and Lee W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651–1686.
Scholkopf, B., Burges, C., and Smola, A. (eds.). (1998). Advances in Kernel Methods — Support Vector Learning, MIT Press.
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., and Anthony, M. (1998). Structural Risk Minimization over Data-Dependent Hierarchies. IEEE Transactions on Information Theory, 44(5): 1926–1940.
Valiant, L. G. (1984). A theory of the learnable. In Communications of the ACM, 27(11): 1134–1142.
Vapnik, V. N., and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of probability and its applications, 16(2):264–280.
Vapnik, V. N. (1998). Statistical Learning Theory. New York: Wiley.
Walker, M., Rambow, O., and Rogati, M. (2001). SPoT: A trainable sentence planner. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2001), pages 17–24.
Zhang, T. (2002). Covering Number Bounds of Certain Regularized Linear Function Classes. In Journal of Machine Learning Research, 2(Mar):527–550, 2002.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Kluwer Academic Publishers
About this chapter
Cite this chapter
Collins, M. (2004). Parameter Estimation for Statistical Parsing Models: Theory and Practice of Distribution-Free Methods. In: Bunt, H., Carroll, J., Satta, G. (eds) New Developments in Parsing Technology. Text, Speech and Language Technology, vol 23. Springer, Dordrecht. https://doi.org/10.1007/1-4020-2295-6_2
Download citation
DOI: https://doi.org/10.1007/1-4020-2295-6_2
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-2293-7
Online ISBN: 978-1-4020-2295-1
eBook Packages: Humanities, Social Sciences and Law