Abstract
This paper presents a study of using neural probabilistic models in a syntactic based language model. The neural probabilistic model makes use of a distributed representation of the items in the conditioning history, and is powerful in capturing long dependencies. Employing neural network based models in the syntactic based language model enables it to use efficiently the large amount of information available in a syntactic parse in estimating the next word in a string. Several scenarios of integrating neural networks in the syntactic based language model are presented, accompanied by the derivation of the training procedures involved. Experiments on the UPenn Treebank and the Wall Street Journal corpus show significant improvements in perplexity and word error rate over the baseline SLM. Furthermore, comparisons with the standard and neural net based N-gram models with arbitrarily long contexts show that the syntactic information is in fact very helpful in estimating the word string probability. Overall, our neural syntactic based model achieves the best published results in perplexity and WER for the given data sets.
Article PDF
Similar content being viewed by others
References
Bellegarda, J. R. (1997). A latent semantic analysis framework for large–span language modeling. In Proceedings of the 5th European Conference on Speech Communication and Technology (pp. 1451&1454). Vol. 3. Rhodes, Greece.
Bengio, Y., Ducharme, R., & Vincent, P. (2001). A neural probabilistic language model. Advances in Neural Information Processing Systems, 13, 933–938.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neuralprobabilistic language model. Journal of Machine Learning Reseach, 3, 1137–1155.
Berger, A. L., Pietra, S. A. D., & Pietra, V. J. D. (1996). A maximum entropyapproach to natural language processing. Computational Linguistics, 22:1, 39–72.
Bridle, J. S. (1989). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical patternrecognition. In F. Fougelman-Soulie and J. Herault (Eds.), Neuro-computing: Algorithms, architectures and applicatations (pp. 227&236).
Byrne, W., Gunawardana, A., & Khudanpur, S. (1998). Information geometry and EMvariants. Technical Report CLSP Research Note (17). Department of Electrical andComputer Engineering, The Johns Hopkins University, Baltimore, MD.
Charniak, E. (2001). Immediate-head parsing for language models. In Proceedings of the 39th Annual Meeting and 10th Conference of the European Chapter of ACL (pp. 116–123). Toulouse, France.
Chelba, C. (1997). A structured language model. In ACL-EACL, Student Section (pp. 498&500). Madrid, Spain.
Chelba, C., & Jelinek, F. (2000). Structured language modeling. Computer Speech and Language, 14:4, 283–332.
Chelba, C., & Xu, P. (2001). Richer syntactic dependencies for structuredlanguage modeling. In Proceedings of the Automatic Speech Recognition and Understanding Workshop. Madonna di Campiglio, Trento-Italy.
Chen, S. F. & Goodman, J. (1999). An empirical study of smoothing techniquesfor language modeling. Computer Speech and Language, 13, 359–394.
Collins, M. (1996). A new statistical parser based on bigram lexicaldependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (pp. 184&191). Santa Cruz, CA.
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41:6, 391–407.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–38.
Elman, J. L. (1991). Distributed representations, simple recurrent networks,and grammatical structure. Machine Learning, 7, 195–225.
Emami, A. (2003). Improving a connectionist based syntactical language model. In Proceedings of the 8th European Conference on Speech Communication and Technology (pp. 413–416), Vol. 1. Geneva, Switzerland.
Emami, A., & Jelinek, F. (2004). Exact training of a neural syntactic languagemodel. In Proceedings of the IEEE International Conference onAcoustics, Speech and Signal Processing. Montreal,Quebec.
Emami, A., Xu, P., & Jelinek, F. (2003). Using a connectionist model in asyntactical based language model. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 372–375). Vol. I. Hong Kong.
Fodor, J. A. & Pylyshyn, Z.W. (1988). Connectionism and cognitive structure: A critical analysis. Cognition, 28, 3–71.
Goodman, J. (2001). A bit of progress in language modeling. Technical Report MSR-TR-2001-72, Microsoft Research, Redmond, WA.
Gropp,W., Lusk, E., & Skjellum, A. (1999). Using MPI: Portable parallelProgramming with themessage-passing interface. Cambridge: MA: MIT Press.
Henderson, J. (2000). A neural network parser that handles sparse data. In Proceedings of 6th International Workshop on Parsing Technologies (pp. 123–134). Trento, Italy.
Henderson, J. (2003). Inducing history representations for broad coveragestatistical parsing. In Proceedings of the North American Chapter of Association Computational Linguistics and Human Language Technology Conference HLT-NAACL.
Hinton, G. E. (1986). Learning distributed representations of concepts. In R. G. M. Morris (Ed.), Parallel distributed processing:Implications for psychology and Neurobiology (pp. 46–61). Oxford, UK: Oxford University Press.
Ho, E. & Chan, L. (1999). How to design a connectionist holistic parser. Neural Computation, 11:8, 1995–2016.
Jelinek, F. (1998). Statistical methods for speech recognition. Cambridge, MA and London: MIT Press.
Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov sourceparameters from sparse data. In Proceedings of Workshop on Pattern Recognition in Practice (pp. 381–397). Amsterdam, The Netherlands: North Holland Publishing Co.
Kim, W., Khudanpur, S., & Wu, J. (2001). Smoothing issues in the structuredlanguage model. In Proceedings of the 7th European Conference on Speech Communication and Technology (pp. 717–720). Alborg, Denmark.
Kneser, R., & Ney, H. (1995). Improved backing-off for m-gram languagemodeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 181&184), Vol. I.
Lawrence, S., Giles, C. L., & Fong, S. (1996). Can recurrent neural networkslearn natural language grammars?. In Proceedings of the IEEE International Conference on Neural Networks (pp. 1853&1858). Piscataway, NJ: IEEE Press.
Lawson, C. L., Hanson, R. J., Kincaid, D. R., & Krogh, F. T. (1979). Basiclinear algebra subprograms for fortran usage. ACM Transactions on Mathematical Software, 5:3, 308–323.
LeCun, Y. (1985). A learning scheme for asymmetric threshold networks. In Proceedings of Cognitiva 85 (pp. 599–604). Paris, France.
Miikkulainen, R. & Dyer, M. G. (1991). Natural language processing withmodular neural networks and distributed lexicon. Cognitive Science, 15, 343–399.
Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilisticdependencies in stochastic language modeling.. Computer Speech and Language, 8, 1–38.
Paul, D. B., & Baker, J. M. (1992). The design for the wall street journal-based CSR corpus. In Proceedings of the DARPA SLS Workshop.
Ratnaparkhi, A. (1997). A linear observed time statistical parser based onmaximum entropy models. In Second Conference on Empirical Methods in Natural Language Processing (pp. 1–10). Providence, RI.
Roark, B. (2001). Robust probabilistic predictive syntactic processing: Motivations, models and applications. Ph.D. thesis, Brown University, Providence, RI.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Leaning internalrepresentations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Paralleldistributed processing, I. Cambridge, MA: MIT Press.
Schwenk, H., & Gauvain, J.-L. (2002). Connectionist language modeling for largevocabulary continuous speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, (pp. 765–768). Vol. II. Orlando, FL.
Van Uystel, D. H., Van Compernolle, D., & Wambacq, P. (2001). Maximum-likelihood training of the PLCG-based language model. In Proceedings of the Automatic Speech Recognition andUnderstanding Workshop. Madonna di Campiglio, Trento-Italy.
Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysisin the behavioral sciences. Ph.D. thesis, Harvard University, Cambridge, MA.
Xu, P., Chelba, C., & Jelinek, F. (2002). A study on richer syntacticdependencies for structured language modeling. In Proceedings of the 40th Annual Meeting of the Associationfor Computational Linguistics. Philadelphia, PA.
Xu, P., Emami, A., & Jelinek, F. (2003). Training connectionist models for thestructured language model. In M. Collins, & M. Steedman (Eds.), Proceedings of the 2003conference on empirical methods in natural language processing. Sapporo, Japan: (pp. 160–167). Association for Computational Linguistics.
Xu, W., & Rudnicky, A. (2000). Can artificial neural networks learn languagemodels? In Proceedings of 6th International Conference on Spoken Language Processing. Beijing, China.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the National Science Foundation under grant No. IIS-0085940.
Editors:
Dan Roth and Pascale Fung
Rights and permissions
About this article
Cite this article
Emami, A., Jelinek, F. A Neural Syntactic Language Model. Mach Learn 60, 195–227 (2005). https://doi.org/10.1007/s10994-005-0916-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-005-0916-y