Abstract
The prediction of next word, letter or phrase for the user, while she is typing, is a really valuable tool for improving user experience. The users are communicating, writing reviews and expressing their opinion on such platforms frequently and many times while moving. It has become necessary to provide the user with an application that can reduce typing effort and spelling errors when they have limited time. The text data is getting larger in size due to the extensive use of all kinds of social media platforms and so implementation of text prediction application is difficult considering the size of text data to be processed for language modeling. This research paper’s primary objective is processing large text corpus and implementing a probabilistic model like N-grams to predict the next word when the user provides input. In this exploratory research, n-gram models are discussed and evaluated using Good Turing Estimation, perplexity measure and type-to-token ratio.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Chauhan R, Kaur H, Alam AM (2010) Data clustering method for discovering clusters in spatial cancer databases. Int J Comput Appl Special Issue 10(6):9–14
Avasthi S, Chauhan R, Acharjaya DP (2020) Techniques, applications and issues in mining large scale text databases. In: International conference on advances in information communication technology and computing (AICTC-2019)
Chauhan R, Kumar N (2018) Predictive data analytics for breast cancer prognosis. In: International conference on advanced computing and intelligent engineering (Accepted in Springer Proceedings). https://doi.org/10.1007/978-981-15-1081-6_21
Chauhan R, Kaur H (2015) SPAM: an effective and efficient spatial algorithm for mining grid data. In: Geo-intelligence and visualization through big data trends. IGI Global, pp 245–263. Web. 9 Sep. 2015. https://doi.org/10.4018/978-1-4666-8465-2.ch010
Jurafsky D, Martin JH (2000) Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition. Prentice Hall, Englewood Cliffs, NJ
Hard A, Rao K, Mathews R, Beaufays F, Augenstein S, Eichner H, Ramage D (2018) Federated learning for mobile keyboard prediction. arXiv:1811.03604
Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410
Violos J, Tserpes K, Varlamis I, Varvarigou T (2018) Text Classification using the N-gram graph Representation model over high frequency data streams. Front Appl Math Stat 4:41
Buchta, C., Hornik, K, Feinerer, I., & Meyer, D.: tau: Text analysis utilities (Version 0.0-18) [Software] (2014)
R Core Team (2014, July 10). R: a language and environment for statistical computing. R Foundation for statistical computing (Version 3.1.1) [Software]
Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stat Comput 25(5)
Cohen S (2019) Bayesian analysis in natural language processing. Synthesis Lect Human Lang Technol 12(1):1–343
Ranjan N, Mundada K, Phaltane K, Ahmad S (2016) A survey on techniques in NLP. Int J Comput Appl 134(8):6–9
Richards B (1987) Type/token ratios: what do they really tell us? J Child Lang 14:201-209
Turing AM (1950) Computing machinery and intelligence. Mind 59:433–460
Steyvers M, Griffiths T (2007) Probabilistic topic models. In: Handbook of latent semantic analysis, vol 427(7), pp 424-440
Gale WA, Sampson G (1995) Good-Turing frequency estimation without tears. J Quantitative Linguistics 2:217–237
Chen SF, Goodman J (1999) An empirical study of smoothing techniques for language modeling. Comput Speech Lang 13(4):359–394
Kneser R, Ney H (1995) Improved backing-off for m-gram language modeling. In: 1995 International conference on acoustics, speech, and signal processing, vol 1. IEEE, pp 181–184
Chen SF, Beeferman D, Rosenfeld R (1998) Evaluation metrics for language models. In: DARPA broadcast news transcription and understanding workshop, pp 275–280
Meyer D, Hornik K, Feinerer I (2008) Text mining infrastructure in R. J Stat Softw 25(5):1–54
Feinerer I (2013) Introduction to the tm package text mining in R. Accessible en ligne http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
Coursera. SwifKey Text Dataset [Data file]. Retrieved from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Last accessed on 2019/05/06
Wickham H, Wickham MH (2007) The ggplot package
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Avasthi, S., Chauhan, R., Acharjya, D.P. (2021). Processing Large Text Corpus Using N-Gram Language Modeling and Smoothing. In: Goyal, D., Gupta, A.K., Piuri, V., Ganzha, M., Paprzycki, M. (eds) Proceedings of the Second International Conference on Information Management and Machine Intelligence. Lecture Notes in Networks and Systems, vol 166. Springer, Singapore. https://doi.org/10.1007/978-981-15-9689-6_3
Download citation
DOI: https://doi.org/10.1007/978-981-15-9689-6_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9688-9
Online ISBN: 978-981-15-9689-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)