Abstract
In today’s world, several digitized Hindi text documents are generated daily at the Government sites, news portals, and public and private sectors, which are required to be classified effectively into various mutually exclusive pre-defined categories. As such, many Hindi text-based processing systems exist in application domains of information retrieval, machine translation, text summarization, simplification, keyword extraction, and other related parsing and linguistic perspectives, but still, there is a wide scope to classify the extracted text of Hindi documents into pre-defined categories using a classifier. In this paper, a Hindi Text Classification model is proposed, which accepts a set of known Hindi documents, preprocesses them at document, sentence and word levels, extracts features, and trains SVM classifier, which further classifies a set of Hindi unknown documents. Such text classification becomes challenging in Hindi due to its large set of available conjuncts and letter combinations, its sentence structure, and multisense words. The experiments have been performed on a set of four Hindi documents of two categories, which have been classified by SVM with 100% accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sinha RMK (2009) A journey from Indian scripts processing to Indian language processing. IEEE Ann Hist Comput 31(1):8–31. https://doi.org/10.1109/MAHC.2009.1
Mishra G, Nitharwal SL, Kaur S (2010) Language identification using Fuzzy-SVM technique. In: 2nd International conference on computing, communication and networking technologies, pp 1–5. IEEE Press. https://doi.org/10.1109/icccnt.2010.5592553
Sreejith C, Indu M, Raj PCR (2013) N-gram based algorithm for distinguishing between Hindi and Sanskrit texts. In: 4th International conference on computing, communications and networking technologies. IEEE Press, pp 1–4. https://doi.org/10.1109/icccnt.2013.6726777
Kumar R, Singh P (2017) Bilingual code-mixing in Indian social media texts for Hindi and English. In: Singh D, Raman B, Luhach A, Lingras P (eds) Advanced informatics for computing research. communications in computer and information science, vol 712. Springer, Singapore, pp 121–129. https://doi.org/10.1007/978-981-10-5780-9_11
Kumar R, Dua M, Jindal S (2014) D-HIRD Domain-independent Hindi language interface to relational database. In: IEEE international conference on computation of power, energy, in-formation and communication. IEEE Press, pp 81–86. https://doi.org/10.1109/iccpeic.2014.6915344
Prasad G, Fousiya KK (2015) Named entity recognition approaches: a study applied to English and Hindi language. In: International conference on circuits, power and computing technologies. IEEE Press (2015), pp 1–4. https://doi.org/10.1109/iccpct.2015.7159443
Gupta S, Bhattacharyya P (2010) Think globally, apply locally: using distributional character-istics for Hindi named entity identification. In: Proceedings of the named entities work-shop. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 116–125
Jain A, Yadav D, Tayal DK (2014) NER for Hindi language using association rules. In: International conference on data mining and intelligent computing. IEEE Press, pp 1–5. https://doi.org/10.1109/icdmic.2014.6954253
Ekbal A, Saha S (2016) Simultaneous feature and parameter selection using multiobjective optimization: application to named entity recognition. Int J Mach Learn Cybernet 7(4):597–611. https://doi.org/10.1007/s13042-014-0268-7
Ekbal A, Saha S, Sikdar UK (2016) On active annotation for named entity recognition. Int J Mach Learn Cybernet 7(4):623–640. https://doi.org/10.1007/s13042-014-0275-8
Bhagavatula M, GSK S, Varma V (2012) Language-independent named entity identification using Wikipedia. In: Proceedings of the first workshop on multilingual modeling. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 11–17
Siddiqi S, Sharan A (2015) Keyword and keyphrase extraction from single Hindi document us-ing statistical approach. In: 2nd International conference on signal processing and integrated networks. IEEE Press, pp. 713–718. https://doi.org/10.1109/spin.2015.7095377
Sharan A, Siddiqi S (2014) A supervised approach to distinguish between keywords and stop-words using probability distribution functions. In: International conference on advances in computing, communications and informatics. IEEE Press, pp 1074–1080. https://doi.org/10.1109/icacci.2014.6968206
Singh S, Siddiqui TJ (2012) Evaluating effect of context window size, stemming and stop word removal on Hindi word sense disambiguation. In: International conference on information retrieval and knowledge management. IEEE Press, pp 1–5. https://doi.org/10.1109/infrkm.2012.6204972
Sinha RMK (2011) Stepwise mining of multi-word expressions in Hindi. In: Proceedings of the ACM workshop on multiword expressions: from parsing and generation to the real world. Association for Computational Linguistics Stroudsburg, PA, USA, pp 110–115
Sinha RMK (2009) Automated mining of names using parallel Hindi-English corpus. In: Proceedings of the 7th workshop on Asian language resources. Association for Computational Linguistics Stroudsburg, PA, USA, pp 48–54
Sinha RMK (2009) Mining complex predicates in Hindi using a parallel Hindi-English corpus. In: Proceedings of the ACM workshop on multiword expressions: identification, interpretation, disambiguation and applications. Association for Computational Linguistics Stroudsburg, PA, USA, pp 40–46
Priyanka Sinha RMK (2014) A system for identification of idioms in Hindi. In: 7th international conference on contemporary computing. IEEE Press, pp 467–472. https://doi.org/10.1109/ic3.2014.6897218
Ramrakhiyani N, Majumder P (2015) Approaches to temporal expression recognition in Hindi. ACM Trans Asian Low-Resour Lang Inf Process 14(1):2
Mall S, Jaiswal UC (2014) Resolving issues in parsing technique in machine translation from Hindi language to English language. In: International conference on computer and com-munication technology. IEEE Press, pp 55–58. https://doi.org/10.1109/iccct.2014.7001469
Goutam R (2012) Exploring self-training and co-training for Hindi dependency parsing using partial parses. In: International conference on Asian language processing. IEEE Press, pp 37–40. https://doi.org/10.1109/ialp.2012.38
Sarika Sharma DK (2015) A comparative analysis of Hindi word sense disambiguation and its approaches. In: International conference on computing, communication and automation. IEEE Press, pp. 314–321. https://doi.org/10.1109/ccaa.2015.7148396
Sinha RMK (2011) Learning recognition of ambiguous proper names in Hindi. In: 10th Inter-national conference on machine learning and applications and workshops. IEEE Press, pp. 178–182. https://doi.org/10.1109/icmla.2011.87
Jain A, Yadav S, Tayal D (2013) Measuring context-meaning for open class words in Hindi language. In: 6th International conference on contemporary computing. IEEE Press, pp 118–123. https://doi.org/10.1109/ic3.2013.6612174
Sawhney R, Kaur A (2014) A modified technique for word sense disambiguation using Lesk algorithm in Hindi language. In: International conference on advances in computing, communications and informatics. IEEE Press, pp 2745–2749. https://doi.org/10.1109/icacci.2014.6968470
Jain A, Lobiyal DK (2014) A new method for updating word senses in Hindi WordNet. In: International conference on issues and challenges in intelligent computing techniques. IEEE Press, pp 666–671. https://doi.org/10.1109/icicict.2014.6781359
Agarwal M, Bajpai J (2014) Correlation based word sense disambiguation. In: 7th International conference on contemporary computing. IEEE Press, pp 382–386. https://doi.org/10.1109/ic3.2014.6897204
Nandanwar L (2015) Graph connectivity for unsupervised word sense disambiguation for Hindi language. In: International conference on innovations in information, embedded and communication systems. IEEE Press, pp 1–4. https://doi.org/10.1109/iciiecs.2015.7193083
Jain A, Lobiyal DK (2015) Unsupervised Hindi word sense disambiguation based on network agglomeration. In: 2nd International conference on computing for sustainable global development. IEEE Press, pp 195–200
Goyal V, Lehal GS (2010) Automatic standardization of spelling variations of Hindi text. In: International conference on computer and communication technology. IEEE Press, pp 764–767. https://doi.org/10.1109/iccct.2010.5640441
Jain A, Jain M (2014) Detection and correction of non word spelling errors in Hindi language. In: International conference on data mining and intelligent computing. IEEE Press, pp 1–5. https://doi.org/10.1109/icdmic.2014.6954235
Harikrishna DM, Rao KS (2015) Classification of children stories in Hindi using keywords and POS density. In: International conference on computer, communication and control. IEEE Press, pp 1–5. https://doi.org/10.1109/ic4.2015.7375666
Harikrishna DM, Rao KS (2015) Children story classification based on structure of the story. In: International conference on advances in computing, communications and informatics. IEEE Press, pp 1485–1490. https://doi.org/10.1109/icacci.2015.7275822
Joshi N, Mathur I, Darbar H, Kumar A, Jain P (2014) Evaluation of some English-Hindi MT systems. In: International conference on advances in computing, communications and informatics. IEEE Press, pp 1751–1758. https://doi.org/10.1109/icacci.2014.6968570
Arora KK, Sinha RMK (2012) Improving statistical machine translation through co-joining parts of verbal constructs in English-Hindi translation. In: ACM proceedings of the 6th workshop on syntax, semantics and structure in statistical translation. Association for Computational Linguistics Stroudsburg, PA, USA, pp 95–101
Thaokar C, Malik L (2013) Test model for summarizing Hindi text using extraction method. In: Proceedings of conference on information and communication technologies. IEEE Press, pp 1138–1143. https://doi.org/10.1109/cict.2013.6558271
Tyagi S, Chopra D, Mathur I, Joshi N (2015) Comparison of classifier based approach with baseline approach for English-Hindi text simplification. In: International conference on computing, communication and automation. IEEE Press, pp 290–293. https://doi.org/10.1109/ccaa.2015.7148390
Krishnamurthi K, Sudi RK, Panuganti VR, Bulusu VV (2013) An empirical evaluation of dimensionality reduction using latent semantic analysis on Hindi text. In: International conference on Asian language processing. IEEE Press, pp 21–24. https://doi.org/10.1109/ialp.2013.11
Singh P, Dutta K (2014) Annotating indirect anaphora for Hindi: a corpus based study. In: 6th international conference on computational intelligence and communication networks. IEEE Press, pp 525–529. https://doi.org/10.1109/cicn.2014.120
Puri S, Kaushik S (2012) An enhanced fuzzy similarity based concept mining model for text classification using feature clustering. In: Students’ conference on engineering and systems. IEEE Press, pp 1–6. https://doi.org/10.1109/sces.2012.6199126
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Puri, S., Singh, S.P. (2019). An Efficient Hindi Text Classification Model Using SVM. In: Peng, SL., Dey, N., Bundele, M. (eds) Computing and Network Sustainability. Lecture Notes in Networks and Systems, vol 75. Springer, Singapore. https://doi.org/10.1007/978-981-13-7150-9_24
Download citation
DOI: https://doi.org/10.1007/978-981-13-7150-9_24
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-7149-3
Online ISBN: 978-981-13-7150-9
eBook Packages: EngineeringEngineering (R0)