Skip to main content

An Efficient Hindi Text Classification Model Using SVM

  • Conference paper
  • First Online:
Computing and Network Sustainability

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 75))

Abstract

In today’s world, several digitized Hindi text documents are generated daily at the Government sites, news portals, and public and private sectors, which are required to be classified effectively into various mutually exclusive pre-defined categories. As such, many Hindi text-based processing systems exist in application domains of information retrieval, machine translation, text summarization, simplification, keyword extraction, and other related parsing and linguistic perspectives, but still, there is a wide scope to classify the extracted text of Hindi documents into pre-defined categories using a classifier. In this paper, a Hindi Text Classification model is proposed, which accepts a set of known Hindi documents, preprocesses them at document, sentence and word levels, extracts features, and trains SVM classifier, which further classifies a set of Hindi unknown documents. Such text classification becomes challenging in Hindi due to its large set of available conjuncts and letter combinations, its sentence structure, and multisense words. The experiments have been performed on a set of four Hindi documents of two categories, which have been classified by SVM with 100% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sinha RMK (2009) A journey from Indian scripts processing to Indian language processing. IEEE Ann Hist Comput 31(1):8–31. https://doi.org/10.1109/MAHC.2009.1

    Article  MathSciNet  Google Scholar 

  2. Mishra G, Nitharwal SL, Kaur S (2010) Language identification using Fuzzy-SVM technique. In: 2nd International conference on computing, communication and networking technologies, pp 1–5. IEEE Press. https://doi.org/10.1109/icccnt.2010.5592553

  3. Sreejith C, Indu M, Raj PCR (2013) N-gram based algorithm for distinguishing between Hindi and Sanskrit texts. In: 4th International conference on computing, communications and networking technologies. IEEE Press, pp 1–4. https://doi.org/10.1109/icccnt.2013.6726777

  4. Kumar R, Singh P (2017) Bilingual code-mixing in Indian social media texts for Hindi and English. In: Singh D, Raman B, Luhach A, Lingras P (eds) Advanced informatics for computing research. communications in computer and information science, vol 712. Springer, Singapore, pp 121–129. https://doi.org/10.1007/978-981-10-5780-9_11

    Google Scholar 

  5. Kumar R, Dua M, Jindal S (2014) D-HIRD Domain-independent Hindi language interface to relational database. In: IEEE international conference on computation of power, energy, in-formation and communication. IEEE Press, pp 81–86. https://doi.org/10.1109/iccpeic.2014.6915344

  6. Prasad G, Fousiya KK (2015) Named entity recognition approaches: a study applied to English and Hindi language. In: International conference on circuits, power and computing technologies. IEEE Press (2015), pp 1–4. https://doi.org/10.1109/iccpct.2015.7159443

  7. Gupta S, Bhattacharyya P (2010) Think globally, apply locally: using distributional character-istics for Hindi named entity identification. In: Proceedings of the named entities work-shop. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 116–125

    Google Scholar 

  8. Jain A, Yadav D, Tayal DK (2014) NER for Hindi language using association rules. In: International conference on data mining and intelligent computing. IEEE Press, pp 1–5. https://doi.org/10.1109/icdmic.2014.6954253

  9. Ekbal A, Saha S (2016) Simultaneous feature and parameter selection using multiobjective optimization: application to named entity recognition. Int J Mach Learn Cybernet 7(4):597–611. https://doi.org/10.1007/s13042-014-0268-7

    Article  Google Scholar 

  10. Ekbal A, Saha S, Sikdar UK (2016) On active annotation for named entity recognition. Int J Mach Learn Cybernet 7(4):623–640. https://doi.org/10.1007/s13042-014-0275-8

    Article  Google Scholar 

  11. Bhagavatula M, GSK S, Varma V (2012) Language-independent named entity identification using Wikipedia. In: Proceedings of the first workshop on multilingual modeling. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 11–17

    Google Scholar 

  12. Siddiqi S, Sharan A (2015) Keyword and keyphrase extraction from single Hindi document us-ing statistical approach. In: 2nd International conference on signal processing and integrated networks. IEEE Press, pp. 713–718. https://doi.org/10.1109/spin.2015.7095377

  13. Sharan A, Siddiqi S (2014) A supervised approach to distinguish between keywords and stop-words using probability distribution functions. In: International conference on advances in computing, communications and informatics. IEEE Press, pp 1074–1080. https://doi.org/10.1109/icacci.2014.6968206

  14. Singh S, Siddiqui TJ (2012) Evaluating effect of context window size, stemming and stop word removal on Hindi word sense disambiguation. In: International conference on information retrieval and knowledge management. IEEE Press, pp 1–5. https://doi.org/10.1109/infrkm.2012.6204972

  15. Sinha RMK (2011) Stepwise mining of multi-word expressions in Hindi. In: Proceedings of the ACM workshop on multiword expressions: from parsing and generation to the real world. Association for Computational Linguistics Stroudsburg, PA, USA, pp 110–115

    Google Scholar 

  16. Sinha RMK (2009) Automated mining of names using parallel Hindi-English corpus. In: Proceedings of the 7th workshop on Asian language resources. Association for Computational Linguistics Stroudsburg, PA, USA, pp 48–54

    Google Scholar 

  17. Sinha RMK (2009) Mining complex predicates in Hindi using a parallel Hindi-English corpus. In: Proceedings of the ACM workshop on multiword expressions: identification, interpretation, disambiguation and applications. Association for Computational Linguistics Stroudsburg, PA, USA, pp 40–46

    Google Scholar 

  18. Priyanka Sinha RMK (2014) A system for identification of idioms in Hindi. In: 7th international conference on contemporary computing. IEEE Press, pp 467–472. https://doi.org/10.1109/ic3.2014.6897218

  19. Ramrakhiyani N, Majumder P (2015) Approaches to temporal expression recognition in Hindi. ACM Trans Asian Low-Resour Lang Inf Process 14(1):2

    Article  Google Scholar 

  20. Mall S, Jaiswal UC (2014) Resolving issues in parsing technique in machine translation from Hindi language to English language. In: International conference on computer and com-munication technology. IEEE Press, pp 55–58. https://doi.org/10.1109/iccct.2014.7001469

  21. Goutam R (2012) Exploring self-training and co-training for Hindi dependency parsing using partial parses. In: International conference on Asian language processing. IEEE Press, pp 37–40. https://doi.org/10.1109/ialp.2012.38

  22. Sarika Sharma DK (2015) A comparative analysis of Hindi word sense disambiguation and its approaches. In: International conference on computing, communication and automation. IEEE Press, pp. 314–321. https://doi.org/10.1109/ccaa.2015.7148396

  23. Sinha RMK (2011) Learning recognition of ambiguous proper names in Hindi. In: 10th Inter-national conference on machine learning and applications and workshops. IEEE Press, pp. 178–182. https://doi.org/10.1109/icmla.2011.87

  24. Jain A, Yadav S, Tayal D (2013) Measuring context-meaning for open class words in Hindi language. In: 6th International conference on contemporary computing. IEEE Press, pp 118–123. https://doi.org/10.1109/ic3.2013.6612174

  25. Sawhney R, Kaur A (2014) A modified technique for word sense disambiguation using Lesk algorithm in Hindi language. In: International conference on advances in computing, communications and informatics. IEEE Press, pp 2745–2749. https://doi.org/10.1109/icacci.2014.6968470

  26. Jain A, Lobiyal DK (2014) A new method for updating word senses in Hindi WordNet. In: International conference on issues and challenges in intelligent computing techniques. IEEE Press, pp 666–671. https://doi.org/10.1109/icicict.2014.6781359

  27. Agarwal M, Bajpai J (2014) Correlation based word sense disambiguation. In: 7th International conference on contemporary computing. IEEE Press, pp 382–386. https://doi.org/10.1109/ic3.2014.6897204

  28. Nandanwar L (2015) Graph connectivity for unsupervised word sense disambiguation for Hindi language. In: International conference on innovations in information, embedded and communication systems. IEEE Press, pp 1–4. https://doi.org/10.1109/iciiecs.2015.7193083

  29. Jain A, Lobiyal DK (2015) Unsupervised Hindi word sense disambiguation based on network agglomeration. In: 2nd International conference on computing for sustainable global development. IEEE Press, pp 195–200

    Google Scholar 

  30. Goyal V, Lehal GS (2010) Automatic standardization of spelling variations of Hindi text. In: International conference on computer and communication technology. IEEE Press, pp 764–767. https://doi.org/10.1109/iccct.2010.5640441

  31. Jain A, Jain M (2014) Detection and correction of non word spelling errors in Hindi language. In: International conference on data mining and intelligent computing. IEEE Press, pp 1–5. https://doi.org/10.1109/icdmic.2014.6954235

  32. Harikrishna DM, Rao KS (2015) Classification of children stories in Hindi using keywords and POS density. In: International conference on computer, communication and control. IEEE Press, pp 1–5. https://doi.org/10.1109/ic4.2015.7375666

  33. Harikrishna DM, Rao KS (2015) Children story classification based on structure of the story. In: International conference on advances in computing, communications and informatics. IEEE Press, pp 1485–1490. https://doi.org/10.1109/icacci.2015.7275822

  34. Joshi N, Mathur I, Darbar H, Kumar A, Jain P (2014) Evaluation of some English-Hindi MT systems. In: International conference on advances in computing, communications and informatics. IEEE Press, pp 1751–1758. https://doi.org/10.1109/icacci.2014.6968570

  35. Arora KK, Sinha RMK (2012) Improving statistical machine translation through co-joining parts of verbal constructs in English-Hindi translation. In: ACM proceedings of the 6th workshop on syntax, semantics and structure in statistical translation. Association for Computational Linguistics Stroudsburg, PA, USA, pp 95–101

    Google Scholar 

  36. Thaokar C, Malik L (2013) Test model for summarizing Hindi text using extraction method. In: Proceedings of conference on information and communication technologies. IEEE Press, pp 1138–1143. https://doi.org/10.1109/cict.2013.6558271

  37. Tyagi S, Chopra D, Mathur I, Joshi N (2015) Comparison of classifier based approach with baseline approach for English-Hindi text simplification. In: International conference on computing, communication and automation. IEEE Press, pp 290–293. https://doi.org/10.1109/ccaa.2015.7148390

  38. Krishnamurthi K, Sudi RK, Panuganti VR, Bulusu VV (2013) An empirical evaluation of dimensionality reduction using latent semantic analysis on Hindi text. In: International conference on Asian language processing. IEEE Press, pp 21–24. https://doi.org/10.1109/ialp.2013.11

  39. Singh P, Dutta K (2014) Annotating indirect anaphora for Hindi: a corpus based study. In: 6th international conference on computational intelligence and communication networks. IEEE Press, pp 525–529. https://doi.org/10.1109/cicn.2014.120

  40. Puri S, Kaushik S (2012) An enhanced fuzzy similarity based concept mining model for text classification using feature clustering. In: Students’ conference on engineering and systems. IEEE Press, pp 1–6. https://doi.org/10.1109/sces.2012.6199126

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shalini Puri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Puri, S., Singh, S.P. (2019). An Efficient Hindi Text Classification Model Using SVM. In: Peng, SL., Dey, N., Bundele, M. (eds) Computing and Network Sustainability. Lecture Notes in Networks and Systems, vol 75. Springer, Singapore. https://doi.org/10.1007/978-981-13-7150-9_24

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-7150-9_24

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-7149-3

  • Online ISBN: 978-981-13-7150-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics