Skip to main content

News Articles Classification Using Random Forests and Weighted Multimodal Features

  • Conference paper
Multidisciplinary Information Retrieval (IRFC 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8849))

Included in the following conference series:

Abstract

This research investigates the problem of news articles classification. The classification is performed using N-gram textual features extracted from text and visual features generated from one representative image. The application domain is news articles written in English that belong to four categories: Business-Finance, Lifestyle-Leisure, Science-Technology and Sports downloaded from three well-known news web-sites (BBC, Reuters, and TheGuardian). Various classification experiments have been performed with the Random Forests machine learning method using N-gram textual features and visual features from a representative image. Using the N-gram textual features alone led to much better accuracy results (84.4%) than using the visual features alone (53%). However, the use of both N-gram textual features and visual features led to slightly better accuracy results (86.2%). The main contribution of this work is the introduction of a news article classification framework based on Random Forests and multimodal features (textual and visual), as well as the late fusion strategy that makes use of Random Forests operational capabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Schneider, K.-M.: Techniques for improving the performance of naive Bayes for text classification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 682–693. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  2. Zeng, A., Huang, Y.: A text classification algorithm based on rocchio and hierarchical clustering. In: Huang, D.-S., Gan, Y., Bevilacqua, V., Figueroa, J.C. (eds.) ICIC 2011. LNCS, vol. 6838, pp. 432–439. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  3. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  4. Toutanova, K.: Competitive generative models with structure learning for NLP classification tasks. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 576–584 (2006)

    Google Scholar 

  5. Ho, A.K.N., Ragot, N., Ramel, J.Y., Eglin, V., Sidere, N.: Document Classification in a Non-stationary Environment: A One-Class SVM Approach. In: Proceedings of the 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 616–620 (2013)

    Google Scholar 

  6. Klassen, M., Paturi, N.: Web document classification by keywords using random forests. In: Zavoral, F., Yaghob, J., Pichappan, P., El-Qawasmeh, E. (eds.) NDT 2010, Part II. CCIS, vol. 88, pp. 256–261. Springer, Heidelberg (2010)

    Google Scholar 

  7. Caropreso, M.F., Matwin, S., Sebastiani, F.: Statistical phrases in automated text categorization. Centre National de la Recherche Scientifique, Paris (2000)

    Google Scholar 

  8. Braga, I., Monard, M., Matsubara, E.: Combining unigrams and bigrams in semi-supervised text classification. In: Proceedings of Progress in Artificial Intelligence, 14th Portuguese Conference on Artificial Intelligence (EPIA 2009), Aveiro, pp. 489–500 (2009)

    Google Scholar 

  9. Selamat, A., Omatu, S.: Web page feature selection and classification using neural networks. Information Sciences 158, 69–88 (2004)

    Article  MathSciNet  Google Scholar 

  10. Aung, W.T., Hla, K.H.M.S.: Random forest classifier for multi-category classification of web pages. In: IEEE Asia-Pacific Services Computing Conference, APSCC 2009, pp. 372–376. IEEE (2009)

    Google Scholar 

  11. Shin, C., Doermann, D., Rosenfeld, A.: Classification of document pages using structure-based features. International Journal on Document Analysis and Recognition 3(4), 232–247 (2001)

    Article  Google Scholar 

  12. Chen, N., Shatkay, H., Blostein, D.: Exploring a new space of features for document classification: figure clustering. In: Proceedings of the 2006 Conference of the Center for Advanced Studies on Collaborative Research, p. 35. IBM Corp. (2006)

    Google Scholar 

  13. Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., König, A.C.: BLEWS: Using Blogs to Provide Context for News Articles. In: ICWSM (2008)

    Google Scholar 

  14. Bandari, R., Asur, S., Huberman, B.A.: The Pulse of News in Social Media: Forecasting Popularity. In: ICWSM (2012)

    Google Scholar 

  15. Swezey, R.M., Sano, H., Shiramatsu, S., Ozono, T., Shintani, T.: Automatic detection of news articles of interest to regional communities. IJCSNS 12(6), 100 (2012)

    Google Scholar 

  16. Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  17. Xu, B., Ye, Y., Nie, L.: An improved random forest classifier for image classification. In: 2012 International Conference on Information and Automation (ICIA), pp. 795–800. IEEE (2012)

    Google Scholar 

  18. Erdélyi, M., Garzó, A., Benczúr, A.A.: Web spam classification: a few features worth more. In: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, pp. 27–34. ACM (2011)

    Google Scholar 

  19. Li, W., Meng, Y.: Improving the performance of neural networks with random forest in detecting network intrusions. In: Guo, C., Hou, Z.-G., Zeng, Z. (eds.) ISNN 2013, Part II. LNCS, vol. 7952, pp. 622–629. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  20. Gray, K.R., Aljabar, P., Heckemann, R.A., Hammers, A., Rueckert, D.: Random forest-based similarity measures for multi-modal classification of Alzheimer’s disease. NeuroImage 65, 167–175 (2013)

    Article  Google Scholar 

  21. Robnik-Šikonja, M.: Improving random forests. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 359–370. Springer, Heidelberg (2004)

    Google Scholar 

  22. HaCohen-Kerner, Y., Mughaz, D., Beck, H., Yehudai, E.: Words as Classifiers of Documents According to their Historical Period and the Ethnic Origin of their Authors. Cybernetics and Systems: An International Journal 39(3), 213–228 (2008)

    Article  MATH  Google Scholar 

  23. Fox, C.: A stop list for general text. ACM SIGIR Forum 24(1-2) (1989)

    Google Scholar 

  24. Sikora, T.: The MPEG-7 visual standard for content description-an overview. IEEE Transactions on Circuits and Systems for Video Technology 11(6), 696–702 (2001)

    Article  MathSciNet  Google Scholar 

  25. Zhou, Q., Hong, W., Luo, L., Yang, F.: Gene selection using random forest and proximity differences criterion on DNA microarray data. Journal of Convergence Information Technology 5(6), 161–170 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I. (2014). News Articles Classification Using Random Forests and Weighted Multimodal Features. In: Lamas, D., Buitelaar, P. (eds) Multidisciplinary Information Retrieval. IRFC 2014. Lecture Notes in Computer Science, vol 8849. Springer, Cham. https://doi.org/10.1007/978-3-319-12979-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12979-2_6

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12978-5

  • Online ISBN: 978-3-319-12979-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics