News Articles Classification Using Random Forests and Weighted Multimodal Features

Liparas, Dimitris; HaCohen-Kerner, Yaakov; Moumtzidou, Anastasia; Vrochidis, Stefanos; Kompatsiaris, Ioannis

doi:10.1007/978-3-319-12979-2_6

Dimitris Liparas¹⁷,
Yaakov HaCohen-Kerner¹⁸,
Anastasia Moumtzidou¹⁷,
Stefanos Vrochidis¹⁷ &
…
Ioannis Kompatsiaris¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8849))

Included in the following conference series:

Information Retrieval Facility Conference

600 Accesses
23 Citations

Abstract

This research investigates the problem of news articles classification. The classification is performed using N-gram textual features extracted from text and visual features generated from one representative image. The application domain is news articles written in English that belong to four categories: Business-Finance, Lifestyle-Leisure, Science-Technology and Sports downloaded from three well-known news web-sites (BBC, Reuters, and TheGuardian). Various classification experiments have been performed with the Random Forests machine learning method using N-gram textual features and visual features from a representative image. Using the N-gram textual features alone led to much better accuracy results (84.4%) than using the visual features alone (53%). However, the use of both N-gram textual features and visual features led to slightly better accuracy results (86.2%). The main contribution of this work is the introduction of a news article classification framework based on Random Forests and multimodal features (textual and visual), as well as the late fusion strategy that makes use of Random Forests operational capabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Schneider, K.-M.: Techniques for improving the performance of naive Bayes for text classification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 682–693. Springer, Heidelberg (2005)
Chapter Google Scholar
Zeng, A., Huang, Y.: A text classification algorithm based on rocchio and hierarchical clustering. In: Huang, D.-S., Gan, Y., Bevilacqua, V., Figueroa, J.C. (eds.) ICIC 2011. LNCS, vol. 6838, pp. 432–439. Springer, Heidelberg (2011)
Chapter Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Toutanova, K.: Competitive generative models with structure learning for NLP classification tasks. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 576–584 (2006)
Google Scholar
Ho, A.K.N., Ragot, N., Ramel, J.Y., Eglin, V., Sidere, N.: Document Classification in a Non-stationary Environment: A One-Class SVM Approach. In: Proceedings of the 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 616–620 (2013)
Google Scholar
Klassen, M., Paturi, N.: Web document classification by keywords using random forests. In: Zavoral, F., Yaghob, J., Pichappan, P., El-Qawasmeh, E. (eds.) NDT 2010, Part II. CCIS, vol. 88, pp. 256–261. Springer, Heidelberg (2010)
Google Scholar
Caropreso, M.F., Matwin, S., Sebastiani, F.: Statistical phrases in automated text categorization. Centre National de la Recherche Scientifique, Paris (2000)
Google Scholar
Braga, I., Monard, M., Matsubara, E.: Combining unigrams and bigrams in semi-supervised text classification. In: Proceedings of Progress in Artificial Intelligence, 14th Portuguese Conference on Artificial Intelligence (EPIA 2009), Aveiro, pp. 489–500 (2009)
Google Scholar
Selamat, A., Omatu, S.: Web page feature selection and classification using neural networks. Information Sciences 158, 69–88 (2004)
Article MathSciNet Google Scholar
Aung, W.T., Hla, K.H.M.S.: Random forest classifier for multi-category classification of web pages. In: IEEE Asia-Pacific Services Computing Conference, APSCC 2009, pp. 372–376. IEEE (2009)
Google Scholar
Shin, C., Doermann, D., Rosenfeld, A.: Classification of document pages using structure-based features. International Journal on Document Analysis and Recognition 3(4), 232–247 (2001)
Article Google Scholar
Chen, N., Shatkay, H., Blostein, D.: Exploring a new space of features for document classification: figure clustering. In: Proceedings of the 2006 Conference of the Center for Advanced Studies on Collaborative Research, p. 35. IBM Corp. (2006)
Google Scholar
Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., König, A.C.: BLEWS: Using Blogs to Provide Context for News Articles. In: ICWSM (2008)
Google Scholar
Bandari, R., Asur, S., Huberman, B.A.: The Pulse of News in Social Media: Forecasting Popularity. In: ICWSM (2012)
Google Scholar
Swezey, R.M., Sano, H., Shiramatsu, S., Ozono, T., Shintani, T.: Automatic detection of news articles of interest to regional communities. IJCSNS 12(6), 100 (2012)
Google Scholar
Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)
Article MATH Google Scholar
Xu, B., Ye, Y., Nie, L.: An improved random forest classifier for image classification. In: 2012 International Conference on Information and Automation (ICIA), pp. 795–800. IEEE (2012)
Google Scholar
Erdélyi, M., Garzó, A., Benczúr, A.A.: Web spam classification: a few features worth more. In: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, pp. 27–34. ACM (2011)
Google Scholar
Li, W., Meng, Y.: Improving the performance of neural networks with random forest in detecting network intrusions. In: Guo, C., Hou, Z.-G., Zeng, Z. (eds.) ISNN 2013, Part II. LNCS, vol. 7952, pp. 622–629. Springer, Heidelberg (2013)
Chapter Google Scholar
Gray, K.R., Aljabar, P., Heckemann, R.A., Hammers, A., Rueckert, D.: Random forest-based similarity measures for multi-modal classification of Alzheimer’s disease. NeuroImage 65, 167–175 (2013)
Article Google Scholar
Robnik-Šikonja, M.: Improving random forests. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 359–370. Springer, Heidelberg (2004)
Google Scholar
HaCohen-Kerner, Y., Mughaz, D., Beck, H., Yehudai, E.: Words as Classifiers of Documents According to their Historical Period and the Ethnic Origin of their Authors. Cybernetics and Systems: An International Journal 39(3), 213–228 (2008)
Article MATH Google Scholar
Fox, C.: A stop list for general text. ACM SIGIR Forum 24(1-2) (1989)
Google Scholar
Sikora, T.: The MPEG-7 visual standard for content description-an overview. IEEE Transactions on Circuits and Systems for Video Technology 11(6), 696–702 (2001)
Article MathSciNet Google Scholar
Zhou, Q., Hong, W., Luo, L., Yang, F.: Gene selection using random forest and proximity differences criterion on DNA microarray data. Journal of Convergence Information Technology 5(6), 161–170 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Information Technologies Institute, Centre for Research and Technology Hellas, Thermi-Thessaloniki, Greece
Dimitris Liparas, Anastasia Moumtzidou, Stefanos Vrochidis & Ioannis Kompatsiaris
Dept. of Computer Science, Jerusalem College of Technology - Lev Academic Center, 21 Havaad Haleumi St., P.O.B. 16031, 9116001, Jerusalem, Israel
Yaakov HaCohen-Kerner

Authors

Dimitris Liparas
View author publications
You can also search for this author in PubMed Google Scholar
Yaakov HaCohen-Kerner
View author publications
You can also search for this author in PubMed Google Scholar
Anastasia Moumtzidou
View author publications
You can also search for this author in PubMed Google Scholar
Stefanos Vrochidis
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Kompatsiaris
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Informatics, Tallinn University, Narva mnt 25, Tallinn, Estonia
David Lamas
Insight Centre for Data Analytics, Unit for Natural Language Processing, IDA, National University of Ireland, Business Park, Lower Dangan, Galway, Ireland
Paul Buitelaar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I. (2014). News Articles Classification Using Random Forests and Weighted Multimodal Features. In: Lamas, D., Buitelaar, P. (eds) Multidisciplinary Information Retrieval. IRFC 2014. Lecture Notes in Computer Science, vol 8849. Springer, Cham. https://doi.org/10.1007/978-3-319-12979-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-12979-2_6
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12978-5
Online ISBN: 978-3-319-12979-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics