Classification of Imbalanced Big Data using SMOTE with Rough Random Forest
Tanuja Das1, Abhinandan Khan2, Goutam Saha3

1Tanuja Das, Department of Information Technology, Gauhati University Institute of Science and Technology, Guwahati, Assam, India.
2Abhinandan Khan*, Department of Computer Science and Technology, University of Calcutta, Acharya Prafulla Chandra Roy Siksha Prangan, JD-2, Sector-III, Saltlake, Kolkata, India.
3Goutam Saha, Department of Information Technology, North-Eastern Hill University, Shillong, Meghalaya, India.
Manuscript received on November 21, 2019. | Revised Manuscript received on December 15, 2019. | Manuscript published on December 30, 2019. | PP: 5174-5184  | Volume-9 Issue-2, December, 2019. | Retrieval Number: B4096129219/2019©BEIESP | DOI: 10.35940/ijeat.B4096.129219
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Learning from datasets is an important research topic today. Amongst the various data mining tools available for the purpose, none works satisfactorily in the case of imbalanced data mainly because this type of data gives rise to various minority classes, which may affect the learning process. In addition to the large volume, characteristics of Big Data also include velocity and variety. The Synthetic Minority Oversampling Technique (SMOTE) is a widely used technique to balance imbalanced data. Here, we have focussed on extending this concept to conform to the Big Data environment by combining it with the concepts of rough random forest (RRF). This hybrid approach comprising SMOTE and RRF algorithms for learning from imbalanced datasets has been applied on various benchmark datasets from the KEEL Dataset Repository. The results obtained are satisfactory. The velocity aspect of Big Data has been handled by this method on the dynamic dataset of the stock market. The results obtained have been verified using popular online websites related to stock markets.
Keywords: Big data, rough set theory, random forest, rough random forest, SMOTE, stock market data.