Abstract
Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, overlooking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream mining algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.
- C. Aggarwal, editor. Data Streams: Models and Algorithms. Springer, 2007. Google ScholarDigital Library
- C. Aggarwal and D. Turaga. Mining data streams: Systems and algorithms. In Machine Learning and Knowledge Discovery for Engineering Systems Health Management, pages 4--32. Chapman and Hall, 2012.Google Scholar
- R. Agrawal and R. Srikant. Privacy-preserving data mining. SIGMOD Rec., 29(2):439--450, 2000. Google ScholarDigital Library
- C. Anagnostopoulos, N. Adams, and D. Hand. Deciding what to observe next: Adaptive variable selection for regression in multivariate data streams. In Proc. of the 2008 ACM Symp. on Applied Computing, SAC, pages 961--965, 2008. Google ScholarDigital Library
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS, pages 1--16, 2002. Google ScholarDigital Library
- C. Brodley, U. Rebbapragada, K. Small, and B. Wallace. Challenges and opportunities in applied machine learning. AI Magazine, 33(1):11--24, 2012.Google ScholarDigital Library
- D. Brzezinski and J. Stefanowski. Reacting to different types of concept drift: The accuracy updated ensemble algorithm. IEEE Trans. on Neural Networks and Learning Systems., 25:81--94, 2014.Google ScholarCross Ref
- D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal multi-armed bandits. In Proc. of the 22nd Conf. on Neural Information Processing Systems, NIPS, pages 273--280, 2008.Google Scholar
- D. Cox and D. Oakes. Analysis of Survival Data. Chapman & Hall, London, 1984.Google Scholar
- T. Dietterich. Machine-learning research. AI Magazine, 18(4):97--136, 1997.Google ScholarDigital Library
- G. Ditzler and R. Polikar. Semi-supervised learning in nonstationary environments. In Proc. of the 2011 Int. Joint Conf. on Neural Networks, IJCNN, pages 2741--2748, 2011.Google ScholarCross Ref
- W. Fan and A. Bifet. Mining big data: current status, and forecast to the future. SIGKDD Explorations, 14(2):1--5, 2012. Google ScholarDigital Library
- M. Gaber, J. Gama, S. Krishnaswamy, J. Gomes, and F. Stahl. Data stream mining in ubiquitous environments: state-of-theart and current directions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(2):116--138, 2014.Google ScholarDigital Library
- M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. SIGMOD Rec., 34(2):18--26, 2005. Google ScholarDigital Library
- J. Gama. Knowledge Discovery from Data Streams. Chapman & Hall/CRC, 2010. Google ScholarDigital Library
- J. Gama, R. Sebastiao, and P. Rodrigues. On evaluating stream learning algorithms. Machine Learning, 90(3):317--346, 2013. Google ScholarDigital Library
- J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept-drift adaptation. ACM Computing Surveys, 46(4), 2014. Google ScholarDigital Library
- J. Gantz and D. Reinsel. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east, December 2012.Google Scholar
- A. Goldberg, M. Li, and X. Zhu. Online manifold regularization: A new learning setting and empirical study. In Proc. of the European Conf. on Machine Learning and Principles of Knowledge Discovery in Databases, ECMLPKDD, pages 393--407, 2008. Google ScholarDigital Library
- I. Guyon, A. Saffari, G. Dror, and G. Cawley. Model selection: Beyond the bayesian/frequentist divide. Journal of Machine Learning Research, 11:61--87, 2010. Google ScholarDigital Library
- M. Hassani and T. Seidl. Towards a mobile health context prediction: Sequential pattern mining in multiple streams. In Proc. of , IEEE Int. Conf. on Mobile Data Management, MDM, pages 55--57, 2011. Google ScholarDigital Library
- H. He and Y. Ma, editors. Imbalanced Learning: Foundations, Algorithms, and Applications. IEEE, 2013. Google ScholarDigital Library
- T. Hoens, R. Polikar, and N. Chawla. Learning from streaming data with concept drift and imbalance: an overview. Progress in Artificial Intelligence, 1(1):89--101, 2012.Google ScholarCross Ref
- IBM. An architectural blueprint for autonomic computing. Technical report, IBM, 2003.Google Scholar
- E. Ikonomovska, K. Driessens, S. Dzeroski, and J. Gama. Adaptive windowing for online learning from multiple interrelated data streams. In Proc. of the 11th IEEE Int. Conf. on Data Mining Workshops, ICDMW, pages 697--704, 2011. Google ScholarDigital Library
- A. Kotov, C. Zhai, and R. Sproat. Mining named entities with temporally correlated bursts from multilingual web news streams. In Proc. of the 4th ACMInt. Conf. onWeb Search and Data Mining, WSDM, pages 237--246, 2011. Google ScholarDigital Library
- G. Krempl. The algorithm APT to classify in concurrence of latency and drift. In Proc. of the 10th Int. Conf. on Advances in Intelligent Data Analysis, IDA, pages 222--233, 2011. Google ScholarDigital Library
- M. Last and H. Halpert. Survival analysis meets data stream mining. In Proc. of the 1st Worksh. on Real-World Challenges for Data Stream Mining, RealStream, pages 26--29, 2013.Google Scholar
- F. Nelwamondo and T.Marwala. Key issues on computational intelligence techniques for missing data imputation - a review. In Proc. of World Multi Conf. on Systemics, Cybernetics and Informatics, volume 4, pages 35--40, 2008.Google Scholar
- E. Noack,W. Belau, R.Wohlgemuth, R.Müller, S. Palumberi, P. Parodi, and F. Burzagli. Efficiency of the columbus failure management system. In Proc. of the AIAA 40th Int. Conf. on Environmental Systems, 2010.Google ScholarCross Ref
- E. Noack, A. Luedtke, I. Schmitt, T. Noack, E. Schaumlöffel, E. Hauke, J. Stamminger, and E. Frisk. The columbus module as a technology demonstrator for innovative failure management. In German Air and Space Travel Congress, 2012.Google Scholar
- M. Oliveira and J. Gama. A framework to monitor clusters evolution applied to economy and finance problems. Intelligent Data Analysis, 16(1):93--111, 2012. Google ScholarDigital Library
- D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc., 1999. Google ScholarDigital Library
- T. Raeder and N. Chawla. Model monitor (m2): Evaluating, comparing, and monitoring models. Journal of Machine Learning Research, 10:1387--1390, 2009. Google ScholarDigital Library
- P. Rodrigues and J. Gama. Distributed clustering of ubiquitous data streams. WIREs Data Mining and Knowledge Discovery, pages 38--54, 2013.Google Scholar
- T. Sakaki, M. Okazaki, and Y. Matsuo. Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Trans. on Knowledge and Data Engineering, 25(4):919--931, 2013. Google ScholarDigital Library
- C. Salperwyck and V. Lemaire. Learning with few examples: An empirical study on leading classifiers. In Proc. of the 2011 Int. Joint Conf. on Neural Networks, IJCNN, pages 1010--1019, 2011.Google ScholarCross Ref
- B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan and Claypool Publishers, 2012.Google Scholar
- A. Shaker and E. Hüllermeier. Survival analysis on data streams: Analyzing temporal events in dynamically changing environments. Int. Journal of Applied Mathematics and Computer Science, 24(1):199--212, 2014.Google ScholarCross Ref
- C. Shearer. The CRISP-DMmodel: the new blueprint for data mining. J Data Warehousing, 2000.Google Scholar
- Z. Siddiqui, M. Oliveira, J. Gama, and M. Spiliopoulou. Where are we going? predicting the evolution of individuals. In Proc. of the 11th Int. Conf. on Advances in Intelligent Data Analysis, IDA, pages 357--368, 2012. Google ScholarDigital Library
- Z. Siddiqui and M. Spiliopoulou. Classification rule mining for a stream of perennial objects. In Proc. of the 5th Int. Conf. on Rule-based Reasoning, Programming, and Applications, RuleML, pages 281--296, 2011. Google ScholarDigital Library
- M. Spiliopoulou and G. Krempl. Tutorial "mining multiple threads of streaming data". In Proc. of the Pacific-Asia Conf. on Knowledge Discovery and Data Mining, PAKDD, 2013.Google Scholar
- D. Waterman. A Guide to Expert Systems. Addison-Wesley, 1986. Google ScholarDigital Library
- W. Young, G. Weckman, and W. Holland. A survey of methodologies for the treatment of missing values within datasets: limitations and benefits. Theoretical Issues in Ergonomics Science, 12, January 2011.Google ScholarCross Ref
- B. Zhou, Y. Han, J. Pei, B. Jiang, Y. Tao, and Y. Jia. Continuous privacy preserving publishing of data streams. In Proc. of the 12th Int. Conf. on Extending Database Technology, EDBT, pages 648--659, 2009. Google ScholarDigital Library
- I. Zliobaite. Controlled permutations for testing adaptive learning models. Knowledge and Information Systems, In Press, 2014. Google ScholarDigital Library
- I. Zliobaite, A. Bifet, M. Gaber, B. Gabrys, J. Gama, L. Minku, and K. Musial. Next challenges for adaptive learning systems. SIGKDD Explorations, 14(1):48--55, 2012. Google ScholarDigital Library
- I. Zliobaite and B. Gabrys. Adaptive preprocessing for streaming data. IEEE Trans. on Knowledge and Data Engineering, 26(2):309--321, 2014. Google ScholarDigital Library
Index Terms
- Open challenges for data stream mining research
Recommendations
Improvised methods for tackling big data stream mining challenges: case study of human activity recognition
Big data stream is a new hype but a practical computational challenge founded on data streams that are prevalent in applications nowadays. It is quite well known that data streams that are originated and collected from monitoring sensors accumulate ...
IoT Big Data Stream Mining
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningThe challenge of deriving insights from the Internet of Things (IoT) has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams from sensors and devices is bound to become ...
Data Stream Mining: Challenges and Techniques
ICTAI '10: Proceedings of the 2010 22nd IEEE International Conference on Tools with Artificial Intelligence - Volume 02Data streams are continuous flows of data. Examples of data streams include network traffic, sensor data, call center records and so on. Their sheer volume and speed pose a great challenge for the data mining community to mine them. Data streams ...
Comments