skip to main content
research-article

Open challenges for data stream mining research

Published:25 September 2014Publication History
Skip Abstract Section

Abstract

Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, overlooking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream mining algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.

References

  1. C. Aggarwal, editor. Data Streams: Models and Algorithms. Springer, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Aggarwal and D. Turaga. Mining data streams: Systems and algorithms. In Machine Learning and Knowledge Discovery for Engineering Systems Health Management, pages 4--32. Chapman and Hall, 2012.Google ScholarGoogle Scholar
  3. R. Agrawal and R. Srikant. Privacy-preserving data mining. SIGMOD Rec., 29(2):439--450, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Anagnostopoulos, N. Adams, and D. Hand. Deciding what to observe next: Adaptive variable selection for regression in multivariate data streams. In Proc. of the 2008 ACM Symp. on Applied Computing, SAC, pages 961--965, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS, pages 1--16, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Brodley, U. Rebbapragada, K. Small, and B. Wallace. Challenges and opportunities in applied machine learning. AI Magazine, 33(1):11--24, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Brzezinski and J. Stefanowski. Reacting to different types of concept drift: The accuracy updated ensemble algorithm. IEEE Trans. on Neural Networks and Learning Systems., 25:81--94, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  8. D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal multi-armed bandits. In Proc. of the 22nd Conf. on Neural Information Processing Systems, NIPS, pages 273--280, 2008.Google ScholarGoogle Scholar
  9. D. Cox and D. Oakes. Analysis of Survival Data. Chapman & Hall, London, 1984.Google ScholarGoogle Scholar
  10. T. Dietterich. Machine-learning research. AI Magazine, 18(4):97--136, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Ditzler and R. Polikar. Semi-supervised learning in nonstationary environments. In Proc. of the 2011 Int. Joint Conf. on Neural Networks, IJCNN, pages 2741--2748, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  12. W. Fan and A. Bifet. Mining big data: current status, and forecast to the future. SIGKDD Explorations, 14(2):1--5, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Gaber, J. Gama, S. Krishnaswamy, J. Gomes, and F. Stahl. Data stream mining in ubiquitous environments: state-of-theart and current directions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(2):116--138, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. SIGMOD Rec., 34(2):18--26, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Gama. Knowledge Discovery from Data Streams. Chapman & Hall/CRC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Gama, R. Sebastiao, and P. Rodrigues. On evaluating stream learning algorithms. Machine Learning, 90(3):317--346, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept-drift adaptation. ACM Computing Surveys, 46(4), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Gantz and D. Reinsel. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east, December 2012.Google ScholarGoogle Scholar
  19. A. Goldberg, M. Li, and X. Zhu. Online manifold regularization: A new learning setting and empirical study. In Proc. of the European Conf. on Machine Learning and Principles of Knowledge Discovery in Databases, ECMLPKDD, pages 393--407, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. I. Guyon, A. Saffari, G. Dror, and G. Cawley. Model selection: Beyond the bayesian/frequentist divide. Journal of Machine Learning Research, 11:61--87, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Hassani and T. Seidl. Towards a mobile health context prediction: Sequential pattern mining in multiple streams. In Proc. of , IEEE Int. Conf. on Mobile Data Management, MDM, pages 55--57, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. He and Y. Ma, editors. Imbalanced Learning: Foundations, Algorithms, and Applications. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Hoens, R. Polikar, and N. Chawla. Learning from streaming data with concept drift and imbalance: an overview. Progress in Artificial Intelligence, 1(1):89--101, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  24. IBM. An architectural blueprint for autonomic computing. Technical report, IBM, 2003.Google ScholarGoogle Scholar
  25. E. Ikonomovska, K. Driessens, S. Dzeroski, and J. Gama. Adaptive windowing for online learning from multiple interrelated data streams. In Proc. of the 11th IEEE Int. Conf. on Data Mining Workshops, ICDMW, pages 697--704, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Kotov, C. Zhai, and R. Sproat. Mining named entities with temporally correlated bursts from multilingual web news streams. In Proc. of the 4th ACMInt. Conf. onWeb Search and Data Mining, WSDM, pages 237--246, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Krempl. The algorithm APT to classify in concurrence of latency and drift. In Proc. of the 10th Int. Conf. on Advances in Intelligent Data Analysis, IDA, pages 222--233, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Last and H. Halpert. Survival analysis meets data stream mining. In Proc. of the 1st Worksh. on Real-World Challenges for Data Stream Mining, RealStream, pages 26--29, 2013.Google ScholarGoogle Scholar
  29. F. Nelwamondo and T.Marwala. Key issues on computational intelligence techniques for missing data imputation - a review. In Proc. of World Multi Conf. on Systemics, Cybernetics and Informatics, volume 4, pages 35--40, 2008.Google ScholarGoogle Scholar
  30. E. Noack,W. Belau, R.Wohlgemuth, R.Müller, S. Palumberi, P. Parodi, and F. Burzagli. Efficiency of the columbus failure management system. In Proc. of the AIAA 40th Int. Conf. on Environmental Systems, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  31. E. Noack, A. Luedtke, I. Schmitt, T. Noack, E. Schaumlöffel, E. Hauke, J. Stamminger, and E. Frisk. The columbus module as a technology demonstrator for innovative failure management. In German Air and Space Travel Congress, 2012.Google ScholarGoogle Scholar
  32. M. Oliveira and J. Gama. A framework to monitor clusters evolution applied to economy and finance problems. Intelligent Data Analysis, 16(1):93--111, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc., 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. T. Raeder and N. Chawla. Model monitor (m2): Evaluating, comparing, and monitoring models. Journal of Machine Learning Research, 10:1387--1390, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. Rodrigues and J. Gama. Distributed clustering of ubiquitous data streams. WIREs Data Mining and Knowledge Discovery, pages 38--54, 2013.Google ScholarGoogle Scholar
  36. T. Sakaki, M. Okazaki, and Y. Matsuo. Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Trans. on Knowledge and Data Engineering, 25(4):919--931, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Salperwyck and V. Lemaire. Learning with few examples: An empirical study on leading classifiers. In Proc. of the 2011 Int. Joint Conf. on Neural Networks, IJCNN, pages 1010--1019, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  38. B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan and Claypool Publishers, 2012.Google ScholarGoogle Scholar
  39. A. Shaker and E. Hüllermeier. Survival analysis on data streams: Analyzing temporal events in dynamically changing environments. Int. Journal of Applied Mathematics and Computer Science, 24(1):199--212, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  40. C. Shearer. The CRISP-DMmodel: the new blueprint for data mining. J Data Warehousing, 2000.Google ScholarGoogle Scholar
  41. Z. Siddiqui, M. Oliveira, J. Gama, and M. Spiliopoulou. Where are we going? predicting the evolution of individuals. In Proc. of the 11th Int. Conf. on Advances in Intelligent Data Analysis, IDA, pages 357--368, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Z. Siddiqui and M. Spiliopoulou. Classification rule mining for a stream of perennial objects. In Proc. of the 5th Int. Conf. on Rule-based Reasoning, Programming, and Applications, RuleML, pages 281--296, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. Spiliopoulou and G. Krempl. Tutorial "mining multiple threads of streaming data". In Proc. of the Pacific-Asia Conf. on Knowledge Discovery and Data Mining, PAKDD, 2013.Google ScholarGoogle Scholar
  44. D. Waterman. A Guide to Expert Systems. Addison-Wesley, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. W. Young, G. Weckman, and W. Holland. A survey of methodologies for the treatment of missing values within datasets: limitations and benefits. Theoretical Issues in Ergonomics Science, 12, January 2011.Google ScholarGoogle ScholarCross RefCross Ref
  46. B. Zhou, Y. Han, J. Pei, B. Jiang, Y. Tao, and Y. Jia. Continuous privacy preserving publishing of data streams. In Proc. of the 12th Int. Conf. on Extending Database Technology, EDBT, pages 648--659, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. I. Zliobaite. Controlled permutations for testing adaptive learning models. Knowledge and Information Systems, In Press, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. I. Zliobaite, A. Bifet, M. Gaber, B. Gabrys, J. Gama, L. Minku, and K. Musial. Next challenges for adaptive learning systems. SIGKDD Explorations, 14(1):48--55, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. I. Zliobaite and B. Gabrys. Adaptive preprocessing for streaming data. IEEE Trans. on Knowledge and Data Engineering, 26(2):309--321, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Open challenges for data stream mining research

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGKDD Explorations Newsletter
      ACM SIGKDD Explorations Newsletter  Volume 16, Issue 1
      Special issue on big data
      June 2014
      63 pages
      ISSN:1931-0145
      EISSN:1931-0153
      DOI:10.1145/2674026
      Issue’s Table of Contents

      Copyright © 2014 Authors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 September 2014

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader