Abstract
Recent years have witnessed a large body of research work on mining concept drifting data streams, where a primary assumption is that the up-to-date data chunk and the yet-to-come data chunk share identical distributions, so classifiers with good performance on the up-to-date chunk would also have a good prediction accuracy on the yet-to-come data chunk. This “stationary assumption”, however, does not capture the concept drifting reality in data streams. More recently, a “learnable assumption” has been proposed and allows the distribution of each data chunk to evolve randomly. Although this assumption is capable of describing the concept drifting in data streams, it is still inadequate to represent real-world data streams which usually suffer from noisy data as well as the drifting concepts. In this paper, we propose a Realistic Assumption which asserts that the difficulties of mining data streams are mainly caused by both concept drifting and noisy data chunks. Consequently, we present a new Aggregate Ensemble (AE) framework, which trains base classifiers using different learning algorithms on different data chunks. All the base classifiers are then combined to form a classifier ensemble through model averaging. Experimental results on synthetic and real-world data show that AE is superior to other ensemble methods under our new realistic assumption for noisy data streams.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Fan, W.: Systematic data selection to mine concept-drifting data streams. In: Proc. of KDD 2004, pp. 128–137 (2004)
Kolter, J., Maloof, M.: Using additive expert ensembles to cope with concept drift. In: Proc. of ICML 2005, pp. 449–456 (2005)
Scholz, M., Klinkenberg, R.: An ensemble classifier for drifting concepts. In: Proc. of ECML/PKDD 2005 Workshop on Knowledge Discovery in Data Streams, pp. 53–64 (2005)
Wang, H., Fan, W., Yu, P., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: Proc. of KDD 2003, pp. 226–235 (2003)
Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23, 69–101 (1996)
Street, W., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale classification. In: Proc. of KDD 2001, pp. 377–382 (2001)
Wang, H., et al.: Suppressing model overfitting in mining concept-drifting data streams. In: Proc. of KDD 2006, pp. 736–741 (2006)
Zhu, X., Zhang, P., Lin, X., Shi, Y.: Active learning from data streams. In: Proc. of ICDM 2007, pp. 757–762 (2007)
Gao, J., Fan, W., Han, J.: On appropriate assumptions to mine data streams: Analysis and Practice. In: Proc. of ICDM 2007, pp. 143–152 (2007)
Zhang, P., Zhu, X., Shi, Y.: Categorizing and mining concept drifting data streams. In: Proc. of KDD 2008, pp. 812–820 (2008)
Yang, Y., Wu, X., Zhu, X.: Combining proactive and reactive predictions of data streams. In: Proc. of KDD 2005, pp. 710–715 (2005)
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proc. of KDD 2000, pp. 71–80 (2000)
Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proc. of KDD 2001, pp. 97–106 (2001)
Witten, I., Frank, E.: Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco (2005)
Asuncion, A., Newman, D.J.: UCI Machine Learning Repository, Irvine, CA (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, P., Zhu, X., Shi, Y., Wu, X. (2009). An Aggregate Ensemble for Mining Concept Drifting Data Streams with Noise. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_109
Download citation
DOI: https://doi.org/10.1007/978-3-642-01307-2_109
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)