ABSTRACT
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.
- S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 487-498, May 2000. Google ScholarDigital Library
- S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 275-286, June 1999. Google ScholarDigital Library
- M. Ajtai, T. Jayram, R. Kumar, and D. Sivakumar. Counting inversions in a data stream. manuscript, 2001.Google Scholar
- N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proc. of the 1999 ACM Symp. on Principles of Database Systems, pages 10-20, 1999. Google ScholarDigital Library
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proc. of the 1996 Annual ACM Symp. on Theory of Computing, pages 20-29, 1996. Google ScholarDigital Library
- M. Altinel and M. J. Franklin. Efficient filtering of XML documents for selective dissemination of information. In Proc. of the 2001 Intl. Conf. on Very Large Data Bases, pages 53-64, Sept. 2000. Google ScholarDigital Library
- A. Arasu, B. Babcock, S. Babu, J. McAlister, and J. Widom. Characterizing memory requirements for queries over continuous data streams. In Proc. of the 2002 ACM Symp. on Principles of Database Systems, June 2002. Available at http://dbpubs.stanford.edu/pub/2001-49. Google ScholarDigital Library
- R. Avnur and J. Hellerstein. Eddies: Continuously adaptive query processing. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 261-272, May 2000. Google ScholarDigital Library
- B. Babcock, M. Datar, and R. Motwani, Sampling from a moving window over streaming data. In Proc. of the 2002 Annual ACM-SIAM Symp. on Discrete Algorithms, pages 633-634, 2002. Google ScholarDigital Library
- S. Babu and J. Widom. Continuous queries over data streams. SIGMOD Record, 30(3):109-120, Sept. 2001. Google ScholarDigital Library
- Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Sampling algorithms: Lower bounds and applications. In Proc. of the 2001 Annual ACM Symp. on Theory of Computing, pages 266-275, 2001. Google ScholarDigital Library
- Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proc. of the 2002 Annual ACM-SIAM Symp. on Discrete Algorithms, pages 623-632, 2002. Google ScholarDigital Library
- S. Bellamkonda, T. Borzkaya, B. Ghosh, A. Gupta, J. Haydu, S. Subramanian, and A. Witkowski. Analytic functions in oracle 8i. Available at http://www-db.stanford.edu/dbseminar/Archive/SpringY2000/speakers/agupta/paper.pdf.Google Scholar
- J. A. Blakeley, N. Coburn, and P. A. Larson. Updating derived relations: Detecting irrelevant and autonomously computable updates. ACM Trans. on Database Systems, 14(3):369-400, 1989. Google ScholarDigital Library
- D. Carney, U. Cetinternel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams --- a new class of dbms applications. Technical Report CS-02-01, Department of Computer Science, Brown University, Feb. 2002.Google Scholar
- K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. In Proc. of the 2000 Intl. Conf. on Very Large Data Bases, pages 111-122, Sept. 2000. Google ScholarDigital Library
- M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proc. of the 2000 ACM Symp. on Principles of Database Systems, pages 268-279, 2000. Google ScholarDigital Library
- S. Chaudhuri, G. Das, and V. Narasayya. A robust, optimization-based approach for approximate answering of aggregate queries. In Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, pages 295-306, May 2001. Google ScholarDigital Library
- S. Chaudhuri and R. Motwani. On sampling and relational operators. Bulletin of the Technical Committee on Data Engineering, 22:35-40, 1999.Google Scholar
- S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough? In Proc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, pages 436-447, 1998. Google ScholarDigital Library
- S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 263-274, June 1999. Google ScholarDigital Library
- S. Chaudhuri and V. Narasayya. An efficient cost-driven index selection tool for microsoft sql server. In Proc. of the 1997 Intl. Conf. on Very Large Data Bases, pages 146-155, 1997. Google ScholarDigital Library
- J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagraCQ: A scalable continuous query system for internet databases. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 379-390, May 2000. Google ScholarDigital Library
- C. Cortes, K. Fisher, D. Pregibon, and A. Rogers. Hancock: a language for extracting signatures from data streams. In Proc. of the 2000 ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 9-17, Aug. 2000. Google ScholarDigital Library
- M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In Proc. of the 2002 Annual ACM-SIAM Symp. on Discrete Algorithms, pages 635-644, 2002. Google ScholarDigital Library
- A. Dobra, J. Gehrke, M. Garofalakis, and R. Rastogi. Processing complex aggregate queries over data streams. In Proc. of the 2002 ACM SIGMOD Intl. Conf. on Management of Data, 2002. Google ScholarDigital Library
- P. Domingos and G. Hulten. Mining high-speed data streams. In Proc. of the 2000 ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 71-80, Aug. 2000. Google ScholarDigital Library
- P. Domingos, G. Hulten, and L. Spencer. Mining time-changing data streams. In Proc. of the 2001 ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 97-106, 2001. Google ScholarDigital Library
- N. Duffield and M. Grossglauser. Trajectory sampling for direct traffic observation. In Proc. of the 2000 ACM SIGCOMM, pages 271-284, Sept. 2000. Google ScholarDigital Library
- D. B. et al. The New Jersey data reduction report. IEEE Data Engineering Bulletin, 20(4):3-45, 1997.Google Scholar
- C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In Proc. of the 1994 ACM SIGMOD Intl. Conf. on Management of Data, pages 419-429, May 1994. Google ScholarDigital Library
- M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. In Proc. of the 1998 Intl. Conf. on Very Large Data Bases, pages 299-310, 1998. Google ScholarDigital Library
- J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate 11-difference algorithm for massive data streams. In Proc. of the 1999 Annual IEEE Symp. on Foundations of Computer Science, pages 501-511, 1999. Google ScholarDigital Library
- J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. Testing and spot checking of data streams. In Proc. of the 2000 Annual ACM-SIAM Symp. on Discrete Algorithms, pages 165-174, 2000. Google ScholarDigital Library
- P. Flajolet and G. Martin. Probabilistic counting. In Proc. of the 1983 Annual IEEE Symp. on Foundations of Computer Science, 1983.Google ScholarDigital Library
- H. Garcia-Molina, W. Labio, and J. Yang. Expiring data in a warehouse. In Proc. of the 1998 Intl. Conf. on Very Large Data Bases, pages 500-511, Aug. 1998. Google ScholarDigital Library
- J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams. In Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, pages 13-24, May 2001. Google ScholarDigital Library
- P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proc. of the 2001 ACM Symp. on Parallel Algorithms and Architectures, pages 281-291, 2001. Google ScholarDigital Library
- A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Fast, small-space algorithms for approximate histogram maintenance. In Proc. of the 2002 Annual ACM Symp. on Theory of Computing, 2002. Google ScholarDigital Library
- A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proc. of the 2001 Intl. Conf. on Very Large Data Bases, pages 79-88, 2001. Google ScholarDigital Library
- M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, pages 58-66, 2001. Google ScholarDigital Library
- S. Guha and N. Koudas. Approximating a data stream for querying and estimation: Algorithms and performance evaluation. In Proc. of the 2002 Intl. Conf. on Data Engineering, 2002. Google ScholarDigital Library
- S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In Proc. of the 2001 Annual ACM Symp. on Theory of Computing, pages 471-475, 2001. Google ScholarDigital Library
- S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proc. of the 2000 Annual IEEE Symp. on Foundations of Computer Science, pages 359-366, Nov. 2000. Google ScholarDigital Library
- A. Gupta, H. V. Jagadish, and I. S. Mumick. Data integration using self-maintainable views. In Proc. of the 1996 Intl. Conf. on Extending Database Technology, pages 140-144, Mar. 1996. Google ScholarDigital Library
- P. Haas, J. Naughton, P. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. of the 1995 Intl. Conf. on Very Large Data Bases, pages 311-322, Sept. 1995. Google ScholarDigital Library
- J. Hellerstein, M. Franklin, et al. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 23(2):7-18, June 2000.Google Scholar
- J. Hellerstein, P. Haas, and H. Wang. Online aggregation. In Proc. of the 1997 ACM SIGMOD Intl. Conf. on Management of Data, pages 171-182, May 1997. Google ScholarDigital Library
- M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report TR 1998-011, Compaq Systems Research Center, Palo Alto, California, May 1998.Google Scholar
- P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. of the 2000 Annual IEEE Symp. on Foundations of Computer Science, pages 189-197, 2000. Google ScholarDigital Library
- Y. E. Ioannidis and V. Poosala. Histogram-based approximation of set-valued query-answers. In Proc. of the 1999 Intl. Conf. on Very Large Data Bases, pages 174-185, Sept. 1999. Google ScholarDigital Library
- iPolicy Networks home page. http://www.ipolicynetworks.com.Google Scholar
- Z. Ives, D. Florescu, M. Friedman, A. Levy, and D. Weld. An adaptive query execution system for data integration. In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 299-310, June 1999. Google ScholarDigital Library
- H. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In Proc. of the 1998 Intl. Conf. on Very Large Data Bases, pages 275-286, 1998. Google ScholarDigital Library
- H. Jagadish, I. Mumick, and A. Silberschatz. View maintenance issues for the Chronicle data model. In Proc. of the 1995 ACM Symp. on Principles of Database Systems, pages 113-124, May 1995. Google ScholarDigital Library
- E. Kushlevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997. Google ScholarDigital Library
- L. Liu, C. Pu, and W. Tang. Continual queries for internet scale event-driven information delivery. IEEE Trans. on Knowledge and Data Engineering, 11(4):583-590, Aug. 1999. Google ScholarDigital Library
- S. Madden and M. J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In Proc. of the 2002 Intl. Conf. on Data Engineering, Feb. 2002. (To appear). Google ScholarDigital Library
- S. Madden, J. Hellerstein, M. Shah, and V. Raman. Continuously adaptive continuous queries over streams. In Proc. of the 2002 ACM SIGMOD Intl. Conf. on Management of Data, June 2002. (To appear). Google ScholarDigital Library
- G. Manku and R. Motwani. Approximate frequency counts over streaming data. manuscript, 2002.Google Scholar
- G. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In Proc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, pages 426-435, June 1998. Google ScholarDigital Library
- G. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 251-262, June 1999. Google ScholarDigital Library
- Y. Matias, J. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, pages 448-459, June 1998. Google ScholarDigital Library
- Y. Matias, J. Vitter, and M. Wang. Dynamic maintenance of wavelet-based histograms. In Proc. of the 2000 Intl. Conf. on Very Large Data Bases, pages 101-110, Sept. 2000. Google ScholarDigital Library
- R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarCross Ref
- J. Munro and M. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, 12:315-323, 1980.Google ScholarCross Ref
- B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda. Monitoring XML data on the web. In Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, pages 437-448, May 2001. Google ScholarDigital Library
- V. Poosala and V. Ganti. Fast approximate answers to aggregate queries on a data cube. In Proc. of the 1999 Intl. Conf. on Scientific and Statistical Database Management, pages 24-33, July 1999. Google ScholarDigital Library
- D. Quass, A. Gupta, I. Mumick, and J. Widom. Making views self-maintainable for data warehousing. In Proc. of the 1996 Intl. Conf. on Parallel and Distributed Information Systems, pages 158-169, Dec. 1996. Google ScholarDigital Library
- V. Raman, B. Raman, and J. Hellerstein. Online dynamic reordering for interactive data processing. In Proc. of the 1999 Intl. Conf. on Very Large Data Bases, 1999. Google ScholarDigital Library
- M. Saks and X. Sun. Space lower bounds for distance approximation in the data stream model. In Proc. of the 2002 Annual ACM Symp. on Theory of Computing, 2002. Google ScholarDigital Library
- U. Schreier, H. Pirahesh, R. Agrawal, and C. Mohan. Alert: An architecture for transforming a passive DBMS into an active DBMS. In Proc. of the 1991 Intl. Conf. on Very Large Data Bases, pages 469-478, Sept. 1991. Google ScholarDigital Library
- T. K. Sellis. Multiple-query optimization. ACM Trans. on Database Systems, 13(1):23-52, 1988. Google ScholarDigital Library
- P. Seshadri, M. Livny, and R. Ramakrishnan. Sequence query processing. In Proc. of the 1994 ACM SIGMOD Intl. Conf. on Management of Data, pages 430-441, May 1994. Google ScholarDigital Library
- P. Seshadri, M. Livny, and R. Ramakrishnan. Seq: A model for sequence databases. In Proc. of the 1995 Intl. Conf. on Data Engineering, pages 232-239, Mar. 1995. Google ScholarDigital Library
- P. Seshadri, M. Livny, and R. Ramakrishnan. The design and implementation of a sequence database system. In Proc. of the 1996 Intl. Conf. on Very Large Data Bases, pages 99-110, Sept. 1996. Google ScholarDigital Library
- J. Shanmugasundaram, K. Tufte, D. J. DeWitt, J. F. Naughton, and D. Maier. Architecting a network query engine for producing partial results. In Proc. of the 2000 Intl. Workshop on the Web and Databases, pages 17-22, May 2000.Google Scholar
- R. Snodgrass and I. Ahn. A taxonomy of time in databases. In Proc. of the 1985 ACM SIGMOD Intl. Conf. on Management of Data, pages 236-245, 1985. Google ScholarDigital Library
- S.-. Standard. On-line analytical processing (sql/olap). Available from http://www.ansi.org/, document#ISO/IEC9075-2/Amd1:2001.Google Scholar
- Stanford Stream Data Management (STREAM) Project. http://www-db.stanford.edu/stream.Google Scholar
- M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In Proc. of the 1996 Intl. Conf. on Very Large Data Bases, page 594, Sept. 1996. Google ScholarDigital Library
- D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In Proc. of the 1992 ACM SIGMOD Intl. Conf. on Management of Data, pages 321-330, June 1992. Google ScholarDigital Library
- Traderbot home page. http://www.traderbot.com.Google Scholar
- P. Tucker, D. Maier, T. Sheard, and L. Fegaras. Enhancing relational operators for querying over punctuated data streams. manuscript, 2002. Available at http://www.cse.ogi.edu/dot/niagara/pstream/punctuating.pdf.Google Scholar
- J. Ullman and J. Widom. A First Course in Database Systems. Prentice Hall, Upper Saddle River, New Jersey, 1997. Google ScholarDigital Library
- T. Urhan and M. Franklin. Xjoin: A reactively-scheduled pipelined join operator. IEEE Data Engineering Bulletin, 23(2):27-33, June 2000.Google Scholar
- S. Viglas and J. Naughton. Rate-based query optimization for streaming information sources. In Proc. of the 2002 ACM SIGMOD Intl. Conf. on Management of Data, June 2002. (To appear). Google ScholarDigital Library
- J. Vitter. Random sampling with a reservoir. ACM Trans. on Mathematical Software, 11(1):37-57, 1985. Google ScholarDigital Library
- J. Vitter. External memory algorithms and datastructures. In J. Abello, editor, External Memory Algorithms, pages 1-18. Dimacs, 1999. Google ScholarDigital Library
- J. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 193-204, June 1999. Google ScholarDigital Library
- J. Vitter, M. Wang, and B. Iyer. Data cube approximation and histograms via wavelets. In Proc. of the 1998 Intl. Conf. on Information and Knowledge Management, Nov. 1998. Google ScholarDigital Library
- Xml path language (XPath) version 1.0, Nov. 1999. W3C Recommendation available at http://www.w3.org/TR/xpath.Google Scholar
- Yahoo home page. http://www.yahoo.com.Google Scholar
Index Terms
- Models and issues in data stream systems
Recommendations
Data Stream Mining: Challenges and Techniques
ICTAI '10: Proceedings of the 2010 22nd IEEE International Conference on Tools with Artificial Intelligence - Volume 02Data streams are continuous flows of data. Examples of data streams include network traffic, sensor data, call center records and so on. Their sheer volume and speed pose a great challenge for the data mining community to mine them. Data streams ...
Real-Time Scheduling for Data Stream Management Systems
ECRTS '05: Proceedings of the 17th Euromicro Conference on Real-Time SystemsQuality-aware management of data streams is gaining moreand more importance with the amount of data produced by streams growing continuously. The resources required for data stream processing depend on different factors and are limited by the ...
Comments