ABSTRACT
A Bottom-sketch is a summary of a set of items with nonnegative weights that supports approximate query processing. A sketch is obtained by associating with each item in a ground set an independent random rank drawn from a probability distribution that depends on the weight of the item and including the k items with smallest rank value.
Bottom-k sketches are an alternative to k-mins sketches[9], which consist of the k minimum ranked items in k independent rank assignments,and of min-hash [5] sketches, where hash functions replace random rank assignments. Sketches support approximate aggregations, including weight and selectivity of a subpopulation. Coordinated sketches of multiple subsets over the same ground set support subset-relation queries such as Jaccard similarity or the weight of the union. All-distances sketches are applicable for datasets where items lie in some metric space such as data streams (time) or networks. These sketches compactly encode the respective plain sketches of all neighborhoods of a location. These sketches support queries posed over time windows or neighborhoods and time/spatially decaying aggregates.
An important advantage of bottom-k sketches, established in a line of recent work, is much tighter estimators for several basic aggregates. To materialize this benefit, we must adapt traditional k-mins applications to use bottom-k sketches. We propose all-distances bottom-k sketches and develop and analyze data structures that incrementally construct bottom-k sketches and all-distances bottom-k sketches.
Another advantage of bottom-k sketches is that when the data is represented explicitly, they can be obtained much more efficiently than k-mins sketches. We show that k-mins sketches can be derived from respective bottom-k sketches, which enables the use of bottom-k sketches with off-the-shelf k-mins estimators. (In fact, we obtain tighter estimators since each bottom-k sketch is adistribution over k-mins sketches).
- N. Alon, N. Duffield, M. Thorup, and C. Lund. Estimating arbitrary subset sums with few probes. In Proceedings of the 24th ACM Symposium on Principles of Database Systems, pages 317--325, 2005. Google ScholarDigital Library
- K. Bharat and A. Z. Broder. Mirror, mirror on the web: A study of host pairs with replicated content. In Proceedings of the 8th International World Wide Web Conference (WWW), pages 501--512, 1999. Google ScholarDigital Library
- A. Broder. Filtering near-duplicate documents. In FUN, 1998.Google Scholar
- A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, pages 21--29. ACM, 1997. Google ScholarDigital Library
- A. Z. Broder. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, volume 1848 of LLNCS, pages 1--10. Springer, 2000. Google ScholarDigital Library
- A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630--659, 2000. Google ScholarDigital Library
- B. Chazelle and L. Guibas. Fractional cascading: I. a data structuring technique. Algorithmica, 1(2):133--162, 1986.Google ScholarDigital Library
- Y.-J. Chiang and R. Tamassia. Dynamic algorithms in computational geometry. Proceedings of the IEEE, 80(9):1412--1434, 1992.Google ScholarCross Ref
- E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google ScholarDigital Library
- E. Cohen and H. Kaplan. Efficient estimation algorithms for neighborhood variance and other moments. In Proc. 15th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM, 2004. Google ScholarDigital Library
- E. Cohen and H. Kaplan. Bottom-k sketches: Better and more efficient estimation of aggregates. In Proceedings of the ACM SIGMETRICS'07 Conference, 2007. poster. Google ScholarDigital Library
- E. Cohen and H. Kaplan. Sketches and estimators for subpopulation weight queries. Manuscript, 2007.Google Scholar
- E. Cohen and H. Kaplan. Spatially-decaying aggregation over a network: model and algorithms. J. Comput. System Sci., 73:265--288, 2007. Google ScholarDigital Library
- E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In Proc. of the 2003 ACM Symp. on Principles of Database Systems (PODS 2003). ACM, 2003. Google ScholarDigital Library
- E. Cohen, Y.-M. Wang, and G. Suri. When piecewise determinism is almost true. In Proc. Pacific Rim International Symposium on Fault-Tolerant Systems, pages 66--71, December 1995.Google Scholar
- M. T. de Berg, O. Schwarzkopf, M. J. van Kreveld, and M. Overmars. Computational Geometry: Algorithms and Applications. Springer-Verlag, 2000. Google ScholarDigital Library
- J. R. Driscoll, N. Sarnak, D. Sleator, and R. Tarjan. Making data structures persistent. J. of Computer and System Science, 38:86--124, 1989. Google ScholarDigital Library
- N. Duffield, M. Thorup, and C. Lund. Flow sampling under hard resource constraints. In Proceedings the ACM IFIP Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance), pages 85--96, 2004. Google ScholarDigital Library
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. System Sci., 31:182--209, 1985. Google ScholarDigital Library
- H. Kaplan and M. Sharir. Randomized incremental constructions of three-dimensional convex hulls and planar voronoi diagrams, and approximate range counting. In SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 484--493, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- D. Mosk-Aoyama and D. Shah. Computing separable functions via gossip. In Proceedings of the ACM PODC'06 Conference, 2006. Google ScholarDigital Library
- R. Motwani, E. Cohen, M. Datar, S. Fujiware, A. Gronis, P. Indyk, J. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE Transactions on Knowledge and Data Engineering, 13:64--78, 2001. Google ScholarDigital Library
- N. T. Spring and D. Wetherall. A protocol-independent technique for eliminating redundant network traffic. In Proceedings of the ACM SIGCOMM'00 Conference. ACM, 2000. Google ScholarDigital Library
- M. Szegedy. The DLT priority sampling is essentially optimal. In Proc. 38th Annual ACM Symposium on Theory of Computing. ACM, 2006. Google ScholarDigital Library
Index Terms
- Summarizing data using bottom-k sketches
Recommendations
Bottom-k sketches: better and more efficient estimation of aggregates
SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systemsA Bottom-k sketch is a summary of a set of items with nonnegative weights. Each such summary allows us to compute approximate aggregates over the set of items. Bottom-k sketches are obtained by associating with each item in a ground set an independent ...
Tighter estimation using bottom k sketches
Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records' attributes. Bottom-k sketches are a powerful ...
Comments