skip to main content
10.1145/1281100.1281133acmconferencesArticle/Chapter ViewAbstractPublication PagespodcConference Proceedingsconference-collections
Article

Summarizing data using bottom-k sketches

Published:12 August 2007Publication History

ABSTRACT

A Bottom-sketch is a summary of a set of items with nonnegative weights that supports approximate query processing. A sketch is obtained by associating with each item in a ground set an independent random rank drawn from a probability distribution that depends on the weight of the item and including the k items with smallest rank value.

Bottom-k sketches are an alternative to k-mins sketches[9], which consist of the k minimum ranked items in k independent rank assignments,and of min-hash [5] sketches, where hash functions replace random rank assignments. Sketches support approximate aggregations, including weight and selectivity of a subpopulation. Coordinated sketches of multiple subsets over the same ground set support subset-relation queries such as Jaccard similarity or the weight of the union. All-distances sketches are applicable for datasets where items lie in some metric space such as data streams (time) or networks. These sketches compactly encode the respective plain sketches of all neighborhoods of a location. These sketches support queries posed over time windows or neighborhoods and time/spatially decaying aggregates.

An important advantage of bottom-k sketches, established in a line of recent work, is much tighter estimators for several basic aggregates. To materialize this benefit, we must adapt traditional k-mins applications to use bottom-k sketches. We propose all-distances bottom-k sketches and develop and analyze data structures that incrementally construct bottom-k sketches and all-distances bottom-k sketches.

Another advantage of bottom-k sketches is that when the data is represented explicitly, they can be obtained much more efficiently than k-mins sketches. We show that k-mins sketches can be derived from respective bottom-k sketches, which enables the use of bottom-k sketches with off-the-shelf k-mins estimators. (In fact, we obtain tighter estimators since each bottom-k sketch is adistribution over k-mins sketches).

References

  1. N. Alon, N. Duffield, M. Thorup, and C. Lund. Estimating arbitrary subset sums with few probes. In Proceedings of the 24th ACM Symposium on Principles of Database Systems, pages 317--325, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Bharat and A. Z. Broder. Mirror, mirror on the web: A study of host pairs with replicated content. In Proceedings of the 8th International World Wide Web Conference (WWW), pages 501--512, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Broder. Filtering near-duplicate documents. In FUN, 1998.Google ScholarGoogle Scholar
  4. A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, pages 21--29. ACM, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Z. Broder. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, volume 1848 of LLNCS, pages 1--10. Springer, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630--659, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Chazelle and L. Guibas. Fractional cascading: I. a data structuring technique. Algorithmica, 1(2):133--162, 1986.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y.-J. Chiang and R. Tamassia. Dynamic algorithms in computational geometry. Proceedings of the IEEE, 80(9):1412--1434, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  9. E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Cohen and H. Kaplan. Efficient estimation algorithms for neighborhood variance and other moments. In Proc. 15th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. Cohen and H. Kaplan. Bottom-k sketches: Better and more efficient estimation of aggregates. In Proceedings of the ACM SIGMETRICS'07 Conference, 2007. poster. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Cohen and H. Kaplan. Sketches and estimators for subpopulation weight queries. Manuscript, 2007.Google ScholarGoogle Scholar
  13. E. Cohen and H. Kaplan. Spatially-decaying aggregation over a network: model and algorithms. J. Comput. System Sci., 73:265--288, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In Proc. of the 2003 ACM Symp. on Principles of Database Systems (PODS 2003). ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Cohen, Y.-M. Wang, and G. Suri. When piecewise determinism is almost true. In Proc. Pacific Rim International Symposium on Fault-Tolerant Systems, pages 66--71, December 1995.Google ScholarGoogle Scholar
  16. M. T. de Berg, O. Schwarzkopf, M. J. van Kreveld, and M. Overmars. Computational Geometry: Algorithms and Applications. Springer-Verlag, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. R. Driscoll, N. Sarnak, D. Sleator, and R. Tarjan. Making data structures persistent. J. of Computer and System Science, 38:86--124, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Duffield, M. Thorup, and C. Lund. Flow sampling under hard resource constraints. In Proceedings the ACM IFIP Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance), pages 85--96, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. System Sci., 31:182--209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Kaplan and M. Sharir. Randomized incremental constructions of three-dimensional convex hulls and planar voronoi diagrams, and approximate range counting. In SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 484--493, New York, NY, USA, 2006. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Mosk-Aoyama and D. Shah. Computing separable functions via gossip. In Proceedings of the ACM PODC'06 Conference, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Motwani, E. Cohen, M. Datar, S. Fujiware, A. Gronis, P. Indyk, J. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE Transactions on Knowledge and Data Engineering, 13:64--78, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. T. Spring and D. Wetherall. A protocol-independent technique for eliminating redundant network traffic. In Proceedings of the ACM SIGCOMM'00 Conference. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Szegedy. The DLT priority sampling is essentially optimal. In Proc. 38th Annual ACM Symposium on Theory of Computing. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Summarizing data using bottom-k sketches

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PODC '07: Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
        August 2007
        424 pages
        ISBN:9781595936165
        DOI:10.1145/1281100

        Copyright © 2007 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 August 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate740of2,477submissions,30%

        Upcoming Conference

        PODC '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader