Article

Summarizing data using bottom-k sketches

Authors:
Edith Cohen

AT&T Labs-Research

AT&T Labs-Research
View Profile

,
Haim Kaplan

Tel Aviv University

Tel Aviv University
View Profile

PODC '07: Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computingAugust 2007Pages 225–234https://doi.org/10.1145/1281100.1281133

Published:12 August 2007Publication History

PODC '07: Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing

Pages 225–234

ABSTRACT

A Bottom-sketch is a summary of a set of items with nonnegative weights that supports approximate query processing. A sketch is obtained by associating with each item in a ground set an independent random rank drawn from a probability distribution that depends on the weight of the item and including the k items with smallest rank value.

Bottom-k sketches are an alternative to k-mins sketches[9], which consist of the k minimum ranked items in k independent rank assignments,and of min-hash [5] sketches, where hash functions replace random rank assignments. Sketches support approximate aggregations, including weight and selectivity of a subpopulation. Coordinated sketches of multiple subsets over the same ground set support subset-relation queries such as Jaccard similarity or the weight of the union. All-distances sketches are applicable for datasets where items lie in some metric space such as data streams (time) or networks. These sketches compactly encode the respective plain sketches of all neighborhoods of a location. These sketches support queries posed over time windows or neighborhoods and time/spatially decaying aggregates.

An important advantage of bottom-k sketches, established in a line of recent work, is much tighter estimators for several basic aggregates. To materialize this benefit, we must adapt traditional k-mins applications to use bottom-k sketches. We propose all-distances bottom-k sketches and develop and analyze data structures that incrementally construct bottom-k sketches and all-distances bottom-k sketches.

Another advantage of bottom-k sketches is that when the data is represented explicitly, they can be obtained much more efficiently than k-mins sketches. We show that k-mins sketches can be derived from respective bottom-k sketches, which enables the use of bottom-k sketches with off-the-shelf k-mins estimators. (In fact, we obtain tighter estimators since each bottom-k sketch is adistribution over k-mins sketches).

References

N. Alon, N. Duffield, M. Thorup, and C. Lund. Estimating arbitrary subset sums with few probes. In Proceedings of the 24th ACM Symposium on Principles of Database Systems, pages 317--325, 2005. Google ScholarDigital Library
K. Bharat and A. Z. Broder. Mirror, mirror on the web: A study of host pairs with replicated content. In Proceedings of the 8th International World Wide Web Conference (WWW), pages 501--512, 1999. Google ScholarDigital Library
A. Broder. Filtering near-duplicate documents. In FUN, 1998.Google Scholar
A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, pages 21--29. ACM, 1997. Google ScholarDigital Library
A. Z. Broder. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, volume 1848 of LLNCS, pages 1--10. Springer, 2000. Google ScholarDigital Library
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630--659, 2000. Google ScholarDigital Library
B. Chazelle and L. Guibas. Fractional cascading: I. a data structuring technique. Algorithmica, 1(2):133--162, 1986.Google ScholarDigital Library
Y.-J. Chiang and R. Tamassia. Dynamic algorithms in computational geometry. Proceedings of the IEEE, 80(9):1412--1434, 1992.Google ScholarCross Ref
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google ScholarDigital Library
E. Cohen and H. Kaplan. Efficient estimation algorithms for neighborhood variance and other moments. In Proc. 15th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM, 2004. Google ScholarDigital Library
E. Cohen and H. Kaplan. Bottom-k sketches: Better and more efficient estimation of aggregates. In Proceedings of the ACM SIGMETRICS'07 Conference, 2007. poster. Google ScholarDigital Library
E. Cohen and H. Kaplan. Sketches and estimators for subpopulation weight queries. Manuscript, 2007.Google Scholar
E. Cohen and H. Kaplan. Spatially-decaying aggregation over a network: model and algorithms. J. Comput. System Sci., 73:265--288, 2007. Google ScholarDigital Library
E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In Proc. of the 2003 ACM Symp. on Principles of Database Systems (PODS 2003). ACM, 2003. Google ScholarDigital Library
E. Cohen, Y.-M. Wang, and G. Suri. When piecewise determinism is almost true. In Proc. Pacific Rim International Symposium on Fault-Tolerant Systems, pages 66--71, December 1995.Google Scholar
M. T. de Berg, O. Schwarzkopf, M. J. van Kreveld, and M. Overmars. Computational Geometry: Algorithms and Applications. Springer-Verlag, 2000. Google ScholarDigital Library
J. R. Driscoll, N. Sarnak, D. Sleator, and R. Tarjan. Making data structures persistent. J. of Computer and System Science, 38:86--124, 1989. Google ScholarDigital Library
N. Duffield, M. Thorup, and C. Lund. Flow sampling under hard resource constraints. In Proceedings the ACM IFIP Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance), pages 85--96, 2004. Google ScholarDigital Library
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. System Sci., 31:182--209, 1985. Google ScholarDigital Library
H. Kaplan and M. Sharir. Randomized incremental constructions of three-dimensional convex hulls and planar voronoi diagrams, and approximate range counting. In SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 484--493, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
D. Mosk-Aoyama and D. Shah. Computing separable functions via gossip. In Proceedings of the ACM PODC'06 Conference, 2006. Google ScholarDigital Library
R. Motwani, E. Cohen, M. Datar, S. Fujiware, A. Gronis, P. Indyk, J. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE Transactions on Knowledge and Data Engineering, 13:64--78, 2001. Google ScholarDigital Library
N. T. Spring and D. Wetherall. A protocol-independent technique for eliminating redundant network traffic. In Proceedings of the ACM SIGCOMM'00 Conference. ACM, 2000. Google ScholarDigital Library
M. Szegedy. The DLT priority sampling is essentially optimal. In Proc. 38th Annual ACM Symposium on Theory of Computing. ACM, 2006. Google ScholarDigital Library

Index Terms

Summarizing data using bottom-k sketches
1. Information systems
  1. Information storage systems
    1. Record storage systems
      1. Record storage alternatives
2. Mathematics of computing
  1. Probability and statistics

Recommendations

Bottom-k sketches: better and more efficient estimation of aggregates
SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems

A Bottom-k sketch is a summary of a set of items with nonnegative weights. Each such summary allows us to compute approximate aggregates over the set of items. Bottom-k sketches are obtained by associating with each item in a ground set an independent ...
Read More
Tighter estimation using bottom k sketches

Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records' attributes. Bottom-k sketches are a powerful ...
Read More
Architectural Drawing Using Pencil Sketches and AutoCAD
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PODC '07: Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
August 2007
424 pages
ISBN:9781595936165
DOI:10.1145/1281100
General Chair:
Indranil Gupta
UIUC, USA
,
Program Chair:
Rogert Wattenhofer
ETH Zurich, Switzerland
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
all-distances sketches
bottom-k sketches
data streams
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate740of2,477submissions,30%
Upcoming Conference
PODC '24

Sponsor:

sigact

sigact

ACM Symposium on Principles of Distributed Computing

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 107
  Total Citations
  View Citations
- 641
  Total Downloads
- Downloads (Last 12 months)80
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Summarizing data using bottom-k sketches

PODC '07: Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Bottom-k sketches: better and more efficient estimation of aggregates

Tighter estimation using bottom k sketches

Architectural Drawing Using Pencil Sketches and AutoCAD

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Summarizing data using bottom-k sketches

PODC '07: Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Bottom-k sketches: better and more efficient estimation of aggregates

Tighter estimation using bottom k sketches

Architectural Drawing Using Pencil Sketches and AutoCAD

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media