ABSTRACT
Inspired by the great success of machine learning in the past decade, people have been thinking about the possibility of improving the theoretical results by exploring data distribution. In this paper, we revisit a fundamental problem called Distributed Tracking (DT) under an assumption that the data follows a certain (known or unknown) distribution, and propose a number Data-dependent algorithms with improved theoretical bounds. Informally, in the DT problem, there is a coordinator and k players, where the coordinator holds a threshold N and each player has a counter. At each time stamp, at most one counter can be increased by one. The job of the coordinator is to capture the exact moment when the sum of all these k counters reaches N. The goal is to minimise the communication cost. While our first type of algorithms assume the concrete data distribution is known in advance, our second type of algorithms can learn the distribution on the fly. Both of the algorithms achieve a communication cost bounded by O(k log log N) with high probability, improving the state-of-the-art data-independent bound O(k log N/k). We further propose a number of implementation optimisation heuristics to improve both efficiency and robustness of the algorithms. Finally, we conduct extensive experiments on three real datasets and four synthetic datasets. The experimental results show that the communication cost of our algorithms is as least as $20%$ of that of the state-of-the-art algorithms.
Supplemental Material
- Anders Aamand, Piotr Indyk, and Ali Vakilian. 2019. (Learned) Frequency Estimation Algorithms under Zipfian Distribution. CoRR, Vol. abs/1908.05198 (2019). arxiv: 1908.05198Google Scholar
- Jean-Yves Audibert, Ré mi Munos, and Csaba Szepesvá ri. 2009. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theor. Comput. Sci., Vol. 410, 19 (2009), 1876--1902.Google ScholarDigital Library
- Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. 2017. Neural Combinatorial Optimization with Reinforcement Learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Workshop Track Proceedings.Google Scholar
- Fan R. K. Chung and Lincoln Lu. 2006. Survey: Concentration Inequalities and Martingale Inequalities: A Survey. Internet Mathematics, Vol. 3, 1 (2006), 79--127.Google ScholarCross Ref
- Graham Cormode. 2013. The continuous distributed monitoring model. SIGMOD Record, Vol. 42, 1 (2013), 5--14.Google ScholarDigital Library
- Graham Cormode, Minos N. Garofalakis, S. Muthukrishnan, and Rajeev Rastogi. 2005. Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14--16, 2005. 25--36.Google ScholarDigital Library
- Graham Cormode, S. Muthukrishnan, and Ke Yi. 2011. Algorithms for distributed functional monitoring. ACM Trans. Algorithms, Vol. 7, 2 (2011), 21:1--21:20.Google ScholarDigital Library
- Nikos Giatrakos, Antonios Deligiannakis, Minos N. Garofalakis, Izchak Sharfman, and Assaf Schuster. 2012. Prediction-based geometric monitoring over distributed data streams. In Proceedings of the International Conference on Management of Data, SIGMOD, Scottsdale, AZ, USA, May 20--24, 2012. 265--276.Google ScholarDigital Library
- Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. 2019. Learning-Based Frequency Estimation Algorithms. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019.Google Scholar
- Zengfeng Huang, Ke Yi, and Qin Zhang. 2019. Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks. Algorithmica, Vol. 81, 6 (2019), 2222--2243.Google ScholarDigital Library
- Ram Keralapura, Graham Cormode, and Jeyashankher Ramamirtham. 2006. Communication-efficient distributed monitoring of thresholded counts. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27--29, 2006. 289--300.Google ScholarDigital Library
- Elias B. Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. 2017. Learning Combinatorial Optimization Algorithms over Graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 6348--6358.Google Scholar
- David Kotz, Tristan Henderson, Ilya Abyzov, and Jihwang Yeo. 2009. CRAWDAD dataset dartmouth/campus (v. 2009-09-09).Google Scholar
- Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD 2018. 489--504.Google ScholarDigital Library
- Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and Optimizing by Sandwiching. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3--8 December 2018, Montré al, Canada. 462--471.Google Scholar
- Miao Qiao, Junhao Gan, and Yufei Tao. 2016. Range Thresholding on Streams. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016. 571--582.Google ScholarDigital Library
- Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In Annual Conference on Neural Information Processing Systems (NeuIPS), December 7--12, 2015, Montreal, Quebec, Canada. 2692--2700.Google Scholar
Index Terms
- Learning Based Distributed Tracking
Recommendations
Randomized algorithms for tracking distributed count, frequencies, and ranks
PODS '12: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database SystemsWe show that randomization can lead to significant improvements for a few fundamental problems in distributed tracking. Our basis is the count-tracking problem, where there are k players, each holding a counter ni that gets incremented over time, and ...
Perfect $L_p$ Sampling in a Data Stream
In this paper, we resolve the one-pass space complexity of perfect $L_p$ sampling for $p \in (0,2)$ in a stream. Given a stream of updates (insertions and deletions) to the coordinates of an underlying vector $f \in \mathbb{R}^n$, a perfect $L_p$ sampler ...
Tight bounds for Lp samplers, finding duplicates in streams, and related problems
PODS '11: Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsIn this paper, we present near-optimal space bounds for Lp-samplers. Given a stream of updates (additions and subtraction) to the coordinates of an underlying vector x in Rn, a perfect Lp sampler outputs the i-th coordinate with probability xipxpp. In ...
Comments