ABSTRACT
Tachyon is a distributed file system enabling reliable data sharing at memory speed across cluster computing frameworks. While caching today improves read workloads, writes are either network or disk bound, as replication is used for fault-tolerance. Tachyon eliminates this bottleneck by pushing lineage, a well-known technique, into the storage layer. The key challenge in making a long-running lineage-based storage system is timely data recovery in case of failures. Tachyon addresses this issue by introducing a checkpointing algorithm that guarantees bounded recovery cost and resource allocation strategies for recomputation under commonly used resource schedulers. Our evaluation shows that Tachyon outperforms in-memory HDFS by 110x for writes. It also improves the end-to-end latency of a realistic workflow by 4x. Tachyon is open source and is deployed at multiple companies.
- Apache Cassandra. http://cassandra.apache.org/.Google Scholar
- Apache Hadoop. http://hadoop.apache.org/.Google Scholar
- Apache HBase. http://hbase.apache.org/.Google Scholar
- Apache Oozie. http://incubator.apache.org/oozie/.Google Scholar
- Apache Crunch. http://crunch.apache.org/.Google Scholar
- Dell. http://www.dell.com/us/business/p/servers.Google Scholar
- Luigi. https://github.com/spotify/luigi.Google Scholar
- Apache Mahout. http://mahout.apache.org/.Google Scholar
- L. Alvisi and K. Marzullo. Message logging: Pessimistic, optimistic, causal, and optimal. Software Engineering, IEEE Transactions on, 24(2):149--159, 1998. Google ScholarDigital Library
- L. Alvisi, K. Bhatia, and K. Marzullo. Causality tracking in causal message-logging protocols. Distributed Computing, 15 (1):1--15, 2002. Google ScholarDigital Library
- G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Disk-Locality in Datacenter Computing Considered Irrelevant. In USENIX HotOS 2011. Google ScholarDigital Library
- G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. PACMan: Coordinated Memory Caching for Parallel Jobs. In NSDI 2012. Google ScholarDigital Library
- D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. Fawn: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, pages 1--14. ACM, 2009. Google ScholarDigital Library
- E. B. Nightingale, J. Elson, J. Fan, O. Hofmann, J. Howell, and Y. Suzue. Flat Datacenter Storage. In OSDI 2012. Google ScholarDigital Library
- J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Léon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In CIDR, volume 11, pages 223--234, 2011.Google Scholar
- J. Bent, D. Thain, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and M. Livny. Explicit control in the batch-aware distributed file system. In NSDI, volume 4, pages 365--378, 2004. Google ScholarDigital Library
- R. Bose and J. Frew. Lineage Retrieval for Scientic Data Processing: A Survey. In ACM Computing Surveys 2005. Google ScholarDigital Library
- R. Bose and J. Frew. Lineage retrieval for scientific data processing: a survey. ACM Computing Surveys (CSUR), 37 (1):1--28, 2005. Google ScholarDigital Library
- C. Chambers et al. FlumeJava: easy, efficient data-parallel pipelines. In PLDI 2010. Google ScholarDigital Library
- Y. Chen, S. Alspaugh, and R. Katz. Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. Proceedings of the VLDB Endowment, 5(12):1802--1813, 2012. Google ScholarDigital Library
- J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in Databases: Why, How, and Where. In Foundations and Trends in Databases 2007. Google ScholarDigital Library
- M. Chowdhury, S. Kandula, and I. Stoica. Leveraging endpoint flexibility in data-intensive clusters. In Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM, pages 231--242. ACM, 2013. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2004. Google ScholarDigital Library
- E. Elnozahy, D. Johnson, and W. Zwaenepoel. The Performance of Consistent Checkpointing. In 11th Symposium on Reliable Distributed Systems 1994.Google Scholar
- R. Escriva, B. Wong, and E. G. Sirer. Hyperdex: A distributed, searchable key-value store. ACM SIGCOMM Computer Communication Review, 42(4):25--36, 2012. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, 2003. Google ScholarDigital Library
- P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic Management of Data and Computation in Data Centers. In OSDI 2010. Google ScholarDigital Library
- P. J. Guo and D. Engler. CDE: Using system call interposition to automatically create portable software packages. In Proceedings of the 2011 USENIX Annual Technical Conference, pages 247--252, 2011. Google ScholarDigital Library
- J. Hyde. Discardable Memory and Materialized Queries. http://hortonworks.com/blog/dmmq/.Google Scholar
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review, 41(3):59--72, 2007. Google ScholarDigital Library
- M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair scheduling for distributed computing clusters. In SOSP, November 2009. Google ScholarDigital Library
- H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Reliable, memory speed storage for cluster computing frameworks. Technical Report UCB/EECS-2014-135, EECS Department, University of California, Berkeley, Jun 2014.Google Scholar
- D. Locke, L. Sha, R. Rajikumar, J. Lehoczky, and G. Burns. Priority inversion and its control: An experimental investigation. In ACM SIGAda Ada Letters, volume 8, pages 39--42. ACM, 1988. Google ScholarDigital Library
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8):716--727, 2012. Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135--146. ACM, 2010. Google ScholarDigital Library
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3 (1-2):330--339, 2010. Google ScholarDigital Library
- E. B. Nightingale, P. M. Chen, and J. Flinn. Speculative execution in a distributed file system. In ACM SIGOPS Operating Systems Review, volume 39, pages 191--205. ACM, 2005. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08, pages 1099--1110. Google ScholarDigital Library
- J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, D. Ongaro, G. Parulkar, et al. The case for ramcloud. Communications of the ACM, 54(7):121--130, 2011. Google ScholarDigital Library
- J. Plank. An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance. In Technical Report, University of Tennessee, 1997. Google Scholar
- J. S. Plank and W. R. Elwasif. Experimental assessment of workstation failures and their impact on checkpointing systems. In 28th International Symposium on Fault-Tolerant Computing, 1997. Google ScholarDigital Library
- R. Power and J. Li. Piccolo: Building Fast, Distributed Programs with Partitioned Tables. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, pages 293--306. USENIX Association, 2010. Google ScholarDigital Library
- S. Radia. Discardable Distributed Memory: Supporting Memory Storage in HDFS. http://hortonworks.com/blog/ddm/.Google Scholar
- C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing. ACM, 2012. Google ScholarDigital Library
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1--10. IEEE, 2010. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996--1005. IEEE, 2010.Google ScholarCross Ref
- A. Vahdat and T. E. Anderson. Transparent result caching. In USENIX Annual Technical Conference, 1998. Google ScholarDigital Library
- N. H. Vaidya. Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme. In IEEE Trans. Computers 1997. Google ScholarDigital Library
- S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 307--320. USENIX Association, 2006. Google ScholarDigital Library
- J. W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17:530--531, Sept 1974. ISSN 0001-0782. Google ScholarDigital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 1--14. USENIX Association, 2008. Google ScholarDigital Library
- M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys 10, 2010. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012. Google ScholarDigital Library
- M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 423--438. ACM, 2013. Google ScholarDigital Library
Index Terms
- Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
Recommendations
Tachyon: a constraint-based temporal reasoning model and its implementation
We provide an overview of Tachyon, an implementation of a constraint-based model for representing and reasoning about qualitative and quantitative aspects of time. Tachyon's data model provides substantial expressiveness, fast computation over convex ...
Tachyon Common Lisp: an efficient and portable implementation of CLtL2
LFP '92: Proceedings of the 1992 ACM conference on LISP and functional programmingTachyon Common Lisp is an efficient and portable implementation of Common Lisp 2nd Edition. The design objective of Tachyon is to apply both advanced optimization technology developed for RISC processors and Lisp optimization techniques. The compiler ...
Tachyon Common Lisp: an efficient and portable implementation of CLtL2
Tachyon Common Lisp is an efficient and portable implementation of Common Lisp 2nd Edition. The design objective of Tachyon is to apply both advanced optimization technology developed for RISC processors and Lisp optimization techniques. The compiler ...
Comments