skip to main content
research-article
Open Access

Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform

Authors Info & Claims
Published:17 August 2017Publication History
Skip Abstract Section

Abstract

Distributed actor languages are an effective means of constructing scalable reliable systems, and the Erlang programming language has a well-established and influential model. While the Erlang model conceptually provides reliable scalability, it has some inherent scalability limits and these force developers to depart from the model at scale. This article establishes the scalability limits of Erlang systems and reports the work of the EU RELEASE project to improve the scalability and understandability of the Erlang reliable distributed actor model.

We systematically study the scalability limits of Erlang and then address the issues at the virtual machine, language, and tool levels. More specifically: (1) We have evolved the Erlang virtual machine so that it can work effectively in large-scale single-host multicore and NUMA architectures. We have made important changes and architectural improvements to the widely used Erlang/OTP release. (2) We have designed and implemented Scalable Distributed (SD) Erlang libraries to address language-level scalability issues and provided and validated a set of semantics for the new language constructs. (3) To make large Erlang systems easier to deploy, monitor, and debug, we have developed and made open source releases of five complementary tools, some specific to SD Erlang.

Throughout the article we use two case studies to investigate the capabilities of our new technologies and tools: a distributed hash table based Orbit calculation and Ant Colony Optimisation (ACO). Chaos Monkey experiments show that two versions of ACO survive random process failure and hence that SD Erlang preserves the Erlang reliability model. While we report measurements on a range of NUMA and cluster architectures, the key scalability experiments are conducted on the Athos cluster with 256 hosts (6,144 cores). Even for programs with no global recovery data to maintain, SD Erlang partitions the network to reduce network traffic and hence improves performance of the Orbit and ACO benchmarks above 80 hosts. ACO measurements show that maintaining global recovery data dramatically limits scalability; however, scalability is recovered by partitioning the recovery data. We exceed the established scalability limits of distributed Erlang, and do not reach the limits of SD Erlang for these benchmarks at this scale (256 hosts, 6,144 cores).

References

  1. Gul Agha. 1985. ACTORS: A Model of Concurrent Computation in Distributed Systems. Ph.D. Dissertation. MIT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Gul Agha. 1986. An overview of actor languages. SIGPLAN Not. 21, 10 (1986), 58--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AMD 2015. Bulldozer (microarchitecture) Retrieved from https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture).Google ScholarGoogle Scholar
  4. Apache SF. 2016. Liblcoud. Retrieved from https://libcloud.apache.org/.Google ScholarGoogle Scholar
  5. C. R. Aragon and R. G. Seidel. 1989. Randomized search trees. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science. 540--545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Joe Armstrong. 2007. Programming Erlang: Software for a Concurrent World. Pragmatic Bookshelf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Joe Armstrong. 2010. Erlang. Commun. ACM 53 (2010), 68--75. Issue 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Stavros Aronis, Nikolaos Papaspyrou, Katerina Roukounaki, Konstantinos Sagonas, Yiannis Tsiouris, and Ioannis E. Venetis. 2012. A scalability benchmark suite for erlang/OTP. In Proceedings of the 11th ACM SIGPLAN Workshop on Erlang, Torben Hoffman and John Hughes (Eds.). ACM, New York, NY, 33--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Thomas Arts, John Hughes, Joakim Johansson, and Ulf Wiger. 2006. Testing telecoms software with quviq quickcheck. In Proceedings of the 2006 ACM SIGPLAN Workshop on Erlang. ACM, New York, NY, 2--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Robert Baker, Peter Rodgers, Simon Thompson, and Huiqing Li. 2013. Multi-level visualization of concurrent and distributed computation in erlang. In Visual Languages and Computing (VLC): Proceedings of the 19th International Conference on Distributed Multimedia Systems (DMS’13).Google ScholarGoogle Scholar
  11. Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The Datacenter as a Computer (2nd ed.). Morgan and Claypool.Google ScholarGoogle Scholar
  12. Basho Technologies. 2014. Riakdocs. Basho Bench. Retrieved from http://docs.basho.com/riak/latest/ops/building/benchmarking/.Google ScholarGoogle Scholar
  13. J. E. Beasley. 1990. OR-library: Distributing test problems by electronic mail. J. Operat. Res. Soc. 41, 11 (1990), 1069--1072. 01605682, 14769360 Retrieved from http://www.jstor.org/stable/2582903. Datasets available at http://people.brunel.ac.uk/∼mastjjb/jeb/orlib/wtinfo.html.Google ScholarGoogle ScholarCross RefCross Ref
  14. Cory Bennett and Ariel Tseitlin. 2012. Chaos Monkey Released into the Wild. Netflix Blog (2012).Google ScholarGoogle Scholar
  15. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Olivier Boudeville. 2012. Technical manual of the sim-diasca simulation engine. EDF R&D (2012).Google ScholarGoogle Scholar
  17. István Bozó, Viktória Fördős, Dániel Horpácsi, Zoltán Horváth, Tamás Kozsik, Judit Kőszegi, and Melinda Tóth. 2015. Refactorings to enable parallelization. In Proceedings of the 15th International Symposium on Trends in Functional Programming (TFP’14). Revised Selected Papers (LNCS), Jurriaan Hage and Jay McCarthy (Eds.), Vol. 8843. Springer, 104--121.Google ScholarGoogle ScholarCross RefCross Ref
  18. Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, and Nir Shavit. 2013. NUMA-aware reader-writer locks. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Francesco Cesarini and Simon Thompson. 2009. Erlang Programming: A Concurrent Approach to Software Development (1st ed.). O’Reilly Media. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, and Ramesh Menon. 2001. Parallel Programming in OpenMP. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Natalia Chechina, Huiqing Li, Amir Ghaffari, Simon Thompson, and Phil Trinder. 2016. Improving the network scalability of erlang. J. Parallel Distrib. Comput. 90, C (2016), 22--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Natalia Chechina, Kenneth MacKenzie, Simon Thompson, Phil Trinder, Olivier Boudeville, Viktória Fördős, Csaba Hoch, Amir Ghaffari, and Mario Moro Hernandez. 2017. Evaluating scalable distributed erlang for scalability and reliability. IEEE Transactions on Parallel and Distributed Systems (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Natalia Chechina, Mario Moro Hernandez, and Phil Trinder. 2016. A scalable reliable instant messenger using the SD erlang libraries. In Proceedings of Erlang’16. ACM, New York, NY, 33--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Koen Claessen and John Hughes. 2000. QuickCheck: A lightweight tool for random testing of Haskell programs. In Proceedings of the 5th ACM SIGPLAN International Conference on Functional Programming. ACM, New York, NY, 268--279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. A. J. Crauwels, C. N. Potts, and L. N. van Wassenhove. 1998. Local search heuristics for the single machine total weighted tardiness scheduling problem. INFORMS J. Comput. 10, 3 (1998), 341--350. arXiv:http://pubsonline.informs.org/doi/pdf/10.1287/ijoc.10.3.341. The datasets from this article are included in Beasley’s ORLIB.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. David Dewolfs, Jan Broeckhove, Vaidy Sunderam, and Graham E. Fagg. 2006. FT-MPI, fault-tolerant metacomputing and generic name services: A case study. In Proceedings of EuroPVM/MPI’06. Springer-Verlag, Berlin, 133--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Alan A. A. Donovan and Brian W. Kernighan. 2015. The Go Programming Language. Addison-Wesley Professional. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Marco Dorigo and Thomas Stützle. 2004. Ant Colony Optimization. Bradford Company, Scituate, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Faith Ellen, Yossi Lev, Victor Luchangco, and Mark Moir. 2007. SNZI: Scalable nonzero indicators. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing. ACM, New York, NY, 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jeff Epstein, Andrew P. Black, and Simon Peyton-Jones. 2011. Towards Haskell in the cloud. In Proceedings of Haskell ’11. ACM, 118--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ana Gainaru and Franck Cappello. 2015. Errors and faults. In Fault-Tolerance Techniques for High-Performance Computing. Springer International Publishing, 89--144.Google ScholarGoogle Scholar
  34. Martin Josef Geiger. 2010. New Instances for the Single Machine Total Weighted Tardiness Problem. Technical Report Research Report 10-03-01. Retrieved from http://logistik.hsu-hh.de/SMTWTP.Google ScholarGoogle Scholar
  35. Guillaume Germain. 2006. Concurrency oriented programming in termite scheme. In Proceedings of the 2006 ACM SIGPLAN workshop on Erlang. ACM, New York, NY, USA, 20--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Amir Ghaffari. 2014a. DE-Bench, A Benchmark Tool for Distributed Erlang (2014). Retrieved from https://github.com/amirghaffari/DEbench.Google ScholarGoogle Scholar
  37. Amir Ghaffari. 2014b. Investigating the scalability limits of distributed Erlang. In Proceedings of the 13th ACM SIGPLAN Workshop on Erlang. ACM, 43--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Amir Ghaffari, Natalia Chechina, Phip Trinder, and Jon Meredith. 2013. Scalable persistent storage for Erlang: Theory and practice. In Proceedings of the 12th ACM SIGPLAN Workshop on Erlang. ACM, New York, NY, 73--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News 33, 2 (2002), 51--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Andrew S. Grimshaw, Wm A. Wulf, and the Legion team. 1997. The legion vision of a worldwide virtual computer. Commun. ACM 40, 1 (1997), 39--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Carl Hewitt. 2010. Actor model for discretionary, adaptive concurrency. CoRR abs/1008.1459 (2010).Google ScholarGoogle Scholar
  42. Carl Hewitt, Peter Bishop, and Richard Steiger. 1973. A universal modular ACTOR formalism for artificial intelligence. In Proceedings of the International joint Conference on Artificial Intelligence (IJCAI’73). Morgan Kaufmann, San Francisco, CA, 235--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Rich Hickey. 2008. The clojure programming language. In Proceedings of the Dynamic Languages Symposium (DLS’08). ACM, New York, NY, 1:1--1:1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Zoltán Horváth, László Lövei, Tamás Kozsik, Róbert Kitlei, Anikó Nagyné Víg, Tamás Nagy, Melinda Tóth, and Roland Király. 2008. Building a refactoring tool for Erlang. In Proceedings of the Workshop on Advanced Software Development Tools and Techniques (WASDETT’08).Google ScholarGoogle Scholar
  45. Laxmikant V. Kale and Sanjeev Krishnan. 1993. CHARM++: A portable concurrent object oriented system based on C++. In ACM Sigplan Notices, Vol. 28. ACM, 91--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2013. On the scalability of the Erlang term storage. In Proceedings of the 12th ACM SIGPLAN Workshop on Erlang. ACM, New York, NY, 15--26. Retrieved from Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2014. Delegation locking libraries for improved performance of multithreaded programs. In Proceedings of Euro-Par 2014 Parallel Processing (LNCS’14), Vol. 8632. Springer, 572--583.Google ScholarGoogle ScholarCross RefCross Ref
  48. David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2017. Queue delegation locking. IEEE Trans. Parallel Distrib. Syst. (2017). To appear.Google ScholarGoogle Scholar
  49. Rusty Klophaus. 2010. Riak core: Building distributed applications without shared state. In ACM SIGPLAN Commercial Users of Functional Programming (CUFP’10). ACM, New York, NY, Article 14, 1 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 2 (April 2010), 35--40. 0163-5980 Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. J. Lee et al. 2010. Python Actor Runtime Library. Retrieved from osl.cs.uiui.edu/parley/.Google ScholarGoogle Scholar
  52. Huiqing Li and Simon Thompson. 2012. Automated API migration in a user-extensible refactoring tool for erlang programs. In Proceedings of the Conference on Automated Software Engineering (ASE’12), Tim Menzies and Motoshi Saeki (Eds.). IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Huiqing Li and Simon Thompson. 2013. Multicore profiling for Erlang programs using percept2. In Proceedings of the 12th ACM SIGPLAN Workshop on Erlang. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Huiqing Li and Simon Thompson. 2014. Improved semantics and implementation through property-based testing with quickcheck. In Proceedings of the 9th International Workshop on Automation of Software Test. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Huiqing Li and Simon Thompson. 2015. Safe concurrency introduction through slicing. In Proceedings of Workshop on Partial Evaluation and Program Manipulation (PEPM’15). ACM SIGPLAN. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Huiqing Li, Simon Thompson, György Orosz, and Melinda Töth. 2008. Refactoring with Wrangler, updated. In ACM SIGPLAN Erlang Workshop, Vol. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Frank Lubeck and Max Neunhoffer. 2001. Enumerating large orbits and direct condensation. Exp. Math. 10, 2 (2001), 197--205.Google ScholarGoogle ScholarCross RefCross Ref
  58. Andreea Lutac, Natalia Chechina, Gerardo Aragon-Camarasa, and Phil Trinder. 2016. Towards reliable and scalable robot communication. In Proceedings of the 15th International Workshop on Erlang (Erlang’16). ACM, New York, NY, 12--23. Retrieved from Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. LWN.net. 2006. The high-resolution timer API. (Jan. 2006). Retrieved from https://lwn.net/Articles/167897/Google ScholarGoogle Scholar
  60. Kenneth MacKenzie, Natalia Chechina, and Phil Trinder. 2015. Performance portability through semi-explicit placement in distributed Erlang. In Proceedings of the 14th ACM SIGPLAN Workshop on Erlang. ACM, 27--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Jeff Matocha and Tracy Camp. 1998. A taxonomy of distributed termination detection algorithms. J. Syst. Softw. 43, 221 (1998), 207--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Nicholas D. Matsakis and Felix S. Klock II. 2014. The rust language. In ACM SIGAda Ada Letters, Vol. 34. ACM, 103--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Robert McNaughton. 1959. Scheduling with deadlines and loss functions. Manage. Sci. 6, 1 (1959), 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Martin Odersky et al. 2012. The Scala Programming Language. (2012). Retrieved from www.scala-lang.org.Google ScholarGoogle Scholar
  65. William F. Opdyke. 1992. Refactoring Object-Oriented Frameworks. Ph.D. Dissertation. University of Illinois at Urbana-Champaign. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Nikolaos Papaspyrou and Konstantinos Sagonas. 2012. On preserving term sharing in the Erlang virtual machine. In Proceedings of the 11th ACM SIGPLAN Erlang Workshop, Torben Hoffman and John Hughes (Eds.). ACM, 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. RELEASE Project Team. 2015. EU Framework 7 Project 287510 (2011--2015). Retrieved from http://www.release-project.eu.Google ScholarGoogle Scholar
  68. Konstantinos Sagonas and Thanassis Avgerinos. 2009. Automatic refactoring of Erlang programs. In Proceedings of the Conference on Principles and Practice of Declarative Programming (PPDP’09). ACM, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Konstantinos Sagonas and Kjell Winblad. 2014. More scalable ordered set for ETS using adaptation. In Proceedings of the ACM SIGPLAN Workshop on Erlang. ACM, 3--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Konstantinos Sagonas and Kjell Winblad. 2015. Contention adapting search trees. In Proceedings of the 14th International Symposium on Parallel and Distributed Computing. IEEE Computing Society, 215--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Konstantinos Sagonas and Kjell Winblad. 2016. Efficient support for range queries and range updates using contention adapting search trees. In Proceedings of the 28th International Workshop on Languages and Compilers for Parallel Computing (LNCS’16), Vol. 9519. Springer, 37--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Marc Snir, Steve W. Otto, D. W. Walker, Jack Dongarra, and Steven Huss-Lederman. 1995. MPI: The Complete Reference. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Sriram Srinivasan and Alan Mycroft. 2008. Kilim: Isolation-typed actors for Java. In Proceedings of the European Conference on Object Oriented Programming (ECOOP’08). Springer-Verlag, Berlin, 104--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Don Syme, Adam Granicz, and Antonio Cisternino. 2015. Expert F# 4.0. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Simon Thompson and Huiqing Li. 2013. Refactoring tools for functional languages. J. Funct. Program. 23, 3 (2013), 293--350.Google ScholarGoogle ScholarCross RefCross Ref
  76. Marcus Völker. 2014. Linux Timers. Retrieved from https://upvoid.com/devblog/2014/05/linux-timers/.Google ScholarGoogle Scholar
  77. WhatsApp. 2015. Homepage. Retrieved from https://www.whatsapp.com/.Google ScholarGoogle Scholar
  78. Tom White. 2012. Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale (3rd ed., revised and updated). O’Reilly. I--XXIII, 1--657 pages.Google ScholarGoogle Scholar
  79. Ulf Wiger. 2000. Industrial-strength functional programming: Experiences with the ericsson AXD301 project. In Proceedings of the Conference on Implementing Functional Languages (IFL’00). Springer-Verlag, Aachen, Germany.Google ScholarGoogle Scholar

Index Terms

  1. Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Programming Languages and Systems
              ACM Transactions on Programming Languages and Systems  Volume 39, Issue 4
              December 2017
              191 pages
              ISSN:0164-0925
              EISSN:1558-4593
              DOI:10.1145/3133234
              Issue’s Table of Contents

              Copyright © 2017 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 17 August 2017
              • Accepted: 1 June 2017
              • Revised: 1 April 2017
              • Received: 1 December 2015
              Published in toplas Volume 39, Issue 4

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader