Hostname: page-component-8448b6f56d-tj2md Total loading time: 0 Render date: 2024-04-18T04:17:25.771Z Has data issue: false hasContentIssue false

MultiMLton: A multicore-aware runtime for standard ML

Published online by Cambridge University Press:  18 June 2014

K. C. SIVARAMAKRISHNAN
Affiliation:
Purdue University, West Lafayette, IN, USA (e-mail: chandras@purdue.edu)
LUKASZ ZIAREK
Affiliation:
SUNY Buffalo, NY, USA (e-mail: lziarek@buffalo.edu)
SURESH JAGANNATHAN
Affiliation:
Purdue University, West Lafayette, IN, USA (e-mail: suresh@cs.purdue.edu)
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

MultiMLton is an extension of the MLton compiler and runtime system that targets scalable, multicore architectures. It provides specific support for ACML, a derivative of Concurrent ML that allows for the construction of composable asynchronous events. To effectively manage asynchrony, we require the runtime to efficiently handle potentially large numbers of lightweight, short-lived threads, many of which are created specifically to deal with the implicit concurrency introduced by asynchronous events. Scalability demands also dictate that the runtime minimize global coordination. MultiMLton therefore implements a split-heap memory manager that allows mutators and collectors running on different cores to operate mostly independently. More significantly, MultiMLton exploits the premise that there is a surfeit of available concurrency in ACML programs to realize a new collector design that completely eliminates the need for read barriers, a source of significant overhead in other managed runtimes. These two symbiotic features - a thread design specifically tailored to support asynchronous communication, and a memory manager that exploits lightweight concurrency to greatly reduce barrier overheads - are MultiMLton's key novelties. In this article, we describe the rationale, design, and implementation of these features, and provide experimental results over a range of parallel benchmarks and different multicore architectures including an 864 core Azul Vega 3, and a 48 core non-coherent Intel SCC (Single-Cloud Computer), that justify our design decisions.

Type
Articles
Copyright
Copyright © Cambridge University Press 2014 

References

Agrawal, K., He, Y., & Leiserson, C. E. (2007) Adaptive work stealing with parallelism feedback. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. (PPoPP '07). New York, NY, USA: ACM, pp. 112120.Google Scholar
Anderson, T. A. (2010) Optimizations in a private nursery-based garbage collector. In Proceedings of the 2010 International Symposium on Memory Management. (ISMM '10). New York, NY, USA: ACM, pp. 2130.Google Scholar
Appel, A. W. (1989) Simple generational garbage collection and fast allocation. Softw. Pract. Exp. 19(February), 171183.Google Scholar
Armstrong, J., Virding, R., Wikstrom, C., & Williams, M. (1996) Concurrent Programming in Erlang, 2nd ed.Prentice-Hall.Google Scholar
Auhagen, S., Bergstrom, L., Fluet, M., & Reppy, J. (2011) Garbage collection for multicore NUMA machines. In Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness. MSPC '11. New York, NY, USA: ACM, pp. 5157.Google Scholar
Bacon, D. F., Cheng, P., & Rajan, V. T. (2003) A real-time garbage collector with low overhead and consistent utilization. In Proceedings of the 30th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '03). New York, NY, USA: ACM, pp. 285298.CrossRefGoogle Scholar
Baker, Jr. and Henry, G. (1978) List processing in real time on a serial computer. Commun. ACM, 21(April), 280294.Google Scholar
Baker, M., & Carpenter, B. (2000) MPJ: A proposed java message passing API and environment for high performance computing. In Parallel and Distributed Processing, Rolim, J. (ed), Lecture Notes in Computer Science, vol. 1800. Berlin, Heidelberg: Springer, pp. 552559.CrossRefGoogle Scholar
Biagioni, E., Cline, K., Lee, P., Okasaki, C., & Stone, C. (1998) Safe-for-space threads in standard ML. Higher Order Symbo. Comput. 11(2), 209225.Google Scholar
Blackburn, S. M., & Hosking, A. L. (2004) Barriers: Friend or foe? In Proceedings of the 4th International Symposium on Memory Management (ISMM '04). New York, NY, USA: ACM, pp. 143151.CrossRefGoogle Scholar
Blumofe, R. D. & Leiserson, C. E. (1999) Scheduling multithreaded computations by work stealing. J. ACM, 46(5), 720748.Google Scholar
Boehm, H. (2012) A Garbage Collector for C and C++. Available at: http://www.hpl.hp.com/personal/Hans_Boehm/gc.Google Scholar
Brooks, R. A. (1984) Trading data space for reduced time and code space in real-time garbage collection on stock hardware. In Proceedings of the 1984 ACM Symposium on LISP and Functional Programming (LFP '84). New York, NY, USA: ACM, pp. 256262.Google Scholar
Bruggeman, C., Waddell, O., & Dybvig, R. K. (1996) Representing control in the presence of one-shot continuations. In Proceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation (PLDI '96). New York, NY, USA: ACM, pp. 99107.Google Scholar
Chaudhuri, A. (2009) A concurrent ML library in concurrent haskell. In Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming (ICFP '09). New York, NY, USA: ACM, pp. 269280.Google Scholar
Doligez, D. & Leroy, X. (1993) A concurrent, generational garbage collector for a multithreaded implementation of ML. In Proceedings of the 20th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '93). New York, NY, USA: ACM, pp. 113123.Google Scholar
Feeley, M. & Miller, J. S. (1990) A parallel virtual machine for efficient scheme compilation. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming (LFP '90). New York, NY, USA: ACM, pp. 119130.CrossRefGoogle Scholar
Felleisen, M. & Friedman, D. (1986) Control operators, the SECD Machine, and the λ-calculus. In Formal Description of Programming Concepts III, pp. 193–217.Google Scholar
Frigo, M., Leiserson, C. E., & Randall, K. H. (1998) The Implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation. PLDI '98. New York, NY, USA: ACM, pp. 212223.Google Scholar
GHC. (2014) Glasgow Haskell Compiler. Available at: http://www.haskell.org/ghc.Google Scholar
Goldman, R. & Gabriel, R. P. (1988). Qlisp: Experience and new directions. In Proceedings of the ACM/SIGPLAN Conference on Parallel Programming: Experience with Applications, Languages and Systems (PPEALS '88). New York, NY, USA: ACM, pp. 111123.Google Scholar
Goldstein, S. C., Schauser, K. E., & Culler, D. E. (1996) Lazy threads: Implementing a fast parallel call. J. Parallel and Distrib. Comput. - Special Issue on Multithreading for Multiprocessors, 37(1), 520.Google Scholar
Harris, T., Marlow, S., & Jones, S. P. (2005) Haskell on a shared-memory multiprocessor. In Proceedings of the 2005 ACM SIGPLAN Workshop on Haskell. (Haskell '05). New York, NY, USA: ACM, pp. 4961.Google Scholar
Hartel, P. H., Feeley, M., Alt, M., & Augustsson, L. (1996) Benchmarking implementations of functional languages with “Pseudoknot”, a float-intensive benchmark. J. Funct. Program. 6 (4), 621655. Available at: http://doc.utwente.nl/55704/.CrossRefGoogle Scholar
Hot-Split. (2013) Contiguous Stacks in Go. Available at: http://golang.org/s/contigstacks.Google Scholar
Intel. (2012) SCC Platform Overview. Available at: http://communities.intel.com/docs/DOC-5512.Google Scholar
Johnston, W. M., Hanna, J. R. Paul, & Millar, R. J. (2004) Advances in dataflow programming languages. ACM Comput. Surv. 36 (1), 134.CrossRefGoogle Scholar
Jones, R. & King, A. C. (2005) A fast analysis for thread-local garbage collection with dynamic class loading. In Proceedings of the Fifth IEEE International Workshop on Source Code Analysis and Manipulation. Washington, DC, USA: IEEE Computer Society, pp. 129138.Google Scholar
Kale, L. V. & Krishnan, S. (1993) CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA '93). New York, NY, USA: ACM, pp. 91108.Google Scholar
Kranz, D. A., Halstead, R. H. Jr. & Mohr, E. (1989). Mul-T: A high-performance Parallel Lisp. In Proceedings of the ACM SIGPLAN 1989 Conference on Programming Language Design and Implementation (PLDI '89). New York, NY, USA: ACM, pp. 8190.Google Scholar
Lea, Doug. (1999) Concurrent Programming in Java. Second Edition: Design Principles and Patterns. 2nd edn. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.Google Scholar
Li, G., Delisi, M., Gopalakrishnan, G., & Kirby, R. M. (2008) Formal specification of the MPI-2.0 standard in TLA+. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '08). New York, NY, USA: ACM, pp. 283284.Google Scholar
Marlow, S. & Peyton Jones, S. (2011) Multicore garbage collection with local heaps. In Proceedings of the 2011 International Symposium on Memory Management (ISMM '11). New York, NY, USA: ACM, pp. 2132.Google Scholar
McKay, D. P. & Shapiro, S. C. (1980) MULTI - a LISP based multiprocessing system. In Proceedings of the 1980 ACM Conference on LISP and Functional Programming (LFP '80). New York, NY, USA: ACM, pp. 2937.CrossRefGoogle Scholar
Miller, J. S. (1988) Implementing a scheme-based Parallel processing system. Int. J. Parallel Program. 17(5), 367402.Google Scholar
MLton. (2012) The MLton Compiler and Runtime System. Available at: http://www.mlton.org.Google Scholar
Mohr, E., Kranz, D. A. & Halstead, R. H. Jr. (1990). Lazy task creation: A technique for increasing the granularity of Parallel programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming. (LFP '90). New York, NY, USA: ACM, pp. 185197.CrossRefGoogle Scholar
Nikhil, R. & Arvind, (2001) Implicit Parallel Programming in pH. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.Google Scholar
Raymond, D. J. (2000). SISAL: A safe and efficient language for numerical calculations. Linux J. 2000(80es).Google Scholar
Reppy, J. H. (2007) Concurrent Programming in ML. Cambridge, UK: Cambridge University Press.Google Scholar
Reppy, J., Russo, C. V., & Xiao, Y. (2009) Parallel concurrent ML. In Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming. ICFP '09. New York, NY, USA: ACM, pp. 257268.CrossRefGoogle Scholar
Sansom, P. M. (1991) Dual-mode garbage collection. In Proceedings of the Workshop on the Parallel Implementation of Functional Languages, pp. 283–310.Google Scholar
Sivaramakrishnan, K. C., Ziarek, L., Prasad, R., & Jagannathan, S. (2010) Lightweight asynchrony using parasitic threads. In Proceedings of the 5th ACM SIGPLAN workshop on Declarative Aspects of Multicore Programming (DAMP '10). New York, NY, USA: ACM, pp. 6372.Google Scholar
Sivaramakrishnan, K. C., Ziarek, L., & Jagannathan, S. (2012) Eliminating read barriers through procrastination and cleanliness. In Proceedings of the 2012 International Symposium on Memory Management (ISMM '12). New York, NY, USA: ACM, pp. 4960.Google Scholar
Sivaramakrishnan, K. C., Harris, T., Marlow, S. & Peyton Jones, S. (2013) Composable Schedular Activations for Haskell. Tech. rept. Microsoft Research, Cambridge.Google Scholar
Stack, T. (2013) Abandoning segmented stacks in Rust. Available at: https://mail.mozilla.org/pipermail/rust-dev/2013-November/006314.html.Google Scholar
Steele, G. L. Jr. (1975) Multiprocessing compactifying garbage collection. Commun. ACM 18 (9), 495508.Google Scholar
Steensgaard, B. (2000) Thread-specific heaps for multi-threaded programs. In Proceedings of the 2000 International Symposium on Memory Management (ISMM '00). New York, NY, USA: ACM, pp. 1824.Google Scholar
Svensson, H., Fredlund, L.-A. & Benac Earle, C. (2010) A unified semantics for future erlang. In Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang '10). New York, NY, USA: ACM, pp. 2332.Google Scholar
Syme, D., Granicz, A., & Cisternino, A. (2007) Expert F#. Apress.Google Scholar
Tang, H. & Yang, T. (2001) Optimizing threaded MPI execution on SMP clusters. In Proceedings of the 15th International Conference on Supercomputing. ICS '01. New York, NY, USA: ACM, pp. 381392.Google Scholar
Wand, M. (1980) Continuation-based multiprocessing. In Proceedings of the 1980 ACM Conference on LISP and Functional Programming (LFP '80). New York, NY, USA: ACM, pp. 1928.Google Scholar
Ziarek, L., Sivaramakrishnan, K. C., & Jagannathan, S. (2011) Composable asynchronous events. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '11). New York, NY, USA: ACM, pp. 628639.Google Scholar
Submit a response

Discussions

No Discussions have been published for this article.