ABSTRACT
The C++ Actor Framework (CAF) was designed for using multiple, exchangeable schedulers with a default choice of random work stealing (RWS) for load-balancing. RWS is excellently scalable, and by choosing a random victim scheduling is kept simple with minimal information required. On the downside, it ignores data locality and misses opportunities to improve the application performance.
In this paper, we contribute a locality-guided scheduling that exploits knowledge about the host system to adapt runtime deployment and thereby improves the performance of actor based applications. We implement and thoroughly analyze a CAF scheduler which considers the trade-off between communication locality and execution locality. The former describes the locality of communicating actors, while the latter the locality between a worker, which executes an actor, and the location of its data. Extensive performance evaluations show a performance gain for data intensive application of up to 25% on a 64 core NUMA machine.
- Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2000. The Data Locality of Work Stealing. In Proc. of the 12th An. ACM Symposium on Parallel Algorithms and Arch. (SPAA '00). ACM, NY, USA, 1-12. Google ScholarDigital Library
- David Applegate and William Cook. 1991. A Computational Study of the Job-Shop Scheduling Problem. ORSA Journ. on comp. 3, 2, 149-156.Google Scholar
- Joe Armstrong. 1996. Erlang - A Survey of the Language and its Industrial Applications. In Proc. of the symposium on industrial applications of Prolog (INAP96). Hino, 16-18.Google Scholar
- R. D. Blumofe and Ch. E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. J. ACM 46, 5 (Sept.), 720-748. Google ScholarDigital Library
- Dominik Charousset, Raphael Hiesgen, and Thomas C. Schmidt. 2016. Revisiting Actor Programming in C++. Computer Languages, Systems & Structures 45 (April 2016), 105-131. Google ScholarDigital Library
- D. Charousset, T. C. Schmidt, R. Hiesgen, and M. Wählisch. 2013. Native Actors - A Scalable Software Platform for Distributed Environments. In Proc. of the 4rd SPLASH '13, WS AGERE! ACM, NY, USA, 87-96. Google ScholarDigital Library
- Sylvan Clebsch, Sophia Drossopoulou, Sebastian Blessing, and Andy McNeil. 2015. Deny Capabilities for Safe, Fast Actors. In Proc. of the 6th SPLASH '15, WS AGERE!. ACM, NY, USA, 1-12. Google ScholarDigital Library
- Peter J. Denning. 2005. The Locality Principle. Commun. ACM 48, 7 (2005), 19-24. Google ScholarDigital Library
- Emilio Francesquini, Alfredo Goldman, and Jean-François Méhaut. 2013. Actor Scheduling for Multicore Hierarchical Memory Platforms. In Proc. of the 12th ACM SIGPLAN Workshop on Erlang (Erlang '13). ACM, New York, NY, USA, 51-62. Google ScholarDigital Library
- Fabien Gaud, Baptiste Lepers, Justin Funston, et al., 2015. Challenges of Memory Management on Modern NUMA Systems. Commun. ACM 58, 12 (2015), 59-66. Google ScholarDigital Library
- Carl Hewitt, Peter Bishop, and Richard Steiger. 1973. A Universal Modular ACTOR Formalism for Artificial Intelligence. In Proc. of the 3rd IJCAI. Morgan Kaufmann, San Francisco, CA, USA, 235-245. Google ScholarDigital Library
- Shams Imam and Vivek Sarkar. 2014. Habanero-Java Library: A Java 8 Framework for Multicore Programming. In PPPJ. ACM, 75-86. Google ScholarCross Ref
- Shams Imam and Vivek Sarkar. 2014. Savina - An Actor Benchmark Suite. In Proc. of the 5th SPLASH '14, WS AGERE! ACM, NY, USA, 67-80. Google ScholarDigital Library
- Shams M. Imam and Vivek Sarkar. 2012. Integrating Task Parallelism with Actors. SIGPLAN Not. 47, 10 (Oct. 2012), 753-772. Google ScholarDigital Library
- Kirk L. Johnson. 1992. The Impact of Communication Locality on Large-scale Multiprocessor Performance. SIGARCH Comput. Archit. News 20, 2 (1992), 392-402. Google ScholarDigital Library
- Stephen L Olivier, Allan K Porterfield, Kyle B Wheeler, Michael Spiegel, and Jan F Prins. 2012. OpenMP Task Scheduling Strategies for Multicore NUMA Systems. Int. J. High Perform. Comp. Appl. 26, 2, 110-124. Google ScholarDigital Library
- M. Pericas, A. Cristal, R. Gonzalez, D. A. Jimenez, and M. Valero. 2006. A decoupled KILO-instruction processor. In The 12th Intern. Symp. on High-Perform. Comp. Arch.,'06. Springer, Berlin, Heidelberg, 53-64.Google Scholar
- Jean-Noël Quintin and Frédéric Wagner. 2010. Hierarchical Workstealing. In Proc. of the 16th Intern. Euro-Par Conf. on Parallel Processing: Part I (EuroPar'10). Springer-Verlag, Berlin, Heidelberg, 217-229. Google ScholarDigital Library
- Mike Rettig. 2012. Jetlang. code.google.com/p/jetlang. (April 2012).Google Scholar
- Niranjan G. Shivaratri, Phillip Krueger, and Mukesh Singhal. 1992. Load Distributing for Locally Distr. Systems. Computer 25, 12, 33-44. Google ScholarDigital Library
- H. Topcuoglu, S. Hariri, and Min-You Wu. 1999. Task Scheduling Algorithms for Heterogeneous Processors. In Het. Comp. WS. (HCW '99) Proceedings. 8th. IEEE Comp. Soc., DC, USA, 3-14. Google ScholarDigital Library
- Typesafe Inc. 2017. Akka Framework. http://akka.io. (August 2017).Google Scholar
- Rob V. van Nieuwpoort, Thilo Kielmann, and Henri E. Bal. 2001. Efficient Load Balancing for Wide-area Divide-and-conquer Applications. SIGPLAN Not. 36, 7 (2001), 34-43. Google ScholarDigital Library
- K. Wang, X. Zhou, T. Li, D. Zhao, M. Lang, and I. Raicu. 2014. Optimizing Load Balancing and Data-Locality with Data-aware Scheduling. In 2014 IEEE Int. Conf. on Big Data. IEEE, DC, USA, 119-128.Google Scholar
Index Terms
- Locality-guided scheduling in CAF
Recommendations
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesOptimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...
Profile-guided proactive garbage collection for locality optimization
Proceedings of the 2006 PLDI ConferenceMany applications written in garbage collected languages have large dynamic working sets and poor data locality. We present a new system for continuously improving program data locality at run time with low overhead. Our system proactively reorganizes ...
A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads
AbstractGPUs are capable of delivering peak performance in TFLOPs, however, peak performance is often difficult to achieve due to several performance bottlenecks. Memory divergence is one such performance bottleneck that makes it harder to exploit ...
Comments