ABSTRACT
Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for inflight memory requests. These resources, however, often exhibit poor utilization rates on workloads with large working sets, e.g., in-memory databases, key-value stores, and graph analytics, as compilers and hardware struggle to expose ILP and MLP from the instruction stream automatically.
In this paper, we introduce the IMLP (Instruction and Memory Level Parallelism) task programming model. IMLP tasks execute as coroutines that yield execution at annotated long-latency operations, e.g., memory accesses, divisions, or unpredictable branches. IMLP tasks are interleaved on a single thread, and integrate well with thread parallelism and vectorization. Our DSL embedded in C++, Cimple, allows exploration of task scheduling and transformations, such as buffering, vectorization, pipelining, and prefetching.
We demonstrate state-of-the-art performance on core algorithms used in in-memory databases that operate on arrays, hash tables, trees, and skip lists. Cimple applications reach 2.5× throughput gains over hardware multithreading on a multi-core, and 6.4× single thread speedup.
- Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, and Burton Smith. 1990. The Tera Computer System. In Proceedings of the 4th International Conference on Supercomputing (ICS '90). ACM, New York, NY, USA, 1--6. Google ScholarDigital Library
- Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, and Mark Callaghan. 2013. LinkBench: A Database Benchmark Based on the Facebook Social Graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13). ACM, New York, NY, USA, 1185--1196. Google ScholarDigital Library
- Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Özsu. 2015. Main-Memory Hash Joins on Modern Processor Architectures. IEEE Trans. Knowl. Data Eng. 27, 7 (2015), 1754--1766.Google ScholarDigital Library
- Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. 2017. Attack of the Killer Microseconds. Commun. ACM 60, 4 (March 2017), 48--54. Google ScholarDigital Library
- Scott Beamer, Krste Asanović, and David A. Patterson. 2015. Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server. In 2015 IEEE International Symposium on Workload Characterization, IISWC 2015, Atlanta, GA, USA, October 4--6, 2015. 56--65. Google ScholarDigital Library
- Scott Beamer, Krste Asanović, and David A. Patterson. 2017. Reducing PageRank Communication via Propagation Blocking. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 820--831.Google Scholar
- Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An Efficient Multithreaded Runtime System. SIGPLAN Not. 30, 8 (Aug. 1995), 207--216. Google ScholarDigital Library
- Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. 2013. TAO: Facebook's Distributed Data Store for the Social Graph. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (USENIX ATC'13). USENIX Association, Berkeley, CA, USA, 49--60. http://dl.acm.org/citation.cfm?id=2535461.2535468 Google ScholarDigital Library
- Carl Bruggeman, Oscar Waddell, and R. Kent Dybvig. 1996. Representing Control in the Presence of One-shot Continuations. In Proceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation (PLDI '96). ACM, New York, NY, USA, 99--107. Google ScholarDigital Library
- Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, and Todd C. Mowry. 2004. Improving Hash Join Performance Through Prefetching. In Proceedings of the 20th International Conference on Data Engineering (ICDE '04). IEEE Computer Society, Washington, DC, USA, 116--. http://dl.acm.org/citation.cfm?id=977401.978128 Google ScholarDigital Library
- Shimin Chen, Phillip B. Gibbons, and Todd C. Mowry. 2001. Improving Index Performance Through Prefetching. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (SIGMOD '01). ACM, New York, NY, USA, 235--246. Google ScholarDigital Library
- Trishul M. Chilimbi, Mark D. Hill, and James R. Larus. 1999. Cache-conscious Structure Layout. In Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation (PLDI '99). ACM, New York, NY, USA, 1--12. Google ScholarDigital Library
- Cloudera. 2013. Inside Cloudera Impala: Runtime Code Generation. http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala-runtime-code-generation/. (2013).Google Scholar
- Scott A. Crosby and Dan S. Wallach. 2003. Denial of Service via Algorithmic Complexity Attacks. In Proceedings of the 12th Conference on USENIX Security Symposium - Volume 12 (SSYM'03). USENIX Association, Berkeley, CA, USA, 3--3. http://dl.acm.org/citation.cfm?id=1251353.1251356 Google ScholarDigital Library
- Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. Hekaton: SQL Server's Memory-optimized OLTP Engine. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13). ACM, New York, NY, USA, 1243--1254. Google ScholarDigital Library
- Tom Duff. 1988. Duff's Device. http://doc.cat-v.org/bell_labs/duffs_device. (1988).Google Scholar
- M. Anton Ertl and David Gregg. 2003. Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI '03). ACM, New York, NY, USA, 278--288. Google ScholarDigital Library
- Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI '98). ACM, New York, NY, USA, 212--223. Google ScholarDigital Library
- Vinodh Gopal, Wajdi Feghali, Jim Guilford, Erdinc Ozturk, Gil Wolrich, Martin Dixon, Max Locktyukhin, and Maxim Perminov. 2010. Fast Cryptographic Computation on Intel Architecture Processors Via Function Stitching. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/communications-ia-cryptographic-paper.pdf. (2010).Google Scholar
- Niklas Gustafsson, Deon Brewis, and Herb Sutter. 2014. Resumable Functions. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3858.pdf. (2014).Google Scholar
- Pablo Halpern, Arch Robison, Hong Hong, Artur Laksberg, Gor Nishanov, and Herb Sutter. 2015. Task Block (formerly Task Region) R4. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4411.pdf. (2015).Google Scholar
- Maurice Herlihy, Yossi Lev, Victor Luchangco, and Nir Shavit. 2006. A provably correct scalable concurrent skip list. In Conference On Principles of Distributed Systems (OPODIS).Google Scholar
- R. Hieb, R. Kent Dybvig, and Carl Bruggeman. 1990. Representing Control in the Presence of First-class Continuations. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation (PLDI '90). ACM, New York, NY, USA, 66--77. Google ScholarDigital Library
- Mark Horowitz, Margaret Martonosi, Todd C. Mowry, and Michael D. Smith. 1996. Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA '96). ACM, New York, NY, USA, 260--270. Google ScholarDigital Library
- Intel. 2015. Intel Xeon Processor E5-2680 v3(30M Cache, 2.50 GHz). http://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2_50-GHz. (2015).Google Scholar
- Intel. 2017. Intel 64 and IA-32 Architectures Optimization Reference Manual. http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html. (2017).Google Scholar
- Christopher Jonathan, Umar Farooq Minhas, James Hunter, Justin Levandoski, and Gor Nishanov. 2018. Exploiting Coroutines to Attack the "Killer Nanoseconds". In Proceedings of the 44th International Conference on Very Large Data Bases (VLDB'18), August 2018, Rio de Janeiro, Brazil, VLDB Endowment 11, 11 (2018), 1702--1714. Google ScholarDigital Library
- Alfons Kemper, Thomas Neumann, Jan Finis, Florian Funke, Viktor Leis, Henrik Mühe, Tobias Mühlbauer, and Wolf Rödiger. 2013. Processing in the Hybrid OLTP & OLAP Main-Memory Database System HyPer. IEEE Data Eng. Bull. 36, 2 (2013), 41--47. http://sites.computer.org/debull/A13june/hyper1.pdfGoogle Scholar
- Paul-Virak Khuong and Pat Morin. 2017. Array Layouts for Comparison-Based Searching. J. Exp. Algorithmics 22, Article 1.3 (May 2017), 39 pages. Google ScholarDigital Library
- Vladimir Kiriansky, Haoran Xu, Martin Rinard, and Saman Amarasinghe. 2018. Cimple: Instruction and Memory Level Parallelism. ArXiv e-prints (July 2018). arXiv:1807.01624Google Scholar
- Vladimir Kiriansky, Yunming Zhang, and Saman Amarasinghe. 2016. Optimizing Indirect Memory References with Milk. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT '16). ACM, New York, NY, USA, 299--312. Google ScholarDigital Library
- Onur Kocberber, Babak Falsafi, and Boris Grot. 2015. Asynchronous Memory Access Chaining. Proc. VLDB Endow. 9, 4 (Dec. 2015), 252--263. Google ScholarDigital Library
- Nicholas Kohout, Seungryul Choi, Dongkeun Kim, and Donald Yeung. 2001. Multi-Chain Prefetching: Effective Exploitation of Inter-Chain Memory Parallelism for Pointer-Chasing Codes. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT '01). IEEE Computer Society, Washington, DC, USA, 268--279. http://dl.acm.org/citation.cfm?id=645988.674157 Google ScholarDigital Library
- Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, Ishaan Joshi, Lenni Kuff, Dileep Kumar, Alex Leblang, Nong Li, Ippokratis Pandis, Henry Robinson, David Rorke, Silvius Rus, John Russell, Dimitris Tsirogiannis, Skye Wanderman-Milne, and Michael Yoder. 2015. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4--7, 2015, Online Proceedings. http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdfGoogle Scholar
- David Kroft. 1981. Lockup-free Instruction Fetch/Prefetch Cache Organization. In Proceedings of the 8th Annual Symposium on Computer Architecture (ISCA '81). IEEE Computer Society Press, Los Alamitos, CA, USA, 81--87. http://dl.acm.org/citation.cfm?id=800052.801868 Google ScholarDigital Library
- Jens Krueger, Changkyu Kim, Martin Grund, Nadathur Satish, David Schwalb, Jatin Chhugani, Hasso Plattner, Pradeep Dubey, and Alexander Zeier. 2011. Fast Updates on Read-optimized Databases Using Multi-core CPUs. Proc. VLDB Endow. 5, 1 (Sept. 2011), 61--72. Google ScholarDigital Library
- Junjie Lai and Andre Seznec. 2013. Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO '13). IEEE Computer Society, Washington, DC, USA, 1--10. Google ScholarDigital Library
- Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI '00). ACM, New York, NY, USA, 145--156. Google ScholarDigital Library
- Per-Åke Larson, Adrian Birka, Eric N. Hanson, Weiyun Huang, Michal Nowakiewicz, and Vassilis Papadimos. 2015. Real-Time Analytical Processing with SQL Server. PVLDB 8, 12 (2015), 1740--1751. http://www.vldb.org/pvldb/vol8/p1740-Larson.pdf Google ScholarDigital Library
- Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012. When Prefetching Works, When It Doesn't, and Why. ACM Trans. Archit. Code Optim. 9, 1, Article 2 (March 2012), 29 pages. Google ScholarDigital Library
- Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The Adaptive Radix Tree: ARTful Indexing for Main-memory Databases. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013) (ICDE '13). IEEE Computer Society, Washington, DC, USA, 38--49. Google ScholarDigital Library
- Charles E. Leiserson. 2010. The Cilk++ concurrency platform. The Journal of Supercomputing 51, 3 (2010), 244--257. Google ScholarDigital Library
- Justin Levandoski, David Lomet, Sudipta Sengupta, Adrian Birka, and Cristian Diaconu. 2014. Indexing on Modern Hardware: Hekaton and Beyond. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2014. ACM. http://research.microsoft.com/apps/pubs/default.aspx?id=213089 Google ScholarDigital Library
- Sheng Li, Hyeontaek Lim, Victor W. Lee, Jung Ho Ahn, Anuj Kalia, Michael Kaminsky, David G. Andersen, O. Seongil, Sukhan Lee, and Pradeep Dubey. 2015. Architecting to Achieve a Billion Requests Per Second Throughput on a Single Key-value Store Server Platform. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 476--488. Google ScholarDigital Library
- Prashanth Menon, Todd C. Mowry, and Andrew Pavlo. 2017. Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last. Proc. VLDB Endow. 11 (September 2017), 1--13. Issue 1. http://www.vldb.org/pvldb/vol11/p1-menon.pdf Google ScholarDigital Library
- Todd C. Mowry, Monica S. Lam, and Anoop Gupta. 1992. Design and Evaluation of a Compiler Algorithm for Prefetching. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V). ACM, New York, NY, USA, 62--73. Google ScholarDigital Library
- Nervana. 2017. SGEMM. https://github.com/NervanaSystems/maxas/wiki/SGEMM. (2017).Google Scholar
- Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proc. VLDB Endow. 4, 9 (June 2011), 539--550. Google ScholarDigital Library
- A. Newell and H. Simon. 1956. The logic theory machine-A complex information processing system. IRE Transactions on Information Theory 2, 3 (September 1956), 61--79.Google ScholarCross Ref
- A. Newell and F. M. Tonge. 1960. An Introduction to Information Processing Language V. Commun. ACM 3, 4 (April 1960), 205--211. Google ScholarDigital Library
- Gor Nishanov and Jim Radigan. 2014. Resumable Functions v.2. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4134.pdf. (2014).Google Scholar
- OpenMP. 2015. OpenMP Application Program Interface 4.5. http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf. (2015).Google Scholar
- Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I.August. 2005. Automatic Thread Extraction with Decoupled Software Pipelining. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 38). IEEE Computer Society, Washington, DC, USA, 105--118. Google ScholarDigital Library
- I.E. Papazian, S. Kottapalli, J. Baxter, J. Chamberlain, G. Vedaraman, and B. Morris. 2015. Ivy Bridge Server: A Converged Design. Micro, IEEE 35, 2 (Mar 2015), 16--25.Google Scholar
- Georgios Psaropoulos, Thomas Legler, Norman May, and Anastasia Ailamaki. 2017. Interleaving with coroutines: a practical approach for robust index joins. In Proceedings of the 44th International Conference on Very Large Data Bases (VLDB'18), August 2018, Rio de Janeiro, Brazil, VLDB Endowment 11, 2 (2017), 230--242. Google ScholarDigital Library
- Christian Queinnec and Bernard Serpette. 1991. A Dynamic Extent Control Operator for Partial Continuations. In Proceedings of the 18th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '91). ACM, New York, NY, USA, 174--184. Google ScholarDigital Library
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 519--530. Google ScholarDigital Library
- Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges, and David I. August. 2008. Parallel-stage Decoupled Software Pipelining. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '08). ACM, New York, NY, USA, 114--123. Google ScholarDigital Library
- Jun Rao and KennethA. Ross. 2000. Making B+- Trees Cache Conscious in Main Memory. SIGMOD Rec. 29, 2 (May 2000), 475--486. Google ScholarDigital Library
- RocksDB. 2017. RocksDB. http://rocksdb.org/. (2017).Google Scholar
- Erven Rohou, Bharath Narasimha Swamy, and André Seznec. 2015. Branch Prediction and the Performance of Interpreters: Don't Trust Folklore. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '15). IEEE Computer Society, Washington, DC, USA, 103--114. http://dl.acm.org/citation.cfm?id=2738600.2738614 Google ScholarDigital Library
- Samsung. 2015. DDR4 SDRAM 288pin Registered DIMM M393A2G40DB1 Datasheet. http://www.samsung.com/semiconductor/global/file/product/DS_8GB_DDR4_4Gb_D_die_RegisteredDIMM_Rev15.pdf. (2015).Google Scholar
- Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. 2008. Larrabee: A Many-core x86 Architecture for Visual Computing. In ACM SIGGRAPH 2008 Papers (SIGGRAPH '08). ACM, New York, NY, USA, Article 18, 15 pages. Google ScholarDigital Library
- Jason Sewall, Jatin Chhugani, Changkyu Kim, Nadathur Satish, and Pradeep Dubey. 2011. PALM: Parallel Architecture-Friendly Latch-Free Modifications to B+ Trees on Many-Core Processors. PVLDB 4, 11 (2011), 795--806. http://www.vldb.org/pvldb/vol4/p795-sewall.pdfGoogle ScholarDigital Library
- Rami Sheikh, James Tuck, and Eric Rotenberg. 2012. Control-Flow Decoupling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 329--340. Google ScholarDigital Library
- Jeff Shute, Mircea Oancea, Stephan Ellner, Ben Handy, Eric Rollins, Bart Samwel, Radek Vingralek, Chad Whipkey, Xin Chen, Beat Jegerlehner, Kyle Littlefield, and Phoenix Tong. 2012. F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business. In SIGMOD. Talk given at SIGMOD 2012. Google ScholarDigital Library
- B. Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F. Marino, E. Retter, and P. Williams. 2011. IBM POWER7 Multicore Server Processor. IBM J. Res. Dev. 55, 3 (May 2011), 191--219. Google ScholarDigital Library
- B J Smith. 1986. Advanced Computer Architecture. IEEE Computer Society Press, Los Alamitos, CA, USA, Chapter A Pipelined, Shared Resource MIMD Computer, 39--41. http://dl.acm.org/citation.cfm?id=17956.17961 Google ScholarDigital Library
- Stefan Sprenger, Steffen Zeuch, and Ulf Leser. 2016. Cache-sensitive skip list: Efficient range queries on modern CPUs. In International Workshop on In-Memory Data Management and Analytics. Springer, 1--17.Google Scholar
- Rebecca Taft, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J Elmore, Ashraf Aboulnaga, Andrew Pavlo, and Michael Stonebraker. 2014. E-store: Fine-grained elastic partitioning for distributed transaction processing systems. Proceedings of the VLDB Endowment 8, 3 (2014), 245--256. Google ScholarDigital Library
- K. A. Tran, T. E. Carlson, K. Koukos, M. Själander, V. Spiliopoulos, S. Kaxiras, and A. Jimborean. 2017. Clairvoyance: Look-ahead compile-time scheduling. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 171--184. Google ScholarDigital Library
- James Tuck, Luis Ceze, and Josep Torrellas. 2006. Scalable Cache Miss Handling for High Memory-Level Parallelism. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39). IEEE Computer Society, Washington, DC, USA, 409--422. Google ScholarDigital Library
- Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme Ottoni, and David I. August. 2007. Speculative Decoupled Software Pipelining. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT '07). IEEE Computer Society, Washington, DC, USA, 49--59. Google ScholarDigital Library
- Vasily Volkov. 2010. Better performance at lower occupancy. In Proceedings of the GPU technology conference, GTC'10. http://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pdfGoogle Scholar
Index Terms
- Cimple: instruction and memory level parallelism: a DSL for uncovering ILP and MLP
Recommendations
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Comments