Abstract
Commodity many-core hardware is now mainstream, but parallel programming models are still lagging behind in efficiently utilizing the application parallelism. There are (at least) two principal reasons for this. First, real-world programs often take the form of a deeply nested composition of parallel operators, but mapping the available parallelism to the hardware requires a set of transformations that are tedious to do by hand and beyond the capability of the common user. Second, the best optimization strategy, such as what to parallelize and what to efficiently sequentialize, is often sensitive to the input dataset and therefore requires multiple code versions that are optimized differently, which also raises maintainability problems.
This article presents three array-based applications from the financial domain that are suitable for gpgpu execution. Common benchmark-design practice has been to provide the same code for the sequential and parallel versions that are optimized for only one class of datasets. In comparison, we document (1) all available parallelism via nested map-reduce functional combinators, in a simple Haskell implementation that closely resembles the original code structure, (2) the invariants and code transformations that govern the main trade-offs of a data-sensitive optimization space, and (3) report target cpu and multiversion gpgpu code together with an evaluation that demonstrates optimization trade-offs and other difficulties. We believe that this work provides useful insight into the language constructs and compiler infrastructure capable of expressing and optimizing such applications, and we report in-progress work in this direction.
- Mehdi Amini, Fabien Coelho, Francois Irigoin, and Ronan Keryell. 2011. Static compilation analysis for host-accelerator communication optimization. In Proceedings of the Conference on Languages and Compilers for Parallel Computing (LCPC’11). 237--251.Google Scholar
- Patrick Bahr, Jost Berthold, and Martin Elsman. 2015. Certified symbolic management of financial multi-party contracts. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming (ICFP’15). Google ScholarDigital Library
- Erik Barendsen and Sjaak Smetsers. 1993. Conventional and uniqueness typing in graph rewrite systems. In Foundations of Software Technology and Theoretical Computer Science. Lecture Notes in Computer Science, Vol. 761. Springer, 41--51. Google ScholarDigital Library
- Basel Committee on Banking Supervision. 2010. Basel III: A Global Regulatory Framework for More Resilient Banks and Banking Systems. Bank for International Settlements, Basel, Switzerland.Google Scholar
- M. M. Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the International Conference on Compiler Construction (CC’10). 244--263. Google ScholarDigital Library
- Nathan Bell and Jared Hoberock. 2011. Thrust: A productivity-oriented library for CUDA. In GPU Computing Gems Jade Edition, W.-M. W. Hwu (Ed.). Morgan Kaufmann, San Francisco, CA.Google Scholar
- Lars Bergstrom and John Reppy. 2012. Nested data-parallelism on the GPU. In Proceedings of the 17th ACM SIGPLAN International Conference on Functional Programming (ICFP’12). 247--258. Google ScholarDigital Library
- R. S. Bird. 1987. An introduction to the theory of lists. In Proceedings of the NATO Advanced Study on Logic of Programming and Calculi of Discrete Design. 5--42. Google ScholarDigital Library
- F. Black and M. Scholes. 1973. The pricing of options and corporate liabilities. Journal of Political Economy 81, 3, 637--654.Google ScholarCross Ref
- Guy Blelloch. 1996. Programming parallel algorithms. Communications of the ACM 39, 3, 85--97. Google ScholarDigital Library
- Guy E. Blelloch. 1989. Scans as primitive parallel operations. IEEE Transactions on Computers 38, 11, 1526--1538. Google ScholarDigital Library
- Guy E. Blelloch. 1990. Prefix Sums and Their Applications. Carnegie Mellon University, Pittsburgh, PA.Google Scholar
- Guy E. Blelloch, Jonathan C. Hardwick, Jay Sipelstein, Marco Zagha, and Siddhartha Chatterjee. 1994. Implementation of a portable nested data-parallel language. Journal of Parallel and Distributed Computing 21, 1, 4--14. Google ScholarDigital Library
- Cajo J. Braak. 2006. A Markov chain Monte Carlo version of the genetic algorithm differential evolution: Easy Bayesian computing for real parameter spaces. Statistics and Computing 16, 3, 239--249. Google ScholarDigital Library
- Paul Bratley and Bennett L. Fox. 1988. Algorithm 659 implementing Sobol’s quasirandom sequence generator. ACM Transactions on Mathematical Software 14, 1, 88--100. Google ScholarDigital Library
- Richard P. Brent. 1973. Algorithms for Minimization without Derivatives. Prentice Hall.Google Scholar
- Damiano Brigo and Fabio Mercurio. 2006. Interest Rate Models—Theory and Practice: With Smile, Inflation and Credit (2nd ed.). Springer.Google Scholar
- Manuel M. T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the 6th Workshop on Aspects of Multicore Programming (DAMP’11). 3--14. Google ScholarDigital Library
- Y. Chicha, M. Lloyd, C. Oancea, and S. M. Watt. 2004. Parametric polymorphism for computer algebra software components. In Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. 119--130.Google Scholar
- Koen Claessen, Mary Sheeran, and Bo Joel Svensson. 2012. Expressive array constructs in an embedded GPU kernel programming language. In Proceedings of the 7th Workshop on Declarative Aspects and Applications of Multicore Programming (DAMP’12). 21--30. Google ScholarDigital Library
- J. Crank and P. Nicolson. 1947. A practical method for numerical evaluation of solutions of partial differential equations of the heat-conduction type. Mathematical Proceedings of the Cambridge Philosophical Society 43, 1, 50--67.Google ScholarCross Ref
- Francis Dang, Hao Yu, and Lawrence Rauchwerger. 2002. The R-LRPD test: Speculative parallelization of partially parallel loops. In Proceedings of the International Parallel and Distributed Processing Symposium (PDPS’02). 20--29. Google ScholarDigital Library
- Christophe Dubach, Perry Cheng, Rodric Rabbah, David F. Bacon, and Stephen J. Fink. 2012. Compiling a high-level language for GPUs. In Proceedings of the International Conference on Programming Language Design and Implementation (PLDI’12). 1--12. Google ScholarDigital Library
- Daniel Egloff. 2011. Pricing financial derivatives with high performance finite difference solvers on GPUs. In GPU Computing Gems Jade Edition, W.-M. W. Hwu (Ed.). Morgan Kaufmann, San Francisco, CA, 309--322.Google Scholar
- V. Elango, F. Rastello, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan. 2015. On characterizing the data access complexity of programs. In Proceedings of the 42nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’15). ACM, New York, NY, 567--580. Google ScholarDigital Library
- Martin Elsman and Martin Dybdal. 2014. Compiling a subset of APL into a typed intermediate language. In Proceedings of the 1st International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’14). ACM, New York, NY. Google ScholarDigital Library
- Martin Elsman and Anders Schack-Nielsen. 2014. Typelets—a rule-based evaluation model for dynamic, statically typed user interfaces. In Proceedings of the International Symposium on Practical Aspects of Declarative Languages (PADL’14). Google ScholarDigital Library
- Paul Feautrier. 1991. Dataflow analysis of array and scalar references. International Journal of Parallel Programming 20, 1, 23--54.Google ScholarCross Ref
- Michael Flænø Werk, Joakim Ahnfelt-Rønne, and Ken Friis Larsen. 2012. An embedded DSL for stochastic processes: Research article. In Proceedings of the 1st ACM SIGPLAN Workshop on Functional High-Performance Computing (FHPC’12). ACM, New York, NY, 93--102. Google ScholarDigital Library
- M. B. Giles, G. R. Mudalige, Z. Sharif, G. Markall, and P. H. J. Kelly. 2011. Performance analysis and optimisation of the OP2 framework on many-core architectures. ACM SIGMETRICS Performance Evaluation Review 38, 4, 9--15. Google ScholarDigital Library
- Paul Glasserman. 2004. Monte Carlo Methods in Financial Engineering. Springer, New York, NY.Google Scholar
- Clemens Grelck and Sven-Bodo Scholz. 2006. SAC: A functional array language for efficient multithreaded execution. International Journal of Parallel Programming 34, 4, 383--427. Google ScholarDigital Library
- Jing Guo, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2011. Breaking the GPU programming barrier with the auto-parallelising SAC compiler. In Proceedings of the 6th Workshop on Declarative Aspects of Multicore Programming (DAMP’11). ACM, New York, NY, 15--24. Google ScholarDigital Library
- G. Hains and L. M. R. Mullin. 1993. Parallel functional programming with arrays. Computer Journal 36, 3, 238--245.Google ScholarCross Ref
- Mary W. Hall, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, and Monica S. Lam. 2005. Interprocedural parallelization analysis in SUIF. ACM Transactions on Programming Languages and Systems. 27, 4, 662--731. Google ScholarDigital Library
- Troels Henriksen. 2014. Exploiting Functional Invariants to Optimise Parallelism: A Dataflow Approach. Master’s Thesis. DIKU, Copenhagen, Denmark.Google Scholar
- Troels Henriksen, Martin Elsman, and Cosmin Eugen Oancea. 2014. Size slicing—a hybrid approach to size inference in Futhark. In Proceedings of the 3rd ACM SIGPLAN Workshop on Functional High-Performance Computing (FHPC’14). ACM, New York, NY, 31--42. Google ScholarDigital Library
- Troels Henriksen and Cosmin Eugen Oancea. 2013. A T2 graph-reduction approach to fusion. In Proceedings of the 2nd ACM SIGPLAN Workshop on Functional High-Performance Computing (FHPC’13). ACM, New York, NY, 47--58. Google ScholarDigital Library
- Troels Henriksen and Cosmin Eugen Oancea. 2014. Bounds checking: An instance of hybrid analysis. In Proceedings of the ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’14). ACM, New York, NY, 88. Google ScholarDigital Library
- Roger W. Hockney. 1965. A fast direct solution of Poisson’s equation using Fourier analysis. Journal of the ACM 12, 1, 95--113. Google ScholarDigital Library
- J. Hull. 2009. Options, Futures and Other Derivatives. Prentice Hall.Google Scholar
- Kenneth E. Iverson. 1962. A Programming Language. John Wiley & Sons. Google ScholarDigital Library
- Ajay Joshi, Aashish Phansalkar, Lieven Eeckhout, and Lizy Kurian John. 2006. Measuring benchmark similarity using inherent program characteristics. IEEE Transactios on Computers 6, 769--782. Google ScholarDigital Library
- M. S. Joshi. 2010. Graphical Asian options. Wilmott Journal 2, 2, 97--107.Google ScholarCross Ref
- Hee-Seok Kim, Shengzhao Wu, Li-Wen Chang, and Wen-Mei W. Hwu. 2011. A scalable tridiagonal solver for GPUs. In Proceedings of the International Conference on Parallel Processing (ICPP’11). IEEE, Los Alamitos, CA, 444--453. Google ScholarDigital Library
- A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes. 2010. On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Journal of Computational and Graphical Statistics 19, 4, 769--789.Google ScholarCross Ref
- Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). 101--110. Google ScholarDigital Library
- Yuan Lin and David Padua. 2000. Analysis of irregular single-indexed arrays and its applications in compiler optimizations. In Proceedings of the International Conference on Compiler Construction. 202--218. Google ScholarDigital Library
- Frederik M. Madsen and Andrzej Filinski. 2013. Towards a streaming model for nested data parallelism. In Proceedings of the 2nd ACM SIGPLAN Workshop on Functional High-Performance Computing. Google ScholarDigital Library
- Geoffrey Mainland and Greg Morrisett. 2010. Nikola: Embedding compiled GPU functions in Haskell. In Proceedings of the 3rd ACM International Symposium on Haskell. 67--78. Google ScholarDigital Library
- Robin Milner, Mads Tofte, Robert Harper, and David MacQueen. 1997. The Definition of Standard ML (Revised). MIT Press, Cambridge, MA. Google ScholarDigital Library
- Claus Munk. 2007. Introduction to the Numerical Solution of Partial Differential Equations in Finance. Retrieved May 10, 2016, from http://mit.econ.au.dk/vip_htm/cmunk/noter/pdenote.pdf.Google Scholar
- Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2014. Deterministic Galois: On-demand, portable and parameterless. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). Google ScholarDigital Library
- Fredrik Nord and Erwin Laure. 2011. Monte Carlo option pricing with graphics processing units. In Proceedings of the International Conference on Parallel Computing (ParCo’11).Google Scholar
- Cosmin Oancea, Christian Andreetta, Jost Berthold, Alain Frisch, and Fritz Henglein. 2012. Financial software on GPUs: Between Haskell and Fortran. In Proceedings of the Workshop on Functional High-Performance Computing (FHPC’12). ACM, New York, NY, 61--72. Google ScholarDigital Library
- Cosmin E. Oancea and Lawrence Rauchwerger. 2015. Scalable conditional induction variable (CIV) analysis. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’15). Google Scholar
- Cosmin E. Oancea, Jason W. A. Selby, Mark Giesbrecht, and Stephen M. Watt. 2005. Distributed models of thread level speculation. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’05), Vol. 5. 920--927.Google Scholar
- C. E. Oancea and S. M. Watt. 2005. Domains and expressions: An interface between two approaches to computer algebra. In Proceedings of the 2005 International Symposium on Symbolic and Algebraic Computation (ISSAC’05). ACM, New York, NY, 261--269. Google ScholarDigital Library
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. 2011. Loop transformations: Convexity, pruning and optimization. In Proceedings of the 38th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL’11). ACM, New York, NY, 549--562. Google ScholarDigital Library
- James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media. Google ScholarDigital Library
- Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-Mei W. Hwu. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’08). 73--82. Google ScholarDigital Library
- Standard Performance Evaluation Corporation. 2014. SPEC ACCEL. Retrieved May 10, 2016, from https://www.spec.org/accel/.Google Scholar
- N. M. Steen, G. D. Byrne, and E. M. Gelbard. 1969. Gaussian quadratures for the integrals ∫∞0 exp( − x2)f(x)dx and ∫b0 exp( − x2)f(x)dx. Mathematics of Computation 23, 661--671.Google Scholar
- Rainer Storn and Kenneth Price. 1997. Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization 11, 4, 341--359. Google ScholarDigital Library
- Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and Wen-Mei W. Hwu. 2008. CUDA-lite: Reducing GPU programming complexity. In Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing (LCPC’08). 1--15. Google ScholarDigital Library
- Jin Wang and Sudhakar Yalamanchili. 2014. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’14). 51--60.Google ScholarCross Ref
- David Watkins. 1991. Fundamentals of Matrix Computations. Wiley, New York, NY. Google ScholarDigital Library
- M. J. Wichura. 1988. Algorithm AS 241: The percentage points of the normal distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics) 37, 3, 477--484.Google ScholarCross Ref
- Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the ACM SIGPLAN 2010 Conference on Programming Language Design and Implementation (PLDI’10). 86--97. Google ScholarDigital Library
Index Terms
- FinPar: A Parallel Financial Benchmark
Recommendations
Financial software on GPUs: between Haskell and Fortran
FHPC '12: Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computingThis paper presents a real-world pricing kernel for financial derivatives and evaluates the language and compiler tool chain that would allow expressive, hardware-neutral algorithm implementation and efficient execution on graphics-processing units (GPU)...
A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study
This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital ...
MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingCode transformations, such as loop tiling and loop fusion, are of key importance for the efficient implementation of stencil computations. However, their direct application to a large code base is costly and severely impacts program maintainability. ...
Comments