ABSTRACT
High-performance quantum circuit simulation in a classic HPC is still imperative in the NISQ era. Observing that the major obstacle of scalable state-vector quantum simulation arises from the massively fine-grained irregular data-exchange with remote nodes, in this paper we present SV-Sim to apply the PGAS-based communication models (i.e., direct peer access for intra-node CPUs/GPUs and SHMEM for inter-node CPU/GPU clusters) for efficient generalpurpose quantum circuit simulation. Through an orchestrated design based on device functional pointer, SV-Sim is able to abstract various quantum gates across multiple heterogeneous backends, including IBM/Intel/AMD CPUs, NVIDIA/AMD GPUs, and Intel Xeon Phi, in a unified framework, but still asserting outstanding performance and tractable interface to higher-level quantum programming environments, such as IBM Qiskit, Microsoft Q# and Google Cirq. Circumventing the obstacle from the lack of polymorphism in GPUs and leveraging the device-initiated one-sided communication, SV-Sim can process circuit that are dynamically generated in Python using a single GPU/CPU kernel without the need of expensive JIT or runtime parsing, significantly simplifying the programming complexity and improving performance for QC simulation. This is especially appealing for the variational quantum algorithms given the circuits are synthesized online per iteration. Evaluations on the latest NVIDIA DGX-A100, V100-DGX-2, ALCF Theta, OLCF Spock, and OLCF Summit HPCs show that SV-Sim can deliver scalable performance on various state-of-the-art HPC platforms, offering a useful tool for quantum algorithm validation and verification. SV-Sim has been released at http://github.com/pnnl/sv-sim. A version specially tweaked for Q#/QDK is also provided.
Supplemental Material
- [n.d.]. List of QC simulators. https://www.quantiki.org/wiki/list-qc-simulators.Google Scholar
- Ali J Abhari, Arvin Faruque, Mohammad J Dousti, Lukas Svec, Oana Catu, Amlan Chakrabati, Chen-Fu Chiang, Seth Vanderwilt, John Black, and Fred Chong. 2012. Scaffold: Quantum programming language. Technical Report. Princeton Univ NJ Dept of Computer Science.Google Scholar
- Gadi Aleksandrowicz, Thomas Alexander, Panagiotis Barkoutsos, Luciano Bello, Yael Ben-Haim, D Bucher, FJ Cabrera-Hernández, J Carballo-Franquis, A Chen, CF Chen, et al. 2019. Qiskit: An open-source framework for quantum computing. Accessed on: Mar 16 (2019).Google Scholar
- AMD. [n.d.]. ROCm OpenSHMEM. URL: https://github.com/ROCm-Developer-Tools/ROC_SHMEM.Google Scholar
- AMD. 2020. AMD Infinity Fabric.Google Scholar
- Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C Bardin, Rami Barends, Rupak Biswas, Sergio Boixo, Fernando GSL Brandao, David A Buell, et al. 2019. Quantum supremacy using a programmable superconducting processor. Nature 574, 7779 (2019), 505--510.Google Scholar
- Adriano Barenco, Charles H Bennett, Richard Cleve, David P DiVincenzo, Norman Margolus, Peter Shor, Tycho Sleator, John A Smolin, and Harald Weinfurter. 1995. Elementary gates for quantum computation. Physical review A 52, 5 (1995), 3457.Google Scholar
- Rami Barends, Julian Kelly, Anthony Megrant, Andrzej Veitia, Daniel Sank, Evan Jeffrey, Ted C White, Josh Mutus, Austin G Fowler, Brooks Campbell, et al. 2014. Superconducting quantum circuits at the surface code threshold for fault tolerance. Nature 508, 7497 (2014), 500--503.Google Scholar
- Kerstin Beer, Dmytro Bondarenko, Terry Farrelly, Tobias J Osborne, Robert Salzmann, Daniel Scheiermann, and Ramona Wolf. 2020. Training deep quantum neural networks. Nature communications 11, 1 (2020), 1--6.Google Scholar
- Sergio Boixo, Sergei V Isakov, Vadim N Smelyanskiy, Ryan Babbush, Nan Ding, Zhang Jiang, Michael J Bremner, John M Martinis, and Hartmut Neven. 2018. Characterizing quantum supremacy in near-term devices. Nature Physics 14, 6 (2018), 595--600.Google ScholarCross Ref
- Michael Broughton, Guillaume Verdon, Trevor McCourt, Antonio J Martinez, Jae Hyeon Yoo, Sergei V Isakov, Philip Massey, Murphy Yuezhen Niu, Ramin Halavati, Evan Peters, et al. 2020. Tensorflow quantum: A software framework for quantum machine learning. arXiv preprint arXiv:2003.02989 (2020).Google Scholar
- Jonathan Carter, David Dean, Greg Hebner, Jungsang Kim, Andrew Landahl, Peter Maunz, Raphael Pooser, Irfan Siddiqi, and Jeffrey Vetter. 2017. ASCR Report on a Quantum Computing Testbed for Science. Technical Report. USDOE Office of Science (SC), Washington, DC (United States). Advanced ....Google Scholar
- Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lauren Smith. 2010. Introducing OpenSHMEM: SHMEM for the PGAS community. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model. 1--3.Google ScholarDigital Library
- Andrew W Cross, Lev S Bishop, John A Smolin, and Jay M Gambetta. 2017. Open quantum assembly language. arXiv preprint arXiv:1707.03429 (2017). Repo: https://github.com/Qiskit/openqasm.Google Scholar
- Hans De Raedt, Fengping Jin, Dennis Willsch, Madita Willsch, Naoki Yoshioka, Nobuyasu Ito, Shengjun Yuan, and Kristel Michielsen. 2019. Massively parallel quantum computer simulator, eleven years later. Computer Physics Communications 237 (2019), 47--61.Google ScholarCross Ref
- Koen De Raedt, Kristel Michielsen, Hans De Raedt, Binh Trieu, Guido Arnold, Marcus Richter, Th Lippert, Hiroshi Watanabe, and Nobuyasu Ito. 2007. Massively parallel quantum computer simulator. Computer Physics Communications 176, 2 (2007), 121--136.Google ScholarCross Ref
- Jun Doi, Hitomi Takahashi, Rudy Raymond, Takashi Imamichi, and Hiroshi Horii. 2019. Quantum computing simulator on a heterogeneous hpc system. In Proceedings of the 16th ACM International Conference on Computing Frontiers. 85--93.Google ScholarDigital Library
- Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. 2014. A quantum approximate optimization algorithm. arXiv preprint arXiv:1411.4028 (2014).Google Scholar
- Edward Farhi and Hartmut Neven. 2018. Classification with quantum neural networks on near term processors. arXiv preprint arXiv:1802.06002 (2018).Google Scholar
- Gian Giacomo Guerreschi and Anne Y Matsuura. 2019. QAOA for Max-Cut requires hundreds of qubits for quantum speed-up. Scientific reports 9, 1 (2019), 1--7.Google Scholar
- Khaled Hamidouche and Michael LeBeane. 2020. Gpu initiated openshmem: correct and efficient intra-kernel networking for dgpus. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 336--347.Google ScholarDigital Library
- Thomas Häner and Damian S Steiger. 2017. 5 petabyte simulation of a 45-qubit quantum circuit. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--10.Google ScholarDigital Library
- Thomas Häner, Damian S Steiger, Mikhail Smelyanskiy, and Matthias Troyer. 2016. High performance emulation of quantum circuits. In SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 866--874.Google ScholarDigital Library
- Bettina Heim. 2021. Universal Quantum Intermediate Representation. Bulletin of the American Physical Society (2021).Google Scholar
- IBM. [n.d.]. IBM Quantum Experience. URL: https://quantum-computing.ibm.com/.Google Scholar
- Sylvain Jeaugey. 2017. NCCL 2.0. In GPU Technology Conference (GTC).Google Scholar
- Tyson Jones, Anna Brown, Ian Bush, and Simon C Benjamin. 2019. QuEST and high performance simulation of quantum computers. Scientific reports 9, 1 (2019), 1--11.Google Scholar
- Abhinav Kandala, Antonio Mezzacapo, Kristan Temme, Maika Takita, Markus Brink, Jerry M Chow, and Jay M Gambetta. 2017. Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets. Nature 549, 7671 (2017), 242--246.Google Scholar
- Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K Reinhardt, and Lizy K John. 2017. GPU triggered networking for intra-kernel communications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.Google ScholarDigital Library
- Ang Li, Tong Geng, Tianqi Wang, Martin Herbordt, Shuaiwen Leon Song, and Kevin Barker. 2019. BSTC: A novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--30.Google ScholarDigital Library
- Ang Li and Sriram Krishnamoorthy. 2020. QASMBench: A low-level QASM benchmark suite for NISQ evaluation and simulation. arXiv preprint arXiv:2005.13018 (2020).Google Scholar
- Ang Li, Weifeng Liu, Mads RB Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. 2017. Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.Google ScholarDigital Library
- Ang Li, Weifeng Liu, Linnan Wang, Kevin Barker, and Shuaiwen Leon Song. 2018. Warp-consolidation: A novel execution model for gpus. In Proceedings of the 2018 International Conference on Supercomputing. 53--64.Google ScholarDigital Library
- Ang Li, Shuaiwen Leon Song, Eric Brugel, Akash Kumar, Daniel Chavarria-Miranda, and Henk Corporaal. 2016. X: A comprehensive analytic model for parallel machines. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 242--252.Google ScholarCross Ref
- Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R Tallent, and Kevin J Barker. 2019. Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2019), 94--110.Google ScholarDigital Library
- Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin Barker. 2018. Tartan: evaluating modern GPU interconnect via a multi-GPU benchmark suite. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 191--202.Google ScholarCross Ref
- Ang Li, Shuaiwen Leon Song, Akash Kumar, Eddy Z Zhang, Daniel Chavarría-Miranda, and Henk Corporaal. 2016. Critical points based register-concurrency autotuning for GPUs. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1273--1278.Google Scholar
- Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, and Henk Corporaal. 2017. Locality-aware CTA clustering for modern GPUs. ACM SIGARCH Computer Architecture News 45, 1 (2017), 297--311.Google ScholarDigital Library
- Ang Li, Shuaiwen Leon Song, Mark Wijtvliet, Akash Kumar, and Henk Corporaal. 2016. SFU-driven transparent approximation acceleration on GPUs. In Proceedings of the 2016 International Conference on Supercomputing. 1--14.Google ScholarDigital Library
- Ang Li and Simon Su. 2020. Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs. IEEE Transactions on Parallel and Distributed Systems 32, 7 (2020), 1878--1891.Google Scholar
- Ang Li, Omer Subasi, Xiu Yang, and Sriram Krishnamoorthy. 2020. Density matrix quantum circuit simulation via the BSP machine on modern GPU clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.Google ScholarDigital Library
- Ang Li, Gert-Jan van den Braak, Henk Corporaal, and Akash Kumar. 2015. Finegrained synchronizations and dataflow programming on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing. 109--118.Google Scholar
- Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.Google Scholar
- Zhen Li and Jiabin Yuan. 2017. Quantum computer simulation on gpu cluster incorporating data locality. In International Conference on Cloud Computing and Security. Springer, 85--97.Google ScholarCross Ref
- Daniel A Lidar and Todd A Brun. 2013. Quantum error correction. Cambridge university press.Google Scholar
- A Linn. [n.d.]. The future is quantum: Microsoft releases free preview of quantum development kit.(Dec. 11, 2017).Google Scholar
- John Nickolls and William J Dally. 2010. The GPU computing era. IEEE micro 30, 2 (2010), 56--69.Google ScholarDigital Library
- NVIDIA. [n.d.]. NVIDIA NVSHMEM Developer Guide. URL: https://docs.nvidia.com/hpc-sdk/nvshmem/archives/nvshmem-101/developer-guide/index.html.Google Scholar
- Yuchen Pang, Tianyi Hao, Annika Dugad, Yiqing Zhou, and Edgar Solomonik. 2020. Efficient 2D tensor network simulation of quantum systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.Google ScholarDigital Library
- Sreeram Potluri, Davide Rossetti, Donald Becker, Duncan Poole, Manjunath Gorentla Venkata, Oscar Hernandez, Pavel Shamis, M Graham Lopez, Mathew Baker, and Wendy Poole. 2014. Exploring OpenSHMEM model to program GPU-based extreme-scale systems. In Workshop on OpenSHMEM and Related Technologies. Springer, 18--35.Google Scholar
- John Preskill. 2018. Quantum Computing in the NISQ era and beyond. Quantum 2 (2018), 79.Google ScholarCross Ref
- Jonathan Romero, Ryan Babbush, Jarrod R McClean, Cornelius Hempel, Peter J Love, and Alán Aspuru-Guzik. 2018. Strategies for quantum computing molecular energies using the unitary coupled cluster ansatz. Quantum Science and Technology 4, 1 (2018), 014008.Google ScholarCross Ref
- Davide Rossetti and S Team. 2015. GPUDIRECT: Integrating the GPU with a Network Interface. In GPU Technology Conference.Google Scholar
- Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation intel xeon phi product. Ieee micro 36, 2 (2016), 34--46.Google Scholar
- SPI. [n.d.]. OpenMPI: Open Source High Performance Computing. URL: https://www.open-mpi.org/.Google Scholar
- Krysta Svore, Alan Geller, Matthias Troyer, John Azariah, Christopher Granade, Bettina Heim, Vadym Kliuchnikov, Mariia Mykhailova, Andres Paz, and Martin Roetteler. 2018. Q# enabling scalable quantum computing and development with a high-level dsl. In Proceedings of the Real World Domain Specific Languages Workshop 2018. 1--10.Google ScholarDigital Library
- James D Whitfield, Jacob Biamonte, and Alán Aspuru-Guzik. 2011. Simulation of electronic structure Hamiltonians using quantum computers. Molecular Physics 109, 5 (2011), 735--750.Google ScholarCross Ref
- Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76.Google ScholarDigital Library
- Xin-Chuan Wu, Sheng Di, Emma Maitreyee Dasgupta, Franck Cappello, Hal Finkel, Yuri Alexeev, and Frederic T Chong. 2019. Full-state quantum circuit simulation by using data compression. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--24.Google ScholarDigital Library
- Pei Zhang, Jiabin Yuan, and Xiangwen Lu. 2015. Quantum computer simulation on multi-GPU incorporating data locality. In International Conference on Algorithms and Architectures for Parallel Processing. Springer, 241--256.Google ScholarCross Ref
Index Terms
- SV-sim: scalable PGAS-based state vector simulation of quantum circuits
Recommendations
Extending OpenSHMEM for GPU Computing
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed ProcessingGraphics Processing Units (GPUs) are becoming an integral part of modern supercomputer architectures due to their high compute density and performance per watt. In order to maximize utilization, it is imperative that applications running on these ...
Native Mode-Based Optimizations of Remote Memory Accesses in OpenSHMEM for Intel Xeon Phi
PGAS '14: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming ModelsOpenSHMEM is a PGAS library that aims to deliver high performance while retaining portability. Communication operations are a major obstacle to scalable parallel performance and are highly dependent on the target architecture. However, to date there has ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Comments