research-article

Public Access

Accelerating Random Forest Classification on GPU and FPGA

Authors:
Milan Shah

Electrical and Computer Engineering, North Carolina State University, United States of America

Electrical and Computer Engineering, North Carolina State University, United States of America
View Profile

,
Reece Neff

Electrical and Computer Engineering, North Carolina State University, United States of America

Electrical and Computer Engineering, North Carolina State University, United States of America
View Profile

,
Hancheng Wu

Electrical and Computer Engineering, North Carolina State University, United States of America

Electrical and Computer Engineering, North Carolina State University, United States of America
View Profile

,
Marco Minutoli

Pacific Northwest National Laboratory, United States of America

Pacific Northwest National Laboratory, United States of America
View Profile

,
Antonino Tumeo

Pacific Northwest National Laboratory, United States of America

Pacific Northwest National Laboratory, United States of America
View Profile

,
Michela Becchi

Electrical and Computer Engineering, North Carolina State University, United States of America

Electrical and Computer Engineering, North Carolina State University, United States of America
View Profile

ICPP '22: Proceedings of the 51st International Conference on Parallel ProcessingAugust 2022Article No.: 4Pages 1–11https://doi.org/10.1145/3545008.3545067

Published:13 January 2023Publication History

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Pages 1–11

ABSTRACT

Random Forests (RFs) are a commonly used machine learning method for classification and regression tasks spanning a variety of application domains, including bioinformatics, business analytics, and software optimization. While prior work has focused primarily on improving performance of the training of RFs, many applications, such as malware identification, cancer prediction, and banking fraud detection, require fast RF classification.

In this work, we accelerate RF classification on GPU and FPGA. In order to provide efficient support for large datasets, we propose a hierarchical memory layout suitable to the GPU/FPGA memory hierarchy. We design three RF classification code variants based on that layout, and we investigate GPU- and FPGA-specific considerations for these kernels. Our experimental evaluation, performed on an Nvidia Xp GPU and on a Xilinx Alveo U250 FPGA accelerator card using publicly available datasets on the scale of millions of samples and tens of features, covers various aspects. First, we evaluate the performance benefits of our hierarchical data structure over the standard compressed sparse row (CSR) format. Second, we compare our GPU implementation with cuML, a machine learning library targeting Nvidia GPUs. Third, we explore the performance/accuracy tradeoff resulting from the use of different tree depths in the RF. Finally, we perform a comparative performance analysis of our GPU and FPGA implementations. Our evaluation shows that, while reporting the best performance on GPU, our code variants outperform the CSR baseline both on GPU and FPGA. For high accuracy targets, our GPU implementation yields a 5-9 × speedup over CSR, and up to a 2 × speedup over Nvidia’s cuML library.

Supplemental Material

Available for Download

pdf

Appendix (282.4 KB)

References

P. Baldi, P. Sadowski, and D. Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature Communications 5, 1 (jul 2014). https://doi.org/10.1038/ncomms5308Google ScholarCross Ref
Paul E Black 2020. Dads: The on-line dictionary of algorithms and data structures. NIST: Gaithersburg, MD, USA(2020).Google Scholar
Aydin Buluç, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles E Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures. 233–244.Google ScholarDigital Library
Chuan Cheng and Christos-Savvas Bouganis. 2013. Accelerating Random Forest training process using FPGA. In 2013 23rd International Conference on Field programmable Logic and Applications.Google ScholarCross Ref
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
Michael Goldfarb, Youngjoon Jo, and Milind Kulkarni. 2013. General Transformations for GPU Execution of Tree Traversals. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis.Google ScholarDigital Library
Trevor Hastie and Robert Tibshirani. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2 ed.). Springer.Google Scholar
Tero Karras. 2012. Thinking Parallel, Part II: Tree Traversal on the GPU. https://developer.nvidia.com/blog/thinking-parallel-part-ii-tree-traversal-gpu/Google Scholar
Vinod Kathail. 2020. Xilinx Vitis Unified Software Platform. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Stephen Neuendorffer and Lesley Shannon (Eds.).Google ScholarDigital Library
Xiang Lin, R.D. Shawn Blanton, and Donald E. Thomas. 2017. Random Forest Architectures on FPGA for Multiple Applications. In Proceedings of the on Great Lakes Symposium on VLSI 2017.Google ScholarDigital Library
Diego Marron, Albert Bifet, and Gianmarco De Francisci Morales. 2014. Random Forests of Very Fast Decision Trees on GPU for Mining Evolving Big Data Streams. In Proceedings of the Twenty-First European Conference on Artificial Intelligence.Google ScholarDigital Library
Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 72–86. https://doi.org/10.1109/TPDS.2016.2549523Google ScholarDigital Library
Oyku Melikoglu, Oguz Ergin, Behzad Salami, Julian Pavon, Osman Unsal, and Adrian Cristal. 2019. A Novel FPGA-Based High Throughput Accelerator For Binary Search Trees. https://doi.org/10.48550/ARXIV.1912.01556Google Scholar
Hiroki Nakahara, Akira Jinguji, Tomonori Fujii, and Simpei Sato. 2016. An acceleration of a random forest classification using Altera SDK for OpenCL. In 2016 International Conference on Field-Programmable Technology (FPT). 289–292. https://doi.org/10.1109/FPT.2016.7929555Google ScholarCross Ref
Oleksandr Pavlyk and Olivier Grisel. 2020. Accelerate Your scikit-learn Applications. (2020). https://medium.com/intel-analytics-software/accelerate-your-scikit-learn-applications-a06cacf44912Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.Google ScholarDigital Library
Sebastian Raschka, Joshua Patterson, and Corey Nolet. 2020. Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. arXiv preprint arXiv:2002.04803(2020).Google Scholar
J.P. Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy. 1995. Load Balancing and Data Locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Radiosity. J. Parallel and Distrib. Comput. 27, 2 (1995), 118–141. https://doi.org/10.1006/jpdc.1995.1077Google ScholarDigital Library
Brian Van Essen, Chris Macaraeg, Maya Gokhale, and Ryan Prenger. 2012. Accelerating a Random Forest Classifier: Multi-Core, GP-GPU, or FPGA?. In 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.Google ScholarDigital Library
Zeyi Wen, Hanfeng Liu, Jiashuai Shi, Qinbin Li, Bingsheng He, and Jian Chen. 2020. ThunderGBM: Fast GBDTs and Random Forests on GPUs.J. Mach. Learn. Res. 21, 108 (2020), 1–5.Google Scholar
Hancheng Wu and Michela Becchi. 2017. An Analytical Study of Recursive Tree Traversal Patterns on Multi- and Many-Core Platforms. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). 586–595. https://doi.org/10.1109/ICPADS.2017.00082Google Scholar

Index Terms

Accelerating Random Forest Classification on GPU and FPGA
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

Performance and toolchain of a combined GPU/FPGA desktop (abstract only)
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Low-power, high-performance computing nowadays relies on accelerator cards to speed up the calculations. Combining the power of GPUs with the flexibility of FPGAs enlarges the scope of problems that can be accelerated [2, 3]. We describe the performance ...
Read More
Optimization schemes and performance evaluation of Smith–Waterman algorithm on CPU, GPU and FPGA

With fierce competition between CPU and graphics processing unit (GPU) platforms, performance evaluation has become the focus of various sectors. In this paper, we take a well-known algorithm in the field of biosequence matching and database searching, ...
Read More
FPGA, GPU, and CPU implementations of Jacobi algorithm for eigenanalysis

Parallel implementations of Jacobi algorithm for eigenanalysis of a matrix on most commonly used high performance computing (HPC) devices such as central processing unit (CPU), graphics processing unit (GPU), and field-programmable gate array (FPGA) are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing
August 2022
976 pages
ISBN:9781450397339
DOI:10.1145/3545008

Copyright © 2022 ACM
© 2022 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 January 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA
GPU
random forest classification
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate91of313submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 644
  Total Downloads
- Downloads (Last 12 months)551
- Downloads (Last 6 weeks)90
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Accelerating Random Forest Classification on GPU and FPGA

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Performance and toolchain of a combined GPU/FPGA desktop (abstract only)

Optimization schemes and performance evaluation of Smith–Waterman algorithm on CPU, GPU and FPGA

FPGA, GPU, and CPU implementations of Jacobi algorithm for eigenanalysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Accelerating Random Forest Classification on GPU and FPGA

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Performance and toolchain of a combined GPU/FPGA desktop (abstract only)

Optimization schemes and performance evaluation of Smith–Waterman algorithm on CPU, GPU and FPGA

FPGA, GPU, and CPU implementations of Jacobi algorithm for eigenanalysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media