A software-based dynamic-warp scheduling approach for load-balancing the Viola–Jones face detection algorithm on GPUs

https://doi.org/10.1016/j.jpdc.2013.01.012Get rights and content

Abstract

Face detection is a key component in applications such as security surveillance and human–computer interaction systems, and real-time recognition is essential in many scenarios. The Viola–Jones algorithm is an attractive means of meeting the real time requirement, and has been widely implemented on custom hardware, FPGAs and GPUs. We demonstrate a GPU implementation that achieves competitive performance, but with low development costs. Our solution treats the irregularity inherent to the algorithm using a novel dynamic warp scheduling approach that eliminates thread divergence. This new scheme also employs a thread pool mechanism, which significantly alleviates the cost of creating, switching, and terminating threads. Compared to static thread scheduling, our dynamic warp scheduling approach reduces the execution time by a factor of 3. To maximize detection throughput, we also run on multiple GPUs, realizing 95.6 FPS on 5 Fermi GPUs.

Highlights

► We introduce a novel technique to load-balance the Viola–Jones algorithm. ► We compare different implementations of the Viola–Jones algorithm. ► We conduct experiments on both pre-Fermi and Fermi architectures. ► We run the face detector on multiple GPUs. ► We demonstrate that the new technique better utilizes the GPU resources.

Introduction

Face detection is a major component in many applications such as security surveillance and human–computer interaction systems, and real-time recognition is essential in many scenarios. Prior work has employed various approaches [16], [17] including the well-known Viola–Jones algorithm [22], [21], which can provide quick and robust face detection. This algorithm has been widely implemented and deployed on diverse architectural platforms.

Conventional multicore implementations are the most convenient approaches; however, they deliver limited frame rates that do not meet the real-time requirement [6], [14]. Application-specific hardware designs can provide much higher performance. Unfortunately, a custom design is expensive since even a minor change requires the device to be re-fabricated. Reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), are more cost effective since the programmer may reconfigure the device in software yet still realize hardware performance [2], [1]. However, even an FPGA design requires a significant engineering effort, due to the complexity of correctly synthesizing a register-transfer level design that meets area and timing constraints.

Recently, it has been shown that implementing the Viola–Jones face detection algorithm on Graphics Processing Units (GPUs) is a more cost effective solution than FPGAs [7], [18], [8]. In particular, the Single Instruction Multiple Data (SIMD) execution model exploited on the streaming processors of a GPU supports a large amount of data parallelism inherent to the Viola–Jones algorithm. Once SIMD instructions can run efficiently, GPUs deliver better performance than conventional MIMD (Multiple Instruction Multiple Data) multicore processors given a particular power and density budget [9]. In terms of programmability, the high level APIs supported by GPU programming environments, e.g. CUDA and OpenCL, afford greater flexibility than that of dedicated devices and FPGAs. However, the GPU programming model is complicated by the need to be aware of the underlying thread scheduler. For example, CUDA groups scalar threads into warps, each is a set of 32 threads executing identical instructions on a SIMD pipeline. Thus, control-flow divergence within a warp should be avoided since it serializes thread execution, drastically penalizing performance.

The Viola–Jones face detection algorithm exhibits parallelism in image regions called search windows. Thus, threads can concurrently operate on these windows independently [18], [7]. To avoid control-flow divergence, each thread is statically assigned a fixed window to work on. Unfortunately, this straightforward design does not adequately support the irregular load distribution of the Viola–Jones face detection algorithm. In particular, the amount of work greatly varies among threads, since windows that are not likely to have a face require much less computation time than the others. The result is a heavy load imbalance among threads that are co-scheduled to execute SIMD instructions. Also, many virtualized threads are spawned to work for very short durations, introducing heavy scheduling overheads.

In this paper, we present a dynamic, software-managed thread allocation strategy based on warp scheduling that we call dynamic warp scheduling. This approach dynamically assigns windows in units of a warp. Threads within a warp that do not find a face in a window can be reused to work on another window. This design offers three important advantages over the static approach. First, it eliminates load imbalance by dynamically distributing windows to available computing resources. Second, it avoids control-flow divergence by letting threads within a warp work on common windows in lockstep. Third, it significantly reduces the cost of creating, switching, and terminating threads since it uses a smaller number of reusable threads.

We implemented the dynamic warp scheduling approach with the CUDA programming model on NVIDIA GPU platforms, though we believe that this technique is also applicable to similar programming models such as OpenCL on other SIMD architectures such as AMD Radeon GPUs. Experimental results demonstrated that the dynamic warp scheduling design greatly outperforms the static thread scheduling approach. In particular, on single NVIDIA Tesla C1060 and C2050 GPUs, the dynamic warp scheduling variant is, respectively, 3.1 and 2.6 times faster than an optimized implementation of the static approach.

We improved detection throughput still further by running with multiple devices, each located on a different host, that communicate by passing messages. With 5 Fermi GPUs, our implementation realizes a 4.6 times speedup, and runs at 95.6 FPS. These results include the cost to send the initial data to the device, to copy the detected faces back to the host, and to gather the partial results to the root host node.

The remainder of this paper proceeds as follows. Section 2 presents prior work in accelerating face and object detection. Section 3 describes the Viola–Jones Face Detection algorithm. In Section 4 we briefly present the GPU architecture, and we analyze different approaches to parallelize the algorithm. We provide details of our GPU implementations in Section 5. Section 6 compares the results of different implementations. Finally, Section 7 concludes the paper.

Section snippets

Related work

Conventional multicore implementations of the Viola–Jones face detection algorithm obtain limited frame rates: 1.78 FPS with OpenCV 1.1 on VGA image sizes [6]. Much work has been done to increase performance through an application specific hardware design. Theocharides et al. [19] presented an ASIC architecture that heavily exploits parallelism of the AdaBoost face recognition technique by parallelizing accesses of image data. They demonstrated a computation rate of 2 FPS but the image sizes

The face detection algorithm

At a high level, the Viola–Jones face detection algorithm scans an image with a window looking for features of a human face. If enough of these features are found, then the particular window of the image is determined to be a face. To account for different sized faces, the window is scaled and the process is repeated. Each window scale progresses through the algorithm independently of the other scales. To reduce the number of features that each window needs to check, the window passes through a

GPU platform and parallel approaches

In this section, we present the GPU hardware architecture and CUDA programming challenges associated with the SIMD execution model. We then analyze promising mappings of the Viola–Jones face detection parallelism to processing elements of a GPU.

Implementation

We compare four different implementation strategies, including static and dynamic scheduling, enhancing reuse via shared memory and a multi-GPU work assignment. We refer to the static work partitioning strategy as static thread scheduling and the dynamic work strategy as dynamic warp scheduling.

Testbed

Our experimental testbed was a cluster, named Dirac, located at the National Energy Research Scientific Computing Center (NERSC). Dirac includes NVIDIA Tesla GPUs connected with QDR InfiniBand switches. 44 nodes of Dirac are equipped with a C2050 (Fermi, device capability 2.0), and 4 other nodes consist of a C1060 (device capability 1.3). We used both types of devices in our experiments. The SMs on the C2050 contain 32 cores and the device memory bandwidth is 144 GB/sec. The SMs on the C1060

Conclusion

We have presented dynamic warp scheduling as a means of maintaining balanced workloads in the Viola–Jones face detection algorithm on Graphical Processing Units. By avoiding excessive load imbalance, our dynamic warp scheduling approach reduces the running time of static thread scheduling by up to a factor of 3.1. We also parallelized our design on multiple GPUs and realized a 4.6 times speedup on 5 Fermi GPUs. Our combined techniques enabled a face detection rate of 95.6 FPS on VGA images

Acknowledgments

This research began in the Autumn of 2009, as a class project in CSE 260, Parallel Computation, in the Department of Computer Science and Engineering at the University of California, San Diego (UCSD) [3]. The code development was performed on a NVIDIA Tesla system located at UCSD and supported by NSF DMS/MRI Award0821816. This research used computation resources of the National Energy Research Scientific Computing Center (the Dirac GPU cluster), which is supported by the Office of Science of

Tan Nguyen is a Ph.D. student under supervision of Dr. Scott Baden from the Department of Computer Science and Engineering, University of California, San Diego. His research interests include massive parallel programming, latency hiding, and automatic source-to-source transformation. He received his B.S. degree in Computer Science and Engineering from the Ho Chi Minh City University of Technology, Vietnam. Tan Nguyen is a fellow of the Vietnam Education Foundation (VEF), cohort 2009.

References (23)

  • J. Cho et al.

    Parallelized architecture of multiple classifiers for face detection

  • J. Cho et al.

    Fgpa-based face detection system using haar classifiers

  • CSE 260, Department of Computer Science and Engineering, University of California, San Diego. [Online]. Available:...
  • Dirac, NERSC. [Online]. Available: http://www.nersc.gov/users/computational-systems/dirac/,...
  • C. Gao, S.-L. Lu, Novel fpga based haar classifier face detection algorithm acceleration, in: International Conference...
  • J.P. Harvey, GPU Acceleration of Object Classification Algorithms Using NVIDIA CUDA, Master’s thesis, Rochester...
  • D. Hefenbrock et al.

    Accelerating Viola–Jones face detection to fpga-level using gpus

  • J. Kong, Y. Deng, Gpu accelerated face detection, in: 2010 International Conference on Intelligent Control and...
  • J. Meng et al.

    Dynamic warp subdivision for integrated branch and memory divergence tolerance

    SIGARCH Comput. Archit. News

    (2010)
  • Message Passing Interface Forum. [Online]. Available: http://www.mpi-forum.org,...
  • V. Nair et al.

    An fpga-based people detection system

    EURASIP J. Appl. Signal Process.

    (2005)
  • Cited by (12)

    • A vision-based hybrid approach for identification of Anthurium flower cultivars

      2020, Computers and Electronics in Agriculture
      Citation Excerpt :

      Their approach recognized cucumbers from 100 image frames using the multi-template matching method, with a quite acceptable accuracy of 98%. Viola-jones algorithm is an object detector that used for detection of the human face, facial features, cars, and etc. in images (Alionte and Lazar, 2015; Cuevas et al., 2017; El Kaddouhi et al., 2017; Lobban and Jones, 2008; Murphy et al., 2017; Nguyen et al., 2013; Schneiderman and Kanade, 2000; Sharma et al., 2009; Viola and Jones, 2004). Recently, the Viola-Jones algorithm was used for detection of orchid flowers by (Puttemans and Goedeme, 2015).

    • An OpenCL framework for high performance extraction of image features

      2017, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      Previous work [24,33,42,45] focused on developing face detection algorithms using NVidia’s CUDA language, which is specific to NVidia GPUs. In particular, specific GPU implementation of the Viola–Jones algorithm on GPU using dynamic-warp scheduling reduces execution times by a factor of 3 [34], but the implementation is restricted to Haar features. Acceleration of HoG feature computation has also been implemented in CUDA [10].

    • Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUs

      2022, ACM Transactions on Architecture and Code Optimization
    • An Efficient Parallel Implementation of Face Detection System Using CUDA

      2020, 2020 International Conference on Advanced Technologies for Signal and Image Processing, ATSIP 2020
    • Video based face recognition method

      2020, CEUR Workshop Proceedings
    View all citing articles on Scopus

    Tan Nguyen is a Ph.D. student under supervision of Dr. Scott Baden from the Department of Computer Science and Engineering, University of California, San Diego. His research interests include massive parallel programming, latency hiding, and automatic source-to-source transformation. He received his B.S. degree in Computer Science and Engineering from the Ho Chi Minh City University of Technology, Vietnam. Tan Nguyen is a fellow of the Vietnam Education Foundation (VEF), cohort 2009.

    Daniel Hefenbrock received his B.S. and M.S. degrees in IT-Systems Engineering from the University of Potsdam, Germany in 2008 and 2011, respectively. In 2009/2010 he was a visiting graduate student at the Department of Computer Science and Engineering of the University of California, San Diego. He currently works at Microsoft as a Software Development Engineer in the area of cloud storage technologies.

    Jason Oberg received his B.S. degree in Computer Engineering from the University of California, Santa Barbara in 2009. He is currently pursuing his Ph.D. degree, working with Ryan Kastner, from the Department of Computer Science and Engineering, University of California, San Diego. His primary research interests include hardware and embedded system security with the use of information flow tracking and high-performance computing specifically using field-programmable gate-arrays (FPGAs).

    Ryan Kastner received the Ph.D. degree in Computer Science from the University of California, Los Angeles. He is currently a Professor with the Department of Computer Science and Engineering, University of California, San Diego. His current research interests include many aspects of embedded computing systems, including reconfigurable architectures, digital signal processing, and security.

    Scott B. Baden is Professor of Computer Science and Engineering at the University of California, San Diego, where he has been a member of the faculty since 1990. He received the Ph.D. in Computer Science from the University of California, Berkeley in 1987. Dr. Baden’s research is in high performance and parallel computation. His research focuses on programming abstractions, domain-specific translation, performance programming, adaptive and data centric applications and algorithm design. Dr. Baden is a member of IEEE (Senior member) and SIAM, and a Senior Fellow at the San Diego Supercomputer Center. He is a founding member of UCSD’s Computational Science, Mathematics, and Engineering Program (CSME). Dr Baden is an active proponent of International Education and an avid photographer.

    View full text