Abstract

In recent years, vehicle detection and classification have become essential tasks of intelligent transportation systems, and real-time, accurate vehicle detection from image and video data for traffic monitoring remains challenging. The most noteworthy challenges are real-time system operation to accurately locate and classify vehicles in traffic flows and working around total occlusions that hinder vehicle tracking. For real-time traffic monitoring, we present a traffic monitoring approach that overcomes the abovementioned challenges by employing convolutional neural networks that utilize You Only Look Once (YOLO). A real-time traffic monitoring system has been developed, and it has attracted significant attention from traffic management departments. Digitally processing and analyzing these videos in real time is crucial for extracting reliable data on traffic flow. Therefore, this study presents a real-time traffic monitoring system based on a virtual detection zone, Gaussian mixture model (GMM), and YOLO to increase the vehicle counting and classification efficiency. GMM and a virtual detection zone are used for vehicle counting, and YOLO is used to classify vehicles. Moreover, the distance and time traveled by a vehicle are used to estimate the speed of the vehicle. In this study, the Montevideo Audio and Video Dataset (MAVD), the GARM Road-Traffic Monitoring data set (GRAM-RTM), and our collection data sets are used to verify the proposed method. Experimental results indicate that the proposed method with YOLOv4 achieved the highest classification accuracy of 98.91% and 99.5% in MAVD and GRAM-RTM data sets, respectively. Moreover, the proposed method with YOLOv4 also achieves the highest classification accuracy of 99.1%, 98.6%, and 98% in daytime, night time, and rainy day, respectively. In addition, the average absolute percentage error of vehicle speed estimation with the proposed method is about 7.6%.

1. Introduction

Traffic monitoring with an intelligent transportation system provides solutions to various challenges, such as vehicle counting, speed estimation, accident detection, and assisted traffic surveillance [15]. A traffic monitoring system essentially serves as a framework to detect the vehicles that appear on a video image and estimate their position while they remain in the scene. In the case of complex scenes with various vehicle models and high vehicle density, accurately locating and classifying vehicles in traffic flows is difficult [6, 7]. Moreover, limitations occur in vehicle detection due to environmental changes, different vehicle features, and relatively low detection speeds [8]. Therefore, an algorithm must be developed for a real-time traffic monitoring system with the capabilities of real-time computation and accurate vehicle detection. Therefore, the accurate and quick detection of vehicles from traffic images or videos has theoretical and practical significance.

With the rapid development of computer vision and artificial intelligence technologies, object detection algorithms based on deep learning have been widely investigated. Such algorithms can extract features automatically through machine learning; thus, they possess a powerful image abstraction ability and an automatic high-level feature representation capability. A few excellent object detection networks, such as single-shot detection (SSD) [9], Fast R-CNN [10], YOLOv3 [11], and YOLOv4 [12], have been implemented for traffic detection using deep learning object detectors [13]. For example, Biswas et al. [14] implemented SSD to estimate traffic density. Yang et al. [15] proposed a multitasking-capable Faster R-CNN method that uses a single image to generate three-dimensional (3D) space coordinate information for an object with monocular vision to facilitate autonomous driving. Huang et al. [8] proposed a single-stage deep neural network called YOLOv3 and applied it to data sets generated in different environments to improve its real-time detection accuracy. Hu et al. [16] proposed an improved YOLOv4-based video stream vehicle target detection algorithm to solve the problem in the detection speed. In addition, the most noteworthy challenges associated with traffic monitoring systems are real-time operation for accurately locating and classifying vehicles in traffic flows and total occlusions that hinder vehicle tracking. Therefore, YOLO was developed as a regression-based, high-performance algorithm for the real-time detection of and statistics collection from vehicle flows.

The robustness of YOLOv3 and YOLOv4 to road marking detection improves its accuracy in small target detection. The model that is based on the TensorFlow framework, to enhance the real-time monitoring of traffic-flow problems by an intelligent transportation system [17]. The YOLOv3 network comprises 53 layers. It uses the Feature Pyramid Network for pedestrian detection to handle general multiscale object detection problems and the deep residual network (ResNet) ideas to extract image features for achieving a trade-off between detection speed and detection accuracy [18]. In addition to leveraging anchor boxes with predesigned scales and aspect ratios to predict vehicles of different sizes, YOLOv3 and YOLOv4 can realize real-time vehicle detection with a top-down architecture [19]. Moreover, a real-time vehicle detection and classification system can perform foreground extraction, vehicle detection, vehicle feature extraction, and vehicle classification [20]. To test the proposed method for vehicle classification, a vehicle-feature-based virtual detection zone and virtual detection line, which are predefined for each frame in a video, are used for vehicle feature computation [21]. Grents et al. [22] proposed a video-based system that uses a convolutional neural network to count vehicles, classify vehicles, and determine the vehicle speed. Tabassum et al. [23, 24] applied YOLO and a transfer learning approach to recognize native vehicles and vehicle classification on Bangladeshi Roads. Therefore, YOLO can be used to obtain a better matching map.

To increase vehicle counting and classification problems in real-time traffic monitoring, this study presents a real-time traffic monitoring system based on a virtual detection zone, Gaussian mixture model (GMM), and YOLO to increase the vehicle counting and classification efficiency. GMM and a virtual detection zone are used for vehicle counting, and YOLO is used to classify vehicles. Moreover, the distance and time traveled by a vehicle are used to estimate the speed of the vehicle. The major contributions of this study are described as follows: (1) A real-time traffic monitoring system is developed to perform real-time vehicle counting, vehicle speed estimation, and vehicle classification; (2) the virtual detection zone, GMM, and YOLO are used to increase vehicle counting and classification efficiency; (3) the distance and time traveled by a vehicle is proposed to estimate the vehicle speed; and (4) the MAVD, GRAM-RTM, and our collection data sets are used to verify various methods and the proposed method with YOLOv4 achieving the highest classification accuracy in the three data sets.

The remainder of this study is organized as follows. Section 2 describes the materials and methods, including data set preparation, vehicle counting method, and vehicle classification method. Section 3 presents the results of and a discussion on the proposed real-time vehicle counting, speed estimation, and classification system based on a virtual detection zone and YOLO. Finally, Section 4 presents a few concluding remarks and an outline for future research on real-time traffic monitoring.

2. Materials and Methods

To count vehicles from traffic videos, this study proposes a real-time vehicle counting, speed estimation, and classification system based on the virtual detection zone and YOLO. We combined a vehicle detection method with a classification system on the basis of two conditions between the virtual detection zone and the virtual detection lane line. To detect vehicles, a Gaussian mixture model (GMM) is applied to detect moving objects in each frame of a traffic video. Figure 1 shows a flowchart of the vehicle counting and classification process used in the proposed real-time vehicle counting, speed estimation, and classification system. In this study, first, traffic videos are collected to train the image data and used to perform vehicle classification verification. Next, GMM and virtual detection zone are used for vehicle counting. Finally, YOLO is used to perform vehicle classification in real time. In this study, the three steps are described as follows:Part 1: Collect traffic videos from online cameras.

In this study, traffic videos were collected from online cameras and used for image data training and vehicle classification verification, as described in Section 2.1.Part 2: Perform vehicle counting using GMM and virtual detection zone.

To realize real-time vehicle counting, object detection and recognition are performed. A virtual detection lane line and virtual detection zone are used to perform vehicle counting and speed estimation, respectively, as described in Section 2.2 and Section 2.4, respectively.Part 3: Perform vehicle classification and speed estimation using the YOLOv3 and YOLOv4 algorithms.

2.1. Data Set Preparation

The data set used in this study was prepared by collecting traffic videos recorded with online cameras installed along various roads in Taiwan. Image data were extracted from the traffic videos using a script, and labeling was performed using an open-source software application called “labeling” [25]. According to the common types of vehicles on the road are announced by the Directorate General of Highways, Ministry of Transportation and Communications (MOTC) in Taiwan, this study divides six different sizes, such as sedans, trucks, scooters, buses, hlinkcars, and flinkcars, in the training process, and the vehicle lengths of these six vehicle classes are listed in Table 1. In this study, we used YOLO to perform vehicle classification without using the length of the vehicle.

2.2. Vehicle Counting

To count vehicles, a GMM is used for the background subtraction in the complex environment to identify the regions of moving objects. The GMM is quite reliable in the background extraction and foreground segmentation process, so the characteristics of a moving object in video surveillance are easier to detect [26, 27]. The virtual detection zone is predefined in each video and used for vehicle feature computation. When the vehicle enters a virtual detection zone and virtual detection lane line, the GMM is used for vehicle counting. The vehicle counting window is depicted in Figure 2.

2.3. Vehicle Detection and Classification

This study uses the YOLO algorithm to classify vehicles into six classes. The validation method is used for verifying the vehicle classification in the collected videos. A visual classifier based on the YOLO algorithm is used to verify the vehicle classification capability. Figure 3 depicts the architecture of the visual classifier based on the YOLO algorithm that is used for classifying each vehicle into one of six classes. In the training process, when a vehicle belonging to one of the six classes is detected, all bounding boxes are extracted, their classes are manually labeled, and the labeled data are passed to the YOLO model for classifying the vehicle.

The YOLOv3 model architecture displayed in Figure 4 was used in this study. Images of size 416 × 416 px were input into the Darknet-53 network. This feature extraction network comprises 53 convolutional layers, and thus, it is called Darknet-53 [11]. In Darknet-53, alternating convolution kernels are used, and after each convolution layer, a batch normalization layer is used for normalization. The leaky rectified linear unit function is used as the activation function, the pooling layer is discarded, and the step size of the convolution kernel is increased to reduce the size of the feature map. The YOLOv3 model uses ResNet for feature extraction and subsequently uses the feature pyramid top-down and lateral connections to generate three features with sizes of 13 × 13 × 1024, 26 × 26 × 512, and 52 × 52 × 256 px. The final output depth is (5 + class) × 3, which indicates that the following parameters are predicted: four basic parameters and the credibility of a box across three regression bounding boxes as well as the possibility of each class being contained in the bounding box. YOLOv3 uses the sigmoid function to score each class. When the class score is higher than the threshold, the object is considered to belong to a given category, and any object can simultaneously have multiple class identities without conflict.

The loss function of YOLOv3 is mainly divided into four parts. A denotes the loss of the identified center coordinates that is used to predict (x, y) in the bounding box to ensure that it is only valid for the highest predicted target. B is the loss of (, h) width and height in the predicted bounding box, and the error value reflects the bounding box of different sizes in the object to predict the square root of the width and height instead of directly predicting the width and height of the bounding box. C is the loss of the predicted object category, assuming that each box is a cell; if the center of the object detection is in this cell, then mark the cell with bounding box (x, y, , h), and there is also category information to meet which object in the image to predict in the cell. D denotes the loss of the credibility of the predicted object to calculate the credibility in each bounding box to know that when the bounding box predicts the object. When the object is not predicted, there will be a credibility prediction penalty , and it is defined as follows:where , is the location of the centroid of the anchor box and , is the width and height of the anchor box. is the Objectness, i.e., confidence score of whether there is an object or not, and is the classification loss.

YOLOv4 is the latest algorithm of YOLO series, which is the basis of YOLOv3, scales both up and down and is applicable to small and large networks while maintaining optimal speed and accuracy, and the network architecture is shown in Figure 5. Compared with YOLOv3, YOLOv4-tiny is an extended version of YOLOv3. The original Darknet53 network is added with a CSP network. Backbone is CSPOSANet proposed by Cross Stage Partial Network (CSPNet) + One-Shot Aggregation Network (OSANet), plus Partial in Computational Blocks (PCB) technology. CSPNet can be applied to different CNN architectures to reduce the amount of parameters and calculations while improving accuracy. OSANet is derived from the OSA model in VoVNet. Its central idea is improved by the DenseNet module. At the end, all layers are connected to allow input consistent with the number of output channels; PCB technology can make the model more flexible because it can be adjusted according to the structure to achieve the best accuracy-speed balance.

The loss function remains the same as the YOLOv4 model, which consists of three parts: classification loss, regression loss, and confidence loss [28]. Classification loss and confidence loss remain the same as the YOLOv3 model, but complete intersection over union (CIoU) is used to replace mean-squared error (MSE) to optimize the regression loss [29]. The CIoU loss function is shown as follows:where S2 represents S × S grids; each grid generates B candidate boxes, and each candidate box gets corresponding bounding boxes through the network; finally, S × S × B bounding boxes are formed. If there is no object (noobj) in the box, only the confidence loss of the box is calculated. The confidence loss function uses cross entropy error and is divided into two parts: there is the object (obj) and noobj. The loss of noobj increases the weight coefficient λ, which is to reduce the contribution weight of the noobj calculation part. The classification loss function also uses cross entropy error. When the j-th anchor box of the i-th grid is responsible for certain ground truth, then the bounding box generated by this anchor box will calculate the classification loss function.

2.4. Speed Estimation

The real-time vehicle speed is also calculated in this study. Figure 6 shows the video images taken along the direction parallel to the length of the car (defined as the y-axis) and parallel to the width of the car (defined as the x-axis). First, as per the scale in the video, the yellow line (referred to as L) in the red circle has a length of 4 m in accordance with traffic laws. A GMM is used to draw a virtual detection zone (blue box) on the road to be tested (referred to as Q). The green box is the car frame (referred to as C), and the midpoint of the car is Ct.where is the scale, is the scale of the blue box, is the scale of the green box, px is the length of the video, and m is the actual length. The parameter denotes the increase or decrease in relationship of the scale per unit length on the y-axis. If , the speed calculation is performed using equation (4).

To calculate the parallel L line segment of Ct (referred to as ), the algorithm computes the distance y between A and B. Then, it restores y from its scale relationship with the actual line segment length x, where x denotes the distance traveled by the vehicle in Q.

In the calculation process, the program is used to determine the frame rate of the video and calculate the number of frames for which the vehicle travels in Q (referred to as p). Equation (6) is used to find the travel time of the vehicle from A to B in Q.

Equation (7) is used for calculating vehicle speed. After unit conversion (m/s to km/h), Equation (8) provides the vehicle speed.

3. Results and Discussion

All experiments in this study were performed using the YOLO algorithm under the Darknet framework, and the program was written in Python 2.7. To validate the real-time traffic monitoring system, we used a real-world data set to perform vehicle detection, vehicle counting, speed estimation, and classification. In this study, three test data sets were used to evaluate the proposed method. One of these data sets was mainly derived from traffic video images of online cameras on various roads in Taiwan, and it contains 12,761 training images and 3,190 testing images. Second, the Montevideo Audio and Video Dataset (MAVD), which contains data on different levels of traffic activity and social use characteristics in Montevideo city, Uruguay, was used as the other traffic data set [30]. Finally, GARM Road-Traffic Monitoring (GRAM-RTM) data set [21] has four categories (i.e., cars, trucks, vans, and big-trucks). The total number of different objects in each sequence is 256 for M-30, 235 for M-30-HD, and 237 for Urban 1. In this study, the definition of accuracy is based on the classification of vehicles in the database. In the video verification, if the results of manual and proposed system classification of the vehicles are the same, it means that the count is correct; otherwise, it is the wrong vehicle counting.

3.1. Vehicle Counting

Seven input videos of the road, each ranging in length between 3 and 5 minutes, were recorded at 10 am and 8 pm. In addition, eleven input videos of the road in the rain were also recorded for testing. Each frame in these traffic videos was captured at 30 fps. The first experimental results of real-time vehicle counting using the proposed method during the day are summarized in Table 2. The symbols S and L denote small and large vehicles, respectively. The vehicle counting accuracy of the proposed method at 10 am was 95.5%. The second experimental results of real-time vehicle counting using the proposed method during the night are summarized in Table 3. The vehicle counting accuracy of the proposed method at 8 pm was 98.5%. In addition, the third experimental results of real-time vehicle counting using the proposed method in the rain are summarized in Table 4. The vehicle counting accuracy of the proposed method was 94%. Screenshots of vehicle detection with the proposed real-time vehicle counting and classification system are depicted in Figure 7, where the detected vehicles are represented as green rectangles.

Vehicle counting in online videos is delayed due to network stoppages or because the target vehicle may be blocked by other vehicles on the screen, which causes the count to be missed. In addition, poor lighting in the rain and night affects the vehicle recognition capabilities of YOLOv3 and YOLOv4. These challenges can be overcome using a stable network connection and adjusting the camera brightness, respectively. Therefore, the novelty of this study is to solve the problem of unclear recognition in the rain.

3.2. Speed Estimation

In this subsection, the vehicle speed can be estimated using the proposed method. Table 5 lists the actual and the estimated speeds of the vehicles. The results indicate that the average absolute percentage error of vehicle speed estimation was about 7.6%. The use of online video for vehicle speed estimation will cause large speed errors due to network delays. Therefore, network stability is essential to reduce the percentage error in the speed estimation.

3.3. Comparison Results Using the MAVD and GRAM-RTM Data Sets

MAVD traffic data set [30] and GARM Road-Traffic Monitoring (GRAM-RTM) data set [21] were used for evaluating the vehicle counting performance of the proposed method. The videos were recorded with a GoPro Hero 3 camera at a frame rate of 30 fps and a resolution of 1920 × 1080 px. We analyzed 10 videos, and the vehicle counting accuracy of the proposed method at 10 am for the MAVD traffic data set was 93.84%. Vehicle classification results of the proposed method using MAVD traffic data set are listed in Table 6.

In summary, three data sets, namely, MAVD, GRAM-RTM, and our collection data sets, were used to verify the proposed method and Fast RCNN method [10]. The MAVD training and testing samples contains vehicles belonging to four categories (i.e., cars, buses, motorcycles, and trucks). The GRAM-RTM data set has four categories (i.e., cars, trucks, vans, and big-trucks). The total number of different objects in each sequence is as follows: 256 for M-30, 235 for M-30-HD, and 237 for Urban 1. Table 7 shows the classification accuracy results of three data sets using various methods. In Table 7, the proposed method with YOLOv4 achieved the highest classification accuracy of 98.91% and 99.5% in MAVD and GRAM-RTM data sets, respectively. Moreover, three different environments (i.e., daytime, night time, and rainy day) are used verify the proposed method. Experimental results indicate that the proposed method with YOLOv4 also achieves the highest classification accuracy of 99.1%, 98.6%, and 98% in daytime, night time, and rainy day, respectively.

Recently, some researchers have adopted various methods for vehicle classification using GRAM-RTM data set, such as Faster RCNN [10], CNN [31], and DNN [32]. Therefore, we use the same GRAM-RTM data set to compare the proposed method with other methods. Table 8 shows the comparison results. In Table 8, the results show that the proposed method with YOLOv4 can perform better than the other methods.

4. Conclusions

In this study, a real-time traffic monitoring system based on a virtual detection zone, GMM, and YOLO is proposed for increasing the vehicle counting and classification efficiency. GMM and a virtual detection zone are used for vehicle counting, and YOLO is used to classify vehicles. Moreover, the distance and time traveled by a vehicle are used to estimate the speed of the vehicle. In this study, MAVD, GRAM-RTM, and our collection data sets are used to verify the proposed method. Experimental results indicate that the proposed method with YOLOv4 achieved the highest classification accuracy of 98.91% and 99.5% in MAVD and GRAM-RTM data sets, respectively. Moreover, the proposed method with YOLOv4 also achieves the highest classification accuracy of 99.1%, 98.6%, and 98% in daytime, night time, and rainy day, respectively. In addition, the average absolute percentage error of vehicle speed estimation with the proposed method is about 7.6%. Therefore, the proposed method can be applied to vehicle counting, speed estimation, and classification in real time.

However, the proposed method has a few limitations. The vehicles appearing in the video are assumed to be inside the virtual detection zone; thus, the width of the virtual detection zone should be sufficiently large for counting the vehicles. In the future work, we will focus on algorithm acceleration and model simplification.

Data Availability

The MAVD and GRAM-RTM traffic data sets are available at https://zenodo.org/record/3338727#.YBD8B-gzY2w and https://gram.web.uah.es/data/datasets/rtm/index.html.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this study.

Acknowledgments

This research was funded by the Ministry of Science and Technology of the Republic of China, grant number MOST 110-2221-E-167-031-MY2.