1. Introduction
Camellia oleifera fruits can be used to extract tea oil [
1]. Tea oil is a high-quality edible oil with clear colour, rich nutrient content and storage resistance, so it has good economic benefits and market prospects [
2]. However, manual picking requires high labour costs because of the complex growth environment of
Camellia oleifera, such as hills and mountains [
3]. Therefore, the mechanisation of harvesting
Camellia oleifera fruits should be improved to increase efficiency, save labour cost and promote the development of the Camellia industry [
4]. A key technology of automatic picking is to rapidly detect the location of the fruit in field environments. Therefore, a rapid and accurate method for detection of
Camellia oleifera fruits in complex field environments should be developed.
Existing methods for crop detection mainly rely on imaging technology [
5,
6,
7]. Traditional image detection algorithms are mainly based on the colour and shape of the target, which are different from other objects [
8]. These complex algorithms with many fixed thresholds have certain application limitations, such as error detection in complex scenes and lack of sufficient robustness [
9]. Deep learning algorithms have been widely used in crop detection to effectively extract target features in complex scenes and overcome the limitations of traditional algorithms [
10]. Convolutional neural network (CNN), a kind of feedforward neural networks (FNN), involves convolution computation and has a deep structure. As a representative deep learning algorithm, CNN has been widely used in classification [
11], localisation [
12], detection [
13] and segmentation [
14] of crops and fruits. A variety of CNN-based detection algorithms, such as YOLO v3 [
15], YOLOv5 [
16,
17] and Faster R-CNN [
18], have been used to detect fruit targets. Therefore, in the present study, image detection technology and CNN will be used to detect
Camellia oleifera fruits.
YOLO is a commonly used single-stage target detection algorithm with the characteristics of fast and high accuracy [
19,
20]. It exhibits satisfactory performance in detecting small and occluded targets in complex field environments and has better detection speed than other deep learning algorithms [
21,
22]. YOLOv7 is the latest detector in YOLO series. This network is designed with trainable bag-of-freebies, which enable real-time detectors to greatly improve the accuracy without increasing the inference cost. It also involves extend and compound scaling so the target detector can effectively reduce the number of parameters and calculations, thereby greatly improving the detection speed [
23]. At pre-sent, YOLOv7, as a brand-new detector, has not been applied to fruit detection. Therefore, in the present work, YOLOv7 was used to detect
Camellia oleifera fruits.
Camellia oleifera orchards have a complex environment, where an equisized image of
Camellia oleifera fruit will be disturbed by sidelight, backlight, slight occlusion and heavy occlusion, which will lead to false detection or missed detection of targets. The training image should include more scenes to extract features and overcome the interference of complex scenes [
24]. However, the number of images is limited due to constraints of or-chard and acquisition time when
Camellia oleifera fruit images are collected in the field. Therefore, existing research on deep learning usually enhances existing data to obtain more training data and achieve better generalisation of the neural network. Traditional data augmentation methods include mirroring, rotating, changing brightness and adding noise [
25]. Mosaic is a new data augmentation method for mixing multiple images, and it greatly enriches the background of detected objects [
26]. These methods can increase the number of datasets and improve the robustness of the detection models in complex scenes [
27]. In this study, the traditional and Mosaic data augmentation methods are combined to develop a detection model for
Camellia oleifera fruit in complex scenes.
To solve the difficulty of Camellia oleifera fruit detection in complex environments, this study proposes a detection method based on imaging technology and YOLOv7 network combined with image augmentation. This study aims to: (1) acquire and pre-process Camellia oleifera fruit images in complex conditions to establish detection datasets; (2) develop a YOLOv7 detection model and compare its performance with Faster RCNN, YOLO v3 and YOLOv5s models in complex environment; and (3) build an augmented dataset by combining multiple augmentation methods, compare the performance of YOLOv7 models based on augmented and original datasets and select the optimal model.
2. Materials and Methods
2.1. Acquisition of Camellia oleifera Fruit Images
The fruits of
Camellia oleifera in standardised planting orchards were used as the re-search object. Original images of
Camellia oleifera fruits were collected from orchards in Qinglongwan Ecological Garden, Tongcheng City, Anhui Province and planting bases in Yongzhou City, Hunan Province. In standard planting mode, the row spacing of Camellia oleifera trees was both 2 m. The plant spacing was about 1 m, and the tree height was about 1.8–2.5 m. All images of
Camellia oleifera fruit were obtained in August 2021. Image acquisition was conducted in the morning, noon and afternoon under sunny and cloudy weather and natural light condition in the field. A total of 100
Camellia oleifera fruit trees with good growth were selected by random sampling, and the tree age was 8 to 10 years. All the
Camellia oleifera trees were not picked in the same year to ensure that the growth form of the fruit was not destroyed. and different angles were selected to capture images at different shooting distances (0.5–1.5 m), with the camera from the ground height of 1–2 m. The acquired images had the following conditions: slight occlusion, heavy occlusion, overlapped, natural light angle, sidelight angle, backlight angle, etc. Examples of acquired images are shown in
Figure 1. Slight occlusion is when the part of the fruit occluded by branches and leaves is less than one third of the total area. Heavy occlusion is when the part of fruit occluded by branches and leaves is more than one third and less than two thirds of the total area. Sidelight angle is when the lens direction and the direct sunlight direction is 90° when shooting the images. Backlight angle is when the lens direction and the direct sunlight direction is 180° when shooting the images. Eight to 12 images of
Camellia oleifera fruit were taken for each fruit tree. A total of 873 images of
Camellia oleifera fruit were obtained after removing the blurred or repeated images. A single-lens reflex camera (Canon 200DII, Canon Inc., Tokyo, Japan) in “AUTO” mode with a resolution of 4608 × 3456 pixels was used to acquire the images saved in JPG format.
2.2. Image Preprocessing and Dataset Partitioning
Firstly, 200 images (50 of slight occlusion, 50 of heavy occlusion, 50 of sidelight angle and 50 of backlight angle) were randomly selected from 873 images as the test set to evaluate the generalisation of the detection model. The remaining 673 images were randomly divided into a training set (606 images) and a validating set (67 images) with a ratio of 9 to 1. No repeated images among the training, validation and test sets were ensured to prevent overfitting of the model [
28].
Image data annotation software ‘LabelImg’ was used to draw the outer rectangle of the
Camellia oleifera fruit target in all images of the training set to complete the manual labelling of the fruit. Images were labelled based on the smallest surrounding rectangle of the
Camellia oleifera fruit to ensure that the rectangle contains the background area as little as possible. Examples of labeled
Camellia oleifera fruit images are shown in
Figure 2. XML format files were generated after the annotations were saved [
29].
2.3. Data Augmentation
When establishing a deep learning-based object detection model for
Camellia oleifera fruit, a high-quality dataset with a large amount of image data can improve the quality of model training and prediction accuracy. Therefore, the acquired
Camellia oleifera fruit images should be augmented [
30].
Several image augmentation methods were utilised for the 606 images of the training set to improve the generalisation ability and avoid the overfitting of the detection model. These methods were based on Pycharm software and its related image processing library. The image augmentation methods included horizontal mirroring, vertical mirroring, brightness enhancement and reduction, multi-angle rotation (90°, 180°, 270°), adding Gaussian noise and Mosaic data augmentation. The detailed steps of the image augmentation methods are illustrated as follows.
Multi-angle rotation of an image enables the deep learning model to learn more object features in different positions and directions during training. OpenCV function “cv2.getRotationMatrix2D” and “cv2.warpAffine” based on Python were employed to rotate the original image. Image rotation was conducted by changing the parameter “angle” in the function as 90°, 180° and 270°.
Image mirroring (horizontal and vertical mirroring) can increase the viewing angle of the Camellia oleifera fruit. Opencv function ‘flip’ was used to mirror the original image. The image was divided into left and right parts for symmetrical transformation of the image centred on the vertical axis to achieve horizontal mirroring when the parameter ‘dim’ was set to 1. The image was divided into upper and lower parts for symmetric transformation of the image centred on the horizontal axis to achieve vertical mirroring.
Image brightness was enhanced and reduced. The complex light conditions of the plantation caused differences in Camellia oleifera fruit images, thereby interfering with the detection results. Therefore, the values of the three channels of the pixel points of the original image were multiplied by 0.5 and 1.5 to enhance and reduce the brightness of the image. This method improved the robustness of the model.
Adding Gaussian noise to the image was also conducted. The unclear or blurred images captured by the shaking of the equipment or the branches would affect the accuracy of the detection model. A Gaussian noise with a parameter ‘sigma’ of 25 was added to the original image to simulate the low-quality image that the model may capture in practical applications.
Mosaic data augmentation was performed referring to CutMix data augmentation method. During training, the input size of the model was assumed as S × S and a 2S × 2S grey image was marked as a canvas. A point from the rectangle framed by point A (S/2, S/2) and point B (3S/2, 3S/2) was set as the reference point coordinate. Four images were randomly selected and stitched into the image by random scaling, cutting and arrangement. The images and labelled boxes beyond the canvas were ignored. Mosaic data augmentation increased the training data in each BatchSize without increasing the BatchSize to reduce the memory requirements of the model. The mean and variance of each feature layer were calculated during the batch normalisation (BN) operation and were closer to the mean and variance of the entire dataset. Mosaic data augmentation enriched the background of the image, and the image formed by splicing multiple images added numerous small-object Camellia oleifera fruit, thereby improving the detection accuracy of the detection model.
The final augmented training set consists of 5854 images, including 606 original images and 5248 enhanced images. The detailed distribution of the augmented training set is shown in
Figure 3.
2.4. YOLO v7 Network Architecture
YOLOv7, a latest detector with YOLO architecture, is an object detection network that has fast detection speed, high precision and easy to train and deploy characteristics. The speed and accuracy of the network is within the range of 5–160 FPS, surpassing currently known object detectors. The network is 120% faster than YOLOv5 in the same volume (FPS). The test results on the MS COCO dataset outperform the YOLOv5 detector [
31].
Figure 4 shows the network structure of YOLOv7.
Based on the structure diagram, the YOLOv7 network consists of three parts, namely, input network, backbone network and head network. The YOLOv7 network firstly pre-processed the image, resized it to 640 × 640 × 3 and inputted it into the backbone network. The CBS composite module, efficient layer aggregation networks (ELAN) module and MP module alternately reduced the length and width of the feature map by 1/2, and the number of the output channels was increased to twice the number of input channels. As shown in
Figure 5, the CBS composite module performed the convolution + BN + activation function on the input feature map. In YOLOv7, the same as YOLOv5, Silu was used as the activation function. ELAN module was proposed. It used expand, shuffle and merge cardinality to continuously improve the learning ability of the network without destroying the original gradient path, thereby improving the accuracy of the network. The ELAN structure was composed of different convolutions. The group convolution was used to expand the channel and cardinality of the computational blocks, while ensuring the number of channels in each set of feature maps to be the same as the number of channels in the original architecture. Finally, the number of channels derived from the ELAN module was twice that of the input. The upper branch of the MP module halved the length and width of the feature map by maxpooling operation, and the channels were halved by convolution. The lower branch halved the channels by the first convolution, and the second convolution with kernel size of 3 and stride of 2 halved the length and width of the feature map. The upper and lower branches were combined. Finally, the output feature map with half length and width and equal input and output channels was obtained.
Based on the three-layer output in the backbone network, the head network continued to output three layers of feature maps of different sizes. After the Repconv module adjusted the final number of the output channels, three layers of convolution operation of kernel_size = 1 (1 × 1) were used to proceed to objectness, class and bbox prediction tasks for image detection to obtain results. The head network consists of SPPCSPC module, a series of CBS modules, MP module, Catconv module and three subsequent Repconv modules. The SPPCSPC module is similar to the SPPF used by YOLOv5 to increase the receptive field of a network. Firstly, the input feature map with a size of 512 × 20 × 20 was obtained and subjected to three convolution operations. Maxpooling operations with kernel size of 5, 9 and 13 were performed (for different kernel sizes, padding is adaptive) three times. Finally, the feature map with a size of 512 × 20 × 20 was obtained by combining the results with only 1 × 1 convolution operation data without pooling. The SPPCSPC module can obtain multi-scale object information while keeping the size of the feature maps unchanged. YOLOv7 was used to develop a more standardised model with a re-parameterised structure, namely, Repconv structure [
32]. It increased the training time and improved the inference effect [
33]. During training, a whole module was split into multiple identical or different module branches and added with 3 × 3 convolution + BN, a 1 × 1 convolution + BN and a BN layer (when the input and output channels were the same) to obtain the training model. During inference, the three parts were re-parameterised, and a 3 × 3 convolution output was used to convert their parameters equivalently to another set of parameters. The multi-branch training model was then transformed into a high-speed single-branch inference model. The final deployed model retained the high accuracy and other excellent properties of the multi-branch model while maintaining high efficiency as well as exhibited good speed and accuracy balance to improve the network performance.
2.5. Training Platform and Parameter Settings
Based on the PyTorch deep learning framework, training and testing were performed on a desktop computer with Windows 10 operating system and Inter Core i7-7800X CPU processor with 32 GB RAM. Considered the needs of the GPU computing power, selected graphics NVIDIA GeForce GTX 3060Ti, video memory 8GB. Python 3.8 was used as the programming language. The software tools included CUDA 11.3, CUDNN 8.2, OpenCV 3.4.5 and Visual Studio 2017.
In this study, YOLOv7 networks trained the Camellia oleifera fruit detection model through transfer learning. The training epoch was 300. The batch size of the model training was set to 8. The input size was set to 640 × 640. Regularisation was performed each time through the BN layer to update the model’s weight. The momentum factor (momentum) was set to 0.937, and the decay rate (decay) of weight was set to 0.0005. The initial vector was set to 0.01, and the augmentation coefficient of hue (H), saturation (S) and lightness (V) were 0.015, 0.7 and 0.4, respectively. During the training process, Tensorboard visualization tool was used to record data and observe loss, and save the model weight of every epoch.
2.6. Establishment and Evaluation Indicators of Model
2.6.1. Establishment of Model
The establishment of
Camellia oleifera fruit object detection model was divided into training and testing stages. The YOLOv7 neural network was trained using the training set, and the evaluation indicators were verified on the validation set after model weights were obtained. Finally, the model with the best performance weight was selected as the preliminary model for object detection for
Camellia oleifera fruit. In the testing phase, the detection model was run on the test set. The prediction results of the models applied to new data were evaluated to ensure the generalisation ability for application to picking machines in the future. The workflow is illustrated in
Figure 6. The final output of the neural network is the detection box of the identified
Camellia oleifera fruit object and the probability (confidence) that the identified object belongs to a specific category.
2.6.2. Evaluation Indicators of Model
The function of Complete Intersection over Union (CIoU) loss was used to quantitatively compare the error between the prediction and calibration boxes [
34,
35].
Figure 7 illustrates the parameters required to calculate CIoU based on the model prediction and calibration boxes, where A is the calibration box, B is the prediction box,
l1 is the distance between the centre points of box A and B,
l2 is the diagonal length of the minimum bounding rectangle of box A and B.
CIoU was calculated as follows:
where
v is the similarity of aspect ratio of box A and B and
α is the balance factor between the loss caused by
v and
IoU.
In this paper, Precision, Recall, Mean Average Precision (mAP) and F1 score were used to accurately and objectively evaluate the performance of the model. Precision is the most common evaluation index, and it is the number of right targets divided by the number of detected targets. In general, the higher the Precision is, the better the detection effect will be. Precision is a very intuitive evaluation index, but sometimes high Precision does not represent all. Therefore, mAP, Recall and F1 score were introduced for comprehensive evaluation. Precision, Recall, mAP, and F1 score were calculated as follows:
F1 score:
where
TP (True Positive) represents the number of
Camellia oleifera fruit objects that are correctly detected;
FP (False Positive) represents the number of other objects detected as
Camellia oleifera fruit; and
FN (False Negative) represents the number of
Camellia oleifera fruit that are undetected/missed.
4. Conclusions
A real-time and accurate detection method based on YOLOv7 target detection network and multiple data augmentation was proposed to realize the detection of Camellia oleifera fruit in complex scenes of orchard. Firstly, the images of Camellia oleifera fruits were collected, and the detection model of Camellia oleifera fruits was established by YOLOv7 network, which was compared with YOLOv5s, YOLOv3-spp and Faster R-CNN target detection networks. The results showed that the YOLOv7 model has the best performance with mAP of 95.74%, F1 score of 93.67%, Precision of 94.21%, Recall of 93.13% and the average detection time of 0.025 s. The dataset was further augmented by rotation, mirroring, adding Gaussian noise, increasing or decreasing image brightness and mosaic augmentation methods, and the DA-YOLOv7 detection model was established by using the augmented dataset and the YOLOv7 network. Data augmentation can effectively improve the detection ability of the model. The optimal Camellia oleifera fruit detection model was DA-YOLOv7 model with mAP of 96.03%, Precision of 94.76%, Recall of 95.54% and F1 score of 95.15%. In summary, the YOLOv7 target detection network combined with multiple data augmentation can accurately and quickly detect Camellia oleifera fruit in complex scenes. This method has a good application prospect in mechanical harvesting operation. In the future work, we plan to combine the proposed model with the end-effector to realize detection and positioning of fruit, and further adjust the picking angle and the position of the end-effector. At the same time, this study provides a theoretical reference for detection and automatic harvesting of other fruits.