Abstract

Deep learning (DL) is widely used in ship detection, but there are still some problems in the effective classification, such as inaccurate object feature extraction and inconspicuous feature information in deep layers. To address these problems, we propose a YOLOv7-residual convolutional block attention module (YOLOv7-RCBAM) by combining the convolutional attention mechanism and residual connections to the YOLOv7. First, to accelerate the training speed, the parameters in the backbone network of the pretrained model are frozen by using transfer learning, and the model is fine-tuned for training. Second, to enhance the information relevance of channel dimensional features, an attention mechanism with residual connectivity is adopted. Finally, a feature fusion attention mechanism is introduced to improve the effective feature extraction. The effectiveness of the proposed method is fully validated on the SeaShips dataset. The results show that the YOLOv7-RCBAM model achieves better performance with a 97.59% in and effectively extracts object feature in deep layers. Meanwhile, the YOLOv7-RCBAM model can accurately locate ship in complex environments with darkness and noise with the reaching 96.13% to achieve effective ship classification detection.

1. Introduction

With the development of image recognition technology, video surveillance has been applied in the field of maritime supervision and service. It plays a key role in tasks such as ship traffic flow statistics and ship collision prevention. Real-time detection and intelligent tracking of moving ship in complex environments are the significant basis for promoting the efficiency of maritime supervision. However, traditional methods generally have problems, such as slow training speed and weak interference resistance, which make it hard to detect and track moving ship with high accuracy [1, 2].

Recently, with the rapid development of DL in various fields [35], it has made significant breakthroughs in image detection, gradually solving the problems of slow training speed and low detection accuracy of object detection. Detection algorithms based on DL are divided into two categories: (1) the two-stage algorithm based on the candidate area, that is, the selection of the candidate area, and then the positioning and classification of the object in the candidate area. The typical algorithms are R-CNN [6], Faster RCNN [7], R-FCN [8], Mask R-CNN [9], etc. (2) The one-stage algorithm based on regression classification combines the selection of candidate regions with positioning and classification to improve the training speed. The typical algorithms include SSD [10] and YOLO [11]. Compared with the two-stage algorithm, the one-stage algorithm fuses the detection steps of generating and optimizing the bounding box. It can accelerate the training speed while maintaining stable detection accuracy. Therefore, we use the one-stage algorithm to achieve ship classification detection.

However, traditional detection algorithms still have many problems. It is difficult for the model to focus on the object information in the interference of complex environment. The object feature information is not obvious after being disturbed [12], leading to inaccurate positioning and classification. And deep feature extraction is easily led to the loss of feature information [13].

With the development of YOLOv7 [14], it has better advantages in faster speed and higher accuracy. It introduced reparameterized [15, 16] module that replaces the original module to reduce parameters and improve inference speed. And it introduced efficient long-range attention network [17] (ELAN) module instead of CSP module for the backbone network, which can enhance the extraction features and improve the use of parameters and calculations. The trainable bag-of-freebies [14] was proposed, so that the detection accuracy can be improved without increasing the inference cost. With the advantages of YOLOv7, it is more suitable for our method.

Based on analyzing the disadvantages of the YOLOv7, we introduce transfer learning [18], residual connection [19], convolutional block attention module (CBAM) [20], and feature fusion [21]. Transfer learning can freeze part of the model parameters to improve the training speed and extract richer features by fine-tuning. CBAM is capable of adaptively weighting feature values to enhance important features of the ship and restrain interference from the background. By combining a reasonable combination of residual connectivity and CBAM [22], multiple feature recalibrations are prevented from leading to reduced deep feature responses. Feature fusion is able to improve the representation ability of the object. Based on the above methods, YOLOv7-RCBAM is proposed, which effectively improves ship detection accuracy. Specifically, the contributions of this article are summarized as follows: (1)A convolutional attention mechanism block combined with residual connections is proposed. The improved method effectively extracts object feature information and focuses on foreground information(2)An improved YOLOv7 model based on double transfer learning is proposed, which introduces a feature fusion attention mechanism to improve the richness of feature extraction and solve the problem of the feature disappearance caused by too deep network(3)Image enhancement is performed to simulate ship detection in the dark night, rainy, and foggy environments. It can still verify that the method has strong interference resistance ability

The experimental results show that RCBAM outperforms the other attention mechanisms, with an average improvement of 0.53% on the dataset. YOLOv7-RCBAM outperforms other classical YOLO methods with a of 97.59%, which improves classification detection. Under the complex environment, the ship detection accuracy still reaches 96.13%, which verifies the strong interference resistance ability of YOLOv7-RCBAM.

The rest of this paper is organized as follows. Section 2 presents the current related works on ship detection. Section 3 introduces the design of the model, including CBAM, transfer learning, and the network architecture of YOLOv7-RCBAM. Section 4 conducts related experiments. Section 5 presents the conclusion and future works.

Most of the traditional ship detection methods are based on synthetic aperture radar (SAR) images [23, 24], which is an active microwave earth observation device with all-weather, all-day operation and a certain penetration capability to the ground. It can obtain images of ship similar to optical photographs. The development of DL has also supplied effective assistance for ship detection in SAR images, so many methods based on SAR images are put forward.

Li et al. [25] proposed a region-based convolutional neural network (CNN) detection method which effectively extracted SAR image features at each scale. It replaced the region of interest pooling layer with RoIAlign to reduce the quantization error. Yang et al. [26] proposed an anchor-free method using rotatable bounding box on SAR ship detection called CPS-Det. It helped improve speed and proposed a scheme for calculating angle loss to improve the accuracy of angle prediction. He et al. [27] proposed a feature distillation framework to enhance mid-low-resolution ship detection. Yue et al. [28] proposed a two-stage SAR ship detection network to generate anchors. It can mainly capture small objects by generating high-quality anchors and improve the feature pyramid network by inserting a receptive field enhancement module, which can help enrich the feature map. However, it is difficult to process the weight of foreground and background information adaptively because of the improper processing of redundant information of image data. And the small ship detection is not easy to be detected based on the SAR images.

To solve the problem of small ship detection, the following papers were introduced. Chen et al. [29] proposed a method that combines a generative adversarial network (GAN) and a CNN-based detection approach, which can solve the problem of a limited number of small ships. In order to improve the detection accuracy of small ship objects, Zhou et al. [30] improved the YOLOv5s algorithm by optimizing the loss function and expanding the receptive field at the spatial pyramid pooling (SPP) layer. Although they proposed the methods against the small-scale ship, they cannot process the weight of background information. By the way, to solve the problem of feature information redundancy, the following papers introduce the attention mechanism. Li et al. [31] proposed an improved YOLOv3-tiny network, aiming at the real-time transport ship classification problem of waterway and river video surveillance images. It introduced CBAM to adjust the feature weights of the channel and space dimensions, so that the model can focus on the image ship object. Han et al. [32] proposed a ShipYOLO model to strengthen the detection speed and accuracy. They designed a new backbone network and amplified receptive field module to improve the acquisition of small-scale ship. And they used the attention mechanism and ResNet’s shortcut idea to improve the feature pyramid structure. Liu et al. [33] proposed a YOLOv4 method, which applied the reverse depthwise separable convolution (RDSC) to the backbone network and feature fusion network. It reduced the number of weights to increase the detection speed. They solved the difficulties in low detection accuracy in complex environments.

Based on the above papers, it can be found that the attention mechanism can be embedded in the DL model to effectively extract key information, and the network is lightweight to improve the training speed. However, the above methods still have the problem that the attention features are not obvious, and the lightweight model leads to the problem that the feature extraction is not rich enough. We propose YOLOv7-RCBAM based on video surveillance images to improve the richness of feature extraction information, improve the network training speed, and enhance the anti-interference ability of the model.

3. The Design of YOLOv7-RCBAM

Input: Training set and ground truths
Output: Trained model
Initialization: Learning rate , batch size , and set each parameter value epoch, weights.
For to do
 //YOLOv7-RCBAM
 Preprocess by mosaic: = ImagePreprocessing()
 Extract feature information in backbone: Features = backbone()
 Fuse features by feature fusion module: Fused = FeatureFusion(features)
 Attention module extract feature: Attention = RCBAM(fused)
Objectness, anchor_shape, feature_cls, feature_loc = YOLOHead(attention)
 //Calculate loss and gradient descent
 Calculate gradients: Weight = backpropagation()
End: save weights of model

We propose the method as follows. First, RCBAM is introduced to extract important features for model fine-tuning. Second, transfer learning is introduced to freezing parameters after pretraining the backbone network. Finally, feature information is enhanced by feature fusion method instead of feature disappearance in the deep network. The methods proposed are based on YOLOv7. YOLOv7 has better advantages in faster speed and higher accuracy. It introduced reparameterized module to reduce parameters and improve inference speed. And it introduced ELAN module for the backbone network, which can enhance the extraction features and improve the use of parameters and calculations. The trainable bag-of-freebies was proposed, so that the detection accuracy can be improved without increasing the inference cost. The specific process and the diagram of YOLOv7-RCBAM are shown in Algorithm 1 and Table 1.

The overall framework of our method is shown in Figures 1 and 2. YOLOv7-RCBAM network structure mainly includes backbone feature extraction module, SPPCSPC, feature fusion module, RCBAM, reparameterization (REP) modules, and YOLO detection modules for regression object information.

As the backbone network, it usually adopts CBS module, ELAN, and max-pooling (MP) block. CBS is composed of convolutional layer, batch normalization (BN), and SiLU layer for feature extraction. The ELAN is composed by stacking a number of CBS modules for changing the channels and extracting feature information. And the MP block is composed of MaxPool layer and CBS module for downsampling. The SPPCSPC module performs downsampling through max-pooling layer and CBS layers of different sizes, effectively increasing the receptive field and separating the most salient contextual features. Then, in the neck part, feature fusion module includes upsample, ELAN-H layers, and MP-2 layers. They change the channels by upsample and MP-2 module, extract the feature by ELAN-H module, and do the job of feature fusion and communication. And it can convey the semantic and strengthen the extraction ability of multiscale targets. The RCBAM is added before the REP module to enhance the feature maps. Finally, in order to extract and smooth the feature, the REP module is composed of conv layers and BN layers before the prediction head.

3.1. Residual Convolutional Block Attention Module

In traditional object detection algorithms, feature extraction is usually performed on global information. The major shortcoming of the methods is the extraction feature information loss in the deep layers, which makes it difficult to focus on key objects. An effective feature map is necessary for the deep network. To improve the focus of the algorithm on the object, we introduce a RCBAM to enhance the foreground response of the ship.

The attention mechanism is a special module used to calculate the weights of input data and has been modified into a variety of attention mechanisms [3436] used in DL. CBAM includes two independent submodules, namely, the channel attention mechanism (CAM) [37] module and the spatial attention mechanism (SAM) [38] module. The CAM adaptively assigns weights to receive feature map in the channel dimension, while the SAM adaptively assigns weights to receive feature map in the spatial dimension. We find that repeated feature recalibration makes the depth feature response decrease and affects the detection result, so the improvement of the CAM module can help improve the accuracy of ship detection.

The basic structure of residual channel attention mechanism (RCAM) is shown in Figure 3. The CAM adds a parallel maximum pooling layer. The input feature maps compress the global information through the pooling layer and then obtain the characteristic map through two-layer convolution with activation function, respectively: where is the input of the CAM. and are the pooling layers. and are the activation function. is the convolution layer. The output of them are and .

Then, the channel information is recalibrated by matrix point multiplication. The feature map by the sigmoid activation function is denoted as , and then, after recalibration

Finally, we stack number of the convolutional batch-normalization SiLU (CBS) modules. It can help extract effective feature in the deep layers. The is used to smooth feature by kernel size convolution layer, and the last three can be used to extract the feature by kernel size convolution layers and maintain the original channel dimension:

After that, we add the operation of merging and dot-multiplying the feature map by the CAM module with the feature extraction map. Therefore, fully considers the lead of global information and effectively highlights the discriminative feature information of the ship. And the is formed after feature extraction by four CBS modules. After the “” operation, we get since successive repetitive feature recalibration operations can lead to lower response values of depth features and thus affect the detection effect. Residual connection helps to fuse the extraction information and prevents the loss of feature information due to recalibration. Finally, the feature map is obtained after recalibration. We introduce residual connection between successive feature recalibrations by using the idea of residual learning. It can improve the feasibility of optimization while preserving the original information. In the end, the RCBAM is proposed by connecting the output of the residual channel attention module (RCAM) to the SAM.

3.2. Double Transfer Strategy

In DL, the amount of parameters needs to be trained in the face of a large dataset. Training from scratch easily leads to the problems of slow training speed and poor interference resistance. Therefore, transfer learning is introduced to improve the training process and enhance the interference resistance in this paper.

Model-based transfer learning, also called parameter-based transfer learning, shares some common knowledge between the original task and the target task at the model level. The transfer learning based on sharing parameters achieves the purpose of knowledge transfer by freezing the common parameters of some models. The premise of the transfer of shared parameters is that there are similar features in the learning task to make the model parameters consistent.

Since the model is not easy to identify ships, it is difficult to do well in the fine-grained detection of ships, so the model fine-tuning method is introduced. As shown in Figure 4, the VOC2007 dataset A is loaded into the original model for training to get the pretraining parameters, based on which the SeaShips dataset B is input for classification experimental training to make the model improve in ship category recognition. To further improve the accuracy of ship classification detection and increase the model training speed, we add an attention mechanism to the tail of the model for ship classification dataset B training by freezing the backbone network.

3.3. Feature Fusion Module

The features extracted by the single attention mechanism are not obvious. To deepen the features that the attention mechanism pays attention to, the output of the attention is fused, which enhances the feature performance [39, 40]. Given the problem of unfocus in different feature fusion modules, we introduce the same attention fusion to enhance the effective features and prevent feature disappearance from the deep network.

We add an improved convolutional attention mechanism to the three-dimensional feature maps of 128, 256, and 512 and fuse two identical feature maps through a merging operation to enhance the effect of the features extracted by the attention mechanism. The resulting features are

The feature fusion attention module can effectively prevent the loss of depth information so that the model can learn rich features and pay attention to the target features, instead of superimposing attention feature, which can change the feature dimension and lose the feature information. Fusing attention feature is more beneficial to retain feature dimension and information.

4. Experiments

4.1. Datasets and Evaluation Metrics

At present, there are some datasets with ship categories; these include the COCO dataset, VOC dataset, CIFAR-10 dataset, and SeaShips [41] dataset. COCO and VOC only have one type of ship (boat), so ship classification experiments cannot be performed. There are 7000 open-source data in SeaShips dataset, which contains 948 passenger ships, 4398 mining ships, 3010 general cargo ships, 4380 fishing ships, 1802 container ships, and 3904 bulk carriers in total of six types of ships, so the SeaShips dataset is chosen for our experiment. The dataset is divided into training set and test set in the ratio of 4 : 1, and the training set is divided into subtraining set and validation set of 4 : 1. The subtraining set contains 4480 images, and the validation set contains 1120 images, while the test set contains 1400 images.

We select a variety of evaluation indexes to prove the superiority of the model, including precision (), recall (), average precision (), and . The main usage indicators are as follows: (1)AP refers to the area under the precision-recall (P-R) curve when balancing the and to represent the of each category in the model detection degree, and calculation formula is (2) refers to the mean of the of all categories and the calculation formula is is the number of all target categories.

4.2. Experiment Analysis of Ship Classification Detection

In this paper, we set 20 training epochs and mainly use values for model detection accuracy assessment. First, to verify the advantage of our method in comparison with different advanced attention mechanisms, we compare the accuracy of various types of ships. We compare this method with other YOLO methods to verify the superiority of our method. To verify the detection ability of the method to complex environment, the image is enhanced to train the model.

4.2.1. Comparison of Different Attention Mechanisms

To verify the effectiveness of the attention mechanism we proposed, we use the YOLOv7 as the base model, and the efficient channel attention (ECA), squeeze-and-excitation attention (SE), and CBAM and RCBAM are added to it, respectively.

The experimental results show that our method performs well. As shown in Table 2, compared with the of the original YOLOv7 model, there are different degrees of improvement in the of all types of ships. The ECA and SE model focus on the channel dimension and ignore the spatial dimension in the ship images, resulting in a low for all types of ship detection. Compared with the CBAM, our method improves the channel attention mechanism, which can retain the detailed information of various ships and improve the detection accuracy. However, in the category of ore carriers, large ships are more likely to mistake the background for a ship due to the obstruction of the external environment, resulting in a decrease in ship detection accuracy. As can be seen from Table 3, our method achieves the highest of 97.59% among various attention mechanisms. The and are increased to 96.00% and 94.14%, and the value reaches 95.00%.

4.2.2. Comparison of Different Models

To prove that our method has a good detection effect compared with other object detection models, we use the SeaShips dataset for model training and evaluation. The experimental models for comparison are YOLO4, YOLOv5, and YOLOv7. The experimental results are as shown.

The experimental results in Table 4 and Figure 5 show that our method improves the of various types of ships by different degrees compared with various models, but in the class general cargo ship, our method is lower than the YOLOv5 by 0.34%. Compared with the YOLOv4, YOLOv5, and YOLOv7, our method adds an improved attention mechanism to reduce the feature extraction of redundant information and focus on the ship target features, which can significantly improve the value and values. The improvement over the YOLOv4 networks is particularly significant, with improving by 3.62%. While comparing the YOLOv5, improves by 0.79% to reach a maximum of 97.59%, reflecting the superiority of the model’s . Figure 6 shows the effect of six types of ship detection. YOLOv5 networks have located the wrong bounding boxes in bulk cargo boat, indicating that the method have worse detection performance. However, YOLOv4 networks have repeatedly detected the bounding boxes, indicating that the method can correctly locate and classify the ship target, but due to the inaccuracy of the NMS, the wrong bounding boxes cannot be correctly eliminated. YOLOv7 networks have predicted a bigger bounding box in ore carrier, which show that it can correctly locate the target but mismatch the background. Our method focuses on object features and enhances feature performance, so it can correctly classify and locate ships.

4.2.3. Test of Model Interference Resistance

As shown in Figure 7, to verify the interference resistance of the model detection, we cutout the images to occlude the ships, reduce the image brightness to simulate the night, and finally increase the image noise to simulate rainy and foggy conditions. We adopt the method of random enhancement of each ship image and randomly select one of the methods of cutout, changing brightness, and increasing noise for random scale enhancement. It can increase the richness of the dataset and enable the model to adapt to multiple ship detections in different scenarios.

The interference resistance of the model is verified after image enhancement of the original data, as shown in the figures.

As shown in Figure 8, after image enhancement, it is difficult for the model to detect the ship objects in environments such as darkness, noisy, and cutout, resulting in a decrease in the model of 4.83%, and the detection accuracy and of the model decrease by 0.42% and 1.46%. As shown in Table 5, compared with different models, our method takes the lead by 96.13%, indicating that the model is effective in detecting the recalled ship target features and can correctly classify ships. Facing complex environments, our method focuses on the ship’s target itself during detection and can still achieve a high precision. It can be seen that the method is interference resistant and can adapt to various environmental vessel classification detection. From the effect figure in Figure 9, we know that our model can correctly locate ship targets and classify ship types in the environment with weak light and strong noise, while the model can still locate and classify ships without being affected by target occlusion, multiple target overlaps, and small target ships in the image.

4.2.4. Comparison with Previous State-of-the-Art Approaches

To verify that our method has a good detection effect, we compare it with previous SOTA approaches in the SeaShips dataset. The experimental models for comparison are small ship detection method [42], ShipYOLO [32], and enhanced YOLOv3-tiny [31]. The results are as shown.

The experimental results in Table 6 show that our method achieves the highest level of with 97.59% over the other methods. Compared with small ship detection, our method improves the by 1.24%. Small ship detection and ShipYOLO methods ignore the extraction feature in deep layers due to the lower accuracy. Enhanced YOLOv3-tiny method reduces the parameters to improve the detection speed, while the is 0.59% lower than ours. Although the methods add an attention mechanism to focus on the ship objects, our RCBAM performs better, since the RCBAM extracts the deep feature by CBS module and fuses the extraction information by residual connection. By the way, feature recalibration operations help improve the feasibility of optimization, and feature fusion attention module effectively captures the deep rich features.

5. Conclusion

In this paper, aiming at the problems of inaccurate object feature extraction and inconspicuous feature information in deep layers, a YOLOv7-RCBAM for the ship detection method is proposed. In our design, the RCBAM module is introduced to extract object feature information effectively, and the double transfer learning and the feature fusion attention of the method can fuse the feature information and avoid feature losing in deep layers. The effectiveness of our method based on YOLOv7-RCBAM is verified by case studies on several experimental data in this paper. Compared with other benchmark models and state-of-the-art methods, our method has better detection accuracy and anti-interference. In future research work, the real-time performance of ship detection is particularly important. We will focus on improving the detection accuracy of the model while reducing the number of model parameters, to improve the training speed of the model and achieve the purpose of lightweight models.

Data Availability

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Authors’ Contributions

J.C. and H.F. wrote the main manuscript text, X.L., Y.H., and H.L. listed the data, and H.L. and W.H. reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

This work was supported by the Guangzhou Science and Technology Key R&D Program (grant number 202206010022), the Innovation Team Project of Ordinary University of Guangdong Province (grant number 2020KCXTD017) the Guangdong Special Project in Key Field of Artificial Intelligence for Ordinary University (grant number 2019KZDZX1004), and the Guangzhou Key Laboratory Construction Project (grant number 202002010003).