Introduction

Global trade relies on the international shipping industry, which has been implicated in the spread of many marine non-indigenous species (NIS) around the world1,2. Modern vessels have two primary pathways for translocating NIS, namely (i) as stowaways in ballast water, or (ii) attached to the vessel surface as biofouling3; examples of each follow. Ballast water was the likely vector for zebra mussels to spread from Europe to the great lakes in North America4, where they have led to increases in toxic blue-green algae5 and cost industry more than $200 million per year in maintaining water intake structures6. Biofouling is one of the most significant pathways for the spread of non-indigenous seaweeds7,8,9, which can outcompete native species10, make native kelp forests less resilient11 and adversely impact fishing and tourism operations12,13.

Although vessels are incentivised to manage their biofouling to reduce hydrodynamic drag and fuel costs3,14, it is a challenging undertaking and biofouling can occur even on hulls that employ current best practice15,16. The primary method of biofouling management is the regular application of anti-fouling coatings. These contain biocides, such as copper, or create a surface that releases organisms or dissuades attachment to slow down the process of biofouling accumulation17. A vessel’s operating profile contributes to fouling risk, with extended periods of inactivity being associated with higher biofouling pressure18. Niche areas, such as sea chests, propellers, and other complex surface structures are at high risk of becoming fouled as they can offer a sheltered environment for fouling organisms to establish. They are also a lower priority for management as they contribute less to hydrodynamic drag compared to the flat surfaces of the hull19.

There is growing interest in improving management of the biofouling pathway by biosecurity regulators20,21. New Zealand has implemented a clean hull standard that sets requirements for vessels to manage biofouling and proposed a clean hull threshold to determine the potential biosecurity risk of a vessel22. For vessels staying longer than three weeks or visiting areas other than those designated as places of first arrival, any macrofouling except for goose barnacles is considered to be a biosecurity risk, while for short stay vessels there are macrofouling coverage thresholds16,23. In implementing this policy, they have stressed the vessel management requirements rather than the thresholds, as even with current best management practices ships can become fouled16. Australia has also proposed requirements for vessels to implement biofouling management practices or provide evidence that their fouling is appropriately controlled21.

In-water inspections are the best way to verify biofouling standards are being met and to collect the necessary data to measure the effectiveness of biofouling management practices. However, in-water inspections are expensive, require specialist dive teams to operate in an environment with a number of health and safety risks, and while inspections are being conducted vessels are restricted in the activities they can undertake24. A biofouling expert also either needs to be present during the inspection, or review the images and footage gathered afterwards, which can be a costly and time-consuming process. An alternative is to employ an underwater drone or remotely operated vehicle (ROV), which would enhance data collection opportunities but also potentially increase the burden on the expert interpreting the data.

In this paper we explore the potential for deep learning, a type of machine learning which models phenomena using deep neural networks, to automate or assist the analysis of biofouling inspection data. In the last decade deep learning has revolutionised computer vision; in fact, many regard AlexNet, the 2012 winner of the ImageNet visual recognition challenge25, as the watershed moment for deep learning26. AlexNet was among the first deep convolutional neural networks (CNN)27, an architecture that is particularly suited to computer vision tasks. Our present approach is motivated by the plethora of successful applications of deep CNNs to complex image recognition tasks, from identification of wild animals in camera trap images28,29 to identification of coral species30.

A prominent example of automating biofouling image analysis is CoralNet, a machine learning method initially designed for annotating benthic surveys of coral reefs using a random annotation point approach31. CoralNet has been applied to assess the level of cover of different species and higher level taxonomic groups present in fouling communities on oil platforms in the UK continental shelf, using images taken by ROVs32. Our aim in this current study was to develop a method that could be used to assess biosecurity risk, and in this context CoralNet was less suitable. Most of the images of vessel hulls that were available for developing our method had limited biofouling coverage. Unlike coral reefs and oil platforms, vessels are not stationary and actively manage their biofouling. This makes sampling error an important consideration for annotation point approaches, like CoralNet, and as CNNs consider the whole image they do not have this weakness.

Determining the potential biosecurity risk of a vessel also does not require the identification of particular species. It has been found that there is a positive relationship between the degree of biofouling present on a vessel and the number of NIS present23. This has led many jurisdictions, such as New Zealand, to require biofouling to be managed holistically rather than targeting specific species16. Species-based approaches also scale poorly in the marine context, as there is a large number of species that can be observed in these communities, the taxonomy is highly complex, and previously unobserved species are common20,21. Instead, we aimed to identify the presence and severity of biofouling. This is a much simpler problem, and makes our approach more resilient to these taxonomic issues.

Methods

Dataset

We assembled a dataset of 10,263 images collected from in-water surveys of around 300 commercial and recreational vessels. This dataset comprised images provided by three jurisdictions, namely: the Australian Department of Agriculture, Water and the Environment (DAWE), the New Zealand Ministry for Primary Industries (MPI), and the California State Lands Commission (CSLC). Examples from the CSLC dataset are available in the literature33, and the MPI dataset has previously been used to inform vessel biofouling management in New Zealand16,23,34.

Each image was labelled using the six-class Level of Fouling (LoF) scheme35, and these annotations were provided by the jurisdiction the images belonged to. Additionally, DAWE was able to provide annotations for all three datasets using a Simplified Level of Fouling (SLoF) scale (Table 1). This scale was based on the LoF scheme, but collapsed the six levels into pairs to create a three-class scale. Due to inconsistencies in the LoF labelling across the three jurisdictions, we used the SLoF labels in this study. The SLoF scheme also provided the simplest possible set of annotations that supported our goal of identifying images with fouling present and highlighting images with severe fouling.

The SLoF labels provided by DAWE consisted of two sets of annotations from several experts from Ramboll New Zealand, who held qualifications in marine biology and had extensive experience working with biofouling imagery. The first set involved three experts grading a 120 image subset of the DAWE data, constructed by stratified random sampling to ensure balance across LoF. We refer to this as the expert-group labels. In the second set, one of these experts graded the full amalgamated dataset of 10,263 images and we call this the expert labels. The examples and user interface that were used to facilitate labelling the images are given in the supporting information.

The SLoF labelled dataset was highly imbalanced, with most of the images being in class SLoF 0 compared to ~20% in SLoF 1 and ~10% in SLoF 2 (Table 2). Example images and their SLoF labels are provided in Fig. 1, which highlight some of the variation in the imagery in terms of lighting conditions, antifoulant coating quality, niche areas and biofouling organisms that was present in the dataset.

We divided the overall dataset of 10,263 images into a training set and a test set, as is commonly done in machine learning to enable proper evaluation. The test set consists of the 120 expert-group images plus 721 other images from 14 vessels selected with varying degrees of fouling as determined by SLoF. The test set was constructed to challenge the machine learning model with different styles of vessel niches and fouling communities. Hence, we had a total of 841 images in the test set; the remaining data were used to both train the deep learning model and perform cross-validation (5-fold) for hyperparameter tuning.

Machine learning

A machine learning algorithm typically learns by training on a set of examples. We present to the machine learning algorithm a set of images with the accompanying SLoF labels (i.e. 0, 1, 2). We wish the algorithm to accurately label images outside of this training set, i.e., to generalize to never-before-seen images.

The setup so far makes the problem a classic supervised learning task. However unlike most image classification problems, our classes are ordinal. For example, mistaking an image of SLoF 2 as 0 is a larger error compared to mistaking an image of SLoF 1 as 0. This is an analogous challenge to the recent APTOS 2019 Blindness Detection Kaggle competition36, which asked participants to build a model that labels the severity of a disease in images on an integer scale. Many of the best performing Kaggle entries used regression losses rather than classification losses, and we follow the same approach here as this allows the relative magnitude of errors to be easily captured.

To measure the model performance, we consider our three-class problem as two separate binary classification tasks: (1) identify fouled images (SLoF \(= 0\) versus SLoF \(>0\)) and (2) identify heavily fouled images (SLoF \(=2\) versus SLoF \(< 2\)). This allows us to measure the effectiveness of our model as a classifier without choosing arbitrary class thresholds. Instead of the more commonly used receiver operating characteristic (ROC) curve, we use the average precision metric because it provides a better indication of classifier performance in the case that classes are imbalanced37,38. We apply the average precision metric to each of the two binary classification tasks, and report finally their average as an overall indicator of performance.

Given that we are working with image data, the natural deep learning architecture to use is the convolutional neural network (CNN). A CNN comprises an input layer, which in our case is an RGB image, and an output layer, which is a raw number that relates to the SLoF class of the image. Between these are multiple hidden layers, which are connected in a sequence and make up the architecture of network. Each layer performs an operation on the previous layer, such as convolutions, pooling operations, or matrix-matrix multiplications, and the nature of these operations are determined by trainable weights26. The creators of AlexNet were the first to discover that stacking a large number of these layers greatly improved performance the performance of CNNs on image-recognition tasks27.

Training a CNN consists of many components including the selection of a network architecture, a method of optimising the weights of the network (optimiser), a differentiable function that describes network performance with different configurations of weights (loss function), optimiser parameters, an image augmentation pipeline and a learning rate schedule that modifies the size of each weight update over each epoch (i.e., iteration through the training data). Together these components affect the quality of the trained neural network. Often the term hyperparameter is used to refer to parameters of the optimiser, the learning scheduler, etc. The number of possible combination of these design components is incredibly large, and the available search space for determining the best combination is limited by the amount of computing power available.

We trained and tested our deep learning models with pytorch39, an open-source deep-learning library developed by Facebook. We began the model building process by conducting a learning rate test40, using stochastic gradient descent (SGD) as the optimiser and a default set of optimiser parameters picked from the APTOS challenge. The result of this test was used to inform a quasi-random search for the best optimiser parameters41, drawing parameters from a Sobol sequence42 to provide more even coverage of the search space compared to random sampling. This was done by training the model for a small number of epochs, and the best sets of optimiser parameters were chosen for further exploration in addition to the default set.

We then tested performance for different combinations of the training components. We considered mean squared error and smooth-L1 loss, which we weighted by class frequency to remove the bias introduced by the imbalance of the dataset43. In addition to the SGD optimisation algorithm, other optimisers such as Adaptive Moment Estimation (Adam)44, Rectified Adam (RAdam)45 and Adam with a corrected weight decay algorithm (AdamW)46 were tested. Several learning rate schedules were examined including a multi-step learning rate decay schedule, one-cycle47 and cosine annealing48. In CNNs, image augmentation pipelines are important for preventing overfitting to the training data, and two different approaches with varying complexity were tried from the APTOS competition. These applied operations to our training data images that did not change their class such as rotations, random cropping, and adjusting the colour and contrast.

We considered off-the-shelf network architectures, starting with the small resnet18 residual network which was used to test every possible combination of the training components above. The residual network architecture was introduced to address the vanishing gradient problem in networks with large numbers of layers by allowing inputs to skip layers, and obtained first place in the 2015 ImageNet classification challenge49. Once the best training components were identified we trained larger and more modern network architectures on larger images, allowing us to determine if increasing image size from 256\(\times\)256 to 512\(\times\)512 pixels improved performance. These architectures included the “ResNeXT” squeeze and excitation networks which built upon the residual learning idea and introduced a squeeze and excitation block that incorporates relationships between image colour channels50,51. We also tested the inception architecture, which attempts to identify features at different scales in the image by applying convolution layers with several different sized kernels simultaneously52. We also considered efficient nets, which incorporate some of these previous ideas into an architecture that is designed to scale optimally and efficiently when the number of layers is increased53. A summary of the network architectures in this paper and the Python packages used to implement them are provided in Table 3.

We used the pre-trained ImageNet weights to initialise all of our networks. These weights are created by training networks on the ImageNet database, which contains millions of images with a thousand different categories54, and we downloaded them for each architecture through the neural network packages in Table 3. This is a common practice known as transfer learning which reduces the number of epochs required to reach a performance plateau and improves results on small datasets28. All network weights were trained, except for the batch-normalisation layers, as these are best trained on large datasets like the ImageNet database.

The final step was creating a network ensemble. This is a technique where the class of an image is predicted by multiple networks, and their outputs are combined to obtain better performance28. We took the simplest approach, which is to average the raw network output. We identified the best performing ensemble by testing the performance of every combination of network trained on a particular image size. This gave us 510 possible ensembles to test for each image resolution. The full details of the model fitting process are provided in the supporting documentation.

Thresholding to create a classifier

The raw output of our model is a single number which needs to be thresholded to map back to the SLoF classes. The precision-recall curve created by combining validation crossfolds is used to guide this mapping process (Fig. 2). In particular, the curve highlights the trade-off between precision and recall when choosing a threshold. A high precision classifier will only capture some of the positive results, while a high recall classifier will capture most of the positive results along with many false positives. For illustrative purposes we have selected three classifiers to explore, namely a high-precision classifier chosen with a 50% recall threshold, a high-recall classifier chosen with a 95% recall threshold and a balanced classifier with a 80% recall threshold.

Comparison to experts

Perfect agreement within the SLoF labelling scheme is unlikely even among biofouling experts due to its subjectivity, and the frequency at which experts agree with each other is a useful benchmark to evaluate the performance of our models. The 120-image expert-group dataset was graded by three experts, yielding a total of 720 expert-group label pairs. These pairs were obtained by pairing the labels of one expert to annotations provided by the other two, and repeating the process for each expert. We also paired these with the expert and model labels, providing 360 label pairs to compare the performance of the expert and model to the expert-group.

We assessed the significance of differences in precision and recall with a two-sided Fisher’s exact test55 with the fisher.test function in R56, using the null hypothesis that the precision or recall between the expert-group labels is no different to the precision or recall with the other label sets, using the expert-group as the ground truth. We also used the two-one-sided t-tests (TOST) approach to test for non-inferiority57 using the TOSTER R package58. The null hypothesis in this method was that the agreement observed in the expert-group labels would be at least 5% better compared to the agreement observed between our other labels and the expert-group. A separate non-inferiority test was necessary as the lack of significant differences does not mean we can conclude that two distributions are similar59. We chose a p-value of 0.05 to signify statistical significance.

Results

Model thresholding and performance

Our best performing model based on five fold cross validation was an ensemble consisting of the resnet18, se_resnext50_32x4d, se_resnext101_32x4d, inceptionv4, inceptionresnetv2, efficientnet-b4 and efficientnet-b5 CNN architectures (see Table 3) with an input image size of 512\(\times\)512 pixels, trained using an AdamW optimiser, tuned optimiser hyperparameters with a batch-size of 64, smooth-L1 loss, a multi-step learning rate decay schedule and the more complex set of image augmentations. This gave a final mean average precision of 0.796 (standard deviation of 0.023), which significantly improved upon the results from the hyperparameter tuning of 0.632 (standard deviation of 0.025). The full results of the model fitting process is provided in the supporting documentation. The results for each binary classification problem with this model on our validation data and test dataset are shown in Table 4. The classifiers show better results on the testing dataset, which is promising for the generalisability of our model.

Inter-rater reliability

We found that experts agree most often on images showing clean or heavily fouled hulls, while images that only contained some fouling were more likely to obtain inconsistent grades (Fig. 3). Overall, experts showed 89% agreement for both tasks (95% CI: 87–92%). As we have considered every combination of experts, the recall and precision calculated for each task was the same. Experts were found to achieve 91% precision and recall for identifying images containing fouling (95% CI: 88–94%) and 87% for images containing heavy fouling (95% CI: 82–90%) (Table 5).

When the rate of agreement between the expert and the expert-group was compared we found that the non-inferiority test showed that the agreement was similar to within a margin of at most 5% worse (p=0.009–0.054). This similarity could also be observed from the confusion matrix between the expert-group and expert labels (Fig. 3).

Expert agreement is a useful benchmark for our computer vision model, and depending on the thresholds chosen to create a classifier, different outcomes were found (Table 5). Choosing an 80% recall threshold for both tasks resulted in a classifier with similar agreement to experts to within a margin of at most 5% worse (p \(=\) 0.001–0.0014). The results for this classifier are shown in Fig. 3 as a confusion matrix. Using a 95% recall threshold instead could produce significantly higher recall with respect to the expert-group labels (p \(=\) \(<0.001\)–0.004) at the cost of significantly lower precision (p \(<0.001\)). Conversely, using a much lower recall threshold of 50% results in significantly higher precision (p=0.001–0.036), with a corresponding decrease in recall (p\(<0.001\)).

Discussion

In this study we applied deep learning methods to identify the presence and severity of biofouling on ship hulls, and compared our performance to a group of experts. We were able to train neural networks that obtain similar results to these experts, which is highly promising as it suggests that under the study conditions, automated analysis of biofouling in images is feasible and effective. We have also demonstrated that if high precision or recall is desired for the application of the model, then classifiers can be created that offer better performance than experts with regard to this property. This allows the behaviour of the classifiers to be tuned for a particular application. For example, when screening vessels for biosecurity risk it may be desirable to have a classifier with higher recall so few images with severe fouling are missed. Conversely, if an activity were being undertaken where intervention capacity was limited then a classifier with higher precision would be more appropriate.

In practice, the performance of the method will be impacted by image quality, lighting and water turbidity. The dataset that we used to train and test the model is diverse and included images taken under a range of different conditions as shown in Fig. 1 and the supporting information. It is encouraging that despite this the model was able to obtain close to expert accuracy, suggesting that our approach can account for fouling in poor quality images within those labelled by the expert-group about as well as an expert can. This may be a best-case scenario, and performance will be different under other operational settings. The deep convolutional neural networks that we have trained can easily be fine-tuned to incorporate more challenging examples if needed, and the models can be readily adapted to the required conditions.

Fine-tuning with local examples will also likely improve performance when deploying the model in areas where no training data was collected, as the fouling communities present on vessels may be different to those in the training dataset. However, this issue is somewhat mitigated by our focus on classifying the overall coverage of biofouling in an image, rather than just the species present. This also allows us to side-step the challenge of identifying species or species groups, which would have likely required a larger set of images to obtain similar results32. Our approach of simply looking at the overall level of fouling is more robust for the purpose of identifying biosecurity risk, as even if a particular type of organism occurs infrequently or not at all in the training data, our models may generalise information gained from other types of biofouling to detect that the hull is still fouled.

The effectiveness of management activities for vessel biofouling in reducing biosecurity risk is currently a key knowledge gap for regulators, which makes it difficult to determine which combination of activities will provide confidence that a vessel is low risk. This model could be applied to provide a cheaper and more reliable way to identify the most effective management strategies, if combined with standardised vessel sampling protocols24,60, clear definitions of vessel biosecurity risk, such as the clean hull standard for New Zealand16, collection of management data, and ongoing in-water vessel inspections. This will also support more consistent assessment of effective management strategies between different organisations, which is a limitation of expert assessments. Building this evidence base would also provide benefits to industry, as it would be a basis from which to work towards regulatory alignment between different jurisdictions.

In-water cleaning and hull grooming are increasingly important biofouling management activities, as regular cleaning can limit biofouling accumulation and provide options where the anti-fouling coatings of vessels are no longer effective or have failed61,62. However, it also presents a biosecurity risk because cleaning can lead to the release of viable propagules and organisms can detach and still be viable63,64,65. One way this risk can be managed is by considering the biofouling state of the vessel before setting conditions on in-water cleaning or grooming activities. For example, New Zealand recommends that in-water cleaning of macrofouling with an international origin must capture biological waste and dispose of it on land or be rendered non-viable, but this would not be necessary if only a slime layer were present66. Automatic detection of biofouling using the state-of-the-art deep learning tools developed in this paper could be a cost-effective and reliable way for regulators and industry to process the outcomes of biofouling inspections for this purpose.

So far we have only tested our model on static images. Since videos are constructed using a stream of images, our model should be readily adaptable to videos as well. However, further work is needed to address issues such as identifying the frames in which the camera is directed towards a vessel hull as opposed to open water or where image quality is poor, which is a common issue when analysing stills obtained from ROV footage32. The video format would also offer the opportunity to incorporate information from future and previous frames to improve and smooth fouling estimates, and ideas from current action recognition methods could potentially be applied67.

Our SLoF labelling scheme also only relates to the percentage cover of macrofouling present within an image, which could be more rigorously used to determine the absolute biosecurity risk of a vessel if the area of hull captured within the image could be estimated. Given that in-water inspection methods are expected to vary greatly between jurisdictions, being able to do this without the presence of scale bars would be a major advantage. One possibility would be training a deep neural network on images of vessel hulls taken using multiple cameras, and building a model that will estimate depth given a single image68.

Figure 1
figure 1

Example images of different SLoF classes.

Table 1 Simplified Level of Fouling (SLoF) scale.
Table 2 Breakdown of the number of images by SLoF in the crossvalidation and test datasets.
Table 3 Summary of neural network architectures used in model building.
Figure 2
figure 2

Precision-recall curve for model using validation data from each crossfold.

Table 4 Precision and recall of classifier using model with chosen recall thresholds on the validation and testing dataset.
Figure 3
figure 3

Confusion matrices using the SLoF score on the expert-group dataset, comparing labels between the group of experts, the expert that annnotated the full dataset, and the model annotations.

Table 5 Precision and recall for expert-group, expert versus expert-group and classifier versus expert-group label pairs.