Elsevier

Fisheries Research

Volume 200, April 2018, Pages 49-60
Fisheries Research

Review
Robust clustering methodology for multi-frequency acoustic data: A review of standardization, initialization and cluster geometry

https://doi.org/10.1016/j.fishres.2017.12.013Get rights and content

Highlights

  • Cluster geometry and size needs to be considered for clustering.

  • Multi-frequency acoustic data has non-spherical and uneven clusters.

  • Non-standardized Sv differences have to be employed as variables.

  • A new initialization technique directs the clustering towards the global minima.

  • A clustering review with a practical example is included.

Abstract

Clustering is a useful unsupervised technique for the identification of acoustic groups in multi-frequency echograms based on frequency response. K-Means is the most well-known clustering technique but has significant requirements such as clusters of equal size and spherical shape. Initialization is a common problem in clustering as only local minima are usually guaranteed, and thus initialization must locate the centroids near the global minimum. Expectation-Maximization (EM) clustering also requires a good set of initial centroids but allows the identification of clusters with different statistical distributions. This work presents the comparison of these techniques applied to a case with several acoustic signatures presenting different cluster sizes and distributions. The main issues treated in this manuscript are: pre-processing of acoustic data for clustering, initialization of centroids with theoretical scattering models and the need to consider the geometry of the clusters in addition to means, including variance (spread around the mean), orientation (correlation between variables), spherical or ellipsoidal shape (difference in variance between variables) and cluster size (number of observations). EM clustering is the only technique that properly separates acoustic signatures (and noise) after using the supervised initialization presented in this study.

Introduction

Fisheries acoustics is a discipline that examines fishes and plankton species based on their scattering properties using the measured scattering intensity known as volume backscatter (Sv, dB re m−1) (Simmonds and MacLennan, 2005). The identification of acoustic echotraces has traditionally been conducted through net sampling, known as ‘ground-truthing’. However, linking acoustic and net data is complicated due to, among other things, net avoidance and acoustic shadowing of species with lower scatter. Net sampling of deep-distributed species such as mesopelagic fish often challenges the available logistics. In addition, sampling in acoustic surveys are often directed at schools/layers with higher scatter, as echotraces of lower numerical density or those that contain species with lower scatter are more difficult to spot. A priori knowledge of the location of different species or acoustic typologies in the echogram allows the proper sampling of all the desired targets (when biological information is also needed), and may be used to make commercial fishing more efficient, reducing by-catch. The identification of acoustic groups based on acoustic data without ground-truthing requires the employment of an unsupervised technique. Ideally, a quick and not very computationally demanding methodology is desired, such as clustering.

Clustering is an unsupervised machine learning technique that groups data according to similarity in the variables provided as input. As an unsupervised method, there is no training data with labels orientating the algorithm to a particular solution. Several papers have summarized the main clustering techniques (Banerjee and Davé, 2012, Xiao and Yu, 2012), which can be divided into hard-clustering, where one data point can only belong to one cluster, and fuzzy or soft clustering, where each data point may belong to several clusters through a membership function. The second group handles better overlapping clusters and is less sensitive to noise as noise influence is equally split among groups.

The most well-known clustering techniques have been designed for data without noise or outliers (Xiao and Yu, 2012). Robust variations have been posteriorly developed to adapt to real measurement data that contains noise. As shown in this paper, most clustering algorithms must also be robust for initialization (initial centroid estimation). Furthermore, the geometric characteristics of the data used is often overlooked, such as cluster size and shape. For instance, the most popular algorithm, K-Means, requires data with clusters of equal size and variance (spherical clusters). Different clustering algorithms or distance measures can lead to very different results (Jain et al., 2004). There is no single algorithm suitable for all applications and thus, data knowledge and requirement checking would reveal the most suitable. This work focuses on that analysis for fisheries acoustic data.

The incorporation of several frequencies into fisheries and plankton acoustics gave birth to what it is known as multi-frequency methods, where the difference between frequencies is employed to identify acoustic groups, comparing their spectrum with theoretical scattering models. Species are categorized into three acoustic groups: gas-bearing (including a swim bladder or pneumatophore), fluid-like (with a weak acoustic signal, such as krill and copepods), and elastic shell (pteropod type) (Stanton et al., 1996). The first group presents a resonance peak at a particular frequency that depends on swim bladder size (near 18 kHz for lantern fish and around 4 kHz for small pelagic fish). The second and third groups present increasing scatter with frequency shifted in frequency with length. For vessel-borne echosounders, Sv is measured within a volume that increases with depth. Assuming only one acoustic typology is present in the volume, Sv is dependent on the scatter of one single organism (target strength, TS) and its numerical density ρ, following the equation Sv=TS+10*log10(ρ). To remove numerical density dependence, each Sv is subtracted by the Sv of a reference frequency, usually 38 kHz for historical reasons (as it was the most common first frequency onboard research vessels). The results are known as Frequency Response FR=SviSv38=TSiTS38, which reduces the number of variables to the number of frequencies minus one (as FR(38) will be equal to 1 for all data points, and thus, will have little influence on the clustering; see discussion for further information). Typical working frequencies are 18, 38, 70, 120, 200 and 333 kHz but, as the usable range (depth if vertically orientated) decreases at higher frequencies, the number of frequencies that can be employed depend on the depth of the targeted species. Sv data are thus a type of curve data, like time series, where the trend (with frequency instead of time) is used to identify groups, but unlike time series, frequency is a dependent variable, while time is not (Pereira, 2013). The dependence of Sv values with frequency (serial correlation) has been modeled for the different acoustic groups. See, for example, Peña and Calise (2016) for the krill model adapted to short-length species and Peña et al. (2014) for mesopelagic fish models. As in time series, frequency shifts are bound to appear, due to length differences of organisms (reflected in the TS value), as well as vertical offset due to numerical density differences (10 * log10(ρ) term). Calculating the FR removes that offset and achieves some translation invariance, in a similar way that it is done for detrending in time series. The frequency shift is minimal for similar sizes, but could be the key to differentiate different species with similar FR tendency, but very different size, such as krill (∼2–4 cm) and Mysidacea (∼0.5–2.5 cm). The frequency spectrum (FR variation with frequency) has to be maintained in pre-processing and considered in the clustering.

In fisheries acoustics, data noise is often classified as background noise and impulse noise (Ryan et al., 2015). Background noise refers to ambient and vessel noise that affects all pings and varies in intensity and pattern with vessel speed, propeller pitch, bottom depth, number of vessels in the area, etc. (Peña, 2016). Impulse noise is usually caused by interferences with another acoustic device and affect a few pings. Several algorithms have been published to remove background and impulse noise (Ryan et al., 2015, Peña, 2016). Data with very low threshold also include white noise, a random signal having equal intensity at different frequencies. They are a sequence of serially uncorrelated random data with zero mean and finite variance. This noise needs to be accounted for when modeling acoustic data. The sample unit considered in this paper is the pixel, i.e. each data point in the 2D echogram as sampled by the echosounder. For an EK60 with 1 ms pulse duration, a pixel has a vertical length of ∼19 cm. The horizontal length changes with beam width and depth due to the conical shape of the acoustic beam. For a 7° beam, the horizontal length is ∼12 m at 100 m depth and ∼30 m at 500 m. Each pixel represents a particular sampled volume that changes with distance to the transducer and beam angle. Differences in sampled volume between frequencies need to be accounted for when comparing pixels, particularly in cases of small echotraces.

The aim of this paper is to study the behavior of clustering techniques with multi-frequency acoustic data, very noisy data with clusters that can have very different sizes (proportion of echogram pixels). A very robust initialization procedure based on theoretical models that properly locates centroids and provides an estimation of the number of clusters is presented. The use of standardization is also analyzed. The paper is organized as follows: a short summary of clustering methods and their requisites is given, focusing on two techniques: K-Means (KM) and Expectation-Maximization (EM) clustering (also known as Gaussian Mixture Model or GMM). KM and EM clustering have already been used with acoustic data (see Section 1.3) and are both included in the top ten algorithms in data mining (Wu et al., 2008). The geometry of clusters is defined and shown with examples. A review of clustering applied to multi-frequency acoustic data is then given. The material and methods section presents the novel technique to initialize centroids. Finally, the two techniques are compared using a challenging example and the suggested initialization method.

Clustering techniques can be classified based on the clustering approach as center-based techniques, where one cluster is represented by its center, such as K-Means (Lloyd., 1982); density-based clustering like DBSCAN (Arlia and Coppola, 2001), where clusters are defined as areas of higher density surrounded by lower density areas; and distribution-based techniques, with clusters defined as objects belonging to the same distribution. Gaussian Mixture models fitted with an Expectation-Maximization (EM) algorithm (Krishnan and McLachlan, 1997) are included in the last category, and allow clusters to have different variances, density and size. Density-based clustering also allows the separation of clusters of different size, but requires the calculation of distances between all pair of data points, which is too computationally expensive with acoustic data.

Two of the critical aspects of clustering techniques are the pre-allocation of number of clusters and initialization of the centroids. Pre-selecting the number of clusters K is still a very challenging problem in clustering. The available techniques to estimate K are based on comparing different runs of the algorithm, which make them cumbersome. Even though several cluster validity indices (CVIs) exist, they are inefficient when clusters widely differ in density or size (Zalik, 2010). They are usually based on maximizing compactness and minimizing overlap among clusters, but in the presence of noise, overlapping is prone to appear. Distances between centroids do not take into account the cluster shape and dispersion; points from two neighboring but not dispersed clusters can be more separated than two spread clusters that overlap, despite the distance between the centroids being large. Using only centroid information (such as with the Davies-Bouldin measure (DB) (Davies and Bouldin, 1979), the Hartigan index (Ha) (Hartigan, 1975) or the Krzanowski-Lai index (KL) (Krzanowski and Lai, 1988)) is not sufficient to interpret the geometrical structure of the data, and therefore not sufficient for the separation between clusters. The elbow method, one of the most common CVIs based on the variance curve, was found to be unsuitable for several datasets in Milligan and Cooper (1985) and, as seen in Santos and Embrechts (2014) with 30 benchmark datasets, no cluster validation index is perfect.

In general, clustering algorithms guarantee convergence to the closest local minima, so the initial location of the centroids must ensure that this minimum is also the global minimum. MacQueen (1967) suggested choosing K random observations as initialized centroids, but different initialization runs may generate rather different clusters and more dense clusters have a higher probability to attract one or two centroids.

Center-based techniques assume all clusters are spherical (equal variance-covariance). Often standardization/normalization (centering each variable to 0 and scaling by its standard deviation or range) is used to equal the variance of all variables. Multifrequency echograms often present differences in variance, as each frequency has a different sensitivity to noise and directivity among other things, but as shown below, standardization alters the FR spectrum. In addition, standardization works globally on the dataset, but even after applying it, clusters may present different distributions. The cluster size (or prior probability) is often required to be equal, i.e. each cluster has a roughly equal number of observations. This is often not true with acoustic data, as in the example presented.

K-Means or Lloyd's algorithm (Lloyd., 1982) is the most popular clustering algorithm. The steps of K-Means are:

  • 1.

    Randomly choose K items and make them initial centroids.

  • 2.

    For each point, find the nearest centroid and assign the point to that cluster.

  • 3.

    Update the centroid of each cluster as the mean average of the observations in that cluster.

  • 4.

    Repeat steps 2 and 3 until no point switches clusters or the maximum number of iterations is reached.

KM uses hard membership, i.e. each data point is assigned to exactly one cluster.

Expectation-Maximization (EM) clustering (Krishnan and McLachlan, 1997) is a distribution-based clustering method where clusters are defined based on how likely the objects included are to belong to the same distribution. Overfitting is overcome by constraining the algorithm with a specific number of Gaussian distributions (Gaussian mixture models).

EM clustering is a soft-membership clustering technique that allows clusters to have different sizes and statistical distributions. Instead of maximizing the differences in means between clusters, the EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. The objective of this clustering algorithm is to maximize the overall probability or likelihood of the data, given the resulting clusters. KM is a variant of EM clustering, with the assumption that clusters are spherical (with identical variance-covariance matrices for each cluster, assuming Gaussian distribution).

Each iteration includes two steps (Expectation and Maximization) and the algorithm finishes when the distribution parameters converge or reach the maximum number of iterations.

E-Step: In the E-step, data are estimated given the observed data and current estimates of model parameters. This step estimates the probability of each element x belonging to each cluster Ck.P(x|Ck)=1(2π)d/2|k|12e12(xμk)tΣk1(xμk)where P(x|Ck) are mixture components, d indicates dimension, t transpose and μk and Σk are the mean and covariance matrices of cluster Ck. The “membership weights” are calculated as follows:wik=P(zik=1|Ck)=Pk(xi|zk,Ck)αkΣm=1KPm(xi|zm,Cm)αm

P(zij = 1|Cj) is a vector of K binary indicator variables that are mutually exclusive and exhaustive (i.e. one and only one of the zk's is equal to 1, and the others are 0). z is a random variable representing the identity of the mixture component that generated x. The αk are the mixture weights, representing the probability that a randomly selected x was generated by component k, where Σk=1Kαk=1).

M-Step:

The M-step estimates the parameters of the probability distribution of each class for the next step with αk=NkN where Nk is the number of elements assigned to component k and N the total number of elements. The class K means μk and the covariance Σk are calculated asμk=1Nki=1NwikxiandΣk=1Nki=1Nwik(xiμk)(xiμk)t

Clustering can also be classified based on the covariance structure considered (Erar, 2011). The full covariance matrix of each cluster is Σk=λkDkAkDkT where λk are the eigenvalues that specify the cluster size, the eigenvectors Dk indicates the orientation and DkT its transpose, and Ak is the cluster shape. With spherical clusters, all clusters have an equal shape with a diagonal covariance matrix (no correlation between variables), where the cluster size may be equal for all clusters (Σk = λI) or different (Σk = λkI). Diagonal clusters present different variances per variable, with fixed cluster size and shape (Σk = λB), varying shape but fixed cluster size (Σk = λBk) or both varying (Σk = λkBk). In the latter case, the clusters are elliptical but parallel to the axes. With Σk = λD ADT the clusters are elliptical, but the same covariance structure applies to all clusters. General models with full covariance do not constrain the covariance matrix to being diagonal, allowing correlation between variables (Σk=λkDkAkDkT). Fig. 1 shows three different cluster geometries and the corresponding covariance matrix. The left case presents a spherical cluster with variance equal to one for both variables (diagonal values). The middle figure includes a non-spherical cluster (difference in variance for the x and y axis) but no correlation (parallel to one of the axes). The right plot presents a non-spherical cluster with correlation between variables (non-diagonal terms of the covariance matrix are non-zero). Changing the correlation sign would draw a cluster orientated on the opposite direction (top left to bottom right).

This paper is focused on the application of robust partitional (non-hierarchical) clustering techniques to fisheries and plankton multi-frequency acoustic data at the pixel level for identification of acoustic groups exclusively based on their FR. Previous works in this area include Anderson et al. (2007), Woillez et al. (2012) and Ross et al. (2013). Ross et al. (2013) applied KM to broadband data (71 frequencies) comparing the use of absolute Sv, FR and RGB. FR data were calculated as the subtraction of the maximum Sv at each observation. Although they named this pre-processing as normalization, note that it was not applied to columns but to rows, and was thus a decentering technique that removes the numerical density term. The RGB data were created by shrinking the number of variables (71 frequencies) into a three-dimensional color-based space that represents the general tendency of the spectrum. They used random initialization and the elbow method (<10% variation of the variance curve as the number of clusters estimation). Anderson et al. (2007) employed the EM clustering algorithm with acoustic data, but using Sv values as variables, and initializing centroids with the clusters found by a KM pre-processing. They employed a version of the Bayesian Information Criterion (BIC) that considers the sum of the probability of all points belonging to their allocated cluster to estimate the number of clusters. Woillez et al. (2012) combined unsupervised and supervised learning by joining training of labeled data with clustering of unlabelled data; FR data were employed. The unsupervised portion used EM clustering initialized with KM (with no mention of how KM was initialized). The BIC method was used to estimate the number of clusters.

A similar application with monofrequency acoustic data, using Sv at different depths as variables, was presented in Behagle et al. (2016) and Boersch-Supan et al. (2017). Behagle et al. (2016) estimated the number of clusters with the Calinski criterion, which considers the within- and between-group dispersion. Boersch-Supan et al. (2017) clustered vertical profiles of mesopelagic acoustic data using K-medoids (equivalent to KM with the L1 distance). They employed the silhouette technique to estimate the number of clusters. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation), calculated with any distance metric, such as the L2 distance. Thus, only means are considered.

Similar works have been applied to a mixture of external variables (temperature, salinity, etc.) and aggregated acoustic data, usually at the school level (Campanella and Taylor, 2016, Cox et al., 2010, Fablet et al., 2009), but also to data averaged in ‘nodes’ defining larger aggregations (Buelens et al., 2009) or even to fish acoustic tracks (Rakowitz et al., 2012). Clustering employing the ‘kernel trick’ was used in Buelens et al. (2009). Although the kernel trick allows the non-linearly projection of data into a subspace where clusters can be linearly separated, it requires the calculation of distances between all pairs of points, which makes them impractical for large datasets, such as acoustic data at a pixel level. In this publication, the echogram was pre-clustered into nodes according to different smoothing and averaging techniques, which is not comparable with the current study. Hierarchical clustering is not considered in this review, although it has been applied to a mixture of environmental and aggregated acoustic data (Bertrand et al., 1999, Domokos, 2009, Doray et al., 2009).

Section snippets

Example data

Simulated data including four clusters were firstly employed to separately determine the influence of the different geometrical parameters of clusters in KM. Only the most informative examples are shown. Then, multi-frequency acoustic data recorded during the SCAPA surveys, four seasonally distributed (February, April, July and November 2015) research surveys carried out to study the structure and carbon pathways of the planktonic foodweb, were employed with KM and EM clustering. Data from

Results

The influence of cluster geometry on clustering was analyzed with simulated data and K-Means. The results show that the most influential aspect is clearly cluster size, which greatly conditions the random initialization process. Cluster variance and orientation are also relevant. The upper plots in Fig. 3 show the resulting clustering of four clusters of different cluster size, with and without standardization. The random initialization always locates more than one centroid in one of the more

Discussion

This work evaluates the clustering performances of two algorithms (KM and EM clustering) with acoustic data, focusing on pre-processing, cluster geometry and initialization. The use of Sv as variables produces clusters that depend on numerical density, as in Anderson et al. (2007) with clusters named ‘low Sv’ and ‘high Sv’. The use of absolute Sv data in Ross et al. (2013) was equivalent to clusters found using only one frequency, proving the clustering based on Sv intensity. Table 7 in

Conclusions

Selecting the correct clustering algorithm for your data based on requirements is key for an accurate result. Non-spherical clusters and correlation between variables in multi-frequency acoustic data need to be considered for clustering. A good initialization is also essential to locate the centroids near the global minima for all techniques; a new methodology based on theoretical scatter models is presented. Differences of Sv (FR) are necessary to remove dependence from numerical density and

Acknowledgments

We thank the collaboration of all scientists and crew involved in the SCAPA project (CTM2013-45089-R).

References (41)

  • A. Banerjee et al.

    Robust clustering

    WIREs Data Min. Knowl. Discov.

    (2012)
  • A. Bertrand et al.

    Acoustic characterisation of micronekton distribution in French Polynesia

    Mar. Ecol. Prog. Ser.

    (1999)
  • B. Buelens et al.

    Kernel methods for the detection and classification of fish schools in single-beam and multibeam acoustic data

    ICES J. Mar. Sci.

    (2009)
  • L. Calise et al.

    Sensitivity investigation of the SDWBA Antarctic krill target strength model to fatness, material contrast and orientation

    CCAMLR Sci.

    (2011)
  • M. Chen

    Matlab File Exchange

    (2016)
  • D.L. Davies et al.

    A cluster separation measure

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1979)
  • D.A. Demer et al.

    Calibration of Acoustic Instruments. Tech. Rep., ICES Coop. Res. Rep. No. 326

    (2015)
  • R. Domokos

    Environmental effects on forage and longline fishery performance for albacore (Thunnus alalunga) in the American Samoa Exclusive Economic Zone

    Fish. Oceanogr.

    (2009)
  • M. Doray et al.

    The influence of the environment on the variability of monthly tuna biomass around a moored, fish-aggregating device

    ICES J. Mar. Sci.

    (2009)
  • B. Erar

    Mixture Model Cluster Analysis Under Different Covariance Structures Using Information Complexity

    (2011)
  • Cited by (16)

    • Unveiling the bathypelagic zone with an acoustic vertical profiler

      2023, Deep-Sea Research Part I: Oceanographic Research Papers
    View all citing articles on Scopus
    View full text