ReviewRobust clustering methodology for multi-frequency acoustic data: A review of standardization, initialization and cluster geometry
Introduction
Fisheries acoustics is a discipline that examines fishes and plankton species based on their scattering properties using the measured scattering intensity known as volume backscatter (S, dB re m−1) (Simmonds and MacLennan, 2005). The identification of acoustic echotraces has traditionally been conducted through net sampling, known as ‘ground-truthing’. However, linking acoustic and net data is complicated due to, among other things, net avoidance and acoustic shadowing of species with lower scatter. Net sampling of deep-distributed species such as mesopelagic fish often challenges the available logistics. In addition, sampling in acoustic surveys are often directed at schools/layers with higher scatter, as echotraces of lower numerical density or those that contain species with lower scatter are more difficult to spot. A priori knowledge of the location of different species or acoustic typologies in the echogram allows the proper sampling of all the desired targets (when biological information is also needed), and may be used to make commercial fishing more efficient, reducing by-catch. The identification of acoustic groups based on acoustic data without ground-truthing requires the employment of an unsupervised technique. Ideally, a quick and not very computationally demanding methodology is desired, such as clustering.
Clustering is an unsupervised machine learning technique that groups data according to similarity in the variables provided as input. As an unsupervised method, there is no training data with labels orientating the algorithm to a particular solution. Several papers have summarized the main clustering techniques (Banerjee and Davé, 2012, Xiao and Yu, 2012), which can be divided into hard-clustering, where one data point can only belong to one cluster, and fuzzy or soft clustering, where each data point may belong to several clusters through a membership function. The second group handles better overlapping clusters and is less sensitive to noise as noise influence is equally split among groups.
The most well-known clustering techniques have been designed for data without noise or outliers (Xiao and Yu, 2012). Robust variations have been posteriorly developed to adapt to real measurement data that contains noise. As shown in this paper, most clustering algorithms must also be robust for initialization (initial centroid estimation). Furthermore, the geometric characteristics of the data used is often overlooked, such as cluster size and shape. For instance, the most popular algorithm, K-Means, requires data with clusters of equal size and variance (spherical clusters). Different clustering algorithms or distance measures can lead to very different results (Jain et al., 2004). There is no single algorithm suitable for all applications and thus, data knowledge and requirement checking would reveal the most suitable. This work focuses on that analysis for fisheries acoustic data.
The incorporation of several frequencies into fisheries and plankton acoustics gave birth to what it is known as multi-frequency methods, where the difference between frequencies is employed to identify acoustic groups, comparing their spectrum with theoretical scattering models. Species are categorized into three acoustic groups: gas-bearing (including a swim bladder or pneumatophore), fluid-like (with a weak acoustic signal, such as krill and copepods), and elastic shell (pteropod type) (Stanton et al., 1996). The first group presents a resonance peak at a particular frequency that depends on swim bladder size (near 18 kHz for lantern fish and around 4 kHz for small pelagic fish). The second and third groups present increasing scatter with frequency shifted in frequency with length. For vessel-borne echosounders, S is measured within a volume that increases with depth. Assuming only one acoustic typology is present in the volume, S is dependent on the scatter of one single organism (target strength, TS) and its numerical density ρ, following the equation . To remove numerical density dependence, each S is subtracted by the S of a reference frequency, usually 38 kHz for historical reasons (as it was the most common first frequency onboard research vessels). The results are known as Frequency Response , which reduces the number of variables to the number of frequencies minus one (as FR(38) will be equal to 1 for all data points, and thus, will have little influence on the clustering; see discussion for further information). Typical working frequencies are 18, 38, 70, 120, 200 and 333 kHz but, as the usable range (depth if vertically orientated) decreases at higher frequencies, the number of frequencies that can be employed depend on the depth of the targeted species. data are thus a type of curve data, like time series, where the trend (with frequency instead of time) is used to identify groups, but unlike time series, frequency is a dependent variable, while time is not (Pereira, 2013). The dependence of values with frequency (serial correlation) has been modeled for the different acoustic groups. See, for example, Peña and Calise (2016) for the krill model adapted to short-length species and Peña et al. (2014) for mesopelagic fish models. As in time series, frequency shifts are bound to appear, due to length differences of organisms (reflected in the TS value), as well as vertical offset due to numerical density differences (10 * log10(ρ) term). Calculating the FR removes that offset and achieves some translation invariance, in a similar way that it is done for detrending in time series. The frequency shift is minimal for similar sizes, but could be the key to differentiate different species with similar FR tendency, but very different size, such as krill (∼2–4 cm) and Mysidacea (∼0.5–2.5 cm). The frequency spectrum (FR variation with frequency) has to be maintained in pre-processing and considered in the clustering.
In fisheries acoustics, data noise is often classified as background noise and impulse noise (Ryan et al., 2015). Background noise refers to ambient and vessel noise that affects all pings and varies in intensity and pattern with vessel speed, propeller pitch, bottom depth, number of vessels in the area, etc. (Peña, 2016). Impulse noise is usually caused by interferences with another acoustic device and affect a few pings. Several algorithms have been published to remove background and impulse noise (Ryan et al., 2015, Peña, 2016). Data with very low threshold also include white noise, a random signal having equal intensity at different frequencies. They are a sequence of serially uncorrelated random data with zero mean and finite variance. This noise needs to be accounted for when modeling acoustic data. The sample unit considered in this paper is the pixel, i.e. each data point in the 2D echogram as sampled by the echosounder. For an EK60 with 1 ms pulse duration, a pixel has a vertical length of ∼19 cm. The horizontal length changes with beam width and depth due to the conical shape of the acoustic beam. For a 7° beam, the horizontal length is ∼12 m at 100 m depth and ∼30 m at 500 m. Each pixel represents a particular sampled volume that changes with distance to the transducer and beam angle. Differences in sampled volume between frequencies need to be accounted for when comparing pixels, particularly in cases of small echotraces.
The aim of this paper is to study the behavior of clustering techniques with multi-frequency acoustic data, very noisy data with clusters that can have very different sizes (proportion of echogram pixels). A very robust initialization procedure based on theoretical models that properly locates centroids and provides an estimation of the number of clusters is presented. The use of standardization is also analyzed. The paper is organized as follows: a short summary of clustering methods and their requisites is given, focusing on two techniques: K-Means (KM) and Expectation-Maximization (EM) clustering (also known as Gaussian Mixture Model or GMM). KM and EM clustering have already been used with acoustic data (see Section 1.3) and are both included in the top ten algorithms in data mining (Wu et al., 2008). The geometry of clusters is defined and shown with examples. A review of clustering applied to multi-frequency acoustic data is then given. The material and methods section presents the novel technique to initialize centroids. Finally, the two techniques are compared using a challenging example and the suggested initialization method.
Clustering techniques can be classified based on the clustering approach as center-based techniques, where one cluster is represented by its center, such as K-Means (Lloyd., 1982); density-based clustering like DBSCAN (Arlia and Coppola, 2001), where clusters are defined as areas of higher density surrounded by lower density areas; and distribution-based techniques, with clusters defined as objects belonging to the same distribution. Gaussian Mixture models fitted with an Expectation-Maximization (EM) algorithm (Krishnan and McLachlan, 1997) are included in the last category, and allow clusters to have different variances, density and size. Density-based clustering also allows the separation of clusters of different size, but requires the calculation of distances between all pair of data points, which is too computationally expensive with acoustic data.
Two of the critical aspects of clustering techniques are the pre-allocation of number of clusters and initialization of the centroids. Pre-selecting the number of clusters K is still a very challenging problem in clustering. The available techniques to estimate K are based on comparing different runs of the algorithm, which make them cumbersome. Even though several cluster validity indices (CVIs) exist, they are inefficient when clusters widely differ in density or size (Zalik, 2010). They are usually based on maximizing compactness and minimizing overlap among clusters, but in the presence of noise, overlapping is prone to appear. Distances between centroids do not take into account the cluster shape and dispersion; points from two neighboring but not dispersed clusters can be more separated than two spread clusters that overlap, despite the distance between the centroids being large. Using only centroid information (such as with the Davies-Bouldin measure (DB) (Davies and Bouldin, 1979), the Hartigan index (Ha) (Hartigan, 1975) or the Krzanowski-Lai index (KL) (Krzanowski and Lai, 1988)) is not sufficient to interpret the geometrical structure of the data, and therefore not sufficient for the separation between clusters. The elbow method, one of the most common CVIs based on the variance curve, was found to be unsuitable for several datasets in Milligan and Cooper (1985) and, as seen in Santos and Embrechts (2014) with 30 benchmark datasets, no cluster validation index is perfect.
In general, clustering algorithms guarantee convergence to the closest local minima, so the initial location of the centroids must ensure that this minimum is also the global minimum. MacQueen (1967) suggested choosing K random observations as initialized centroids, but different initialization runs may generate rather different clusters and more dense clusters have a higher probability to attract one or two centroids.
Center-based techniques assume all clusters are spherical (equal variance-covariance). Often standardization/normalization (centering each variable to 0 and scaling by its standard deviation or range) is used to equal the variance of all variables. Multifrequency echograms often present differences in variance, as each frequency has a different sensitivity to noise and directivity among other things, but as shown below, standardization alters the FR spectrum. In addition, standardization works globally on the dataset, but even after applying it, clusters may present different distributions. The cluster size (or prior probability) is often required to be equal, i.e. each cluster has a roughly equal number of observations. This is often not true with acoustic data, as in the example presented.
K-Means or Lloyd's algorithm (Lloyd., 1982) is the most popular clustering algorithm. The steps of K-Means are:
- 1.
Randomly choose K items and make them initial centroids.
- 2.
For each point, find the nearest centroid and assign the point to that cluster.
- 3.
Update the centroid of each cluster as the mean average of the observations in that cluster.
- 4.
Repeat steps 2 and 3 until no point switches clusters or the maximum number of iterations is reached.
KM uses hard membership, i.e. each data point is assigned to exactly one cluster.
Expectation-Maximization (EM) clustering (Krishnan and McLachlan, 1997) is a distribution-based clustering method where clusters are defined based on how likely the objects included are to belong to the same distribution. Overfitting is overcome by constraining the algorithm with a specific number of Gaussian distributions (Gaussian mixture models).
EM clustering is a soft-membership clustering technique that allows clusters to have different sizes and statistical distributions. Instead of maximizing the differences in means between clusters, the EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. The objective of this clustering algorithm is to maximize the overall probability or likelihood of the data, given the resulting clusters. KM is a variant of EM clustering, with the assumption that clusters are spherical (with identical variance-covariance matrices for each cluster, assuming Gaussian distribution).
Each iteration includes two steps (Expectation and Maximization) and the algorithm finishes when the distribution parameters converge or reach the maximum number of iterations.
E-Step: In the E-step, data are estimated given the observed data and current estimates of model parameters. This step estimates the probability of each element x belonging to each cluster Ck.where P(x|Ck) are mixture components, d indicates dimension, t transpose and μk and Σk are the mean and covariance matrices of cluster Ck. The “membership weights” are calculated as follows:
P(zij = 1|Cj) is a vector of K binary indicator variables that are mutually exclusive and exhaustive (i.e. one and only one of the zk's is equal to 1, and the others are 0). z is a random variable representing the identity of the mixture component that generated x. The αk are the mixture weights, representing the probability that a randomly selected x was generated by component k, where ).
M-Step:
The M-step estimates the parameters of the probability distribution of each class for the next step with where Nk is the number of elements assigned to component k and N the total number of elements. The class K means μk and the covariance Σk are calculated as
Clustering can also be classified based on the covariance structure considered (Erar, 2011). The full covariance matrix of each cluster is where λk are the eigenvalues that specify the cluster size, the eigenvectors Dk indicates the orientation and its transpose, and Ak is the cluster shape. With spherical clusters, all clusters have an equal shape with a diagonal covariance matrix (no correlation between variables), where the cluster size may be equal for all clusters (Σk = λI) or different (Σk = λkI). Diagonal clusters present different variances per variable, with fixed cluster size and shape (Σk = λB), varying shape but fixed cluster size (Σk = λBk) or both varying (Σk = λkBk). In the latter case, the clusters are elliptical but parallel to the axes. With Σk = λD ADT the clusters are elliptical, but the same covariance structure applies to all clusters. General models with full covariance do not constrain the covariance matrix to being diagonal, allowing correlation between variables (). Fig. 1 shows three different cluster geometries and the corresponding covariance matrix. The left case presents a spherical cluster with variance equal to one for both variables (diagonal values). The middle figure includes a non-spherical cluster (difference in variance for the x and y axis) but no correlation (parallel to one of the axes). The right plot presents a non-spherical cluster with correlation between variables (non-diagonal terms of the covariance matrix are non-zero). Changing the correlation sign would draw a cluster orientated on the opposite direction (top left to bottom right).
This paper is focused on the application of robust partitional (non-hierarchical) clustering techniques to fisheries and plankton multi-frequency acoustic data at the pixel level for identification of acoustic groups exclusively based on their FR. Previous works in this area include Anderson et al. (2007), Woillez et al. (2012) and Ross et al. (2013). Ross et al. (2013) applied KM to broadband data (71 frequencies) comparing the use of absolute S, FR and RGB. FR data were calculated as the subtraction of the maximum S at each observation. Although they named this pre-processing as normalization, note that it was not applied to columns but to rows, and was thus a decentering technique that removes the numerical density term. The RGB data were created by shrinking the number of variables (71 frequencies) into a three-dimensional color-based space that represents the general tendency of the spectrum. They used random initialization and the elbow method (<10% variation of the variance curve as the number of clusters estimation). Anderson et al. (2007) employed the EM clustering algorithm with acoustic data, but using S values as variables, and initializing centroids with the clusters found by a KM pre-processing. They employed a version of the Bayesian Information Criterion (BIC) that considers the sum of the probability of all points belonging to their allocated cluster to estimate the number of clusters. Woillez et al. (2012) combined unsupervised and supervised learning by joining training of labeled data with clustering of unlabelled data; FR data were employed. The unsupervised portion used EM clustering initialized with KM (with no mention of how KM was initialized). The BIC method was used to estimate the number of clusters.
A similar application with monofrequency acoustic data, using S at different depths as variables, was presented in Behagle et al. (2016) and Boersch-Supan et al. (2017). Behagle et al. (2016) estimated the number of clusters with the Calinski criterion, which considers the within- and between-group dispersion. Boersch-Supan et al. (2017) clustered vertical profiles of mesopelagic acoustic data using K-medoids (equivalent to KM with the L1 distance). They employed the silhouette technique to estimate the number of clusters. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation), calculated with any distance metric, such as the L2 distance. Thus, only means are considered.
Similar works have been applied to a mixture of external variables (temperature, salinity, etc.) and aggregated acoustic data, usually at the school level (Campanella and Taylor, 2016, Cox et al., 2010, Fablet et al., 2009), but also to data averaged in ‘nodes’ defining larger aggregations (Buelens et al., 2009) or even to fish acoustic tracks (Rakowitz et al., 2012). Clustering employing the ‘kernel trick’ was used in Buelens et al. (2009). Although the kernel trick allows the non-linearly projection of data into a subspace where clusters can be linearly separated, it requires the calculation of distances between all pairs of points, which makes them impractical for large datasets, such as acoustic data at a pixel level. In this publication, the echogram was pre-clustered into nodes according to different smoothing and averaging techniques, which is not comparable with the current study. Hierarchical clustering is not considered in this review, although it has been applied to a mixture of environmental and aggregated acoustic data (Bertrand et al., 1999, Domokos, 2009, Doray et al., 2009).
Section snippets
Example data
Simulated data including four clusters were firstly employed to separately determine the influence of the different geometrical parameters of clusters in KM. Only the most informative examples are shown. Then, multi-frequency acoustic data recorded during the SCAPA surveys, four seasonally distributed (February, April, July and November 2015) research surveys carried out to study the structure and carbon pathways of the planktonic foodweb, were employed with KM and EM clustering. Data from
Results
The influence of cluster geometry on clustering was analyzed with simulated data and K-Means. The results show that the most influential aspect is clearly cluster size, which greatly conditions the random initialization process. Cluster variance and orientation are also relevant. The upper plots in Fig. 3 show the resulting clustering of four clusters of different cluster size, with and without standardization. The random initialization always locates more than one centroid in one of the more
Discussion
This work evaluates the clustering performances of two algorithms (KM and EM clustering) with acoustic data, focusing on pre-processing, cluster geometry and initialization. The use of S as variables produces clusters that depend on numerical density, as in Anderson et al. (2007) with clusters named ‘low S’ and ‘high S’. The use of absolute S data in Ross et al. (2013) was equivalent to clusters found using only one frequency, proving the clustering based on S intensity. Table 7 in
Conclusions
Selecting the correct clustering algorithm for your data based on requirements is key for an accurate result. Non-spherical clusters and correlation between variables in multi-frequency acoustic data need to be considered for clustering. A good initialization is also essential to locate the centroids near the global minima for all techniques; a new methodology based on theoretical scatter models is presented. Differences of S (FR) are necessary to remove dependence from numerical density and
Acknowledgments
We thank the collaboration of all scientists and crew involved in the SCAPA project (CTM2013-45089-R).
References (41)
- et al.
Acoustic micronektonic distribution is structured by macroscale oceanographic processes across 20–50° S latitudes in the South-Western Indian Ocean
Deep Sea Res. Part I: Oceanogr. Res. Pap.
(2016) - et al.
The distribution of pelagic sound scattering layers across the southwest Indian Ocean
Deep Sea Res. Part II: Top. Stud. Oceanogr.
(2017) - et al.
Investigating acoustic diversity of fish aggregations in coral reef ecosystems from multi-frequency fishery sonar surveys
Fish. Res.
(2016) - et al.
Three-dimensional observations of swarms of Antarctic krill (Euphausia superba) made using a multi-beam echosounder
Deep Sea Res. Part II: Top. Stud. Oceanogr.
(2010) Incrementing the data quality of multi-frequency echograms using the Adaptive Wiener Filter (AWF) denoising algorithm
Deep Sea Res. Part I: Oceanogr. Res. Pap.
(2016)- et al.
Use of SDWBA predictions for acoustic volume backscattering and the Self-Organizing Map to discern frequencies identifying Meganyctiphanes norvegica from mesopelagic fish species
Deep Sea Res. Part I: Oceanogr. Res. Pap.
(2016) - et al.
Use of high-frequency imaging sonar (DIDSON) to observe fish behaviour towards a surface trawl
Fish. Res.
(2012) - et al.
On the use of high-frequency broadband sonar to classify biological scattering layers from a cabled observatory in Saanich Inlet, British Columbia
Methods Oceanogr.
(2013) - et al.
Classifying multi-frequency fisheries acoustic data using a robust probabilistic classification technique
J. Acoust. Soc. Am.
(2007) - et al.
2001 experiments in parallel clustering with DBSCAN
Robust clustering
WIREs Data Min. Knowl. Discov.
Acoustic characterisation of micronekton distribution in French Polynesia
Mar. Ecol. Prog. Ser.
Kernel methods for the detection and classification of fish schools in single-beam and multibeam acoustic data
ICES J. Mar. Sci.
Sensitivity investigation of the SDWBA Antarctic krill target strength model to fatness, material contrast and orientation
CCAMLR Sci.
Matlab File Exchange
A cluster separation measure
IEEE Trans. Pattern Anal. Mach. Intell.
Calibration of Acoustic Instruments. Tech. Rep., ICES Coop. Res. Rep. No. 326
Environmental effects on forage and longline fishery performance for albacore (Thunnus alalunga) in the American Samoa Exclusive Economic Zone
Fish. Oceanogr.
The influence of the environment on the variability of monthly tuna biomass around a moored, fish-aggregating device
ICES J. Mar. Sci.
Mixture Model Cluster Analysis Under Different Covariance Structures Using Information Complexity
Cited by (16)
Unveiling the bathypelagic zone with an acoustic vertical profiler
2023, Deep-Sea Research Part I: Oceanographic Research PapersVertical distribution and acoustic characteristics of deep water micronektonic crustacean in the Bay of Biscay
2023, Marine Environmental ResearchTarget strength of Cyclothone species with fat-filled swimbladers
2023, Journal of Marine SystemsThe digital asset value and currency supervision under deep learning and blockchain technology
2022, Journal of Computational and Applied MathematicsMulti-frequency and light-avoiding characteristics of deep acoustic layers in the North Atlantic
2020, Marine Environmental ResearchClassification of Herring, Salmon, and Bubbles in Multifrequency Echograms Using U-Net Neural Networks
2023, IEEE Journal of Oceanic Engineering