Paper The following article is Open access

Boosting the performance of anomalous diffusion classifiers with the proper choice of features

, , , and

Published 26 May 2022 © 2022 The Author(s). Published by IOP Publishing Ltd
, , Characterisation of Physical Processes from Anomalous Diffusion Data Citation Patrycja Kowalek et al 2022 J. Phys. A: Math. Theor. 55 244005 DOI 10.1088/1751-8121/ac6d2a

1751-8121/55/24/244005

Abstract

Understanding and identifying different types of single molecules' diffusion that occur in a broad range of systems (including living matter) is extremely important, as it can provide information on the physical and chemical characteristics of particles' surroundings. In recent years, an ever-growing number of methods have been proposed to overcome some of the limitations of the mean-squared displacements approach to tracer diffusion. In March 2020, the anomalous diffusion (AnDi) challenge was launched by a community of international scientists to provide a framework for an objective comparison of the available methods for AnDi. In this paper, we introduce a feature-based machine learning method developed in response to task 2 of the challenge, i.e. the classification of different types of diffusion. We discuss two sets of attributes that may be used for the classification of single-particle tracking data. The first one was proposed as our contribution to the AnDi challenge. The latter is the result of our attempt to improve the performance of the classifier after the deadline of the competition. Extreme gradient boosting was used as the classification model. Although the deep-learning approach constitutes the state-of-the-art technology for data classification in many domains, we deliberately decided to pick this traditional machine learning algorithm due to its superior interpretability. After the extension of the feature set our classifier achieved the accuracy of 0.83, which is comparable with the top methods based on neural networks.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Single-particle tracking (SPT) is a popular method for observing the molecular dynamics in a wide range of materials, including the living cells [1, 2]. Typically, an SPT experiment results in trajectories of tracers, i.e. time series of their consecutive positions, the analysis of which allows one to extract properties of the molecules and their environment (e.g. density of obstacles, velocity).

The key first step in the analysis of SPT data is to connect the observed motion of particles with the available models of diffusion [3], since it already sheds light on the mechanical properties of the particles' surrounding [4]. The most common approach to characterize the diffusion of particles is the mean-squared displacement (MSD) [3, 5, 6], which measures their deviations with respect to a reference position in time. A MSD growing linearly in time corresponds to Brownian motion (a.k.a. normal diffusion), which describes the movement of a microscopical particle as a consequence of thermal forces. Any deviations from that linear behavior are called anomalous diffusion (AnDi) [7], which can occur in physical, biological, or geological systems. Those deviations are usually represented by a power-law scaling of MSD (i.e. MSD(t) ∝ tα ) [3, 5, 6], with α being the AnDi exponent. A sublinear MSD (α < 1) corresponds to subdiffusion. It indicates trapped particles [8, 9], particles which hit upon obstacles [10, 11] or particles moderated by viscoelastic properties of the environment [12]. Subdiffusion can take place in crowded cellular fluids and membranes [13], and can be caused by crowding and complex interactions with the cytoskeleton and macromolecular complexes: in cytoplasm [1416], the nucleus [17] and the plasma membrane [1820]. A superlinear MSD (α > 1) is referred to as superdiffusion. In this case, the movement of particles is usually faster than in normal diffusion and frequently directed to a specific location. Examples of superdiffusion include Lévy flights [21, 22], active cytoplasmic flows and transport mediated by molecular motors [23, 24], particle motion in random potentials or in turbulent flows [25, 26], foraging by animals and humans [27], and many others [28].

Despite its simplicity, the MSD-based analysis of SPT trajectories is truly challenging. The experimental trajectories are very often too short to extract meaningful information from the MSD. Moreover, the finite precision of tracers' localization adds a term to the MSD, which can limit the interpretation of data [2931]. Consequently, several other approaches going beyond MSD have been introduced. For instance, the radius of gyration [32], the velocity autocorrelation function [33, 34], the time-dependent directional persistence of trajectories [35], the distribution of directional changes [36], the mean maximum excursion method [37], the fractionally integrated moving average framework [38] may efficiently replace the MSD estimator for classification purposes. The full distribution of displacements may be fitted to a mixed model in order to extract differences in diffusive behavior between subsets of particle ensembles [39]. Heterogeneity within single trajectories can be checked with hidden Markov models [40, 41].

In the last few years, machine learning (ML) has become an interesting alternative for the analysis of AnDi. This approach is very appealing because in contrast to the analytical methods it does not require explicit rules for data processing. Instead, a ML algorithm can learn those rules directly from a series of data. ML is already known to excel in different domains including computer vision [42], speech recognition [43] and natural language processing [44]. First attempts to apply them to SPT data turned out to be very promising, even though the characterization of diffusion remains very challenging. Bayesian approach [4547], random forests [4852], gradient boosting [4952], neural networks [53], and deep neural networks [49, 5457] have been used to either classify the trajectories or to extract quantitative information about them.

The increasing number of methods, both analytical and ML-based, and the lack of an objective comparison between them have resulted in the launch of the AnDi challenge by a team of international scientists [58]. The goal of the challenge was at least two-fold: to provide a framework to benchmark the existing and new methods for the analysis of AnDi and to spur the invention of new high-performance methods. The challenge itself was divided into three main tasks: (1) inference of the AnDi exponent α from the trajectories provided by the organizers, (2) classification of trajectories into five different models of diffusion, and (3) detection of points within single trajectories, at which the anomalous exponent α and/or the diffusion coefficient D change. From the results of the challenge, it follows that the ML methods based on neural networks outperform the more traditional statistics-based approaches [59]. However, it should be emphasized here that the latter usually offer a deeper insight into the mechanisms of classification.

The main obstacle limiting the deployment of ML to the trajectory analysis is the availability of decent training data. The experimental trajectories are not provable (otherwise we would not need any new method of analysis). As a consequence, synthetic trajectories generated by computer simulations of various diffusion models are typically used as training data. The organizers of the AnDi challenge have identified five models of particular importance for the interpretation of experimental results: continuous time random walk (CTRW) [60], fractional Brownian motion (FBM) [61], Lévy walk (LW) [26, 62, 63], annealed transient time motion (ATTM) [64], and scaled Brownian motion (SBM) [65]. To ensure that all participants of the challenge use the same data for inventing/improving their methods, a Python package with the implementation of the models has been provided by the organizers.

This paper describes the details of the method we proposed in response to the AnDi challenge. We focus on the classification of 2D trajectories, i.e. a part of the task 2 within the challenge. In our recent researches [49, 51, 52, 57], we already investigated the applicability of both feature-based models and deep learning techniques to AnDi. The first approach requires a set of human-engineered features to characterize each trajectory. Vectors of these features are then used as input for a classification model. Deep learning methods, in contrast, extract significant features from raw data on their own, without any effort from a human expert. The deep learning approach yielded slightly better performance than the feature-based ones, in line with the general findings of the AnDi challenge, but we decided to further elaborate on the latter due to its superior interpretability. Our method turned out to be inferior to the winning teams, but still offered a reasonable performance [59]. Moreover, further investigation of the method after the challenge allowed us to significantly improve its accuracy by adding some new features and to achieve performance similar to the best methods (0.83 accuracy in task 2 versus 0.88 accuracy of the winners).

The paper is divided as follows. Diffusion models used for training and the classification method are introduced in section 2. Section 3 briefly summarizes the dataset used for training. Results of our analysis are presented in section 4. Finally, some concluding remarks are given.

2. Models and methods

2.1. Anomalous exponent α

The most popular method of deducing the particles' type of motion from their trajectories is based on the MSD [3, 5, 6, 66], defined as

Equation (1)

where ${({X}_{t})}_{t > 0}$ is a particle's trajectory, ||⋅|| stands for the Euclidean distance and E is the ensemble average. Since in many experiments only a limited number of trajectories is observed, the time-averaged MSD (TAMSD) calculated from a single trajectory is usually used as the estimator of MSD,

Equation (2)

The trajectory consists of N consecutive two dimensional positions Xi = (xi , yi ) (i = 0, ..., N) recorded with a constant time interval Δt. The quantity n is the time lag between the initial and the final positions of the particle. If the underlying process is ergodic and has stationary increments, TAMSD converges to the theoretical MSD [67].

AnDi refers to a broad class of processes that deviate from the standard Brownian motion. Those deviations display an asymptotic power-law dependence of the MSD on the time lag,

Equation (3)

where Kα is the generalized diffusion coefficient and α is the anomalous exponent. As already mentioned in the introduction, the value of the latter may be used to discriminate the trajectories into normal diffusion (α = 1), subdiffusion (α < 1), and superdiffusion (α > 1).

2.2. Stochastic models of diffusion

Particles' movements that exhibit deviations from the linear behavior of MSD may be described by multiple models, depending on some specific properties of the corresponding trajectories. Five of such processes have been identified within the AnDi challenge as crucial for the interpretation of the experimental data [58, 59]: CTRW, FBM, LW, ATTM and SBM (see figure 1 for example trajectories). In this section, the main characteristics of those models are briefly summarized.

Figure 1.

Figure 1. Examples of trajectories generated with different anomalous exponents by the five models of diffusion: CTRW, FBM, LW, ATTM, and SBM.

Standard image High-resolution image

2.2.1. Continuous time random walk

CTRW defines a large family of random walks with arbitrary displacement density for which the time spent between subsequent steps (i.e. the waiting time) is a stochastic variable [60]. Here, we will consider a particular instance of CTRW with the waiting times sampled from a power-law distribution ψ(t) ∼ tσ , and the displacements drawn from Gaussian distribution with mean zero and variance D, $N(0,\sqrt{D})$. In this case, we have α = σ − 1.

2.2.2. Fractional Brownian motion

FBM [59] is the solution of the stochastic differential equation

Equation (4)

Here, σ > 0 is the scale coefficient, which relates to the diffusion coefficient D via $\sigma =\sqrt{2D}$. H ∈ (0, 1) is the Hurst parameter and ${B}_{t}^{H}$ is a continuous-time, zero-mean Gaussian process starting at zero, with the following covariance function

Equation (5)

The value of H determines the type of diffusion in the process. For $H< \frac{1}{2}$, FBM produces subdiffusion (regime with a negatively correlated noise). For $H > \frac{1}{2}$, FBM generates superdiffusive motion (positively correlated noise). It reduces to the normal diffusion at $H=\frac{1}{2}$.

2.2.3. Lévy walk

LW [26, 62, 63] is another model of a random walk, but in contrast to CTRW, the waiting times and step lengths are coupled. In this work, we assume that waiting times are power-law distributed (as in case of CTRW, ψ(t) ∼ tσ ) and that the displacements and waiting times are correlated in such a way that the probability to move a certain step Δx at time t and stop at the new position to wait for a new random event is ${\Psi}(\delta x,t)=\frac{1}{2}\psi (t)\delta (\vert \delta x\vert -vt)$, where v is the velocity. Then, one can show that the anomalous exponent α = 2 for 0 < σ < 1 and α = 3 − σ for 1 < σ < 2. We may also observe then that the distribution of displacements is not Gaussian.

2.2.4. Annealed transient time motion

ATTM describes the motion of a Brownian particle with a time-varying diffusion coefficient [64]. The tracer performs Brownian motion with a random diffusion coefficient D1 for a random time t1, then with D2 for t2 and so on. The diffusion coefficients are drawn from a power-law distribution P(D) ∼ Dσ−1. If the random times are sampled from a distribution with the expected value E[t|D] = Dγ (σ < γ < σ + 1), then we have α = σ/γ. According to reference [59] we will assume that P(t|D) ∝ δ(tDγ ).

2.2.5. Scaled Brownian motion

The SBM is a process described by the Langevin equation with a time-dependent diffusivity K(t),

Equation (6)

where ξ(t) is white Gaussian noise. If the diffusivity has the form K(t) = αKα tα−1, the MSD follows the power-law MSD(t) ∝ Kα tα .

2.3. Classification model

In machine learning, all classification algorithms may be divided into two classes. The traditional machine learning is a set of statistical learning methods, which do not operate on raw data. Instead, each data sample is characterized by a vector of human-engineered features or attributes. Those vectors are then used as input for the classifier. The second class consists of deep learning methods that identify and extract features on their own. The representation of data is constructed automatically and there is no need for its complex preprocessing as in the case of the feature-based methods.

Deep learning is nowadays the state-of-the-art technology for data classification in many domains and overshadows a little bit the qualities of the classical machine learning. However, the choice of a suitable classification method is usually more subtle than simply looking at its performance. A lot of deep learning methods do indeed an excellent job with respect to the predictions but are extremely complex to interpret. To give an example, the ResNet18 architecture we considered in one of our previous attempts to trajectory classification [57] originally had 11, 220, 420 parameters. We were able to reduce their number down to 399, 556 with a positive impact on accuracy. Although it was an impressive achievement, interpretation of all of those remaining parameters is almost impossible. Thus, if one wants to comprehend the decisions made by an ML algorithm, it could be tempted to go for feature-based methods and gain on interpretability at the expense of accuracy.

Being aware of this trade-off, we deliberately decided to follow the interpretability path in our contribution to the AnDi challenge and chose the extreme gradient boosting (XGB) algorithm with decision trees as base learners [68] as the classification method. We already used similar approach [49, 51, 52], however for different sets of data.

XGB is an ensemble method, which combines multiple decision trees to obtain better predictive performance. A single decision tree [69] is fairly simple to build. The original dataset is split into smaller subsets based on the values of a given feature. The process is recursively repeated until the resulting subsets are homogeneous (all samples from the same class) or further splitting does not improve the classification performance. A splitting feature for each step is chosen according to similarity score measure [70].

Single decision trees are famous for their ease of interpretation. However, they are sensitive to even small variations of data and prone to overfitting. That is why one rather uses forests (ensembles) of trees as reliable classifiers. In the case of the XGB algorithm, the trees are built in a stage-wise fashion. We start with a single tree trained on random subsets of data and features and then at every step, a new tree is added to the ensemble. This new tree learns from the mistakes committed by the previous trees.

2.4. Diffusion characteristics for AnDi challenge

The main challenge of feature-based classification methods is the proper choice of features that can differentiate between samples belonging to different classes. The impact of their choice on the final performance of the classifier was already discussed in reference [52]. In this work, we continue the search for a robust set of features for the classification of different diffusion modes.

In this section, we briefly introduce features that have been used for the AnDi challenge. Since we were not satisfied with the performance of the resulting classifier in task 2 [59], after the end of the challenge we further searched for features that would improve the classification results. Those additional features will be described in section 2.5. Both the original feature set for the challenge and its extension are listed in table 1.

Table 1. The features used to characterize the SPT trajectories. The original set of features for the AnDi challenge (left column, see section 2.4 for definitions) have been extended afterwards (right columns, see section 2.5 for further details) to improve the performance of the classifier.

Original featuresAdditional features
Anomalous exponentD'Agostino–Pearson test statistic
Diffusion coefficientKolmogorov–Smirnov statistic against χ2 distribution
AsymmetryNoah exponent
EfficiencyMoses exponent
Empirical velocity autocorrelation functionJoseph exponent
Fractal dimensionDetrending moving average
Maximal excursionAverage moving window characteristics
Mean maximal excursionMaximum standard deviation
Mean gaussianity 
MSD 
Kurtosis 
Statistics based on p-variation 
Straightness 
Trappedness 

2.4.1. Anomalous exponent

In our set of features, we included 4 estimates for the anomalous exponent α (see equation (3)), calculated using the following methods:

  • the standard estimation, based on fitting the empirical TAMSD from equation (2) to equation (3),
  • Three estimation methods proposed for the trajectories with noise [71], under the assumption that the noise is normally distributed with zero mean:
    • using estimator
      Equation (7)
      with nmax equal to 0.1 times the trajectory length, rounded to nearest lower integer (but not less than 4),
    • simultaneous fitting of parameters $\hat{D}$, $\hat{\alpha }$ and $\hat{\sigma }$ in the equation
      Equation (8)
      where d denotes dimension, D is the diffusion coefficient and σ2—the variance of noise,
    • simultaneous fitting of parameters $\hat{D}$ and $\hat{\alpha }$ in the equation
      Equation (9)

2.4.2. Diffusion coefficient

We used the diffusion coefficient extracted from the fit of the empirical TA-MSD to equation (3).

2.4.3. Asymmetry

The asymmetry of a trajectory can detect directed motion. According to Saxton [32], it can be derived from the gyration tensor, which describes the second moments of positions of a particle. For a 2D random walk of N steps, the tensor is given by

Equation (10)

where $\langle x\rangle =(1/N){\sum }_{j=1}^{N}{x}_{j}$ is the average of x coordinates over all steps in the random walk. We define the asymmetry as [72]

Equation (11)

where λ1 and λ2 are the principle radii of gyration, i.e. the eigenvalues of the tensor T.

2.4.4. Efficiency

Efficiency E measures the linearity of a trajectory. It relates the net squared displacement of a particle to the sum of squared step lengths,

Equation (12)

It may help to detect directed motion (superdiffusion).

2.4.5. Empirical velocity autocorrelation function

Empirical velocity autocorrelation function [73] for lag 1 and point n is defined as:

Equation (13)

It can be used to distinguish subdiffusion processes. In our model, we used χn for points n = 1 and n = 2.

2.4.6. Fractal dimension

Fractal dimension is a measure of the space-filling capacity of a pattern (a trajectory in our case). According to Katz and George [74], the fractal dimension of a planar curve may be calculated as

Equation (14)

where L is the total length of the trajectory, N is the number of steps, and d is the largest distance between any two positions. It takes values around 1 for straight trajectories (i.e. directed motion), around 2 for random ones (normal diffusion), and around 3 for constrained trajectories (subdiffusion).

2.4.7. Maximal excursion

Maximal excursion of the particle is given by the formula:

Equation (15)

It should detect relatively long jumps (in comparison to the overall displacement).

2.4.8. Mean maximal excursion

According to reference [37], the mean maximal excursion is usually a better observable than MSD to determine the anomalous exponent α. Given the largest distance traveled by a particle,

Equation (16)

the mean maximal excursion is defined as its standardized value, i.e.:

Equation (17)

Here, ${\hat{\sigma }}_{N}$ is a consistent estimator of the standard deviation of DN ,

Equation (18)

2.4.9. Mean gaussianity

Gaussianity g(n) [75] checks the Gaussian statistics of increments of a trajectory and is defined as

Equation (19)

where $\langle {r}_{n}^{k}\rangle $ denotes the kth moment of the trajectory at time lag n:

Equation (20)

Gaussianity for normal diffusion is equal to 0. The same result should be obtained for FBM, since its increments follow Gaussian distribution. Other types of motion should show deviations from zero.

Instead of looking at gaussianities at single time lags, we will include the mean over all lags as one of the features:

Equation (21)

2.4.10. Mean-squared displacement ratio

MSD ratio gives information about the shape of the corresponding MSD curve. We will define it as

Equation (22)

where n1 < n2. MSDR = 0 is zero for normal diffusion (α = 1). We should get MSDR ⩽ 0 for sub- and MSDR ⩾ 0 for superdiffusion. We simply took n2 = n1 + Δt and calculate an averaged ratio for every trajectory.

2.4.11. Kurtosis

Kurtosis gives insight into the asymmetry and peakedness of the distribution of points within a trajectory [72]. To calculate it, the position vectors Xi are projected onto the dominant eigenvector $\vec{r}$ of the gyration tensor (10),

Equation (23)

Kurtosis is then defined as the fourth moment of ${x}_{i}^{p}$,

Equation (24)

with ${\bar{x}}^{p}$ being the mean projected position and ${\sigma }_{{x}^{p}}$—the standard deviation of xp .

2.4.12. Statistics based on p-variation

The empirical p-variation is given by the formula [76, 77]:

Equation (25)

These statistics can be used to detect the fractional Lévy stable motion (including FBM). We defined 6 features based on ${V}_{m}^{(p)}$:

  • power γp fitted to p-variation for lags 1 to 5 [76],
  • statistics P used in reference [52], based on the monotonicity changes of ${V}_{m}^{(p)}$ as a function of m:
    Equation (26)

2.4.13. Straightness

Straightness S measures the average direction change between subsequent steps. It relates the net displacement of a particle to the sum of all step lengths,

Equation (27)

2.4.14. Trappedness

With trappedness we will refer to the probability that a diffusing particle is trapped in a bounded region with radius r0. A comparison of analytical and Monte Carlo results for confined diffusion allowed Saxton [32] to estimate this probability with

Equation (28)

Here, r0 is approximated by half of the maximum distance between any two positions along a given trajectory. D in equation (28) is estimated by fitting the first two points of the MSD curve (i.e. it is the so called short-time diffusion coefficient).

2.5. Additional diffusion characteristics

In order to improve the performance of the original classifier, we searched for further feature candidates after the AnDi challenge. The additional features extending the basic set are described below.

2.5.1. D'Agostino-Pearson test statistic

D'Agostino–Pearson κ2 test statistic [78, 79] is a goodness-of-fit measure that aims to establish whether or not a given sample comes from a normally distributed sample. It is defined as

Equation (29)

where K is the sample kurtosis given by equation (24) and ${g}_{1}={m}_{3}/{m}_{2}^{3/2}$ is the sample skewness with mj being the jth sample central moment. The transformations Z1 and Z2 should bring the distributions of the skewness and kurtosis as close to the standard normal as possible. Their definitions may be found elsewhere [78, 79]. This feature should help to distinguish ATTM and SBM from other motions.

2.5.2. Kolmogorov–Smirnov statistic against χ2 distribution

The Kolmogorov–Smirnov statistic quantifies the distance between the empirical distribution function of the sample Fn (X) and the cumulative distribution function Gn (X) of a reference distribution,

Equation (30)

Here, n is the number of observations (i.e. the length of a trajectory). The value of this statistic for the empirical distribution of squared increments of a trajectory against the sampled χ2 distribution has been taken as the next feature. The rationale of such choice is that for a Gaussian trajectory the theoretical distribution of squared increments is the mentioned χ2 distribution.

2.5.3. Noah, Moses, and Joseph exponents

For processes with stationary increments, there are in principle two mechanisms that violate the Gaussian central limit theorem and produce anomalous scaling of MSD: long-time increment correlations or a flat-tailed increment distribution (in the latter case the second moment is divergent) [80]. These mechanisms are referred to as the Joseph and Noah effects, respectively. FBM is the prototypical process that exhibits the Joseph effect. LW on the other hand is an example of a process with the Noah effect. An anomalous scaling can also be induced by a non-stationary increment distribution [80]. In this case, we deal with the Moses effect. It should help to handle ATTM and SBM trajectories.

All three effects may be quantified by exponents, which will be used as features in our extended set of attributes. Given a stochastic process Xt and the corresponding increment process δt (τ) = Xt+τ Xt , the Joseph, Moses and Noah exponents are defined as follows.

  • (a)  
    Joseph exponent J is estimated from the ensemble average of the rescaled range statistic:
    where Rt is calculated as ${R}_{t}={\mathrm{max}}_{1\leqslant s\leqslant t}[{X}_{s}-\frac{s}{t}{X}_{t}]-{\mathrm{min}}_{1\leqslant s\leqslant t}[{X}_{s}-\frac{s}{t}{X}_{t}]$ and St is standard deviation of process Xt .
  • (b)  
    Moses exponent M is determined from the scaling of the ensemble probability distribution of the sum of the absolute value of increments, which can be estimated by the scaling of the median of the probability distribution of ${Y}_{t}={\sum }_{s=0}^{t-1}\vert {\delta }_{s}\vert $:
  • (c)  
    Noah exponent L is extracted from the scaling of the ensemble probability distribution of the sum of increment squares, which can be estimated by the scaling of the median of the probability distribution of ${Z}_{t}={\sum }_{s=0}^{t-1}{\delta }_{s}^{2}$:

2.5.4. Detrending moving average

The detrending moving average (DMA) statistic [81, 82] is given by

Equation (31)

for τ = 1, 2, ..., where ${\bar{X}}_{i}^{\tau }$ is a moving average of τ observations, i.e. ${\bar{X}}_{i}^{\tau }=\frac{1}{\tau +1}{\sum }_{j=0}^{\tau }{X}_{i-j}$.

As mentioned in reference [82], the DMA-based statistical tests can help in the detection of the SBM. In our model, we used two values of DMA for each trajectory as input features, namely DMA(1) and DMA(2).

2.5.5. Average moving window characteristics

As the moving average methods have already been successfully applied to many problems (see for example reference [83] for change point detection), we decided to accommodate them in our feature set as well. We added eight features based on the formula

Equation (32)

where ${\bar{X}}^{(m)}$ denotes a statistic of the process calculated within the window of length m and sgn is the signum function. In particular, we used the mean and the standard deviation for $\bar{X}$ and calculated moving window (MW) with windows of lengths m = 10 and m = 20 separately for x and y coordinates.

2.5.6. Maximum standard deviation

The idea of the MW helped us to introduce another two features based on the standard deviation σm of the process calculated within a window of length m. They are given by

Equation (33)

and

Equation (34)

where σ denotes the overall standard deviation of the sample. We used the window of length m = 3 and calculated the features for both coordinates separately. They should improve the detection of ATTM type of movements.

3. Data

XGB, i.e. the algorithm we decided to use, is an example of a supervised model. In other words, it requires a training dataset consisting of trajectories and the corresponding labels indicating their diffusion type (i.e. the diffusion model).

We used a synthetic set of trajectories produced with the andi-dataset package [84] to train the classifier. Once the trajectories were generated, a vector of features was calculated for each of them in the preprocessing phase. The resulting set consisting of vectors of attributes and their labels was used as input for the XGB model.

We generated 3 × 105 trajectories for each diffusion model. Those trajectories were evenly distributed over 20 ranges of lengths, from very short ones (n ∈ ⟨10, 50)) to really long ones (n ∈ (951, 1000⟩). The whole dataset was divided into training, test, and validation subsets in the proportion of 0.7:0.15:0.15, respectively. The classifier is initially fit on the training data. The validation set provides an unbiased evaluation of a model while tuning its hyperparameters. Test data is used to provide an unbiased assessment of the final model. Stratified sampling was used to ensure a proper balance of data in the subsets. It should be noted that in this article, the dataset is significantly larger than the one used for the AnDi challenge [58] (only 7 × 104 trajectories).

The parameters used to generate the trajectories are shown in table 2. Their ranges have been constrained by the organizers of the challenge. The range of diffusion exponents prevents stationary paths. All trajectories were standardized so that the distribution of displacements over the unit time is characterized by the standard deviation equal to 1.

Table 2. Parameters of the diffusion models used to generate the synthetic trajectories.

ParameterMeaningParameter ranges
α Diffusion exponent⟨0.05, 2⟩
n Length of a trajectory⟨10, 1000)
σnoise Standard deviation (level) of noise{0.1, 0.5, 1}

To better simulate experimental endeavors, each trajectory was contaminated with a finite localization error. For that purpose, a random number from a normal distribution N(0, σnoise) was added to each trajectory coordinate.

4. Results

We used the implementation of the XGB from the xgboost module [70]. All computations were carried out on a cluster of 24 CPUs with a total memory of 25 GB. A randomized search strategy over the hyperparameter space of the classifier (the RandomizedSearchCV function in scikit-learn module [85]) has been deployed in order to optimize the model. The final values of the hyperparameters are summarized in table 3. The parameter min_child_weight indicates the minimum of instance weight (corresponding to the number of instances) needed in a child node of the single decision tree. If the tree partition step results in a leaf node with the sum of instance weight less than its value, the building process of the tree will give up further partitioning. Max_depth determines the maximum depth of a tree. The value of gamma indicates the minimum loss reduction to make a further split on a leaf node of the tree. The evaluation metric for the validation step is specified by eval_metric. Mean absolute error (mae) is used in our case. Eta is the learning rate used to prevent overfitting of the classifier. And the base_score is the global bias, i.e. the initial prediction score of all instances.

Table 3. Hyperparameters of the final XGB models used for classification.

Hyperparameter Parameter value
min_child_weight 7
max_depth 10
gamma 4
eval_metric mae
eta 0.3
base_score 1

The values of the hyperparameters were optimized for the original set of features. However, the same values of hyperparameters were also used with the extended set for comparison purposes.

4.1. Performance of the classifiers

The performance of the classifiers in task 2 of the AnDi challenge was evaluated with the F1 score,

Equation (35)

where TP, FP, and FN are the true positive, false positive, and false negative rates, respectively. Since we are dealing here with a multilabel classification problem, a micro-averaged F1 score was calculated to assess the aggregated contributions of all classes (i.e. the sums of all TP, FP and FN were first determined and then inserted into equation (35). For balanced sets (same number of samples for each class), like those of the AnDi challenge, the micro-averaged F1 score coincides with the accuracy of the classifier.

Results for the performance of the classifiers are shown in table 4. As we can see, the model with the original set of features achieves an accuracy of 0.73. After adding the features described in section 2.5 we observe a significant improvement of its performance. The accuracy of the extended version (0.83) is comparable with the best methods of the AnDi challenge [59].

Table 4. Micro-averaged F1 score (i.e. accuracy) of both the original and the extended classifiers (see sections 2.4 and 2.5 for more details).

Base XGB modelExtended XGB model
0.730.83
Figure 2.

Figure 2. Normalized confusion matrices of the models. Rows correspond to the true labels and columns to the predicted ones.

Standard image High-resolution image

Inspection of the confusion matrices of the classifiers may give us additional insight into partial contributions of the overall performance. To recall, an element cij of the confusion matrix indicates how many observations known to belong to class i (true label) are predicted to be in class j (predicted labels) [68]. Normalized confusion matrices for both models are shown in figure 2. The best performance of the base model is observed for LWs. Among all LW samples in the test data, 92% of them were correctly recognized. Most of the misclassified LW trajectories were assigned to FBM motion (5%) and SBM (2%). The accuracy of the classifier for CTRW (85%) and SBM (81%) is smaller, but still quite good. The incorrectly classified CTRW trajectories have been usually confused with ATTM. In the case of SBM, many trajectories have been assigned to ATTM (11%) and FBM (6%) classes.

We have not expected such a moderate performance of the original classifier for FBM (72%). In our earlier attempts [49, 51, 52], we usually got much higher performance for this type of motion. However, it should be stressed that previously we trained our classifiers on different set of trajectories than used for the AnDi challenge. Moreover, we usually used longer trajectories. Thus, it seems that the features in the original set are lacking the power to differentiate FBM successfully from other types of motion considered in this paper, in particular from SBM (19% of misclassifications) and LW (7%).

Finally, it should be emphasized that the performance of the base classifier for ATTM is unsatisfactory (37%). Most of the ATTM trajectories are confused with SBM (37%) and CTRW (22%). It is clear that the base model cannot separate the different origins of trajectories from models with time-varying diffusion coefficient D.

Adding the features from section 2.5 to the input vectors has significantly improved the overall accuracy of the classifier. As can be seen in the right plot of figure 2, all of the partial contributions to the accuracy improved as well. Although the performance for ATTM remains weak (61% only), it is much higher than in the previous case. As for the other classes, the modified classifier achieves very good accuracies for LW (98%) and CTRW (94%) and good ones for SBM (83%) and FBM (82%). The biggest challenge is still to distinguish ATTM and FBM from SBM trajectories.

4.2. Impact of trajectory lengths

In figure 3, the F1 scores for different ranges of the trajectory lengths in the case of the extended model are shown. Not surprisingly, the classifier struggles a lot with very short trajectories. In this regime, many of the paths are indistinguishable from normal diffusion. Moreover, in the case of ATTM and SBM models, the information contained in the short trajectories may be not enough for detecting the time-varying diffusion coefficient D. Interestingly, the accuracy of the FBM classification increases with the sample length to achieve around 90% for trajectories longer than 500 steps.

Figure 3.

Figure 3.  F1 scores as functions of different lengths of the trajectories. Columns represent F1 score for each type of the motion, rows correspond to ranges of the lengths.

Standard image High-resolution image

It is worth to mention that the difficulties with very short trajectories are not specific to the XGB in particular or to feature-based methods in general. All methods presented in the challenge were afflicted by similar problems in this regime [59].

4.3. Impact of the level of noise

In figure 4, the F1 scores for different levels of noise (i.e. different values of its standard deviation) are presented in case of our extended model. While the classifier performs quite well for the CTRW and LW even in the case of high noise levels, its accuracy significantly decreases for the remaining diffusion models. It seems that the classifier is not able to capture enough characteristics of ATTM, SBM and FBM types of movements from noise trajectories in order to successfully distinguish them from each other. Interestingly, in a series of computer experiments we observed that a classifier trained on very noisy data performs still reasonably well on a test set with low levels of noise.

Figure 4.

Figure 4.  F1 scores for different levels of trajectories' noise. Columns represent F1 scores for each type of the motion, rows correspond to standard deviations of noise.

Standard image High-resolution image

4.4. Features importance

One of the benefits of feature-based methods is the relative ease of their interpretation. In particular, it is possible to assign a score to each of the features indicating how important these features are to the model's predictions. This importance aids to extract relevant knowledge concerning the relationships contained in the data or learned by the model. But it can also help in the selection of the meaningful features for a simplified version of the classifier.

One of the popular feature assessment techniques is the permutation feature importance [68]. It is defined as the decrease in a classifier's score when a single feature values are randomly shuffled. Since this procedure breaks the relationship between the feature and the target, the drop in the score indicates how much the classifier depends on the feature. In table 5, the 20 most important features in our model according to the permutation method are presented. It is worth to note that the features added in the extended version of the model play the key role among them. In particular, the average MW features have the highest score, followed by the D'Agostino–Pearson test statistic κ2.

Table 5. Permutation feature importance values for the XGB classifier.

Model characteristicMean feature importanceStd. deviation
mw_y_mean10 0.10890.0015
mw_x_mean10 0.10130.0008
dagostino_y 0.07790.0016
dagostino_x 0.07270.0010
M 0.06450.0003
mw_y_mean20 0.05160.0009
mw_x_mean20 0.04970.0007
max_std_y 0.04220.0013
max_std_x 0.04020.0010
alpha 0.02950.0008
p_var_1 0.02830.0005
fractal_dimension 0.02720.0007
mean_gaussianity 0.02400.0004
vac_lag_1 0.02110.0013
ksstat_chi2 0.01940.0011
max_std_change_x 0.01780.0012
max_std_change_y 0.01110.0007
max_ts 0.00780.0007
L 0.00740.0005
D 0.00680.0007

Since the permutation importance is known to have several drawbacks while working with correlated features [86] and this is the case in our model (e.g. we have several features based on MSD), we decided to check the gain importance as well [87]. In this approach, the importance of a feature represents the relative contribution of the feature to the model, calculated by taking each feature's contribution for each tree in the model. A higher value of this metric when compared to other features implies it is more important for generating predictions. The top 20 features according to gain are listed in table 6. Although the moving average window and D'Agostino–Pearson characteristics are still very important, the anomalous exponent α and the MSD ratio get now the highest scores. But again, many of the most important features were added in the extended version of the model. Thus it seems we made some reasonable choices to improve the classification accuracy.

Table 6. Gain feature importances for the XGB classifier.

Model characteristicFeature importance
alpha 177.31
mean_squared_displacement_ratio 135.70
mw_x_mean10 92.59
mw_y_mean10 60.82
dagostino_y 60.21
max_ts 48.25
max_std_x 39.82
dagostino_x 38.16
max_std_y 27.55
M 24.56
fractal_dimension 23.44
p-variation 20.20
mean_gaussianity 13.93
mw_y_std10 13.27
p_var_1 13.02
mw_x_std10 12.84
max_std_change_x 11.01
L 10.73
max_std_change_y 9.88
mw_x_std20 9.75

Last but not least, to further elaborate on that issue, we will employ another well known technique—the SHAP values based on feature attribution, i.e. the impact of having a certain value of a given feature in comparison to the prediction we would make if that feature took some baseline value [88, 89]. Compared the previous methods, SHAP values may be used to explain individual predictions. Moreover, they offer a more consistent approach to assess the overall contribution of features to the final outcome of the model.

In table 7, SHAP values for the most important features in different diffusion models are shown, in the case of the extended model. It should be noted that for every single diffusion model, the additional features turned out to be the most important ones, in agreement with the previous methods. In other words, the additional features prove to be an excellent choice. This also shows how critical the choice of features can be for a classifier (see reference [52] for further details).

Table 7. SHAP values for most important features for different diffusion models in the case of the extended XGB model.

ATTMFBMSBM
Model characteristicSHAP valueModel characteristicSHAP valueModel characteristicSHAP value
M0.13M0.07M0.20
max_std_x0.08alpha0.06dagostino_y0.05
max_std_y0.08dagostino_y0.06dagostino_x0.04
dagostino_y0.07dagostino_x0.06alpha0.03
mw_x_mean100.06max_std_x0.05max_std_y0.03
mw_y_mean100.06max_std_y0.05max_std_x0.03
mean_gaussianity0.06max_std_change_y0.03mw_y_mean100.02
dagostino_x0.06mean_gaussianity0.03ksstat_chi20.02
p_var_10.05p_var_10.03vac_lag_10.02
alpha0.05vac_lag_10.03mean_gaussianity0.02
CTRWLW
Model characteristicSHAP valueModel characteristicSHAP value
mw_x_mean100.07max_std_x0.05
mw_y_mean100.07max_std_y0.05
fractal_dimension0.04dagostino_y0.02
dagostino_x0.03p_var_10.02
ksstat_chi20.02dagostino_x0.02
mw_x_mean200.02alpha0.02
mw_y_mean200.02vac_lag_20.01
dagostino_y0.02max_std_change_y0.01
mean_gaussianity0.02max_std_change_x0.01
p_var_10.01mw_y_mean100.01

The Moses exponent M, which relates to non-stationary increments, turned out to be the essential attribute for distinguishing ATTM and SBM from the rest of models; it seems to be quite important for FBM as well. However, in the latter case, the anomalous exponent is the most meaningful feature, followed by the D'Agostino–Pearson test statistic κ2. Characteristics based on the average MW have the highest importance for CTRW. And finally, the maximum standard deviation feature is the top one for LW.

In figures 5 and 6, we may check how each of the 20 most relevant features impacts the decisions of the classifier for each diffusion model. A dot in the plot corresponds to a single instance of the explanation given by the classifier. The characteristics on the Y axis are sorted by their mean importance. The SHAP values on the horizontal axis reflect the positive or negative influence of each factor on the model prediction. A positive SHAP value increases the likelihood that the sample will be classified as a specific model, whereas a negative SHAP value—that the observation belongs to one of the other classes. Larger values of SHAP (positive or negative) indicate a greater impact on the final prediction. Additionally, the colormap represents the feature values—red being high and blue being low (the color values are relative). The dependent variable Y is positively correlated with high SHAP values on the positive side of the X axis. In other words, those values have a beneficial effect on prediction. For instance, in the case of CTRW, the larger the value of mw_y_mean_10, the higher SHAP attribution, thus the likelihood of classification as CTRW increases.

Figure 5.

Figure 5. SHAP values for 20 most relevant features in case of CTRW and LW. Dots represent the value for a single explanation (trajectory). The characteristics on the vertical axis are sorted by their mean importance. The SHAP values on the horizontal axis reflect the positive or negative influence of each factor on the model prediction. A positive SHAP value improves the likelihood that the sample will be classified as a specific model. A longer tail indicates a greater impact on the prediction. The dependent variable Y is positively correlated with high SHAP values on the positive side of the X axis.

Standard image High-resolution image
Figure 6.

Figure 6. SHAP values for the 20 most relevant features in the case of ATTM, FBM, and SBM. See caption of figure 5 for further details.

Standard image High-resolution image

As can be seen, the SHAP values for CTRW and LW indicate that our classifier is more likely to distinguish these classes from the rest (some features have a long-tailed distribution of values and the colors indicating positive and negative impact are well separated from each other). In contrast, ATTM, FBM and SBM models reveal a high concentration of samples near 0 for many features. Moreover, the colors are mixed with each other indicating that there is no unique relationship between the SHAP value of a feature and its impact on the prediction. In this case, only a combination of all features can give us a reasonable outcome. In other words, it would be rather impossible to build a statistical test to distinguish these three models that operates with a single feature (among the ones present in our set).

5. Conclusion

In this paper, we have introduced two sets of attributes that may be used for the classification of SPT data with help of feature-based machine learning methods. The first of these sets was proposed as our contribution to task 2 of the AnDi challenge [58, 59]. The latter one is the result of our attempt to improve the performance of the classifier presented in the challenge.

The XGB was used as the classification model. It is an ensemble method, which combines multiple decision trees to obtain better predictive performance. Although the deep-learning approach constitutes the state-of-the-art technology for data classification in many domains, we deliberately decided to pick one of the traditional machine learning algorithms due to its superior interpretability.

Our original method turned out to be inferior to the top AnDi challenge teams, but still offered a reasonable performance of 73%. Moreover, further elaboration on the feature set after the challenge allowed us to significantly improve its accuracy by adding some new characteristics and to achieve performance similar to the winning teams (83% in 2D). Although the final accuracy is already quite good, there is still room for improvement, in particular in the case of short trajectories. Thus, more research into meaningful characteristics is encouraged.

Our results show how crucial the choice of features is for good classification performance. Moreover, we were able to identify the most important among the features for all diffusion types. It does not only contribute to our overall understanding of the decisions made by the classifier. The analysis of feature importance may also be helpful in the selection of meaningful features that can be then used to build simpler classifiers trained on subsets of the attributes.

Acknowledgments

The work of PK, HL-O, and JS was supported by NCN-DFG Beethoven Grant No. 2016/23/G/ST1/04083. Calculations have been partially carried out using resources provided by Wrocław Centre for Networking and Supercomputing (http://wcss.pl).

Data availability statement

The data generated and/or analysed during the current study are not publicly available for legal/ethical reasons but are available from the corresponding author on reasonable request.

Please wait… references are loading.
10.1088/1751-8121/ac6d2a