A review of classification algorithms for EEG-based brain–computer interfaces: a 10 year update

F Lotte; L Bougrain; A Cichocki; M Clerc; M Congedo; A Rakotomamonjy; F Yger

doi:10.1088/1741-2552/aab2f2

1. Introduction

A brain–computer interface (BCI) can be defined as a system that translates the brain activity patterns of a user into messages or commands for an interactive application, this activity being measured and processed by the system [44, 139, 229]. A BCI user's brain activity is typically measured using electroencephalography (EEG). For instance, a BCI can enable a user to move a cursor to the left or to the right of a computer screen by imagining left or right hand movements, respectively [230]. As they make computer control possible without any physical activity, EEG-based BCIs promise to revolutionize many applications areas, notably to enable severely motor-impaired users to control assistive technologies, e.g. text input systems or wheelchairs [181], as rehabilitation devices for stroke patients [8], as new gaming input devices [52], or to design adaptive human–computer interfaces that can react to the user's mental states [237], to name a few [45, 216].

In order to use a BCI, two phases are generally required: (1) an offline training phase during which the system is calibrated and (2) the operational online phase in which the system can recognize brain activity patterns and translate them into commands for a computer [136]. An online BCI system is a closed-loop, starting with the user producing a specific EEG pattern (e.g. using motor imagery) and these EEG signals being measured. Then, EEG signals are typically pre-processed using various spatial and spectral filters [23], and features are extracted from these signals in order to represent them in a compact form [140]. Finally, these EEG features are classified [141] before being translated into a command for an application [45] and before feedback is provided to users to inform them whether a specific mental command was recognized or not [170].

Although much effort is currently under way towards calibration-free modes of operation, an off-line calibration is currently used and is necessary in most BCIs to obtain a reliable system. In this stage, the classification algorithm is calibrated and the optimal features from multiple EEG channels are selected. For this calibration, a training data set needs to be pre-recorded from the user. EEG signals are highly user-specific, and as such, most current BCI systems are calibrated specifically for each user. This training data set contains EEG signals recorded while the user performed each mental task of interest several times, according to given instructions.

There are various key elements in the BCI closed-loop, one being the classification algorithms a.k.a classifiers used to recognize the users' EEG patterns based on EEG features. There was, and still is, a large diversity of classifier types that are used and have been explored to design BCIs, as presented in our 2007 review of classifiers for EEG-based BCIs [141]. Now, approximately ten years after this initial review was published, many new algorithms have been designed and explored in order to classify EEG signals in BCI, and BCIs are more popular than ever. We therefore believe that the time is ripe to update this review of EEG classifiers. Consequently, in this paper, we survey the literature on BCI and machine learning from 2007 to 2017 in order to identify which new EEG classification algorithms have been investigated to design BCI, and which appear to be the most efficient¹¹. Note that we also include in the present review machine learning methods for EEG feature extraction, notably to optimize spatial filters, which have become a key component of BCI classification approaches. We synthesize these readings in order to present these algorithms, to report how they were used for BCIs and what were the outcomes. We also identify their pros and cons in order to provide guidelines regarding how and when to use a specific classification method, and propose some challenges that must be solved to enable further progress in EEG signal classification.

This paper is organized as follows. Section 2 briefly presents the typically used EEG feature extraction and selection techniques, as these features are usually the input to classifiers. It also summarizes the classifier performance evaluation metrics. Then, section 3.1 provides a summary of the classifiers that were used for EEG-based BCIs up to 2007, many of which are still in use today, as well as the challenges faced by current EEG classification methods. Section 4 describes the core of the paper, as it reviews the classification algorithms for BCI that have been explored since 2007 to address these various challenges. These algorithms are discussed in section 5, where we also propose guidelines on how and when to use them, and identify some remaining challenges. Finally, section 6 concludes the paper.

2. Feature extraction and selection, and performance measures in brief

The present paper is dedicated to classification methods for BCI. However, most pattern recognition/machine learning pipelines, and BCIs are no exception, not only use a classifier, but also apply feature extraction/selection techniques to represent EEG signals in a compact and relevant manner. In particular for BCI, EEG signals are typically filtered both in the time domain (band-pass filter), and spatial domain (spatial filter) before features are extracted from the resulting signals. The best subsets of features are then identified using feature selection algorithms, and these features are used to train a classifier. This process is illustrated in figure 1. In this chapter, we briefly discuss which features are typically used in BCI, how to select the most relevant features amongst these and how to evaluate the resulting pattern recognition pipeline.

**Figure 1.** Typical classification process in EEG-based BCI systems. The oblique arrow denotes algorithms that can be or have to be optimized from data. A training phase is typically necessary to identify the best filters and features and to train the classifier. The resulting filters, features and classifier are then used online to operate the BCI.
Download figure:
Standard image High-resolution image

2.1. Feature extraction

While there are many ways in which EEG signals can be represented (e.g. [16, 136, 155]), the two most common types of features used to represent EEG signals are frequency band power features and time point features.

Band power features represent the power (energy) of EEG signals for a given frequency band in a given channel, averaged over a given time window (typically 1 second for many BCI paradigms). Band power features can be computed in various ways [28, 87], and are extensively used for BCIs exploiting oscillatory activity, i.e. changes in EEG rhythm amplitudes. As such, band power features are the gold standard features for BCI based on motor and mental imagery for many passive BCI aiming at decoding mental states such as mental workload or emotions, or for steady state visual evoked potential (SSVEP)-based BCIs.

Time point features are a concatenation of EEG samples from all channels. Typically, such features are extracted after some pre-processing, notably band-pass or low-pass filtering and down-sampling. They are the typical features used to classify Event Related Potentials (ERP), which are temporal variations in EEG signals' amplitudes time-locked to a given event/stimulus [22, 136]. These are the features used in most P300-based BCI.

Both types of features benefit from being extracted after spatial filtering [22, 136, 185, 188]. Spatial filtering consists of combining the original sensor signals, usually linearly, which can result in a signal with a higher signal-to-noise ratio than that of individual sensors. Spatial filtering can be data independent, e.g. based on physical consideration regarding how EEG signals travel through the skin and skull, leading to spatial filters such as the well-known Laplacian filter [159] or inverse solution based spatial filtering [18, 101, 124, 173]. Spatial filters can also be obtained in a data-driven and unsupervised manner with methods such as principal component analysis (PCA) or independent component analysis (ICA) [98]. Finally, spatial filters can be obtained in a data-driven manner, with supervised learning, which is currently one of the most popular approaches. Supervised spatial filters include the well-known common spatial patterns (CSP) [23, 185], dedicated to band-power features and oscillatory activity BCI, and spatial filters such as xDAWN [188] or Fisher spatial filters [92] for ERP classification based on time point features. Owing to the good classification performances obtained by such supervised spatial filters in practice, many variants of such algorithms have been developed that are more robust to noise or non-stationary signals, using regularization approaches, robust data averaging, and/or new divergence measures, (e.g. [143, 187, 194, 211, 233]). Similarly, extensions of these approaches have been proposed to optimize spectral and spatial filters simultaneously (e.g. the popular filter bank CSP (FBCSP) method [7] and others [61, 88, 161]). Finally, some approaches have combined both physically-driven spatial filters based on inverse models with data-driven spatial filters (e.g. [49, 148]).

While spatial filtering followed by either band power or time points feature extraction are by far the most common features used in current EEG-based BCIs, it should be mentioned that other feature types have been explored and used. Firstly, an increasingly used type is connectivity features. Such features measure the correlation or synchronization between signals from different sensors and/or frequency bands. This can be measured using features such as spectral coherence, phase locking values or directed transfer functions, among many others [31, 79, 110, 167, 225, 240]. Researchers have also explored various EEG signal complexity measures or higher order statistics as features of EEG signals (e.g. [11, 29, 135, 248]). Finally, rather than using vectors of features, recent research has also explored how to represent EEG signals by covariance matrices or by tensors (i.e. arrays and multi-way arrays, with two or more dimensions), and how to classify these matrices or tensors directly [38, 47, 232]. Such approaches are discussed in section 4.2. It should be mentioned that when using matrix or tensor decompositions, the resulting features are linear combinations of various sensors' data, time points or frequencies (among others). As such they may not have an obvious physical/physiological interpretation, but nonetheless prove useful for BCI design.

Finally, it is interesting to note that several BCI studies have reported that combining various types of features, e.g. time points with band powers or band powers with connectivity features, generally leads to higher classification accuracies as compared to using a single feature type (e.g. [29, 60, 70, 93, 166, 191]). Combining multiple feature types typically increases dimensionality; hence it requires the selection of the most relevant features to avoid the curse-of-dimensionality. Methods to reduce dimensionality are described in the following section.

2.2. Feature selection

A feature selection step can be applied after the feature extraction step to select a subset of features with various potential benefits [82]. Firstly, among the various features that one may extract from EEG signals, some may be redundant or may not be related to the mental states targeted by the BCI. Secondly, the number of parameters that the classifier has to optimize is positively correlated with the number of features. Reducing the number of features thus leads to fewer parameters to be optimized by the classifier. It also reduces possible overtraining effects and can thus improve performance, especially if the number of training samples is small. Thirdly, from a knowledge extraction point of view, if only a few features are selected and/or ranked, it is easier to observe which features are actually related to the targeted mental states. Fourthly, a model with fewer features and consequently fewer parameters can produce faster predictions for a new sample, as it should be computationally more efficient. Fifthly, collection and storage of data will be reduced. Three feature selection approaches have been identified [106]: the filter, wrapper and embedded approaches. Many alternative methods have been proposed for each approach.

Filter methods rely on measures of relationship between each feature and the target class, independently of the classifier to be used. The coefficient of determination, which is the square of the estimation of the Pearson correlation coefficient, can be used as a feature ranking criterion [85]. The coefficient of determination can also be used for a two-class problem, labelling classes as −1 or +1. The correlation coefficient can only detect linear dependencies between features and classes though. To exploit non-linear relationships, a simple solution is to apply non-linear pre-processing, such as taking the square or the log of the features. Ranking criteria based on information theory can also be used e.g. the mutual information between each feature and the target variable [82, 180]. Many filter feature selection approaches require estimations of the probability densities and the joint density of the feature and class label from the data. One solution is to discretize the features and class labels. Another solution is to approximate their densities with a non-parametric method such as Parzen windows [179]. If the densities are estimated by a normal distribution, the result obtained by the mutual information will be similar to the one obtained by the correlation coefficient. Filter approaches have a linear complexity with respect to the number of features. However, this may lead to a selection of redundant features [106].

Wrapper and embedded approaches solve this problem at the cost of a longer computation time. These approaches use a classifier to obtain a subset of features. Wrapper methods select a subset of features, present it as input to a classifier for training, observe the resulting performance and stop the search according to a stopping criterion or propose a new subset if the criterion is not satisfied. Embedded methods integrate the features selection and the evaluation in a unique process, e.g. in a decision tree [27, 184] or a multilayer perceptron with optimal cell damage [37].

Feature selection has provided important improvements in BCI, e.g. the stepwise linear discriminant analysis (embedded method) for P300-BCI [111] and frequency bands selection for motor imagery using maximal mutual information (filtering methods) [7]. Let us also mention the support vector machine for channel selection [115], linear regressor for knowledge extraction [123], genetic algorithms for spectral feature selection [50] and P300-based feature selection [201], or evolutionary algorithms for feature selection based on multiresolution analysis [176] (all being wrapper methods). Indeed, metaheuristic techniques (also including ant colony, swarm search, tabu search and simulated annealing) [152] are becoming more and more frequently used for feature selection in BCI [174] in order to avoid the curse-of-dimensionality.

Other popular methods used in EEG-based BCIs notably include filter methods such as maximum relevance minimum redundancy (mRMR) feature selection [166, 180] or R² feature selection [169, 217]. It should be mentioned that five feature selection methods, namely information gain ranking, correlation-based feature selection, Relief (an instance-based feature ranking method for multiclass problems), consistency-based feature selection and 1R Ranking (one-rule classification) have been evaluated on the BCI competition III data sets [107]. Amongst ten classifiers, the top three feature selection methods were correlation-based feature selection, information gain and 1R ranking, respectively.

2.3. Performance measures

To evaluate BCI performance, one must bear in mind that different components of the BCI loop are at stake [212]. Regarding the classifier alone, the most basic performance measure is the classification accuracy. This is valid only if the classes are balanced [66], i.e. with the same number of samples per class and if the classifier is unbiased, i.e. it has the same performance for each class [199]. If these conditions are not met, the Kappa metric or the confusion matrix are more informative performance measures [66]. The sensitivity-specificity pair, or precision, can be computed from the confusion matrix. When the classification depends on a continuous parameter (e.g. a threshold), the receiver operating characteristic (ROC) curve, and the area under the curve (AUC) are often used.

Classifier performance is generally computed offline on pre-recorded data, using a hold-out strategy: some datasets are set aside to be used for the evaluation, and are not part of the training dataset. However, some authors also report cross-validation measures estimated on training data, which may over-rate the performance.

The contribution of classifier performance to overall BCI performance strongly depends on the orchestration of the BCI subcomponents. This orchestration is highly variable given the variety of BCI systems (co-adaptive, hybrid, passive, self- or system- paced). The reader is referred to [212] for a comprehensive review of evaluation strategies in such BCI contexts.

3. Past methods and current challenges

3.1. A brief overview of methods used ten years ago

In our original review of classification algorithms for EEG-based BCIs published ten years ago, we identified five main families of classifiers that had been explored: linear classifiers, neural networks, non-linear Bayesian classifiers, nearest neighbour classifiers and classifier combinations [141].

Linear classifiers gather discriminant classifiers that use linear decision boundaries between the feature vectors of each class. They include linear discriminant analysis (LDA), regularized LDA and support vector machines (SVMs). Both LDA and SVM were, and still are, the most popular types of classifiers for EEG based-BCIs, particularly for online and real-time BCIs. The previous review highlighted that in terms of performances, SVM often outperformed other classifiers.

Neural networks (NN) are assemblies of artificial neurons, arranged in layers, which can be used to approximate any non-linear decision boundary. The most common type of NN used for BCI at that time was the multi-layer perceptron (MLP), typically employing only one or two hidden layers. Other NN types were explored more marginally, such as the Gaussian classifier NN or learning vector quantization (LVQ) NN.

Non-linear Bayesian classifiers are classifiers modeling the probability distributions of each class and use Bayes' rule to select the class to assign to the current feature vector. Such classifiers notably include Bayes quadratic classifiers and hidden Markov models (HMMs).

Nearest neighbour classifiers assign a class to the current feature vector according to its nearest neighbours. Such neighbours could be training feature vectors or class prototypes. Such classifiers include the k-nearest neighbour (kNN) algorithm or Mahalanobis distance classifiers.

Finally, classifier combinations are algorithms combining multiple classifiers, either by combining their outputs and/or by training them in ways that maximize their complementarity. Classifier combinations used for BCI at the time included boosting, voting or stacking combination algorithms. Classifier combination appeared to be amongst the best performing classifiers for EEG based BCIs, at least in offline evaluations.

3.2. Challenges faced by current EEG signal classification methods

Ten years ago, most classifiers explored for BCI were rather standard classifiers used in multiple machine learning problems. Since then, research efforts have focused on identifying and designing classification methods dedicated to the specificities of EEG-based BCIs. In particular, the main challenges faced by classification methods for BCI are the low signal-to-noise ratio of EEG signals [172, 228], their non-stationarity over time, within or between users, where same-user EEG signals varying between or even within runs [56, 80, 109, 145, 164, 202], the limited amount of training data that is generally available to calibrate the classifiers [108, 137], and the overall low reliability and performance of current BCIs [109, 138, 139, 229].

Therefore, most of the algorithms studied these past ten years aimed at addressing one or more of these challenges. More precisely, adaptive classifiers whose parameters are incrementally updated online were developed to deal with EEG non-stationarity in order to track changes in EEG properties over time. Adaptive classifiers can also be used to deal with limited training data by learning online, thus requiring fewer offline training data. Transfer learning techniques aim at transferring features or classifiers from one domain, e.g. BCI subjects or sessions, to another domain, e.g. other subjects or other sessions from the same subject. As such they also aim at addressing within or between-subjects non-stationarity and limited training data by complementing the few training data available with data transferred from other domains. Finally in order to compensate for the low EEG signal-to-noise ratio and the poor reliability of current BCIs, new methods were explored to process and classify signals in a single step by merging feature extraction, feature selection and classification. This was achieved by using matrix (notably Riemannian methods) and tensor classifiers as well as deep learning. Additional methods explored were targeted specifically at learning from limited amount of data and at dealing with multiple class problems. We describe these new families of methods in the following.

4. New EEG classification methods since 2007

4.1. Adaptive classifiers

4.1.1. Principles.

Adaptive classifiers are classifiers whose parameters, e.g. the weights attributed to each feature in a linear discriminant hyperplane, are incrementally re-estimated and updated over time as new EEG data become available [200, 202]. This enables the classifier to track possibly changing feature distribution, and thus to remain effective even with non-stationary signals such as an EEG. Adaptive classifiers for BCI were first proposed in the mid-2000s, e.g. in [30, 72, 163, 202, 209], and were shown to be promising in offline analysis. Since then, more advanced adaptation techniques have been proposed and tested, including online experiments.

Adaptive classifiers can employ both supervised and unsupervised adaptation, i.e. with or without knowledge of the true class labels of the incoming data, respectively. With supervised adaptation, the true class labels of the incoming EEG signals is known and the classifier is retrained on the available training data augmented with these new, labelled incoming data, or is updated based on this new data only [200, 202]. Supervised BCI adaptation requires guided user training, for which the users' commands are imposed and thus the corresponding EEG class labels are known. Supervised adaptation is not possible with free BCI use, as the incoming EEG data true label is unknown. With unsupervised adaptation, the label of the incoming EEG data is unknown. As such, unsupervised adaptation is based on an estimation of the data class labels for retraining/updating, as discussed in [104], or is based on class-unspecific adaptation, e.g. the general all classesEEG data mean [24, 219] or a covariance matrix [238] is updated in the classifier model. A third type of adaptation, in between supervised and unsupervised methods, has also been explored: semi-supervised adaptation [121, 122]. Semi-supervised adaptation consists of using both initial labelled data and incoming unlabelled data to adapt the classifier. For BCI, semi-supervised adaptation is typically performed by (1) initially training a supervised classifier on available labelled training data, then (2) by estimating the labels of incoming unlabelled data with this classifier, and (3) by adapting/retraining the classifier using these initially unlabelled data assigned to their estimated labels combined with the known available labelled training data. This process is repeated as new batches of unlabelled incoming EEG data become available.

4.1.2. State-of-the-art.

So far, the majority of the work on adaptive classifiers for BCI has been based on supervised adaptation. Multiple adaptive classifiers were explored offline, such as LDA or quadratic discriminant analysis (QDA) [200] for motor imagery-based BCI. An adaptive LDA was also proposed based on Kalman filtering to track the distribution of each class [96]. In order to deal with possibly imperfect labels in supervised adaptation, [236] proposed and evaluated offline an adaptive Bayesian classifier based on sequential Monte Carlo sampling that explicitly models uncertainty in the observed labels. For ERP-based BCI, [227] explored an offline adaptive support vector machine (SVM), adaptive LDA, a stochastic gradient-based adaptive linear classifier, and online passive-aggressive (PA) algorithms. interestingly, McFarland and colleagues demonstrated in offline analysis of EEG data over multiple sessions that continuously retraining the weights of linear classifiers in a supervised manner improved the performance of sensori-motor rhythms (SMR)-based BCI, but not of the P300-based BCI speller [160]. However, results presented in [197] suggested that continuous adaption was beneficial for the asynchronous P300-BCI speller, and [227] suggested the same for passive BCI based on the P300.

Online, still using supervised adaptation, both adaptive LDA and QDA have been explored successfully in [222]. In [86], an adaptive probabilistic neural network was also used for online adaptation with a motor imagery-BCI. Such a classifier models the feature distributions of each class in non-parametric fashion, and updates them as new trials become available. Classifier ensembles were also explored to create adaptive classifiers. In [119], a dynamic ensemble of five SVM classifiers was created by training a new SVM for each batch of new incoming labelled EEG trials, adding it to the ensemble and removing the oldest SVM. Classification was performed using a weighted sum of each SVM output. This approach was shown online to be superior to a static classifier.

Regarding supervised adaptation, it should be mentioned that adaptive spatial filters were also proposed, notably several variants of adaptive CSP [204, 247], but also adaptive xDAWN [227].

Unsupervised adaptation of classifiers is obviously much more difficult, as the class labels, hence the class-specific variability, is unknown. Thus, unsupervised methods have been proposed to estimate the class labels of new incoming samples before adapting the classifier based on this estimation. This technique was explored offline in [24] and [129], and online in [83] for an LDA classifier and Gaussian mixture model (GMM) estimation of the incoming class labels, with motor imagery data. Offline, Fuzzy C-means (FCM) were also explored instead of GMM to track the class means and covariance for an LDA classifier [130]. Similarly, a non-linear Bayesian classifier was adapted using either unsupervised or semi-supervised learning (i.e. only some of the incoming trials were labelled) using extended Kalman filtering to track the changes in the class distribution parameters with auto-regressive (AR) features [149]. Another simple unsupervised adaptation of the LDA classifier for motor imagery data was proposed and evaluated for both offline and online data [219]. The idea was to not incrementally adapt all of the LDA parameters, but only its bias, which can be estimated without knowing the class labels if we know that the data is balanced, i.e. with the same number of trials per class on average. This approach was extended to the multiclass LDA case, and evaluated in an offline scenario in [132].

Adaptation can be performed according to reinforcement signals (RS), indicating whether a trial was erroneously classified by the BCI. Such reinforcement signals can be deduced from error-related potentials (ErrP), potentials appearing following a perceived error which may have been committed by either the user or the machine [68]. In [133], an incremental logistic regression classifier was proposed, which was updated along the error gradient when a trial was judged to be misclassified according to the detection of an ErrP. The strength of the classifier update was also proportional to the probability of this ErrP. A Gaussian probabilistic classifier incorporating an RS was later proposed in [131], in which the update rules of the mean and covariance of each class depend on the probability of the RS. This classifier could thus incorporate a supervised, unsupervised or semi-supervised adaptation mode, according to whether the probability of the RS is always correct as either 0 or 1 (supervised case), uniform, i.e. uninformative (unsupervised case) or with a continuous probability with some uncertainty (partially supervised case). Using simulated supervised RS, this method was shown to be superior to static LDA and the other supervised and unsupervised adaptive LDA discussed above [131]. Evaluations with real-world data remain to be performed. Also using ErrP in offline simulations of an adaptive movement-related potential (MRP)-BCI, [9] augmented the training set with incoming trials, but only with those that were classified correctly, as determined by the absence of an ErrP following feedback to the user. They also removed the oldest trials from the training set as new trials became available. Then, the parameters of the classifier, an incremental SVM, were updated based on the updated training set. ErrP-based classifier adaptation was explored online for code-modulated visual evoked potential (c-VEP) classification in [206]. In this work, the label of the incoming trial was estimated as the one decided by the classifier if no ErrP was detected, the opposite label otherwise (for binary classification). Then, this newly labelled trial was added to the training set, and the classifier and spatial filter, a one-class SVM and canonical correlation analysis (CCA), respectively, were retrained on the new data. Finally, [239] demonstrated that classifier adaptation based on RS could also be performed using classifier confidence, and that such adaptation was beneficial to P300-BCI.

For ERP-based BCI, semi-supervised adaptation was explored with SVM and enabled the calibration of a P300-speller with less data as compared to a fixed, non-adaptive classifier [122, 151]. This method was later tested and validated online in [81]. For P300-BCI, a co-training semi-supervised adaptation was performed in [178]. In this work, two classifiers were used: a Bayesian LDA and a standard LDA. Each was initially trained on training labelled data, and then used to estimate the labels of unlabelled incoming data. The latter were labelled with their estimated class label and used as additional training data to retrain the other classifier, hence the co-training. This semi-supervised approach was shown offline to lead to higher bit-rates than a fully supervised method, which requires more supervised training data. On the other hand, offline semi-supervised adaptation with an LDA as classifier failed on mental imagery data, probably owing to the poor robustness of the LDA to mislabelling [137]. Finally, both for offline and online data, [104, 105] proposed a probabilistic method to adaptively estimate the parameters of a linear classifier in P300-based spellers, which led to a drastic reduction in calibration time, essentially removing the need for the initial calibration. This method exploited the specific structure of the P300-speller, and notably the frequency of samples from each class at each time, to estimate the probability of the most likely class label. In a related work, [78] proposed a generic method to adaptively estimate the parameters of the classifier without knowing the true class labels by exploiting any structure that the application may have. Semi-supervised adaptation was also used offline for multi-class motor imagery with a Kernel discriminant analysis (KDA) classifier in [171]. This method has shown its superiority over non-adaptive methods, as well as over adaptive unsupervised LDA methods.

Vidaurre et al, also explored co-adaptive training, where both the machine and the user are continuously learning, by using adaptive features and an adaptive LDA classifier [220, 221]. This enabled some users who were initially unable to control the BCI to achieve better than chance classification performances. This work was later refined in [64] by using a simpler but fully adaptive setup with auto-calibration, which proved to be effective both for healthy users and for users with disabilities [63]. Co-adaptive training, using adaptive CSP patches, proved to be even more efficient [196].

Adaptive classification approaches used in BCI are summarized in tables 1 and 2, for supervised and unsupervised methods, respectively.

Table 1. Summary of adaptive supervised classification methods explored offline.

EEG pattern	Features	Classifier	References
Motor imagery	Band power	Adaptive LDA/QDA	[200]
Motor imagery	Fractal dimension	Adaptive LDA	[96]
Motor imagery	Band power	Adaptive LDA/QDA	[222]
Motor imagery	Band power	Adaptive probabilistic NN	[86]
Motor imagery	CSP	Dynamic SVM ensemble	[119]
Motor imagery	Adaptive CSP	SVM	[204, 247]
Motor execution	AR parameters	Adaptive Gaussian classifier	[236]
P300	Time points	Adaptive LDA/SVM	[227]
	with adaptive xDAWN	online PA classifier

Table 2. Summary of adaptive unsupervised classification methods explored.

EEG pattern	Features	Classifier	References
Motor imagery	Band power	Adaptive LDA with GMM	[24, 83, 129]
Motor imagery	Band power	Adaptive LDA with FCM	[130]
Motor execution	AR parameters	Adaptive Gaussian classifier	[149]
Motor imagery	Band power	Adaptive LDA	[132, 219]
Motor imagery	Band power	Adaptive Gaussian classifier	[131]
Motor imagery	Band power	Semi-supervised CSP+LDA	[137]
Motor imagery	Adaptive band power	Adaptive LDA	[63, 64, 220, 221]
Motor imagery	Adaptive CSP patches	Adaptive LDA	[196]
Covert attention	Band power	Incremental logistic regression	[133]
MRP	Band power	Incremental SVM	[9]
c-VEP	CCA	Adaptive one-class SVM	[206]
P300	Time points	SWLDA	[239]
P300	Time points	Semi-supervised SVM	[81, 122, 151]
P300	Time points	Co-training LDA	[178]
P300	Time points	Unsupervised linear classifier	[104, 105]
ErrP	Time points	Unsupervised linear classifier	[78]

4.1.3. Pros and cons.

Adaptive classifiers were repeatedly shown to be superior to non-adaptive ones for multiple types of BCI, notably motor-imagery BCI, but also for some ERP-based BCI. To the best of our knowledge, adaptive classifiers have apparently not been explored for SSVEP-BCI. Naturally, supervised adaptation is the most efficient type of adaptation, as it has access to the real labels. Nonetheless unsupervised adaptation has been shown to be superior to static classifiers in multiple studies [24, 130, 132, 149, 219]. It can also be used to shorten or even remove the need for calibration [78, 81, 105, 122, 151]. There is a need for more robust unsupervised adaptation methods, as the majority of actual BCI applications do not provide labels, and thus can only rely on unsupervised methods.

For unsupervised adaptation, reward signals, and notably ErrP, have been exploited in multiple papers (e.g. [9, 206, 239]). Note however, that ErrP decoding from EEG signals may be a difficult task. Indeed, [157] demonstrated that the decoding accuracy of ErrP was positively correlated with the P300 decoding accuracy. This means that people who make errors in the initial BCI task (here a P300), for whom error correction and ErrP-based adaptation would be the most useful, have a lesser chance that the ErrP will be correctly decoded. There is thus a need to identify robust reward signals.

Only a few of the proposed methods were actually used online. For unsupervised methods, a simple and effective one that demonstrated its value online in several studies is adaptive LDA, proposed by Vidaurre et al [219]. This and other methods that are based on incremental adaptation (i.e. updating the algorithms parameters rather than fully re-optimizing them) generally have a computational complexity that is low enough to be used online. Adaptive methods that require fully retraining the classifier with new incoming data generally have a much higher computationnal complexity (e.g. regularly retraining an SVM from scratch in real-time requires a lot of computing power) which might prevent them from being actually used online.

However, more online studies are clearly necessary to determine how adaptation should be performed in practice, with a user in the loop. This is particularly important for mental imagery BCI in which human-learning is involved [147, 170]. Indeed, because the user is adapting to the BCI by learning how to perform mental imagery tasks so that they are recognized by the classifier, adaptation may not always help and may even be confusing to the user, as it may lead to continuously-changing feedback. Both machine and human learning may not necessarily converge to a suitable and stable solution. A recent theoretical model of this two-learner problem was proposed in [168], and indicated that adaptation that is either too fast or too slow can actually be detrimental to user learning. There is thus a need to design adaptive classifiers that ensure and favour human learning.

4.2. Classifying EEG matrices and tensors

4.2.1. Riemannian geometry-based classification.

Principles.

The introduction of Riemannian geometry in the field of BCI has challenged some of the conventions adopted in the classic classification approaches; instead of estimating spatial filters and/or select features, the idea of a Riemannian geometry classifier (RGC) is to map the data directly onto a geometrical space equipped with a suitable metric. In such a space, data can be easily manipulated for several purposes, such as averaging, smoothing, interpolating, extrapolating and classifying. For example, in the case of EEG data, mapping entails computing some form of covariance matrix of the data. The principle of this mapping is based on the assumption that the power and the spatial distribution of EEG sources can be considered fixed for a given mental state and such information can be coded by a covariance matrix. Riemannian geometry studies smooth curved spaces that can be locally and linearly approximated. The curved space is named a manifold and its linear approximation at each point is the tangent space. In a Riemannian manifold the tangent space is equipped with an inner product (metric) smoothly varying from point to point. This results in a non-Euclidean notion of distance between any two points (e.g. each point may be a trial) and a consequent notion of centre of mass of any number of points (figure 2). Therefore, instead of using the Euclidean distance, called the extrinsic distance, an intrinsic distance is used, which is adapted to the geometry of the manifold, and thus to the manner in which the data have been mapped [47, 232].

**Figure 2.** Schematic representation of a Riemannian manifold. EEG trials are represented by points. Left: Representation of the tangent space at point $\newcommand{\G}{{\bf G}} \G$ . The shortest path on the manifold relying on two points $\newcommand{\C}{{\bf C}} \C_1$ and $\newcommand{\C}{{\bf C}} \C_2$ is named the geodesic and its length is the Riemannian distance between them. Curves on the manifolds through a point are mapped on the tangent space as straight lines (local approximation). Right: $\newcommand{\G}{{\bf G}} \G$ represents the centre of mass (mean) of points $\newcommand{\C}{{\bf C}} \C_1$ , $\newcommand{\C}{{\bf C}} \C_2$ , $\newcommand{\C}{{\bf C}} \C_3$ and $\newcommand{\C}{{\bf C}} \C_4$ . It is defined as the point minimizing the sum of the squared distance between itself and the four points. The centre of mass is often used in RGCs as a representative for a given class.
Download figure:
Standard image High-resolution image

$ \newcommand{\G}{{\bf G}} \G$ — **Figure 2.** Schematic representation of a Riemannian manifold. EEG trials are represented by points. Left: Representation of the tangent space at point $\newcommand{\G}{{\bf G}} \G$ . The shortest path on the manifold relying on two points $\newcommand{\C}{{\bf C}} \C_1$ and $\newcommand{\C}{{\bf C}} \C_2$ is named the geodesic and its length is the Riemannian distance between them. Curves on the manifolds through a point are mapped on the tangent space as straight lines (local approximation). Right: $\newcommand{\G}{{\bf G}} \G$ represents the centre of mass (mean) of points $\newcommand{\C}{{\bf C}} \C_1$ , $\newcommand{\C}{{\bf C}} \C_2$ , $\newcommand{\C}{{\bf C}} \C_3$ and $\newcommand{\C}{{\bf C}} \C_4$ . It is defined as the point minimizing the sum of the squared distance between itself and the four points. The centre of mass is often used in RGCs as a representative for a given class.
Download figure:
Standard image High-resolution image

Amongst the most common matrix manifolds used for BCI applications, we encountered the manifold of Hermitian or symmetric positive definite (SPD) matrices [19] when dealing with covariance matrices estimated from EEG trials, and the Stiefel and Grassmann manifolds [62] when dealing with subspaces or orthogonal matrices. Several machine learning problems can be readily extended to those manifolds by taking advantage of their geometrical constraints (i.e. learning on manifold). Furthermore, optimization problems can be formulated specifically on such spaces, which is leading to several new optimization methods and to the solution of new problems [2]. Although related, manifold learning, which consists of empirically attempting to locate the non-linear subspace in which a dataset is defined, is different in concept and will not be covered in this paper. To illustrate these notions, consider the case of SPD matrices. The square of the intrinsic distance between two SPD matrices $\newcommand{\C}{{\bf C}} \C_1$ and $\newcommand{\C}{{\bf C}} \C_2$ has a closed-form expression given by

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\C}{{\bf C}} \displaystyle \delta^2 \left ( \C_1,\C_2 \right)=\sum_n \log^2\lambda_n\left ( \C_1^{-1}\C_2 \right), \label{eq:Rdist} \nonumber \end{align} \tag{ 1 }$

where $\newcommand{\M}{{\bf M}} \lambda_n(\M)$ denotes the nth eigenvalue of matrix $\newcommand{\M}{{\bf M}} \M$ . For $\newcommand{\C}{{\bf C}} \C_1$ and $\newcommand{\C}{{\bf C}} \C_2$ SPDs, this distance is non-negative, symmetric and is equal to zero if and only if $\newcommand{\C}{{\bf C}} \C_1=\C_2$ . Interestingly, when $\newcommand{\C}{{\bf C}} \C_1$ and $\newcommand{\C}{{\bf C}} \C_2$ are the means of two classes, the eigenvectors of matrix $\newcommand{\C}{{\bf C}} (\C_1^{-1}\C_2)$ are used to define CSP filters, while its eigenvalues are used for computing their Riemannian distance [47]. Using the distance in equation (1), the centre of mass $\newcommand{\G}{{\bf G}} \G$ of a set $\newcommand{\C}{{\bf C}} \left \{\C_1, ..., \C_K \right \}$ of K SPD matrices (figure 3), also called the geometric mean, is the unique solution to the following optimization problem

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\G}{{\bf G}} \newcommand{\C}{{\bf C}} \displaystyle \arg\!\min\limits_{\G} \sum_k \delta^2 ( \C_k,\G). \label{eq:Rmean} \nonumber \end{align} \tag{ 2 }$

**Figure 3.** Schematic of the Riemannian minimum distance to mean (RMDM) classifier for a two-class problem. From training data a centre of mass for each class is computed ( $\newcommand{\G}{{\bf G}} \G_1$ and $\newcommand{\G}{{\bf G}} \G_1$ ). An unlabelled trial (question mark) is then assigned to the class whose centre of mass is the closest, $\newcommand{\G}{{\bf G}} \G_1$ in this example. The RMDM works in the same manner for any dimension of the data, any number of classes and any BCI paradigm. It does not require any spatial filtering and feature selection, nor any parameter tuning (see text).
Download figure:
Standard image High-resolution image

$ \newcommand{\G}{{\bf G}} \G_1$ — **Figure 3.** Schematic of the Riemannian minimum distance to mean (RMDM) classifier for a two-class problem. From training data a centre of mass for each class is computed ( $\newcommand{\G}{{\bf G}} \G_1$ and $\newcommand{\G}{{\bf G}} \G_1$ ). An unlabelled trial (question mark) is then assigned to the class whose centre of mass is the closest, $\newcommand{\G}{{\bf G}} \G_1$ in this example. The RMDM works in the same manner for any dimension of the data, any number of classes and any BCI paradigm. It does not require any spatial filtering and feature selection, nor any parameter tuning (see text).
Download figure:
Standard image High-resolution image

As discussed thoroughly in [47], this definition is analogous to the definition of the arithmetic mean $\newcommand{\C}{{\bf C}} 1/K\sum_k \C_k$ , which is the solution of the optimization problem (2) when the Euclidean distance is used instead of the Riemannian one. In contrast to the arithmetic mean, the geometric mean does not have a closed-form solution. A fast and robust iterative algorithm for computing the geometric mean has been presented in [48]. The simplest RGC methods allow immediate classification of trials (mapped via some form of covariance matrix) by simple nearest neighbour methods, using exclusively the notion of Riemannian distance (equation (1)), and possibly with the notion of geometric mean (2). For instance, the Riemannian minimum distance to mean (RMDM) classifier [15, 13] computes a geometric mean for each class using training data and then assigns an unlabelled trial to the class corresponding to the closest mean (figure 3). Another class of RGCs consists of methods projecting the data points to a tangent space followed by a classification, thereafter using standard classifiers such as LDA, SVM, logistic regression, etc [13, 14]. These methods take advantage of both the Riemannian geometry and the possibility of executing complex decision functions using dedicated classifiers. An alternative approach is to project the data in the tangent space, filter the data there (for example by LDA), and map the data back onto the manifold to finally carry out the RMDM.

State-of-the-art.

As described above, Riemannian classifiers either operate directly on the manifold (e.g. the RMDM) or by the projection of the data in the tangent space. Simple RGCs on the manifold have been shown to be competitive as compared to previous state-of-the-art classifiers used in BCI as long as the number of electrodes is not very large, providing better robustness to noise and better generalization capabilities, both on healthy users [13, 46, 100] and clinical populations [158]. RGCs based on tangent space projection clearly outperformed the other state-of-the-art methods in terms of accuracy [13, 14], as demonstrated by the first place they have been awarded in five recent international BCI predictive modelling data competitions, as reported in [47]. For a comprehensive review of the Riemannian approaches in BCI, the reader can refer to [47, 232]. The various approaches using Riemannian geometry classifiers for EEG-based BCIs are summarized in table 3.

Table 3. Summary of Riemannian geometry classifiers for EEG-based BCI.

EEG pattern	Features	Classifier	References
Motor imagery	Band-pass covariance	RMDM	[13, 46]
Motor imagery	Band-pass covariance	Tangent space + LDA	[13, 231]
Motor imagery	Band-pass covariance	SVM Riemannian Kernel	[14]
P300	Special covariance	RMDM	[46]
P300	Special covariance	RMDM	[15]
P300	Special covariance	RMDM	[158]
SSVEP	Band-pass covariance	RMDM	[34, 100]

Pros and cons.

As highlighted in [232], the processing procedures of Riemannian approaches such as RMDM is simpler and involves fewer stages than more classic approaches. Also, Riemannian classifiers apply equally well to all BCI paradigms (e.g. BCIs based on mental imagery, ERPs and SSVEP); only the manner in which data points are mapped in the SPD manifold differs (see [47] for details). Furthermore, in contrast to most classification methods, the RMDM approach is parameter-free, that is, it does not require any parameter tuning, for example by cross-validation. Hence, Riemannian geometry provides new tools for building simple, more robust and accurate prediction models.

Several reasons have been proposed to advocate the use of the Riemannian geometry. Due to its logarithmic nature the Riemannian distance is robust to extreme values, that is, noise. Also, the intrinsic Riemannian distance for SPD matrices is invariant both to matrix inversion and to any linear invertible transformation of the data, e.g. any mixing applied to the EEG sources does not change the distances among the observed covariance matrices. These properties in part explain why Riemannian classification methods provide a good generalization capability [224, 238], which enabled researchers to set up calibration-free adaptive ERP-BCIs using simple subject-to-subject and session-to-session transfer learning strategies [6].

Interestingly, as illustrated in [94], it is possible to not only interpolate along geodesics (figure 2) on the SPD manifolds, but also to extrapolate (e.g. forecast) without leaving the manifold and respecting the geometrical constraints. For example, in [99] interpolation has been used for data augmentation by generating artificial covariance matrices along geodesics but extrapolation could also have been used. Often, the Riemannian interpolation is more relevant than its Euclidean counterpart as it does not suffer from the so-called swelling effect [232]. This effect describes the fact that a Euclidean interpolation between two SPD matrices does not involve the determinant of the matrix as it should (i.e. the determinant of the Euclidean interpolation can exceed the determinant of the interpolated matrices). In the spirit of [231], the determinant of a covariance matrix can be considered as the volume of the polytope described by the column of the matrix. Thus, a distance that is immune to the swelling effect will respect the shape of the polytope along geodesics.

As equation (1) indicates, computing the Riemannian distance between two SPD matrices involves adding squared logarithms, which may cause numerical problems; the smallest eigenvalues of matrix $\newcommand{\C}{{\bf C}} (\C_1^{-1}\C_2)$ tend towards zero as the number of electrodes increases and/or the window size for estimating $\newcommand{\C}{{\bf C}} \C_1$ and $\newcommand{\C}{{\bf C}} \C_2$ decreases, making the logarithm operation ill-conditioned and numerically unstable. Further, note that the larger the dimensions, the more the distance is prone to noise. Moreover, Riemannian approaches usually have high computational complexities (e.g. growing cubically with the number of electrodes for computing both the geometric mean and the Riemannian distance). For these reasons, when the number of electrodes is large with respect to the window size, it is advocated to reduce the dimensions of the input matrices. Classical unsupervised methods such as PCA or supervised methods such as CSP can be used for this purpose. Recently, Riemannian-inspired dimensionality reduction methods have been investigated as well [94, 95, 189].

Interestingly, some approaches have tried to bridge the gap between Riemannian approaches and more classical paradigms by incorporating some Riemannian geometry in approaches such as CSP [12, 233]. CSP was the previous golden standard and is based on a different paradigm than Riemannian geometry. Taking the best of those two paradigms is expected to gain better robustness while compressing the information.

4.2.2. Other matrix classifiers.