Robust regression for large-scale neuroimaging studies☆
Introduction
Population-level inference or population comparison based on neuroimaging data most often rely on a mass univariate linear model, in which voxel-based brain measurements are modeled as a linear function of the variables of interest (e.g. age or sex) forming the so-called design matrix for a given group of subjects. Depending on the imaging modality, these measurements reflect tissue density or volume, neural activity (as measured by the BOLD signal) or (probabilistic) white-matter tract orientation through diffusion MRI. This mass-univariate framework is weakened by some well-known issues, such as i) the large number of statistical tests performed, which entails strong corrections for multiple comparisons to control for type I errors (Bennett et al., 2009); ii) the presence of correlations in the signal, that break the independence assumption (Friston et al., 1995); and iii) the presence of undesired effects or artifacts that substantially degrade the image quality, at a local or global spatial scale (Erasmus et al., 2004). Finally, inter-individual variability in brain anatomy, cognitive function and functional organization potentially results in mismatches in the image registration that degrade the sensitivity of statistical inference procedures.
Neuroimaging group analyses aim at detecting the effect of a variable of interest by assessing the significance of its correlation with brain images. Many data processing and statistical analysis methods have been proposed in the literature to perform neuroimaging group analyses. These deal with the three main issues mentioned above: local averages within regions of interest (Flandin et al., 2002, Nieto-Castanon et al., 2003, Thirion et al., 2006) and feature selection (Hansen et al., 1999, Thirion and Faugeras, 2003, Spetsieris et al., 2009) are used to reduce the data dimension and the dependence between descriptors; prior smoothing of the images reduces registration mismatches (Worsley et al., 1996) and can be accounted for in standard multiple comparison corrections (Worsley et al., 1992); introducing noise regressors into the model aims at improving the sensitivity of the analyses (Lund et al., 2006); cluster-size analysis (Roland et al., 1993, Friston et al., 1993, Poline and Mazoyer, 1993), Threshold-Free Cluster Enhancement (TFCE) (Smith and Nichols, 2009, Salimi-Khorshidi et al., 2011) and Randomized Parcellation Based Inference (RPBI) (Da Mota et al., 2013) are state-of-the-art methods that combine several of the above-mentioned concepts to improve the statistical sensitivity of the analyses. For a more complete review, see Da Mota et al., 2013, Moorhead et al., 2005; and Petersson et al. (1999). All these methods rely on a set of assumptions about the statistical structure of the data (e.g. Gaussian-distributed data, “smooth-enough” images (Hayasaka et al., 2004), descriptors (in-)dependence), which are difficult to check in practice. Even though some tools have been designed to check whether the data exhibit artifacts, such as Luo and Nichols (2003), no guarantee is given that the images output by standard pre-processing pipelines will conform to these assumptions. In particular, most of the methods fit a linear model to the data with ordinary least squares (OLS) regression, a procedure that is optimal only if the noise is Gaussian-distributed with a given variance across samples (i.e. across individuals). Note that, by contrast, the variance can vary arbitrarily across voxels.
Departure from normality has stronger effects in small sample settings than in large sample settings, where the central limit theorem leads to Gaussian errors on the estimated parameters. On the other hand, violation from standard hypotheses about the statistical structure of the data cannot be easily detected when 10 to 20 subjects are included in a neuroimaging experiment, while significant departure may be observable when larger groups of subjects are considered. Consequently, we can expect a much better model of the data, and some gains in sensitivity or specificity if we use a model that relaxes standard, simplistic assumptions such as Gaussian-distributed data, or homoscedastic noise. The need for such improved techniques becomes more apparent as more large-scale neuroimaging cohorts are now emerging (ADNI (Jack et al., 2008), IMAGEN (Schumann et al., 2010), Human Connectome (Van Essen et al., 2012) cohorts, Saguenay Youth Study (Pausova et al., 2007)). Using the simplest analysis scheme, i.e. the massively univariate voxel-wise inference, Wager et al. (2005) suggested to replace standard ordinary least squares regression by robust regression (Huber regression (Huber, 2005)), which has the advantage of i) relying on weak structural assumptions (symmetric, unimodal data) and ii) being robust to outliers. Wager and colleagues' work successfully showed sensitivity improvements for both inter- and intra-subject analyses, as well as better results stability in the presence of outliers. But this work was limited to the consideration of small groups of subjects (< 20) and only the outlier-resistant property of the method seems to have been considered by the community (Poldrack, 2007, Ochsner et al., 2009, McRae et al., 2010, Kober et al., 2010, Atlas et al., 2010).
Many robust regression settings have been proposed in the statistical literature to perform accurate detection in the context of non normally-distributed data. least absolute deviation (LAD) regression (or 1 regression) (Dodge, 1987) minimizes the sum of the absolute value of the model residuals. It is hard to compute in practice and the solution of the associated optimization problem may not be unique (Huber, 2005). The repeated median algorithm (Siegel, 1982) is a regression algorithm that targets a high level of outlier resistance, namely up to 50% of contamination (a property known as high-breakdown point). It is computationally expensive and the resulting estimate is not affine equivariant, because it is sensitive to a rotation of the data. The least median of squares (LMS) (Hampel, 1975) and least trimmed squares (LTS) (Rousseeuw, 1984) estimates also have a high breakdown point but can only be computed with algorithms for which there is no known global optimum. The efficiency of these methods for uncontaminated samples is generally poor. This can be easily understood if one conceptualizes robust methods as methods that reject the input samples that are most dissimilar to the others: the straightforward consequence is that the number of samples used in the estimation is smaller, resulting in more variable estimates and power loss. A compromise thus needs to be found between the amount of robustness to achieve and the estimation accuracy when there are no or few outliers. Moreover, all of the above-mentioned regression estimators are difficult to apply in the context of neuroimaging problems, where thousands of correlated variables are involved. Huber's regression is the most convenient regression criterion to date. It offers a good compromise between interpretability, computational cost and robustness.
Here, we extend the work of Wager (Wager et al., 2005) by revisiting the testing framework based on Huber's robust regression and analyzing its impact on the results of large fMRI cohorts analyses, in particular in an imaging genetics study. Specifically, we show that Huber regression has some key advantages over two alternative robust regression schemes, least trimmed squares (LTS) and support vector regression (SVR) with the ϵ-insensitive loss, which is that it can be associated with an analytical test that controls type I error rate accurately; it is also ten times faster than LTS and, unlike LTS, it is guaranteed to converge to an optimal solution.
We first check the theoretical and empirical validity of robust regression in this context, and obtain additional detections on real data.
The impact of this approach on results for actual datasets is found to be important. This means that, while quality-check or data cleaning procedures are essential, the amount of data and their complexity make it impossible to set up optimal inclusion/exclusion criteria, so that robust inference remains necessary even after standard screening. Finally, we combine robust regression with state-of-the-art analysis methods to increase the analysis sensitivity. To this end we developed a fast algorithm for solving the robust regression problem, as well as a fast testing procedure. State-of-the-art analysis methods are indeed often used with permutation testing as they involve complex statistics. It is therefore crucial to reduce their computation cost.
The outline of the paper is the following. In Statistical inference for neuroimaging group analysis section we detail the neuroimaging group analysis setting. We introduce Randomized Parcellations Based Inference (RPBI), as this method will be used in combination with robust regression later on. We introduce robust regression, emphasize the associated test statistic and detail efficient procedures to perform a robust fit and the ensuing statistical tests in Robust regression section. In Robust analysis of simulated and real datasets section we introduce our simulations and real data experiments; the corresponding results are described in Results section.
Section snippets
Univariate analysis
We consider the linear model:
where y is a set of n samples of an imaging feature (typically the signal value in a given voxel measured in n subjects), X is an n × p matrix of p variates describing the same n samples, and ϵ represents noise. Some columns of X correspond to variates of no interest, while other columns are explanatory variates of interest, of which one wants to assess and test the influence on the data y. The purpose of linear regression is to estimate the unknown
Robust regression
While is the maximum likelihood estimate of β for Gaussian-distributed noise (ϵ), this assumption does not hold for neuroimaging data, as shown in Fig. 1, Fig. 2. Corrupted data and observations that deviate from the population-representative pattern potentially introduce bias in the parameter estimates. We call these unusual observations outliers, and the proportion of outliers in a dataset is called the contamination. Formally, outliers have large residual values that give them
Simulations
We first conduct an empirical validation of the robust regression testing procedure, and compare it with standard OLS. We use the following model to generate n observations {y1, …, yn}:
where X is a random (n × r) design matrix, β is the (r × 1) vector of the model coefficients, models a Gaussian noise, aq is a n-dimensional vector with coordinates drawn from a Bernoulli distribution (1 − q), α > 1 is a scalar, and ∘ is the element-wise multiplication (Hadamard product). q is
Type I error control
The control of type I error obtained with the testing procedures associated with OLS and RLM is shown in Fig. 4. This shows that the analytical test is exact for thresholds up to p < 10− 5 uncorrected. Given the number of tests (106) performed in the simulation, we cannot assert that the control is accurate for lower thresholds.4 This result holds when the amount of
Discussion
The current study shows that Huber's robust regression (referred to as RLM in this paper) yields more sensitive findings than ordinary least squares regression in the context of neuroimaging studies. After ensuring that the testing procedure associated with robust regression comes with a good control of the type I errors, we showed that resistance to outliers yields a better stability and, in turn, increases the number of detected regions. These results are confirmed on real neuroimaging data.
Acknowledgment
This work was supported by Digiteo grants (HiDiNim project No. 2010-42D and ICoGeN project), the ANR grant ANR-10-BLAN-0128, and the Microsoft Inria joint center grant A-brain.
This work was supported by the European Union-funded FP6 Integrated Project IMAGEN (Reinforcement-related behavior in normal brain function and psychopathology) (LSHM-CT-2007-037286), the FP7 projects IMAGEMEND and MATRICS and the Innovative Medicine Initiative Project EU-AIMS (115300-2), Medical Research Council
References (61)
- et al.
Neural correlates of interspecies perspective taking in the post-mortem Atlantic salmon: an argument for multiple comparisons correction
NeuroImage
(2009) - et al.
Analysis of fMRI time-series revisited
Neuroimage
(1995) - et al.
Detecting outliers in high-dimensional neuroimaging datasets with robust covariance estimators
Med. Image Anal.
(2012) - et al.
Generalizable patterns in neuroimaging: How many principal components?
NeuroImage
(1999) - et al.
Nonstationary cluster-size inference with random field and permutation methods
Neuroimage
(2004) - et al.
Non-white noise in fMRI: does modelling have an impact?
NeuroImage
(2006) - et al.
Diagnosis and exploration of massively univariate neuroimaging models
Neuroimage
(2003) - et al.
Empirical comparison of maximal voxel and non-isotropic adjusted cluster extent results in a voxel-based morphometry study of comorbid learning disability with schizophrenia
Neuroimage
(2005) - et al.
Simple group fMRI modeling and inference
Neuroimage
(2009) - et al.
Region of interest based analysis of functional imaging data
Neuroimage
(2003)
Mixed-effect statistics for group analysis in fMRI: a nonparametric maximum likelihood approach
Neuroimage
Adjusting the effect of nonstationarity in cluster-based and TFCE inference
Neuroimage
Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference
Neuroimage
Differential diagnosis of parkinsonian syndromes using pca-based functional imaging features
Neuroimage
Dynamical components analysis of fMRI data through kernel PCA
NeuroImage
Very large fMRI study using the IMAGEN database: sensitivity-specificity and population effect modeling in relation to the underlying anatomy
Neuroimage
The human connectome project: a data acquisition perspective
Neuroimage
Increased sensitivity in neuroimaging analyses using robust regression
NeuroImage
Cluster-extent based thresholding in fmri analyses: pitfalls and recommendations
Neuroimage
Permutation tests for linear models
Aust. N. Z. J. Stat.
Brain mediators of predictive cue effects on perceived pain
J. Neurosci.
Solution of Robust Linear Regression Problems by Preconditioned Conjugate Gradient Methods
Genetic sensitivity to the environment: the case of the serotonin transporter gene and its implications for studying complex diseases and traits
Am. J. Psychiatry
Robust regression: a weighted least squares approach
Commun. Stat. Theory Methods
Support-vector networks
Mach. Learn.
A class of high-breakdown scale estimators based on subranges
Commun. Stat. Theory Methods
Randomized parcellation based inference
Neuroimage
Statistical data analysis based on the L1-norm and related methods
Support vector regression machines
Advances in Neural Information Processing Systems
A short overview of MRI artefacts: review article
SA J. Radiol.
Cited by (19)
Robust Gaussian process regression based on bias trimming
2024, Knowledge-Based SystemsPrenatal exposure to phthalates and peripheral blood and buccal epithelial DNA methylation in infants: An epigenome-wide association study
2022, Environment InternationalCitation Excerpt :Specifically, due to the presence of outliers in the distributions of data (Figure A.2), robust multivariable linear regression with Huber M−estimation (using rlm from the MASS package) (Lourenço et al., 2011; Venables and Ripley, 2002) was used as it is robust to violations of statistical assumptions (e.g., normality) and robust to outliers. Using observed values and robust statistical modeling techniques such as robust linear models is recommended as transforming data can provide less trustworthy estimates of effect and lead to misinterpretations (Feng et al., 2013; Fritsch et al., 2015). Robust linear models were run separately in each tissue to perform association tests between the dependent variable, β-value at each variable CpG (n = 93,000 in blood, n = 153,000 in BECs), and the independent variable, prenatal phthalate exposure (ΣHMWPs and ΣLMWPs, respectively).
Pain Neuroimaging in Humans: A Primer for Beginners and Non-Imagers
2018, Journal of PainCitation Excerpt :Results should be visualized (ie, scatter plots should be inspected to determine whether effects can be attributed to a small number of individuals). Alternatively, automated statistical techniques such as robust regression2,21,89,221 can reduce the contribution of outliers. The group-level analyses outlined previously, when conducted on each voxel throughout the entire brain, require tens of thousands of statistical tests (ie, multiple comparisons).
Blunted neural response to implicit negative facial affect in anorexia nervosa
2017, Biological PsychologyReprint of ‘A Comprehensive review of group level model performance in the presence of heteroscedasticity: Can a single model control Type I errors in the presence of outliers?’
2017, NeuroImageCitation Excerpt :In the multivariate case, no method properly controlled Type I error rates. A slightly different approach to robust regression was presented in Fritsch et al. (2015), where a Randomized Parcellation Based Inference (Da Mota et al., 2014) is combined with robust regression. Huber's loss function was used and the sample size was much larger than in Wager et al. (2005), typically 400–1000 subjects.
A comprehensive review of group level model performance in the presence of heteroscedasticity: Can a single model control Type I errors in the presence of outliers?
2017, NeuroImageCitation Excerpt :Due to the spatial smoothness of fMRI data, it is unlikely that an outlier will exist only in isolated voxels, but in sets of voxels. For example, in Fritsch et al. (2015), an influential outlier that caused a false positive linear relationship when using OLS was consistent across many voxels. In many small sample studies, it is often tempting to use Kendall's Tau and these results show that in most cases the Type I error rate is fairly well controlled, but often with a large decrease in power.
- ☆
Authors 5–17 are listed in alphabetical order.
- 1
URL: www.imagen-europe.com.