Keywords

1 Introduction

Major depressive disorder (MDD) affects over 350 million people worldwide [1] and takes an immense personal toll on patients and their families, placing a vast economic burden on society. MDD involves a wide spectrum of symptoms, varying risk factors, and varying response to treatment [2]. Unfortunately, early diagnosis of MDD is challenging and is based on behavioral criteria; consistent structural and functional brain abnormalities in MDD are just beginning to be understood. Neuroimaging of large cohorts can identify characteristic correlates of depression, and may also help to detect modulatory effects of interventions, and environmental and genetic risk factors. Recent advances in brain imaging, such as magnetic resonance imaging (MRI) and its variants, allow researchers to investigate brain abnormalities and identify statistical factors that influence them, and how they relate to diagnosis and outcomes [12]. Researchers have reported brain structural and functional alterations in MDD using different modalities of MRI. Recently, the ENIGMA-MDD Working Group found that adults with MDD have thinner cortical gray matter in the orbitofrontal cortices, insula, anterior/posterior cingulate and temporal lobes compared to healthy adults without a diagnosis of MDD [3]. A subcortical study – the largest to date – showed that MDD patients tend to have smaller hippocampal volumes than controls [4]. Diffusion tensor imaging (DTI) [5] reveals, on average, lower fractional anisotropy in the frontal lobe and right occipital lobe of MDD patients. MDD patients may also show aberrant functional connectivity in the default mode network (DMN) and other task-related functional brain networks [6].

Even so, classification of MDD is still challenging. There are three major barriers: first, though significant differences have been found, these previously identified brain regions or brain measures are not always consistent markers for MDD classification [7]; second, besides T1 imaging, other modalities including DTI and functional magnetic resonance imaging (fMRI) are not commonly acquired in a clinical setting; last, it is not always easy for collaborating medical centers to perform an integrated data analysis due to data privacy regulations that limit the exchange of individual raw data and due to large transfer times and storage requirements for thousands of images. As biobanks grow, we need an efficient platform to integrate predictive information from multiple centers; as the available datasets increase, this effort should increase the statistical power to identify predictors of disease diagnosis and future outcomes, beyond what each site could identify on its own.

In this study, we introduce a multi-site weighted LASSO (MSW-LASSO) model to boost classification performance for each individual participating site, by integrating their knowledge for feature selection and results from classification. As shown in Fig. 1, our proposed framework features the following characteristics: (1) each site retains their own data and performs weighted LASSO regression, for feature selection, locally; (2) only the selected brain measures and the classification results are shared to other sites; (3) information on the selected brain measures and the corresponding classification results are integrated to generate a unified weight vector across features; this is then sent to each site. This weight vector will be applied to the weighted LASSO in the next iteration; (4) if the new weight vector leads to a new set of brain measures and better classification performance, the new set of brain measures will be sent to other sites. Otherwise, it is discarded and the old one is recovered.

Fig. 1.
figure 1

Overview of our proposed framework.

2 Methods

2.1 Data and Demographics

For this study, we used data from five sites across the world. The total number of participants is 557; all of them were older than 21 years old. Demographic information for each site’s participants is summarized in Table 1.

Table 1. Demographics for the five sites participating in the current study.

2.2 Data Preprocessing

As in most common clinical settings, only T1-weighted MRI brain scans were acquired at each site; quality control and analyses were performed locally. Sixty-eight (34 left/34 right) cortical gray matter regions, 7 subcortical gray matter regions and the lateral ventricles were segmented with FreeSurfer [8]. Detailed image acquisition, pre-processing, brain segmentation and quality control methods may be found in [3, 9]. Brain measures include cortical thickness and surface area for cortical regions and volume for subcortical regions and lateral ventricles. In total, 152 brain measures were considered in this study.

2.3 Algorithm Overview

To better illustrate the algorithms, we define the following notations (Tables 2 and 3):

Table 2. Main steps of Algorithm 1.
Table 3. Main steps of Algorithm 2.
  1. 1.

    \( F_{i} \): The selected brain measures (features) of Site-i;

  2. 2.

    \( A_{i} \): The classification performance of Site-i;

  3. 3.

    W: The weight vector;

  4. 4.

    w-LASSO (W, \( D_{i} \)): Performing weighted LASSO on \( D_{i} \) with weight vector – W;

  5. 5.

    SVM ( \( F_{i} \), \( D_{i} \)): Performing SVM classifier on \( D_{i} \) using the feature set - \( F_{i} \);

The algorithms have two parts that are run at each site, and an integration server. At first, the integration server initializes a weight vector with all ones and sends it to all sites. Each site use this weight vector to conduct weighted LASSO (Sect. 2.6) with their own data locally. If the selected features have better classification performance, it will send the new features and the corresponding classification result to the integration server. If there is no improvement in classification accuracy, it will send the old ones. After the integration server receives the updates from all sites, it generates a new weight vector (Sect. 2.5) according to different feature sets and their classification performance. The detailed strategy is discussed in Sect. 2.5.

2.4 Ordinary LASSO and Weighted LASSO

LASSO [10] is a shrinkage method for linear regression. The ordinary LASSO is defined as:

$$ \widehat{\upbeta}\left( {\text{LASSO}} \right) = { \arg }\;{ \hbox{min} }\left\| {{\text{y}} - \sum\nolimits_{{{\text{i}} = 1}}^{\text{n}} {{\text{x}}_{\text{i}}\upbeta_{\text{i}} } } \right\|^{2} +\uplambda\sum\nolimits_{{{\text{i}} = 1}}^{\text{n}} {\left| {\upbeta_{\text{i}} } \right|} $$
(1)

y and x are the observations and predictors. λ is known as the sparsity parameter. It minimizes the sum of squared errors while penalizing the sum of the absolute values of the coefficients - \( \upbeta \). As LASSO regression will force many coefficients to be zero, it is widely used for variable selection [11].

However, the classical LASSO shrinkage procedure might be biased when estimating large coefficients [12]. To alleviate this risk, adaptive LASSO [12] was developed and it tends to assign each predictor with different penalty parameters. Thus it can avoid having larger coefficients penalized more heavily than small coefficients. Similarly, the motivation of multi-site weighted LASSO (MSW-LASSO) is to penalize different predictors (brain measures), by assigning different weights, according to its classification performance across all sites. Generating the weights for each brain measure (feature) and the MSW-LASSO model are discussed in Sects. 2.5 and 2.6.

2.5 Generation of a Multi-site Weight

In Algorithm 1, after the integration server receives the information on selected features (brain measures) and the corresponding classification performance of each site, it generates a new weight for each feature. The new weight for the \( f^{th} \) feature is:

$$ W_{f} = \sum\nolimits_{s = 1}^{m} {\varPsi_{s,f} A_{s} P_{s} /m} $$
(2)
$$ \varPsi_{s,f} = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {if\,\;{\text{the}}\;f^{th} \;{\text{feature}}\;{\text{was}}\;{\text{selected}}\;\;{\text{in}}\;{\text{site}} - {\text{s}}} \hfill \\ {0, } \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
(3)

Here m is the number of sites. \( A_{s} \) is the classification accuracy of site-s. \( P_{s} \) is the proportion of participants in site-s relative to the total number of participants at all sites. Equation (3) penalizes the features that only “survived” in a small number of sites. On the contrary, if a specific feature was selected by all sites, meaning all sites agree that this feature is important, it tends to have a larger weight. In Eq. (2) we consider both the classification performance and the proportion of samples. If a site has achieved very high classification accuracy and it has a relatively small sample size compared to other sites, the features selected will be conservatively “recommended” to other sites. In general, if the feature was selected by more sites and resulted in higher classification accuracy, it has larger weights.

2.6 Multi-site Weight LASSO

In this section, we define the multi-site weighted LASSO (MSW-LASSO) model:

$$ \widehat{\upbeta}_{{{\text{MSW}} - {\text{Lasso}}}} = { \arg }\,{ \hbox{min} }\left\| {{\text{y}} - \sum\nolimits_{{{\text{i}} = 1}}^{\text{n}} {{\text{x}}_{\text{i}}\upbeta_{\text{i}} } } \right\|^{2} +\uplambda\sum\nolimits_{\text{i = 1}}^{\text{n}} {\left( {1 - \sum\nolimits_{s = 1}^{m} {\varPsi_{s,i} \,A_{s} \,P_{s} /m} } \right)\left| {\upbeta_{\text{i}} } \right|} $$
(4)

Here \( {\text{x}}_{\text{i}} \) represents the MRI measures after controlling the effects of age, sex and intracranial volume (ICV), which are managed within different sites. y is the label indicating MDD patient or control. n is the 152 brain measures (features) in this study. In our MSW-LASSO model, a feature with larger weights implies higher classification performance and/or recognition by multiple sites. Hence it will be penalized less and has a greater chance of being selected by the sites that did not consider this feature in the previous iteration.

3 Results

3.1 Classification Improvements Through the MSW-LASSO Model

In this study, we applied Algorithms 1 and 2 on data from five sites across the world. In the first iteration, the integration server initialized a weight vector with all ones and sent it to all sites. Therefore, these five sites conducted regular LASSO regression in the first round. After a small set of features was selected using similar strategy in [9] within each site, they performed classification locally using a support vector machine (SVM) and shared the best classification accuracy to the integration server, as well as the set of selected features. Then the integration server generated the new weight according to Eq. (2) and sent it back to all sites. From the second iteration, each site performed MSW-LASSO until none of them has improvement on the classification result. In total, these five sites ran MSW-LASSO for six iterations; the classification performance for each round is summarized in Fig. 2(a-e).

Fig. 2.
figure 2

Applying MSW-LASSO to the data coming from five sites (a-e). Each subfigure shows the classification accuracy (ACC), specificity (SPE) and sensitivity (SEN) at each iteration. (f) shows the improvement in classification accuracy at each site after performing MSW-LASSO.

Though the Stanford and Berlin sites did not show any improvements after the second iteration, the classification performance at the BRCDECC site and Dublin continued improving until the sixth iteration. Hence our MSW-LASSO terminated at the sixth round. Figure 2f shows the improvements of classification accuracy for all five sites - the average improvement is 4.9%. The sparsity level of the LASSO is set as 16% - which means that 16% of 152 features tend to be selected in the LASSO process. Section 3.3 shows the reproducibility of results with different sparsity levels. When conducing SVM classification, the same kernel (RBF) was used, and we performed a grid search for possible parameters. Only the best classification results are adopted.

3.2 Analysis of MSW-LASSO Features

In the process of MSW-LASSO, only the new set of features resulting in improvements in classification are accepted. Otherwise, the prior set of features is preserved. The new features are also “recommended” to other sites by increasing the corresponding weights of the new features. Figure 3 displays the changes of the involved features through six iterations and the top 5 features selected by the majority of sites.

Fig. 3.
figure 3

(a) Number of involved features through six iterations. (b-f) The top five consistently selected features across sites. Within each subfigure, the top showed the locations of the corresponding features and the bottom indicated how many sites selected this feature through the MSW-LASSO process. (b-c) are cortical thickness and (d-f) are surface area measures.

At the first iteration, there are 88 features selected by five sites. This number decreases over MSW-LASSO iterations. Only 73 features are preserved after six iterations but the average classification accuracy increased by 4.9%. Moreover, if a feature is originally selected by the majority of sites, it tends to be continually selected after multiple iterations (Fig. 3d-e). For those “promising” features that are accepted by fewer sites at first, they might be incorporated by more sites as the iteration increased (Fig. 2b-c, f).

3.3 Reproducibility of the MSW-LASSO

For LASSO-related problems, there is no closed-form solution for the selection of sparsity level; this is highly data dependent. To validate our MSW-LASSO model, we repeated Algorithms 1 and 2 at different sparsity levels, which leads to preservation of different proportions of the features. The reproducibility performance of our proposed MSW-LASSO is summarized in Table 4.

Table 4. Reproducibility results with different sparsity levels. The column of selected features represents the percentage of features preserved during the LASSO procedure, and the average improvement in accuracy, sensitivity, and specificity by sparsity.

4 Conclusion and Discussion

Here we proposed a novel multi-site weighted LASSO model to heuristically improve classification performance for multiple sites. By sharing the knowledge of features that might help to improve classification accuracy with other sites, each site has multiple opportunities to reconsider its own set of selected features and strive to increase the accuracy at each iteration. In this study, the average improvement in classification accuracy is 4.9% for five sites. We offer a proof of concept for distributed machine learning that may be scaled up to other disorders, modalities, and feature sets.