Keywords

1 Introduction

Focal cortical dysplasia (FCD), a malformation of cortical development, is a frequent cause of drug-resistant epilepsy. This surgically-amenable lesion is characterized on histology by altered cortical laminar structure and cytological anomalies together with gliosis and demyelination, which may extend into the underlying white matter [1]. On MRI, FCD typically presents with cortical thickening, hyperintensity, and blurring of the gray-white matter interface. These changes may be visible to the naked eye on T1- and T2-weighted MRI, or subtle and easily overlooked [2].

Over the last decade, a number of automated algorithms have been developed [3]. Contemporary FCD detection methods rely on surface-based approaches [4,5,6,7], which allow to effectively model sulco-gyral morphology. While they have shown effectiveness, they have been mainly used as a proof of principle and applied to lesions previously seen on MRI, but rarely validated histologically. Despite advances in MRI analytics, current algorithms fail to detect subtle FCD [2]. Importantly, since training and validation have been performed on data from the same center and scanner, generalizability to independent cohorts remains unclear. Finally, arduous pre-processing and specialized expertise preclude their broader integration into clinical workflows.

Conventional machine-learning systems require careful engineering and considerable domain knowledge to design features from which the classifier can learn patterns. Conversely, convolutional neural networks (CNNs), a class of deep neural networks, have the capacity to extract a hierarchy of increasingly complex features from the data [8]. In biomedical imaging, CNNs have gained popularity in brain tissue classification, and segmentation of brain tumors and multiple sclerosis plaques (see Litjens et al. [9] for review). To the best of our knowledge, no study has deployed CNNs to detect cortical brain malformations.

Exploiting the complementary diagnostic power of T1- and T2-weighted contrasts, we propose a novel algorithm with minimal data pre-processing and which harnesses feature-learning proficiency of CNNs to distinguish FCD from healthy tissue directly on MRI voxels. Our algorithm was trained and tested on data from a single site (S1) and tested on independent data from S1 and six sites worldwide (S2–S7), for a total of 107 individuals. Furthermore, it was tested against a benchmark surface-based algorithm, making this study the first deep-learning approach for FCD detection with multicentric validation.

2 Methods

2.1 MRI Acquisition

At S1, multimodal MRI was acquired on a 3T Siemens TimTrio using a 32-channel head coil, including: 3D T1-weighted MPRAGE (T1w; TR = 2300 ms, TE = 2.98 ms, flip angle = 9°, FOV = 256 mm2, voxel size = 1 × 1 × 1 mm3), and T2-weighted 3D fluid-attenuated inversion recovery (FLAIR; TR = 5000 ms, TE = 389 ms, flip angle = 120°, FOV = 230 mm2, 0.9 × 0.9 × 0.9 mm3).

2.2 Image Pre-processing

For all datasets, T1w and FLAIR images underwent intensity non-uniformity correction [10] and normalization. T1w images were then linearly registered (affine, 9 degrees of freedom) to the age-appropriate MNI152 symmetric template (1 × 1 × 1 mm3) stratified across seven age-groups [0–4.5, 4.5–8.5, 7–11, 7.5–13.5, 10–14, 13–18.5, 18.5–43 years old] [11]. Age-appropriate templates minimize the interpolation effects of linear registration, thereby limiting blurring effects that may mimic lesional tissue and manifest as false positives. FLAIR images were linearly mapped to T1w images in MNI space. Skull-stripping was performed to exclude non-brain tissue.

2.3 Patch-Based Input Sampling

Balanced Inputs Based on 3D Volumetric Images.

Data imbalance is a challenging issue in FCD lesion detection where the number of healthy voxels significantly outweighs pathological voxels (<1% of total voxels). To prevent biasing the classifier towards healthy voxels, we constructed a patch-based dataset by randomly under-sampling the healthy voxels such that the feature set was composed of equal number of examples from both classes. To this end, we sub-sampled multi-contrast 3D patches from the co-registered 3D T1w and FLAIR images, with each input image modality representing a channel. The data was normalized within each input modality with zero mean and unit variance. For each normalized training image, we computed 3D patches (16 × 16 × 16) centered on the voxel of interest. The set of all computed patches were aggregated as
P = {n × 2 × 16 × 16 × 16}, where n and 2 denote the number of training patches and input MRI modalities, respectively.

Sampling Heuristics.

On a per-subject level (1.7 million patches × 32 KBytes/patch = 26.3 GB), the training is quite memory-intensive to complete within a reasonable timeframe. To circumvent this issue, we sampled only hyperintense voxels based on the FLAIR contrast by thresholding the subject-level z-normalized images and discarding the bottom 10 percentile intensities. This thresholding yielded a crude gray matter mask, which covered the hyperintense white matter as well. This approach is also biologically meaningful as FCD lesions are primarily located in the gray matter [12]; moreover, both their gray matter and white matter components are consistently hyperintense on FLAIR [13].

2.4 Network Architecture and Design

A typical convolutional neural network (CNN) consists of three stages: convolutions, nonlinearity, and pooling. Here, we designed two identical CNNs whose weights are optimized independently. This two-phase cascaded training procedure has been shown to allow efficient training in both CNNs [14, 15] and conventional machine learning [4, 6] paradigms when the distribution of labels is unbalanced. CNN1 was trained to maximize putative lesional voxels, while CNN2 reduced the number of misclassified voxels (i.e., removing false positives while maintaining optimal sensitivity). Each fully convolutional network was composed of three stacks of convolution and max-pooling layers with 48, 96 and 2 filters, respectively. The rectified linear activation (ReLU) non-linearity function was applied to the first two of the three convolutional layers. Softmax non-linearity was used after the final convolution to normalize the result of the kernel convolutions into a binominal distribution over the healthy and lesional labels. See Fig. 1 for network parameters.

Fig. 1.
figure 1

Top panel: Convolutional network architecture (CNNx) for two-label (lesional vs. non-lesional) classification. Bottom panel: Training and testing schema using two-stage CNNx cascade (CNN1/CNN2).

2.5 Classification Paradigm

Training algorithm.

We used a validation set (75/25 training data split) to optimize the CNN weights. The training set is used to adjust the weights of the neural network, while the validation set measures the performance of trained CNN after each epoch and continues until the validation error plateaus. The model is randomly initialized, and network parameters learns iteratively via the adaptive learning rate method (AdaDelta) by minimizing the binary cross-entropy loss. Binary cross-entropy loss is mathematically defined as:

\( crossentropy\left( {p, q} \right) = - (p \cdot {log q} + (1 - p) \cdot {log} (1 - q)) \), where: p is the true/label distribution, and q is the model/predicted distribution.

Regularization contingencies, including batch-normalization (BN) and Dropout were implemented to prevent overfitting to the training data. At each iteration, BN regularization was implemented after the first two of the three convolutional layers and Dropout (p = 0.4) before the last layer, thereby randomly deactivating 40% of the units (or network connections).

Inference/Testing Algorithm.

The proposed pipeline was trained on the S1 cohort of 40 consecutive patients with histologically-confirmed FCD lesions. This trained model cascade then served probabilistic predictions on unseen datasets acquired at S1-S7 sites. For each test subject, input images were first partitioned into patches with voxel sampling limited to the FLAIR mask (intra-subject Z-score >0.1). The balanced patch dataset was evaluated using CNN1, which effectively discards improbable lesion candidates. The remaining voxels (threshold >10%) were re-evaluated by CNN2 to obtain the final probabilistic lesion mask. Since, the cost of misclassifying the lesion as healthy tissue is severe, we applied a conservative threshold (>10%) on the probabilistic prediction masks. A simple post-processing routine involving successive morphological erosion, dilation, and extraction of connected components (>75 voxels), was executed to remove flat blobs and noise. The final segmentation masks were compared to manual expert annotations of the lesions.

3 Experiment and Results

3.1 Subjects

We studied retrospective cohorts with FCD lesions histologically-confirmed after surgery from seven tertiary epilepsy centers worldwide (n = 107). The presurgical workup included neurologic examination, assessment of seizure history, neuroimaging, and video-EEG telemetry. Since the routine MRI was initially reported as unremarkable in 56 patients (52%), the location of the seizure focus was established using intracranially-implanted electrodes; in all patients, retrospective inspection revealed a subtle FCD in the seizure onset region.

Training Cohort.

The primary site (S1) comprised 40 patients (20 males, 35 adults; mean ± SD age = 27 ± 9 years).

Independent Testing Cohorts.

Independent test cohorts comprised 67 histologically-confirmed FCD (37 adults and 30 children; mean ± SD age = 33 ± 11 years, 9 ± 6 years, respectively) from six sites with different scanners, and field strengths (1.5T, 3T). The control group consisted of 38 healthy individuals (age = 30 ± 7 years) and 63 disease controls with temporal lobe epilepsy (TLE) and histologically-verified hippocampal sclerosis (age = 31 ± 8), matched for age and sex to S1 cohort.

3.2 Performance Evaluation

Evaluation of Classification for S1.

Two experts segmented independently 40 lesions on co-registered T1w and FLAIR images. Inter-rater dice agreement index [D = 2|M1M2|/(|M1| + |M2|) (M1: 1st label, M2: 2nd label; M1 ∩ M2: intersection of M1 and M2] was 0.91 ± 0.11. The union of the two ground truth labels served to train the classifier. The classifier was trained using 5-fold cross validation repeated 20 times. Sensitivity was the proportion of patients in whom a detected cluster co-localized with the lesion label. Specificity was determined with respect to controls (i.e., proportion of controls in whom no FCD lesion cluster was falsely identified), and disease controls with TLE. We also report the number of clusters detected in patients remote from the lesion label (i.e., false positives).

Evaluation of Classifier Generalizability.

We tested the sensitivity of the classifier trained on S1 was tested on a held-out dataset of eight FCD patients from S1 and 59 independent FCD datasets from S2–S7. For the cross-site unbiased reporting of results blinded to clinical information, the prediction maps (in stereotaxic space) were sent back to respective sites to confirm or dispute the detection of the lesion.

Comparison with a Benchmark Surface-Based Classifier.

We analysed the S1 dataset using a previously published method [6] based on an ensemble of RUSBoosted decision trees across two classification stages, which uses a total of 30 intensity and morphology features calculated on multimodal T1-weighted and FLAIR images. The classifiers were trained using 5-fold cross validation averaged across 10 iterations.

3.3 Results

The 5-fold cross-validation of the CNNs resulted in a sensitivity of 87 ± 4%, with an average of 35/40 lesions detected. In these cases, 2 ± 1 extra-lesional clusters were also detected. Specificity was 95% in healthy controls (3 ± 1 clusters in 2/38) and 90% in TLE (1 ± 0 cluster in 7/63).

For cross-dataset classification at seven sites, overall sensitivity was 91% (61/67 lesions detected) with 3 ± 2 extra-lesional clusters observed in 47/67 cases. Per-site sensitivity for S1-S7 was 100% (8/8 lesions detected, 2 ± 2 extra-lesional clusters), 86% (17/19, 4 ± 2), 89% (8/9, 2 ± 1), 75% (6/8, 2 ± 1), 100% (5/5, 5 ± 2), 91% (10/11, 2 ± 3), and 100% (7/7, 2 ± 2), respectively. Stratifying patients based on age, sensitivity in children (2-18.5 years old) was 90% (27/30 FCD detected, 4 ± 3 extra-lesional clusters) while in adults (>19 years old) it was 92% (34/37, 3 ± 2). Figure 2 shows test case examples.

Fig. 2.
figure 2

Classification results using the cascaded CNNx trained on 40 FCD patients at site S1 (Siemens TrioTim 3T) to demonstrate generalizability for lesion detection along three axes of heterogeneity: scanner type, field strength (top labels), and age (bottom labels). The seven cases obtained using different scanners at six sites (excluding S1) are shown. The top row indicates the strength of prediction overlaid on the FLAIR, while the second/third rows show the corresponding FLAIR and T1w, respectively. The bottom labels are read as site-patient-ID/age/gender. MRI-negative cases are identified with .

Training and testing a surface-based classifier based on S1 dataset yielded a lower performance with a sensitivity of 83 ± 2% (33/40 lesions detected), with 4 ± 5 extra-lesional clusters. Specificity was 92% in healthy controls (1 ± 0 cluster in 3/38).

4 Discussion

We present the first deep learning method to segment FCD, with multicentric validation. Operating on routine multi-contrast MRI in voxel-space, our algorithm provides the highest performance to date. Furthermore, we demonstrated generalizability of a model trained on a single-site dataset by showing robust performance across independent cohorts from various centres worldwide with different age, scanner hardware and sequence parameters. Notably, >50% of lesions were missed by conventional radiological inspection.

Operating at two consecutive levels, our classifier resulted in both high sensitivity and specificity. The number of false positive findings in healthy and disease controls were rather modest. Even though our algorithm was trained on an adult dataset, its performance was equally good in children. With respect to the latter, the use of age-appropriate templates taking into account the developmental trajectories, i.e., age-varying tissue contrast, white matter myelination and cortical maturation, is likely to have contributed to the excellent performance by limiting the interpolation effects that would have occurred during registration using an adult template. Moreover, the overall high performance across cohorts strongly suggests that the network learns and optimizes parameters specific to FCD pathology, a fact validated by histological confirmation in all cases.

Compared to a state-of-the-art surface-based classifier, both sensitivity and specificity were higher using the current algorithm. Applying a surface-based approach to S2–S7 would have been challenging due to the large variability in image quality, which would require site-specific fine-tuning of algorithm parameters. A comprehensive comparison is part of future work. In addition, owing to the considerable time investment to manually correct brain tissue segmentation and surface extraction errors, which may have negative downstream effects on the fidelity of features extracted, the current approach is both time-effective and superior.

In conclusion, easy implementation, minimal pre-processing, significant performance gains and inference time of <6 minutes/case make this classifier an ideal platform for large-scale clinical use, particularly in “MRI-negative” FCD.