Introduction

Radiomics is revolutionizing medical image assessment and interpretation, moving from a subjective evaluation to a quantifiable -omics image assessment method [1, 2]. Multiple studies have shown that radiomics provides meaningful information about cancer and correlates with histological and molecular tumor phenotypes, creating opportunities to develop novel predictive and prognostic biomarkers for cancer [3, 4]. The maximum benefit for cancer patients has been shown when tailoring treatments to specific cancer characteristics [5]. Thus, radiomics can play a key role in improving personalized medicine. However, radiomics features are influenced by the image-acquisition technique and the reconstruction parameters [6,7,8,9]. Studies performed at a single institution usually do not account for this source of variability, and then, the results entail low scalability of the signatures for multi-centric applications.

To achieve meaningful generalizable radiomics-based tools, large-scale studies are necessary [10]. These require multicenter data collection, which implies scans acquired with different protocols, particularly when including retrospective data. Different strategies have been followed to minimize the effects of radiomics variability. Aerts et al considered radiomics variability as a feature selection tool by using test–retest analysis, eliminating radiomics features with high variability based on their cohort results [5, 11]. Sun et al introduced the image-acquisition parameters as a confounding variable into the model [3], and Choe et al explored convolutional neural networks-based kernel conversion for reducing radiomics variability [12].

There is an unmet need to establish robust pre- or post-image-acquisition methods for radiomics data harmonization. In this study, we explore the main image-acquisition factors that generate radiomics variability. These variability-causing factors are called “batch effects” [13]. Different batch effect correction techniques have been developed, allowing for genomics and proteomics data harmonization [14]. These batch effect correction techniques aim to remove the variance of the signal caused by the variability between batches to improve the biological signal. Alter et al defined the singular value decomposition (SVD)-based batch effect removal, where the principal components associated with the batch variability are filtered from the data and the matrix is reconstructed without these factors [15]. Johnson et al implemented the ComBat algorithm for batch correction based on an empirical Bayes approach to standardize the means and variances across batches to reduce the batch effect error [16, 17]. However, there is little evidence of the application of these methods towards reducing radiomics variability [18].

In this study, we aim to describe the image-acquisition-based sources of CT-radiomics variability. We also explore the role of image resampling and batch effect as post-image-acquisition correction methods for reducing radiomics variability, thereby improving the classification accuracy of radiomics in phantom and clinical applications.

Materials and methods

Multiple images of a phantom were acquired with different CT-acquisition protocols to explore the radiomics variability and identify the sources of variability according to the CT-acquisition parameters. Then, image post-processing and batch correction methods were implemented to reduce the radiomics variability in phantom and clinical applications. Finally, we explored the improvement of radiomics classification performance by reducing the radiomics variability (Fig. 1).

Fig. 1
figure 1

Methodology flowchart

The retrospective clinical study was approved by the institutional review board. Informed consent for the computational analysis of the CT images was waived.

Phantom image acquisition

The Gammex Model 467 Tissue Characterization Phantom (Gammex RMI) was used to describe intra-scanner variability for different acquisition parameters. This phantom includes a matrix with 13 rods of 33 cm diameter with different density materials simulating human tissues (Supplementary Material 1).

Phantom CT scans were acquired in a 16-channel Philips CT scanner by fixing all the acquisition parameters except the one tested. The tested acquisition parameters included voltage, current, slice thickness, and voxel size (accounting for the slice spacing and the pixel size) with a total of 25 different acquisition protocols (Supplementary Material 2). The minimum and maximum values of the acquisition parameters (voltage, slice thickness, slice spacing, and pixel size) were reconstructed with all the available Philips-specific reconstruction kernels (A, B, C, D, and E) to study the kernel variability with different acquisition parameter sets. The rest was reconstructed with kernel “A,” leading to 53 different protocols (Supplementary Material 3).

Clinical image acquisition

The clinical study included 43 patients (mean [range] age 66.41 [41–77] years; 46.51% [20/43] female, 53.49% [23/43] male) with 97 liver metastases (mean [range] lesions per patient 2.26 [1,2,3,4,5,6,7]) from colorectal adenocarcinoma (53.61% [52/97]) and clear cell renal carcinoma (46.39% [45/97]) [19].

Contrast-enhanced CT scans were collected retrospectively and acquired at Vall d’Hebron University Hospital and Bellvitge University Hospital between November 2013 and September 2019 with two specific acquisition protocols from General Electric and Siemens CT scanners. Additionally, all CT scans from the open-access database The Cancer Genome Atlas Kidney Renal Clear Cell Carcinoma (TCGA-KIRC) [20] were acquired with a General Electric scanner. Only patients with liver metastases from this open-access database were included in the analysis. Detailed information of CT-scan acquisition and reconstruction protocols for each hospital and the TCGA-KIRC database are defined in Supplementary Material 4.

Image processing and radiomics features extraction

In the phantom application, the 13 rods were delineated (including the entire rod) with a semi-automatic contouring function from 3DSlicer v4.8.1 [21], obtaining one volume of interest (VOI) per rod. The same VOI per rod was used to extract the radiomics features from the phantom in all the studied CT-acquisition protocols (Fig. 2a, b). Image registration was not needed, given that the scans were acquired with the same starting and ending position, and the phantom’s position did not change between scans.

Fig. 2
figure 2

Axial CT of the Gammex 467 Tissue Characterization Phantom showing the thirteen tissue and water materials (a) with the segmented volumes of interest (VOI) for the different rod materials (b). Axial enhanced CT of the abdomen showing the target liver metastases (green and red masks) of a patient with clear cell renal carcinoma (c) and a patient with colorectal adenocarcinoma (d)

In the clinical application, all well-defined liver metastases were included in the analyses (Fig. 2c, d). Small metastases (i.e., largest diameter < 1 cm) or with artifacts were excluded. Lesions were delineated with the 3DSlicer v4.8.1 semi-automatic contouring function [21] supervised by a radiologist physician with 10-year experience in oncological imaging.

In the phantom application, radiomics data from images without voxel resampling were extracted to study the impact of image resampling on radiomics data. For batch correction analysis, the images and masks were resampled to isometric voxels of 1 × 1 × 1 mm3 using spline interpolation and nearest-neighbor interpolation, respectively. Image values were discretized to a bin size of 50 HU; afterwards, the CT-radiomics features from the VOIs were extracted. The radiomics features, including first-order and texture analyses, were derived using an in-house program based on the Pyradiomics package for Python [22]. For texture feature extraction, five gray-level matrices (gray-level co-occurrence matrix [GLCM], gray-level dependence matrix [GLDM], gray-level run length matrix [GLRLM], gray-level size zone matrix [GLSZM], and neighboring gray-tone different matrix [NGTDM]) were calculated in three dimensions. Ninety-three radiomics variables were obtained for each VOI, including variables from the first-order histogram and the five gray-level matrices (Supplementary Material 5).

Sources of variability identification

Relevant sources of variability were identified by the intraclass correlation coefficient (ICC) of all the radiomics features between different acquisition parameters accounting for pixel size, reconstruction slice spacing (interpolated from the raw CT-image data without voxel resampling), acquisition slice thickness, convolution kernel, current, and voltage. The CT-acquisition variables that presented less than 80% of robust radiomics features (i.e., less than 80% of the features with ICC > 0.8 [23]) were defined as relevant sources of variability (batches) for further correction.

Techniques for radiomics variability correction

Image resampling

To correct variability from parameters related to voxel size, radiomics data were extracted from images resampled to isometric voxels of 1 × 1 × 1 mm3. Acquisition voxel size variability was analyzed separately by pixel size and slice spacing. Radiomics data of all the phantom materials with different acquisition pixel sizes (0.35 × 0.35, 0.78 × 0.78, and 1 × 1 mm2) and slice spacings (1, 1.25, 2, 2.5, and 5 mm) were included while the rest of parameters remained fixed.

To assess the effect of resampling data on variability correction, the ICCs between groups of different acquisition pixel sizes and slice spacings were computed before and after resampling. Principal component analysis (PCA) was implemented to qualitatively show the reduction of variability on radiomics data variance caused by resampling the acquisition voxel size.

Batch effect removal

To correct variability sources related to image acquisition and reconstruction, two methods of batch correction were applied: singular value decomposition-based (SVD-based) correction [15] and ComBat correction [16]. ComBat correction was applied using the SVA package from R version 3.6.1. [17]. For SVD-based correction, principal components (PC) with higher correlation with batches (i.e., convolution kernel and slice thickness defined as per the ICC analysis) were removed from the PCA space, and the matrix was reconstructed back to the feature space (Supplementary Material 6). ComBat correction with parametric adjustments was applied three times considering the sources of variability as batches (i.e., convolution kernel and slice thickness and the slice thickness–convolution kernel combination).

To evaluate data correction, the ICCs between groups of acquisition parameters were assessed before and after the application of two different batch corrections. Principal component analysis (PCA) was implemented to qualitatively show the reduction of variability on radiomics data variance caused by batch correction techniques.

Two-proportion z test was applied to compare the percentage of robust features before and after correction. The p value threshold for significance was established at 0.05. Adjustment for multiple testing was performed by controlling the false discovery rate at 0.05 according to the Benjamini and Hochberg method.

Validation analysis

Unsupervised K-means clustering purity was used to evaluate the improvement in data classification after batch variability correction with SVD and ComBat. To measure clustering purity, each cluster is assigned to the most frequent class in the cluster. Then, the accuracy is measured by counting the number of correctly classified data in the assigned class. The performance of the unsupervised clustering before and after the implementation of batch correction was analyzed in a phantom and a clinical application. The clustering performance was also analyzed in non-resampled data.

Phantom application

Two similar phantom materials (liver and brain) were included. Batch effect correction by ComBat was applied three times considering different sources of variability as batches: convolution kernel (five batches: A, B, C, D, E), slice thickness (three batches: 2, 3, 5 mm), and the combination of convolution kernel with slice thickness (15 batches; all possible combinations of convolutional kernel and slice thickness).

Clinical application

The clinical application aimed to analyze the performance of clustering different primary tumor types (colorectal versus renal) based on liver metastasis radiomics data. Batch effect correction by ComBat was applied three times considering different sources of variance as batches: manufacturer-dependent convolution kernel (two batches: General Electric and Siemens), slice thickness (four batches: 1.25, 2, 2.5, 5 mm), and the combination of convolution kernel with slice thickness (eight batches; all possible combinations of manufacturers and slice thickness).

The K-means clustering was computed 1000 times, and the highest purity of the clustering appearing on more than 20% of the iterations was chosen for the comparison between the initial data and the data after different batch correction techniques (SVD-based, ComBat) [24].

Results

Population (phantom and clinical applications)

In the phantom application, a total of 53 different CT scans of the 13 phantom materials were acquired in a Big Bore 16 CT scanner (Philips) with different acquisition parameters and reconstruction kernels.

The clinical population included 97 liver metastases from 43 patients. CT scans were retrospectively collected from Vall d’Hebron University Hospital (26/43) and Bellvitge University Hospital (12/43). In addition, five cases from the open-access database TCGA-KIRC were also included. CT images were acquired in CT scanners from two manufacturers: 60.46% (26/43) from Sensation 64 CT scanner (Siemens) and 39.53% (17/43) from Light Speed Pro 16 CT scanner (General Electric) (Table 1).

Table 1 Population description by tumor type (clinical application)

Defined sources of variability

For the phantom data, including all materials, ICCs between the different batches were assessed. Pixel size, slice spacing, slice thickness, convolution kernel, and voltage presented a low percentage of robust radiomics features (i.e., less than 80% of the radiomics features with ICC > 0.8) in at least one of the combinations from the ranging CT-acquisition parameters (Fig. 3).

Fig. 3
figure 3

Intraclass correlation coefficients (ICCs) of the radiomics features of first order and texture matrices (gray-level co-occurrence matrix [GLCM], gray-level dependence matrix [GLDM], gray-level run length matrix [GLRLM], gray-level size zone matrix [GLSZM], neighboring gray-tone different matrix [NGTDM]) between extreme CT-acquisition parameters in the phantom application

Voxel size was defined as a relevant source of variability with a percentage of robust features ranging from 48.40 to 78.49% for pixel size and from 43.01 to 86.02% for slice spacing.

According to the slice thickness, the percentage of robust features ranged from 75.25% (when 2 mm and 5 mm were compared) to 88.17% (when 2 mm and 3 mm were compared). The percentage of robust radiomics features according to the convolution kernel ranged between 55.92% (when A and D were compared) and 97.85% (when A and B were compared).

The acquisition voltage of 90 kV showed the highest variability on radiomics data (65.59% of reproducible features). The standard voltage in clinical protocol range (i.e., 120–140 kV) showed higher radiomics robustness (81.72% of reproducible features). The percentages of robust features are described in Supplementary Material 7.

Therefore, voxel size, slice thickness, and convolution kernel were defined as the sources of variability with the highest impact on radiomics data reproducibility; voxel size was corrected by resampling, whereas slice thickness and convolution kernel were considered for batch correction.

Evaluation of processing effects on radiomics data correction

Image resampling

Voxel size resampling increased the mean percentage of robust features from 59.50 to 89.25% for pixel size and from 71.62 to 82.58% for slice spacing (Fig. 4). The percentage of robust radiomics features (ICC > 0.8) before and after resampling data to isometric voxels of 1 × 1 × 1 mm3 are defined in Table 2.

Fig. 4
figure 4

Principal component analysis (PCA) before and after resampling to 1 × 1 × 1 mm3 voxels of CT images acquired with different pixel sizes (a) and slice spacings (b). PC4 (explaining 7.14% of the radiomics data variance) is associated with the different acquisition pixel sizes before resampling. PC2 (18.42%) is associated with the distribution of the different acquisition pixel heights. After resampling, the acquisition voxel size (accounting for pixel size and slice spacing) is not associated with the variance explained by the PCA

Table 2 Effect of voxel resampling to 1 × 1 × 1 mm3 on the percentage of robust features when comparing radiomics data from phantom CT scans with different voxel sizes. The percentage of robust radiomics features was compared before and after correction using two-proportion z test (p value < 0.05 in italics)

Batch effect removal

Batch effect removal was implemented in all phantom materials based on the previously defined sources of variability (slice thickness and convolution kernel). For SVD-based batch correction, from the principal component analysis (PCA), PC2 significantly associated with convolution kernel (p < 0.001), and PC3 and PC4 significantly associated with slice thickness (p < 0.001); these principal components were removed at the transformed space, and the data matrix was back reconstructed to the feature space. The SVD-based correction technique increased the mean number of robust features (Table 3). Importantly, when analyzing images with different slice thicknesses, the mean number of robust radiomics features increased: from 82.79% without correction to 92.83% with SVD-based batch correction. When analyzing images restructured with different convolutional kernels, the mean percentage of robust radiomics features increased: from 78.45% without correction to 85.25% with SVD-based correction (Table 3, Supplementary Material 8).

Table 3 Percentage of robust radiomics features (ICC > 0.8) without batch correction and after batch correction with singular value decomposition (SVD) and ComBat methods. The percentage of robust radiomics features was compared before and after correction using two-proportion z test (p value < 0.05 in italics)

The ComBat correction technique increased the mean percentage of robust features to 95.34% for slice thickness and to 89.55% for convolution kernel when considering as batches the convolution kernel–slice thickness combination (Table 3, Supplementary Material 8).

Improvement in K-means clustering performance

To test the classification performance based on radiomics features and the potential improvement by reducing radiomics variability by batch correction, a classification of similar density tissues was performed.

Phantom application

The phantom application included liver and brain phantom tissues. SVD-based correction was applied, removing the PC2 (24.84%) and PC3 (13.53%), which associated significantly with slice thickness (p < 0.001), and PC5 (3.21%) that associated significantly with convolution kernel (p < 0.001). ComBat was applied three times for batch correction, defined as convolution kernel, slice thickness, and convolution kernel–slice thickness combination. The K-means purity for tissue classification after SVD and ComBat batch correction is described in Table 4. Importantly, ComBat correction considering convolution kernel–slice thickness combination as batch effects showed the highest clustering purity (85.85%) (Fig. 5).

Table 4 K-means purity for phantom tissue (brain vs liver) and tumor type (colorectal carcinoma vs clear cell renal carcinoma) classification before and after batch correction
Fig. 5
figure 5

Principal component analysis (PCA) of the brain and liver material radiomics distribution before and after convolution kernel–slice thickness ComBat correction. The distance between the radiomics data of the brain and liver materials from CT scans with different acquisitions protocols increases after applying batch correction (i.e., the radiomics distribution better reflects differences between materials and not due to the CT-acquisition parameters)

In the phantom study, the clustering performance from resampled images did not show improvement from non-resampled image data clustering (Supplementary Material 9).

Clinical application

In the clinical application of tumor type classification based on liver metastasis, the SVD was applied removing the PC1 (34.21%) that associated significantly with slice thickness and convolution kernel (p < 0.001). Batch correction by ComBat was applied three times, considering as batches the convolution kernel, slice thickness, and convolution kernel–slice thickness combination.

The K-means improvement for tissue classification after SVD and ComBat batches correction are described in Table 4. Importantly, ComBat correction for the convolution kernel showed the best performance for primary tumor type classification based on radiomics data from liver metastasis (purity = 73.20%) (Fig. 6).

Fig. 6
figure 6

Principal component analysis (PCA) of the liver metastasis radiomics distribution from CT scans of patients with colorectal adenocarcinoma and clear cell renal carcinoma. PCA before and after convolution kernel ComBat correction. The distribution of the groups of patients with different tumor types differs more after batch correction. The first component (PC1 [%]) of data variance can differentiate better between groups (colorectal versus renal) after correction

In the clinical application, the clustering performance from resampled images did not show improvement from non-resampled image data clustering (Supplementary Material 9).

Discussion

The capacity to extract a large amount of valuable quantitative data from medical images, such as CT, is revolutionizing the way medical scans can be evaluated. However, the development of reliable imaging biomarkers requires robust CT-based radiomics data. In this study, we defined the main sources of CT-radiomics variability based on a comprehensive phantom study with multiple CT-acquisition protocols. We also evaluated the influence of image resampling and the effect of radiomics data normalization by means of batch effect correction to reduce the variability and improve the tissue-classification capacity of radiomics in a phantom and clinical application.

We have shown that voxel size, convolution kernel, and slice thickness are relevant sources of variability. The voxel size has the highest impact on radiomics variability, particularly on texture features. Convolution kernel also affects first-order features, which are overall the most robust features regardless of the CT-acquisition protocol. Similarly to Berenguer et al [8], we also found that the radiomics features presented more variability when evaluated in CT scans acquired with low voltage values (90 kV). Importantly, the radiomics features were more robust when the voltage was within the range applied in standard clinical practice (i.e., 120–140 kV).

The pixel size varies in each scan and for each patient due to the changing field of view, limiting the possibility to pre-define this parameter. This study shows that image processing techniques regarding voxel resampling reduce the variability caused by the acquisition voxel size. A possible explanation for this variability decrease could be the resolution homogenization and the smoothness in gray-level transitions in the resampling direction. Therefore, in the z-direction, despite that we cannot restore the missing information in a large voxel size (e.g., 5 mm), when resampling, the texture analysis considers the changes in gray levels inside the original voxel size and avoids abrupt gray-level changes to make them more comparable to the x- and y-directions.

In order to correct the variability caused by the reconstruction kernel and acquisition slice thickness, SVD-based and ComBat batch correction techniques were applied to radiomics data considering both image parameters as batches. In line with Orlhac et al [18], we demonstrate that after applying ComBat, the distribution of the data (using PCA) was modified to differentiate the phantom materials based on radiomics features. In our study, we also show that the reproducibility improves by means of ICCs. However, we aimed to study not only the variability correction by ComBat but also how the tissue-classification performance of radiomics improves after this variability correction. We have shown that ComBat correction for kernel and for kernel–slice thickness combinations in both the phantom and clinical applications outperforms the classification accuracy of radiomics data. The SVD-based correction improved the reproducibility of the radiomics features, although this could have suffered from overcorrection, leading to a loss of biological meaning and decreasing the tissue-clustering accuracy of radiomics.

Deep learning techniques have been developed to reduce radiomics variability by reconstructing images to the same convolution kernels [12]. However, the need for large data sets and the wide variety of intra- and inter-manufacturer reconstruction kernels limits the application of these techniques. ComBat correction can be applied in smaller datasets due to the non-parametric adjusting methods used to correct data variance associated to a particular factor.

The results of our study are promising, but we acknowledge some limitations. First, we implemented a variability correction method in both phantom and clinical applications. The phantom was used to assess the intra-scanner variability from one CT vendor while the clinical application analyzed the inter-scanner variability for ComBat and SVD-based correction. Further studies with intra- and inter-manufacturer CT scans could be performed to extend the application of batch correction methods. Moreover, there are several reconstruction kernels along manufacturers that could be considered comparable, as proposed by Mackin et al [25]. This would reduce the inter-manufacturer variability and would facilitate the definition of batches based on inter-manufacturer similar kernels for large-scale multicenter studies. Second, the radiomics variability correction was clinically tested as a method to improve the tumor type classification. Although this highlights the impact of post-acquisition CT-radiomics normalization by means of batch correction, further applications need to be tested and validated in larger populations. Finally, the described normalization methods have been tested in CT images; it is of interest to test these in multi-image modalities including MRI.

In conclusion, the main sources of CT-radiomics variability are slice thickness and reconstruction kernels. The application of image post-processing and the ComBat correction method minimizes radiomics data variability regardless of the differences in the CT-image-acquisition protocols. These methods are easy to apply to expand the potential of radiomics implementation in new retrospective and prospective multicenter large-scale studies where the variability of the acquisition protocols and scanners is the major limitation.