Introduction

TEM tomography in materials science has become the de facto technique that enables valuable high spatial resolution information on the structure of the materials in 3D1,2,3,4,5. Despite the progress in developing novel methods for acquiring and aligning the tomography tilt series and a broad spectrum of reconstruction algorithms6,7,8,9, semantic segmentation of the large 3D data sets remains a significant bottleneck in 3D analysis. Manual segmentation is a very time-consuming task, relying heavily on the handcrafting skills and expertise of a human operator. A reliable, reproducible, and fully automated segmentation method is in high demand for scaling the 3D data analysis and collecting statistically meaningful information where a single reconstructed volume segmentation involves hundreds of images or more.

Recent advances in deep learning methods have revolutionized the field of computer vision10,11,12,13, and the evolution of these methods has enabled the automatic semantic segmentation of large data sets; otherwise, manual analysis is unfeasible14,15,16. Deep learning pixel-wise classifiers have been successfully applied to many semantic segmentation tasks where complex structures are not easily mapped by simple intensity differences, and boundaries between the image features are not apparent due to the variations in contrast gradients17.

In deep learning, fully convolutional neural networks (FCNs) hierarchically recognize complex features directly from the training data without the additional feature engineering. More recently, FCNs, inspired by large and deep networks, are efficiently trained end-to-end by supervised learning and pixel-to-pixel probabilities computed successfully, thanks to the many advancements in parallel computing18. It has been shown that segmentation results using FCNs can indeed reach a human-level performance, and even on some occasions, exceed that without the post-tuning of the results19,20,21. Today, it is possible to generalize these deep networks with a limited amount of ground-truth data by implementing data augmentation and regularization techniques while mitigating the problem of variance22.

In this paper, we explore the capability of FCNs in the semantic segmentation of a γ-Alumina (Al2O3)/Pt catalytic material. Historically, γ-Alumina has been one of the most used catalytic support materials for noble metals and oxide catalysts employed for reduction, oxidation, and reforming reactions in automotive exhaust control and petroleum refining processes23. γ-Alumina possesses a complex crystalline structure; despite its broad application space, the origin of the catalytic behavior remains actively studied24,25. There is a considerable debate on the role of surfaces of γ-Alumina responsible for both catalytic properties and anchoring of the noble metallic NPs. In addition to the structural complexity and small crystallite sizes, the support material γ-Alumina consists of a dense network of matrix pores, and the degree to which these pores are connected to outside surfaces is of great interest24,25,26,27.

In catalytic materials, the sparse distribution of the noble metallic NPs, over the background and oxide support material introduces an unbalanced representation of the data in TEM images. The extent of the class imbalance problem between the foreground and background of the images has been extensively studied in deep learning-based semantic segmentation approaches28,29,30. It has been shown that the choice of loss function significantly impacts the performance of a semantic segmentation model31. Many recent state-of-the-art applications of FCNs focus on the implementation of weighting strategies coupled with distribution-based or differentiable region-based loss functions for the optimization of the models31,32,33. Even though weighting strategies at the loss function level control the class imbalance, the problem of loss becoming overwhelmed by the number of easy examples during inference remains a challenge in complex multi-class situations34. Moreover, there are limited applications of these strategies in the semantic segmentation of materials science samples, particularly segmentation of the 3D electron tomography reconstructions15,21,35,36. This study presents the first application of deep learning-based multi-class semantic segmentation of large and unbalanced data of 3D tomograms as obtained from HAADF STEM.

Results and discussion

We present a U-Net-based FCN architecture and weighted focal loss as a loss function to overcome the class imbalance problem34,37. The weighted focal loss is a distribution-based loss function, and weighting of the unbalanced data occurs at the loss function level in contrast to data preprocessing strategies. In addition to the weighting, focal loss applies a modulation term to the standard cross-entropy loss and dynamically scales the confidence of the correctly classified examples34. In our experiments, the U-Net architecture equipped with the weighted focal loss facilitated a comprehensive 3D representation of the catalytic material and provided a clear insight regarding the long-standing debate on the characteristics of γ-Alumina surfaces and their relation to the catalytic NPs. We discuss the accuracy of our segmentation results by assessing commonly used semantic segmentation metrics on the overall overlap and boundary match between the ground-truth and predicted segmentations.

To further test our model’s robustness and validity, the best-performing model was deployed on the automatic semantic segmentation of a large data set of reconstructed images. We believe strongly that deep learning-based semantic segmentation methods have immense potential in 3D data analysis and will usher in a new era in materials design and discovery.

Segmentation architecture and inference

For FCNs experiments, we exploit an adopted version of the U-Net architecture, and a schematic of the architecture is shown in Fig. 1. Our network consists of two learning paths, a down-sampling (contraction) path and an up-sampling (expansion) path. There are six convolutional steps in the down-sampling path and five in the up-sampling path. In the down-sampling path, each step has two convolutional layers with a filter size of 3 × 3. The size of the feature maps is halved by the pooling layers following each step. In the up-sampling path, each step starts with a convolutional transpose layer with a filter size of 2 × 2 and a stride of 2, followed by two convolutional layers with a filter size of 3 × 3. The size of the feature maps is doubled, and the number of feature maps is halved at the end of each convolutional step in the up-sampling path.

Figure 1
figure 1

A schematic of the U-Net architecture. The number of feature maps is indicated on the top of each box, and the dimensions of the feature maps are on the bottom left corner. Orange boxes show the contraction path, and black boxes show the expansion path. The gray arrows and boxes indicate the concatenation path at each convolutional step.

In our version of the U-Net, we used the ‘same’ padding in the convolution layers followed by an average pooling layer for down-sampling. Using the ‘same’ padding resulted in the output of the convolution layers being the same size as the input layers. The pooling operation plays a vital role in the flow of information through the convolutional layers and defines the model’s sensitivity to details. In our architecture, average pooling is preferred over max pooling for down-sampling to reduce the spatial information loss at the feature boundaries and prevent excessive pixel saturation.

Concatenation paths (skip connections) give the network a well-known ‘U’ shape pattern and link the high spatial information from down-sampling convolutional layers to the up-sampling convolutional layers. The network hierarchically learns the contextual information and fine details in the predicted images. We did not see a significant performance improvement in our model with the addition of dropout layers; instead, we employed a comprehensive data augmentation strategy for regularization. We used a rectified linear unit (ReLU) activation function for hidden layers and softmax for the output convolutional layer with final feature maps of 3.

As described previously, we investigate the complex bulk and surface structure of the γ-Alumina/Pt catalysts. Our segmentation model aims to distinguish the two-phase microstructure of the γ-Alumina and Pt NPs, as well as the pores. A correct classification of the surfaces, defined by the γ-Alumina—background and γ-Alumina—Pt—background boundaries, is mainly of high interest. In the tomography reconstructions, γ-Alumina and the background constitute the most significant fraction of the reconstructions, compared with the sparsely distributed nanoscale Pt particles. When there is an imbalance in the data representation, the learning algorithm can be biased towards the dominant class, represented densely by the higher number of pixels. In order to address this problem, we used weighted focal loss instead of the standard cross-entropy to minimize the loss34,38. Weighted focal loss is a differentiable modification of the cross-entropy loss term and addresses the class imbalance problem in two ways. Firstly, it shifts the focus from easy-to-classify pixels towards hard-to-classify pixels by extending the range in which each pixel receives loss. This is achieved by scaling the cross-entropy loss by a focusing parameter and preventing the loss from being overwhelmed by the easy pixels. Secondly, the role of the weighted focal loss is to provide a balancing action on the class imbalance problem by adjusting the contribution of each class to the loss function by a weighting factor. Categorical focal loss (LFL) is defined as the following in a multi-class problem:

$$\begin{array}{*{20}c} {L_{FL} (y,\;p) = - \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \mathop \sum \limits_{c = 1}^{C} \alpha_{c} y_{i,c} (1 - p_{i,c} )^{\gamma } \log p_{i,c} } \\ \end{array}$$
(1)

where yi,c and pi,c are the ground-truth and prediction probabilities of class c at pixel location i. Parameters C and N are the number of classes and pixels, respectively. \(\alpha_{c}\) is the weighting factor for class \(c\) and γ is the focusing parameter. Both focusing parameter and weighting factor are tunable hyperparameters. In our experiments, \(\alpha_{c}\) values were approximated on the density of representation of each class at the range from 0 to 1, and γ was set to 1.

Evaluation metrics

Evaluation metrics play an essential role in proving the network’s performance and thus establishing the model for automatic semantic segmentation. In this work, predictions were assessed using four commonly used semantic segmentation metrics: Dice similarity coefficient (DSC), recall, precision, and Hausdorff distance (HD)39. DSC, recall, and precision scores are similarly extracted from the confusion matrix and defined as:

$$\begin{array}{*{20}c} {DSC = \frac{2TP}{{2TP + FP + FN}}} \\ \end{array}$$
(2)
$$\begin{array}{*{20}c} {Recall = \frac{TP}{{TP + FN}}} \\ \end{array}$$
(3)
$$\begin{array}{*{20}c} {Precision = \frac{TP}{{TP + FP}}} \\ \end{array}$$
(4)

where true positives (TP), false positives (FP), and false negatives (FN) represent per pixel classifications of the confusion matrix. In this study, a 3 × 3 confusion matrix was calculated from each ground-truth and predicted segmentations at pixel resolution of 0.12 nm/pixel using the validation data set. Average DSC, recall, and precision scores were reported using a mean ± 95% confidence interval.

We also evaluated our semantic segmentation results by measuring the dissimilarities specifically at the segmentation boundaries. Hausdorff distance is a boundary distance-based metric and measures the largest segmentation error in the overlap between the ground-truth and predicted segmentations40. Given two sets of points A and B, Hausdorff distance is defined as:

$$\begin{array}{*{20}c} {HD \;(A, \;B) = \max (hd(A, \;B), \;hd(B, \;A))} \\ \end{array}$$
(5)

where \(hd(A, \;B)\) and \(hd(B, \;A)\) are directed Hausdorff distances:

$$\begin{array}{*{20}c} {hd(A,\;B) = \mathop {\max }\limits_{a \in A} \mathop {\text{min }}\limits_{b \in B} \left\| {a - b} \right\|} \\ \end{array}$$
(6)
$$\begin{array}{*{20}c} {hd(B,\;A) = \mathop {\max }\limits_{b \in B} \mathop {\text{min }}\limits_{a \in A} \left\| {a - b} \right\|} \\ \end{array}$$
(7)

Functions \(hd(A, \;B)\) and \(hd(B, \;A)\) measure the distances between two points in A and B, which are farthest from any nearest neighbors, and \(HD (A, \;B)\) gives the largest of these distances. \(\left\| {a - b} \right\|\) is the Euclidean norm between the points in \(A\) and \(B\). A well-documented behavior of the Hausdorff distance is its sensitivity to outliers and noise41; thus, we report robust HD (RHD) values considering the percentile of the largest segmentation errors and as well as the maximum HD42,43. We aim to down-weight the impact of outliers and noise on the HD metric by measuring the RHD values.

Classification of the 3D reconstructions

Segmentation of the HAADF STEM tomography reconstructions is a challenging task due to information loss from the insufficient number of projections (i.e., missing wedge artifacts) and variations in the contrast and size of the features in tilt images. A series of representative orthogonal slices (orthoslices) taken from the reconstructed 3D volume of an isolated γ-Alumina/Pt particle is shown in Fig. 2a–c. In the orthoslices, the 3D microstructure of the particle is sectioned perpendicular and parallel to the broad surfaces of γ-Alumina. At first glance, we notice the significant contrast difference between the oxide γ-Alumina and metallic Pt NPs, where Pt NPs appear much brighter than both γ-Alumina substrate and background in the reconstructions. Pt NPs mainly exhibit round shapes, and their size distribution is in the 1–4 nm range, while γ-Alumina particles have a thin plate-like structure and a roughly rhombus shape. Some of the very small Pt NPs show an elongated shape because of the well-documented missing wedge artifact in TEM tilt tomography2, as seen in Fig. 2c.

Figure 2
figure 2

Orthogonal slices are taken from the 3D HAADF STEM reconstructions of an isolated γ-Alumina/Pt particle. (a) A representative section parallel to the broad surface of γ-Alumina, axial slice parallel to XY- plane, (b,c) are axial slices parallel to YZ and XZ-planes, respectively. The large particle with dimmer contrast is γ-Alumina, marked with a gray arrow. Smaller round particles with brighter contrast are Pt NPs, and a few examples are marked with white arrows. Pores are marked with yellow arrows.

In γ-Alumina, the crystallographic shape of the particle is defined by the {110} and {111} type surfaces, and the main ‘broad’ surface is {110} orientation, and side surfaces are {111}. Furthermore, synthesized γ-Alumina particle shows small ledges and facets on the surfaces and a network of high-density pores inside the matrix.

To evaluate the performance of our model in the segmentation task, we compared the results obtained from the validation data. Figure 3a,b, visualize results of exemplar qualitative segmentations from γ-Alumina and Pt NPs, respectively. The images represent two 512 × 512 pixels patches from the validation data. For each patch of the validation data, ground-truth and predicted segmentations of γ-Alumina and Pt NPs are compared separately in the binary images. The differences in the segmentations are shown more explicitly in the false negative and false positive maps highlighting the discrepancies in the overlap of each class.

Figure 3
figure 3

A set of ground truth and predicted segmentations of γ-Alumina and Pt NPs from the validation data and corresponding false negative and false positive maps show the discrepancies in the overlap of each class. The images in (a) and (b) represent two 512 × 512 pixels size patches from the validation data set; for illustrative purpose, each set of images demarcated with a black frame. Upper rows are segmentation results from the γ-Alumina particle and lower rows are from the Pt NPs of the same patch.

When considering the false positive maps of the γ-Alumina segmentations, we notice that γ-Alumina boundaries are relatively shifted towards the pores inside the γ-Alumina matrix. In contrast to γ-Alumina/pore boundaries, most misclassifications near the surfaces are discontinuous. False negative and false positive regions are extended only a few pixels wide towards either γ-Alumina or background. Moreover, the boundaries in the predicted segmentations, both on the surfaces and along the γ-Alumina/pore boundaries, are smoother than the ground-truth segmentations.

We further discuss these observations in the context of the evaluation metrics, and the results are shown in Table 1. The overlap performance between the ground-truth and predicted segmentations was evaluated using the validation data set consisting of 66 isolated Pt NPs with an average diameter of 16 pixels, and 70 γ-Alumina matrix pores with an average diameter of 37 pixels. As expected, shifting of the γ-Alumina boundaries, mainly towards the pores, is reflected in the evaluation results such that the average precision score is lower than the average recall score for γ-Alumina. The average precision and recall scores are 0.95 ± 0.008 and 0.97 ± 0.004 (mean ± 95% confidence interval), respectively. In contrast to γ-Alumina, we observed fewer false positives in the segmentations of Pt NPs. Still, there are some missing Pt NPs in the predictions and corresponding misclassified regions seen in the comparison of the Pt NPs segmentations. Compared with γ-Alumina, the average precision score of Pt NPs is higher than the recall score. The average precision and recall scores are 0.92 ± 0.03 and 0.78 ± 0.04, respectively, for the segmentation of Pt NPs.

Table 1 Evaluation results from the validation data set. All the values are in the form of mean ± 95% confidence interval.

One explanation for the extension of the γ-Alumina matrix towards the pores inside may be the contrast modulations along the diffuse boundaries between the γ-Alumina and pores. Uncertainty in the contrast of these boundaries can potentially fuel ambiguity in the manual annotations of the ground-truth segmentations. Still, a precise annotation of these low contrast boundaries can be a challenging task even for a human expert.

Another approach would be generating more annotated data to elevate the model’s performance. However, this would be computationally costly and would require additional manual annotations. It is also worth mentioning that it is inevitable to have false negatives and false positives during inference. Ideally, we would target a trade-off between precision and recall. Nevertheless, a comparison of the ground-truth segmentations with the predicted segmentations shows a strong correlation, especially in the complex surface structure of the γ-Alumina and the appearance of the Pt NPs. The overall similarities in the size and shape of the Pt NPs between the ground-truth and predicted segmentations also suggest that the model has convincingly managed the class imbalance problem without a significant underestimation. The overall segmentation performance measured by the DSC score for each class is 0.99 ± 0.002 for background/pores, 0.96 ± 0.003 for γ-Alumina, and 0.84 ± 0.03 for Pt NPs. Here, we report the background DSC score since the pores inside γ-Alumina are associated with the background class.

To investigate our segmentation results further, we conducted measurements on the degree of boundary match using the HD metric. This is of particular interest for detecting the model’s performance correctly learning the boundaries in each class. The HD metric is a powerful tool in measuring the largest segmentation error in the overlap between the two segmentations, while the DSC score is an overlap metric affected by the segmentation performance over the entire image. Figure 4 shows the trend in the average HD of γ-Alumina and Pt NPs segmentations at the various percentiles of robust HD (RHD) values. Measurement of the RHD provides a unique opportunity to understand the contributions of the outliers and noise to the model performance while guiding the degree of boundary match.

Figure 4
figure 4

A plot of robust Hausdorff distance (RHD) vs. percentile of the largest segmentation error. Data points are in the form of mean ± standard error.

As seen in Fig. 4, RHD values fall sharply from a maximum HD of 5.45 ± 1.78 nm (mean ± standard error) for γ-Alumina and 3.70 ± 1.84 nm for Pt NPs. At RHD95, HD between the ground-truth and predicted segmentations decreases to 1.45 ± 0.68 nm for γ-Alumina and 3.48 ± 1.80 nm for Pt NPs, and at RHD90, the largest segmentation errors are less than 2 nm (0.59 ± 0.05 nm for γ-Alumina and 1.81 ± 1.05 nm for Pt NPs). The variations in the RHD values show that the largest segmentation error for the Pt NPs trend higher than for the γ-Alumina, which is consistent with the overall lower evaluation scores observed for the Pt NPs, particularly the lower average recall score. Yet, this analysis suggests that our model learned reasonably well to classify the pixels at the boundaries, especially those of the γ-Alumina in addition to the matrix regions.

Based on the evaluation results, we deployed the best-performing model for the automatic semantic segmentation of a large volume of 3D reconstructions. Figure 5 compares the example reconstructions extracted at various locations along the 3D volume of the γ-Alumina/Pt particle and corresponding predicted segmentations by the model. The model has not seen these representative reconstruction slices during the training and validation steps. In the predicted segmentation, the pixels classified as γ-Alumina are denoted in gray, Pt in white, and background/pores in black. Overall, there is a good correspondence between the reconstruction slices and predicted segmentations. The location, shape, and size of the Pt NPs correlate with the test images and the texture of the pores inside the γ-Alumina matrix. One striking observation is that the catalytic Pt NPs are associated with the crystallographic modulations on the broad {110} surfaces of the γ-Alumina particle. Most of the Pt NPs are found at the apex of the two {111} type facets, as visualized explicitly in the zoomed-in images.

Figure 5
figure 5

Comparison of the reconstruction slices (3D reconstructions) and predicted segmentations of γ-Alumina catalytic support material and Pt NPs. Reconstruction slices were extracted at various locations along the 3D volume.

Our model aims to accurately segment the complex surface and bulk microstructures of the γ-Alumina particle and Pt NPs with a limited amount of annotated ground-truth data. The results presented establish that the U-Net model with a weighted focal loss provides a stable model for the multi-class semantic segmentation of a large data set of 3D reconstructions in a severe class imbalance situation. Segmentation results establish a basis for the quantification of critical microstructural parameters, including 1) quantification of external surfaces in terms of their general area and proportion of individual facets, 2) quantification of volume fraction of pores, and their surface area, 3) quantification of Pt particles and their attachment to Al2O3. While this topic will be the focus of our future work, an essential qualitative assessment of the γ-Alumina surfaces and geometry of the Pt NPs can be obtained by transforming the stack of predicted segmentations into 3D visualizations. Figure 6a–d shows 3D volume visualization of γ-Alumina and Pt NPs from different viewpoints, and surface contour maps in light gray represent γ-Alumina and red Pt NPs. 3D volume visualizations show that {110} surfaces of γ-Alumina are not atomically flat; instead, they form a series of periodically repeating structural facets. These facets are mostly terminated towards the center of γ-Alumina, and Pt NPs are anchored along with the {111} type facets rather than randomly distributed on the surfaces. Surprisingly, matrix pores are aligned along the direction of the surface facets, as seen in Fig. 6d.

Figure 6
figure 6

3D visualization of γ-Alumina/Pt catalytic particle from different viewpoints. Pt NPs are colored in red, while γ-Alumina support material is colored in light gray. (ac) Showing periodic facets on the {110} type broad surfaces of γ-Alumina along with Pt NPs. (d) Showing the pores inside the γ-Alumina matrix in a 3D transparent view. 3D visualizations were generated using open-source Paraview47 software version 5.9.1 (http://www.paraview.org).

We have demonstrated the effectiveness of a deep learning model in multi-class semantic segmentation of large and unbalanced data. This work was in many respects exploratory as a proof of concept, with a focus mainly on model performance. The current model is naturally limited in its applicability. Alumina-based catalysts with various morphologies and particle contrasts will require additional adaptations to the current model. We intend to continue to utilize this U-Net model in a transfer learning environment, incorporating the broad morphological and size variation of Alumina-based catalysts and eventually with general catalyst systems to achieve wider applicability. With recent advances in automatic data collection, deep learning-assisted semantic segmentation is genuinely expected to broaden the field of STEM tomography for routine quantitative measurement of catalysts on a statistically relevant scale, which is not possible currently.

Methods

3D tomography data acquisition and visualization

Our main goal is to assess the effectiveness of a deep learning-based approach in semantic segmentation of the 3D HAADF STEM tomography reconstructions while achieving a full 3D view of the γ-Alumina/Pt catalytic material. For this purpose, we conducted HAADF STEM tomography experiments on a well-isolated γ-Alumina/Pt catalytic particle. TEM samples were prepared by dropping a solution containing well-dispersed NPs on a lacey carbon film. A detailed description of the synthesis of γ-Alumina/Pt material was reported in an earlier paper24. A probe aberration-corrected 300 kV Thermo Fisher Scientific Titan S/TEM microscope was used to acquire the HAADF STEM tilt series. HAADF STEM images were acquired at the detector inner collection angle of 40 mrad, beam current of 20 pA with a 0.1 nm probe size, an accelerating voltage of 200 kV. To extend the depth of focus of the electron beam during STEM tomography acquisitions, the convergence angle of the illumination system was adjusted to 10 mrad using the three-condenser lens optics of the microscope. Tomography tilt series consists of 69 HAADF STEM images acquired at the tilt range of ± 68° and tilt increment of 2°. Post-processing of the tilt series was conducted using open-source resources; a Python programming script based on the tomviz software used for the image shift and tilt alignments44, and TomoPy and ASTRA Toolbox Python libraries for the maximum likelihood expectation maximization (MLEM) reconstructions45,46. Paraview software was employed to generate 3D visualizations from the fully segmented 3D reconstructions47.

Training and optimization

For optimization, we used a mini-batch gradient descent with a batch size of 2 and Adam optimizer at a learning rate of 0.0005. We used default parameters from the Tensorflow deep learning framework for the first and second moments of gradient averaging and updating48. All the weights were initialized by “He normal” kernel initialization49, and all the biases were initialized at 0. A total of 30 ground-truth images were selected from the 3D reconstructions, and corresponding ground-truth segmentations were manually annotated for the training and validation steps. A class label representing background/pores, γ-Alumina and Pt NPs, were assigned to each pixel in the ground-truth segmentations. Due to the large image size and limited GPU memory availability, 1024 × 512 pixels (0.12 nm/pixel) ground-truth images and segmentations were divided into 512 × 512 pixels patches. This data was then randomly split into 75% training and 25% validation data sets. The average pixel density of each class in the patches is 77.7% for background/pores, 21.8% for γ-Alumina, and 0.5% for Pt NPs. Data augmentation is crucial to teach the network a robust invariance to input data and generalize the model. We used rotation, vertical and horizontal flip, zoom, and shear transformations during training to generate a diverse range of images representing variations in the location and shape of the features. The best-performing model was selected based on the evaluation performance and applied to the automatic segmentation of a stack of 702 3D reconstructions. We employed a smooth blending approach to form final predictions where 512 × 512 pixels size segmented patches were smoothly merged into 1024 × 512 pixels size final predictions using spline interpolations between the overlapping patches50.