Digital refocusing based on deep learning in optical coherence tomography

Zhuoqun Yuan; Zhuoqun Yuan; Di Yang; Di Yang; Zihan Yang; Jingzhu Zhao; Yanmei Liang

doi:10.1364/BOE.453326

1. Introduction

Because of the advantages of non-destruction and non-invasion, optical coherence tomography (OCT) has been successfully used in ophthalmology [1,2], cardiology [3], gastroenterology [4], and dermatology [5,6] since it was proposed in 1991 [7]. As a medical imaging technology, higher resolution has been pursued continuously in order to visualize the cellular and subcellular features of biological samples. OCT is a three-dimensional (3D) imaging technology, whose axial and lateral resolutions are determined by different factors, which provides us with the possibility to improve them separately. Up to now, some hardware methods [8–10] have been proposed to improve axial resolution to the micron level.

Unlike the axial resolution, its lateral resolution is determined by the diffraction-limited spot size of the sample focused beam. Generally, the high lateral resolution will result in a sharp increase of spot size beyond the focal plane, thereby sacrificing depth of focus (DOF) and making lateral resolutions at different depths along the axial direction rapidly worse. Improving DOF and lateral resolution simultaneously has always been a dilemma.

Typical hardware methods are using binary phase spatial filter [11] or axicon [12] to achieve a Bessel beam to extend the DOF. Besides, in 2017, E. Bo et al. [13] used multiple aperture synthesis to extend DOF. H. Pahlevaninezhad et al. [14] introduced a nano-optics endoscope for image super-resolution (SR) and high imaging DOF. However, hardware methods usually need to build complicated imaging systems. Beyond these methods, the multi-frame super-resolution technique was also used to improve the lateral resolution [15], but it needs a long time for imaging.

Compared to hardware-based methods, the signal processing methods can provide alternative and relatively cheaper solutions. For example, interferometric synthetic aperture microscopy (ISAM) [16] was proposed to perform high-resolution (HR) imaging in a large depth range, and the finite impulse response (FIR) filter [17] was used to improve lateral resolution of OCT, but they need phase stability during image acquisition.

Recently, with the rise of deep learning, neural networks have gathered a lot of attention in improving resolution [18,19] and extending the DOF [20,21] of microscopy, and have been explored to enhance the axial resolution of OCT [22,23]. Y. Huang et al. applied deep learning in denoising and super-resolution of OCT images, where they obtained LR images by data down-sampling [24]. K. Liang et al. enhanced the resolution and recovered realistic speckle of OCT images [25]. Datasets with correct low-resolution (LR) and HR mapping are critical for training appropriate neural network models and producing realistic SR OCT images. The intrinsically co-registered set of paired low-high resolution data can be produced by windowing [22,23,25] or averaging [25] the interference spectra. However, the simulation data cannot express the characteristics of OCT image fully and correctly, especially for en face OCT images, whose lateral resolutions vary with different depths, and texture or noise distributions are difficult to be simulated.

We designed the framework of collection and pixel-level registration of en face LR and HR OCT image pairs in this paper. Based on the experimental data, the generative adversarial networks (GAN) were trained with the fine registered image pairs, and the lateral resolutions of defocused images were improved by the trained network. It was demonstrated by the experimental results of phantom and biological samples that deep learning-based digital refocusing approaches had great potential in extending the DOF of OCT images. The present approach does not require hardware modifications to the OCT system and can be easily applied to various systems at low cost.

2. Methods and materials

2.1 Image collection and registration

With the increase of defocus distance, the lateral resolution gradually deteriorates. We collected en face image pairs of LR images with different resolutions and HR images at the same position of the sample, whose scheme is shown in Fig. 1. Figure 1(a₁) is used to collect HR en face images, where the incident beam from the objective lens is focused in the sample. As pointed by the dashed line in Fig. 1(a₁), some en face images collected within DOF will be regarded as the HR images. As shown in Figs. 1(b₁) - (e₁), by moving up the sample stage gradually, the incident beams are focused at different depths, and the same position of the sample pointed by the dashed line have different resolutions, whose en face images will be regarded as the defocused LR images. A set of example images are shown in Figs. 1(a₂) -(e₂), where Fig. 1(a₂) is the focused HR image, and Figs. 1(b₂) -(e₂) are four LR images of the same position with four defocus distances. As shown in these images, the structure of the sample is gradually blurred with the defocus distance increasing.

Fig. 1. The collection scheme of en face images with different resolutions. (a₁) is used to collect en face focused HR images, where the focused plane of the incident beam is at the dashed line. (b₁) - (e₁) are used to collect en face defocused LR images with different resolutions, where the incident beams are focused at four different depths of the sample, respectively. (a₂) - (e₂) are en face focused HR images and four LR images of the same position with four defocus distances, respectively.

Download Full Size | PDF

Considering the possible misalignment of en face images of different resolutions because of bio-sample movement and shrinkage in imaging at different times, registration must be done. We developed a registration framework suitable for OCT defocused and focused en face images. The steps are shown in Fig. 2. There are three main steps: selecting image pairs, registering image pairs, and optimizing the registered HR image quality.

Fig. 2. The framework of image registration.

Download Full Size | PDF

For selecting appropriate image pairs, pre-registration between an LR image and each of the unprocessed HR (UHR) images within the DOF was firstly carried out by an affine transformation, and then the best appropriate image pair was selected by calculating correlation coefficients between the pre-registration images. The image pair with the largest correlation coefficient is the best appropriate image pair.

The pyramid registration algorithm [19] was then used to accomplish fine multi-scale image registration on the pre-registered image pair. Both images were firstly divided into N×N blocks. Next, the 2D cross-correlation map of each block pair of the corresponding two input images was calculated. Then, the displacement of each block is the peak coordinates of the 2D Gaussian function, which was fitted by the 2D cross-correlation map. The displacements were interpolated to the image size, and the resulting image was used as the translation map, which was then applied to the pre-registered HR image. Repeat steps (2) - (4) if the highest value of the translation map (shift error) is larger than or equal to the tolerance. Otherwise, judge whether the image block size is less than the set minimum size. If it is not less than the set minimum value, set N = N+2 and repeat steps (1) - (5). When meeting the conditions, the fine image registrations were completed, and image pairs of registered HR (RHR) images and LR images reach the pixel-level registration.

In addition to pixel-level registration, the dataset with favorable image quality is also essential to perform deep learning-based SR. It must be noted that all interpolation transformation was performed on HR images and may blur HR images, and induce the loss of high-frequency information. UHR images and RHR images were further set as training datasets, and neural networks were trained to optimize the quality of RHR images. We used the same network structure as described in Section 2.2 and only 20,000 iterations to optimize the quality of RHR images. During training, RHR images were input into the network, and their corresponding UHR images were the target images. The feature correspondence between images can be learned by the neural networks after fewer iterations, but the matching relation between pixels cannot. Therefore, the trained network can solve the problem of high-frequency information loss caused by the interpolation registration. Finally, RHR images were again input the trained network to generate optimized RHR images. The optimized RHR images were regarded as the ground truth (GT) images.

Finally, LR-GT image pairs were used to train the GAN in digital refocusing. GT images in the test dataset were also used to compare with the SR images after digital refocusing. Corresponding to LR images with different defocus distances, multiple training datasets were established by registering en face LR-HR image pairs.

2.2 Digital refocusing of en face images

GAN has ultra-strong learning ability and can be used to learn the end-to-end mapping between LR and HR images directly. In this study, we built GAN to carry out the digital refocusing of en face images at different depths along the axial direction. GAN is composed of a generator and a discriminator. The training process is shown in Fig. 3. GT and LR images from training datasets are input to GAN. As shown in Fig. 3(a), an LR image is input into the generator to produce an intermediate image which is received by the discriminator with the paired GT image. Then, the discriminator is used to determine whether the intermediate image is as realistic as the GT image. The output loss is used to update the generator and discriminator. The updated weight of the discriminator is only related to the adversarial loss, while that of the generator is composed of pixel loss, perceptual loss, and adversarial loss, whose settings are the same as Ref. 26. This cycle is repeated until the training is complete.

Fig. 3. The training process and the architecture of GAN. (a) Training process of GAN. (b) The architecture of the generator. (c) The architecture of discriminator.

Download Full Size | PDF

The detailed structure of the GAN is shown in Figs. 3(b) and 3(c). To learn the mapping relationship between LR-HR image pairs, we used the residual-in-residual dense block (RRDB) [26], including residual networks and dense connections, as the basic structure of the generator. Considering that the digital refocusing needs to process LR images of different resolutions, inspired by the structure of receptive field in the human visual system, we introduced the receptive field block (RFB) [27,28] in the dense block to combine multiple branches with different kernels and dilated convolution layers, which make the network can learn the multi-scale features. The RFB structure is shown in Fig. 3(b), which uses a convolutional layer with small kernels (1×1, 1×3, 3×1, 1×3, and 3×3), where d means dilated convolution and α is the rescaling factor. The images are firstly input into the convolutional layer, then multi-scale features are extracted from 12 RRDB blocks and 8 receptive field blocks in residual-in-residual dense block (RFB-RRDB) blocks. Each RFB-RRDB is composed of receptive field block in dense block (RFB-Dense) blocks. The features are finally reconstructed into an image through two convolutional layers and a Leaky ReLU (LReLU) [29].

The discriminator is a convolutional neural network composed of eight convolutional layers, which is shown in Fig. 3(c). The corresponding kernel size(k), number of feature maps(n), and stride(s) of each convolution layer are given in Fig. 3(c).

2.3 Quantitative and qualitative evaluation

The SR performance of our proposed approach was quantified by three typical objective metrics, learned perceptual image patch similarity (LPIPS) [30], resolution scaled pearson coefficient (RSP) [19] and signal-to-noise ratio (SNR), and the subjective evaluation, mean opinion score (MOS).

LPIPS is used to evaluate the similarity objectively between images through deep features across different architectures in neural networks, which is close to human perception in visual similarity judgment. The smaller the LPIPS, the higher the similarity between the two images.

RSP is the parameter that quantified artifacts in the network output images by using the Fiji software plugin NanoJ-Squirrel [31]. The plugin provides the globally averaged scores RSP to assess the image qualities across different resolution images by quantifying their correlation. The closer the RSP is to 1, the higher the similarity between the two images.

MOS can be used to measure the human-judged overall quality of images numerically. It is the average of several human-scored individual parameters, which are usually rated on a scale of 1 (bad) to 5 (excellent). In our study, we invited 20 volunteers to give their rates on the images.

To characterize the image quality under different defocus conditions, we selected the small field of view (FOV) of a single microsphere to calculate SNR. The formula is as follows,

(1)$$SNR = 10{\log _{10}}{\left( {\frac{{s - \overline b }}{{{\sigma_b}}}} \right)^2}, $$

where s is the peak value of the signal calculated from a Gaussian fit to the particle, $\bar{b}$ and ${\sigma _b}$ is the mean value and the standard deviation of the background.

The extrapolation of frequency distribution is also one of the features of SR images. Therefore, in addition to the above-mentioned quantitative evaluation parameters, we also performed spatial frequency analysis on the images generated by the networks to evaluate the SR ability of the network.

2.4 Experimental system and datasets

A home-made spectral domain OCT (SD-OCT) system [32] was used to collect HR and LR images, whose light source is a multiplexed super-luminescent diode in the near-infrared waveband (central wavelength 840 nm, bandwidth 100 nm) and achieved ∼3.4 µm axial resolution in air. Compared with the previous system [32], we replaced the scanning lens with a new one (LSM02-BB, Thorlabs), and its lateral resolution is ∼4 µm on the focal plane. DOF of this system is ∼60 µm, calculated by the formula $DOF = 4\lambda / \pi ({NA} )^2,\,NA = D/2f$, where NA is the numerical aperture calculated with the diameter of the sample beam D and focal length of scanning len f. The SD-OCT image size is 1000 pixels × 1000 pixels × 2048 pixels, corresponding to a field of view of 1.5 mm (x) × 1.5 mm (y) × 2.3 mm (z).

For each sample, we collected a C-scan within the DOF and four C-scan images of the same position at different focus positions, respectively, whose distance between C-scans along the axial direction is about 60 µm. The en face images were obtained from these C-scans. Therefore, four datasets of LR-HR image pairs were established in the study, where LR images correspond to defocus distances of 60, 120, 180, and 240 µm, respectively. The en face images size is 1000×1000 pixel× pixels, and they were divided into patches with the size of 64 × 64 pixels × pixels to generate the data set, and then we applied the data argument, such as random vertical/horizontal flips. The number of image patches in training datasets and test images of each sample with different defocus distances are given in Table 1. We found LR images of 60 µm defocus distance are close to their HR images visually, so we only used other three datasets to train network and perform refocusing, and we did not give its image information in Table 1.

Table 1. Number of images in datasets and quantitative evaluation results of digital refocusing

View Table

The N was initially set as 4, shift error was 0.2, and minimum block size was 40 × 40 pixel × pixels in pyramid registration. All network models were implemented based on Pytorch on a server with 64 GB of RAM and an NVIDIA TITAN RTX graphics processing unit (GPU). All networks were optimized by using the Adam algorithm [33]. The learning rate dropped by step decay. The training iterations were set as 250000. α in the RFB was set as 0.2.

3. Results

3.1 Digital refocusing of en face images on the phantom

The phantom of polystyrene microparticle was firstly used to evaluate the feasibility of digital refocusing of en face images at different depths, whose results are shown in Fig. 4. It was constructed by mixing epoxy glue with polystyrene microparticles, whose nominal diameter is 20 µm.

Fig. 4. OCT en face images of the phantom. (a) and (b) are defocused and SR images, respectively. (c) is the GT image. (d) are intensity profiles pointed out by the red dashed lines in Figs. 4(a) - (c). Scale bar is 200 µm.

Download Full Size | PDF

Figure 4(a) is a defocused OCT en face image of the phantom with 240 µm defocused distance. Figure 4(b) is the SR en face image. The corresponding GT image is shown in Fig. 4(c). Magnified boxes are shown in the right part of each image. Intensity profiles pointed out by the red dashed lines in Figs. 4(a)- (c) are shown in Fig. 4(d). As shown in the magnified blue boxes, a single polystyrene microparticle is blurred in the defocused en face image [Fig. 4(a)], and it is narrowed laterally in the SR [Fig. 4(b)] and GT [Fig. 4(c)] en face images. The two polystyrene microparticles are adhered at the defocus plane in the green magnified box [Fig. 4(a)] and intensity profiles [blue profiles in Fig. 4(d)], but they are completely separated in the SR [Fig. 4(b)] and GT [Fig. 4(c)] en face images, indicating the improvement of lateral resolution. The intensity curves of SR and GT images [red and green profiles in Fig. 4(d)] have the same trends, indicating that the intensity distribution of microspheres in SR and GT is consistent. The SR image with 240 µm defocused distance has the same lateral resolution as the GT image, which proves that the digital refocusing can increase the DOF by ∼8 fold. By calculating the SNR, we can see the SNR of SR box is significantly higher than that of the LR box, which demonstrates that the refocusing method can also greatly increase the SNR.

3.2 Digital refocusing of en face images on biological samples

Orange pulp was further used to show the refocusing ability for its wall structure. The Richardson-Lucy deconvolution method [34] with 15 iterations was also used to compare the SR results, where point spread function was modeled by using collected phantom en face OCT images. The results are shown in Fig. 5. Figures 5(a₁) - (c₁) are the collected orange en face images at defocus distances of 120 µm, 180 µm, and 240 µm, respectively. The corresponding deconvolution outputs, SR en face images by our method, and GT images are shown in Figs. 5(a₂) - (c₂), Figs. 5(a₃) - (c₃), and Figs. 5(a₄) - (c₄), respectively. Zoomed-in regions of interest (ROIs) are demonstrated in Figs. 5(a₅) - (c₅), revealing more details about the orange wall structure. Figures 5(a₆) - (c₆) are the intensity profiles pointed by red dashed lines in Figs. 5(a₅) - (c₅).

Fig. 5. SR results of orange pulp OCT en face images. (a₁) - (c₁) are the defocused en face images collected at defocus distances, 120 µm, 180 µm, and 240 µm, respectively. (a₂) - (c₂) are the results of the Richardson-Lucy deconvolution method. (a₃) - (c₃) are the SR images of our digital refocusing method based on deep learning. (a₄) - (c₄) are the corresponding GT images. (a₅) - (c₅) are zoomed-in ROIs. (a₆) - (c₆) are intensity profiles pointed by the red dashed lines in (a₅) - (c₅), respectively, and the colors of the curves correspond to the colors of the boxes. Scale bar is 200 µm.

Download Full Size | PDF

As shown in Figs. 5(a₁) - (c₁), hole-like structures of the orange pulp became fuzzy and hard to be distinguished with the defocus distance increasing. The deconvolution method can reduce this kind of blur, but it comes at the cost of a lower SNR and the loss of texture detail. In contrast, the deep learning-based digital refocusing method can realize the reconstruction of the orange structure under a variety of defocusing conditions [Figs. 5(a₃) - (c₃)]. Meanwhile, as shown in Figs. 5(a₆) - (c₆), the intensity profiles indicate that the SR approach of deep learning can recover the fine structure, demonstrating the improvement in lateral resolution and DOF extension.

To show the refocusing ability on OCT images of biological samples, we collected ex vivo human thyroid tissues based on our home-made OCT system. Human thyroid tissues were provided by Department of Thyroid and Neck Tumor, Tianjin Medical University Cancer Institute and Hospital, China, and it was in accordance with approved protocols and guidelines of institutional review board of Tianjin Cancer Hospital. The results are shown in Fig. 6. Figure 6(a₁) is the en face image collected at the 180 µm defocus distance and Fig. 6(b₁) is the SR image. The GT image is shown in Fig. 6(c₁). Figures 6(a₂) - (c₂) are the magnified views of the regions framed by the blue squares.

Fig. 6. SR results of thyroid OCT en face images. (a₁) is the defocused en face image collected at defocus distance of 180 µm. (b₁) is the SR images of digital refocusing. (c₁) is the corresponding GT. (a₂) - (c₂) are zoomed-in ROIs of (a₁) - (c₁), respectively.

Download Full Size | PDF

The thyroid is composed of spherical follicles that selectively absorb and store iodine and store them in thyroglobulin, which appears as follicle structures in OCT images [35]. The thyroid follicle structures are difficult to be distinguished in Figs. 6(a₁) and 6(a₂) due to defocusing. The speckle distribution of SR image [Figs. 6(b₁) and 6(b₂)] is similar to that of the GT [Figs. 6(c₁) and 6(c₂)], and the follicle structure is clearly visible, which shows that the method refocused the defocused en face image, achieving DOF extension.

The SR results on the OCT images of zebrafish larvae of 5 to 6 days are shown in Fig. 7. Figure 7(a₁) is the zebrafish en face image at defocus distance 240 µm and Fig. 7(b₁) is the SR image of Fig. 7(a₁). Figure 7(c₁) is the GT image. Compared with the defocused image [Fig. 7(a₁)], the internal organs of the larvae zebrafish can be clearly distinguished in GT en face image [Fig. 7(c₁)]. The lateral resolution is considerably enhanced in SR image [Fig. 7(b₁)] after digital refocusing.

Fig. 7. SR results of zebrafish larvae en face OCT images. (a₁) is the defocused en face images collected at defocus distance of 240 µm. (b₁) is the SR images after the digital refocusing. (c₁) is the GT image. (a₂) - (c₂) are spatial frequency distribution of (a₁) - (c₁), respectively. (d) is the radially-averaged intensities of the spatial frequency spectra in (a₂) - (c₂). Scale bar is 200 µm.

Download Full Size | PDF

We further analyzed their spatial frequency distributions, which are shown in Figs. 7(a₂) - (c₂), respectively. Figure 7(d) is the associated radially-averaged frequency intensity profiles. Compared with Fig. 7(a₂), Fig. 7(b₂) is more similar to Fig. 7(c₂) in the frequency distribution. According to Fourier spectrum, GT image [the red curve in Fig. 7(d)] has higher information in the intermediate frequency part [indicated by the gray double-arrow line] than LR image [the blue curve in Fig. 7(d)], meaning the sample details are mainly concentrated in the intermediate frequency part. The frequency distribution of the SR image [the green curve in Fig. 7(d)] is closer to that of the GT image, and the proportion of the intermediate frequency part is significantly higher than that of the LR image, demonstrating the frequency extrapolation ability of the deep learning.

The quantitative results of RSP, LPIPS, and MOS of test images are listed in Table 1. Compared with LR images, three quantitative metrics are improved in SR images, demonstrating that the image quality of SR images is closer to the GT and proving the enhancement in lateral resolution after digital refocusing. The zebrafish was placed in epoxy resin adhesive, and the reflection signals of epoxy resin adhesive were also collected by OCT system, so zebrafish en face images include many reflection signals from the irrelevant sample. In order to remove the influence of the irrelevant sample on the quantitative evaluation, regions of zebrafish in their en face images were manually extracted.

3.3 Digital refocusing of three dimensional images

In addition to super-resolve the defocused en face image, we also executed three dimensional (3D) digital refocusing for the additionally collected C-scan images, which were not used in training the model. The C-scan image included en face images within DOF and en face images of a series of continuous defocused positions. The defocused images were separated into four levels (30 µm- 90 µm, 90 µm - 150 µm, 150 µm - 210 µm, and 210 µm and above) according to the defocus distances. The images within DOF and 30 - 90 µm defocus distance were not processed, and other images were super-resolved by the corresponding trained models to generate the digital refocusing C-scans, respectively.

A typical B-scan OCT image from the refocused C-scan is shown in Fig. 8. Figures 8(a) and 8(b) are two B-scan images whose focus positions marked by the red dot-dashed lines are 120 µm apart. Figure 8(c) is the digital refocused result of Fig. 8(b). Sub-images at the lower part of Figs. 8(a) - (c) are the magnified views of the regions framed by a blue square. As shown in Figs. 8(b) and 8(a), the beam focus position in Fig. 8(b) is lower than that of Fig. 8(a), and deeper tissue can be detected, but the hole-like features of the orange near the sample surface are blurred due to defocusing. Lateral resolutions in a large DOF are improved in the refocused image [Fig. 8(c)]. More importantly, we can see the overall image quality of Fig. 8(c) is improved with the optimization of the lateral resolutions at different depths.

Fig. 8. B-scan OCT images of orange pulp. (a) and (b) are two B-scan images where the red dotted lines mark the incident beam focus positions. (c) is the refocused image of (b). Scale bar is 200 µm.

Download Full Size | PDF

4. Discussions

In this paper, we built pixel-level registered pairs of en face LR-GT OCT images based on experimental data, and carried out digital refocusing for defocus OCT images based on deep learning. It was proven by the experimental results that the lateral resolutions at different depths along the axial direction were greatly improved and the DOF of OCT images was extended.

The spatial frequency spectrum can be used to analyze the change of the image during the registration process. Figure 9 is used to show the change of the spatial frequency spectrum when registering LR and HR en face images. Figure 9(a₁) is the defocused LR image and Fig. 9(b₁) is the UHR image. Figure 9(c₁) is the RHR image. Figure 9(d₁) is the GT image. Figures 9(a₂) - (d₂) are the spatial frequency distribution of Figs. 9(a₁) - (d₁), respectively. We further plot the radially-averaged intensity of each spatial frequency distribution of Figs. 9(a₂) - (d₂) in the log scale, which is shown in Fig. 9(e). According to the frequency distribution curves of LR and UHR images, the frequency is mainly concentrated in the intermediate frequency part (indicated by gray double-arrow line), and the high-frequency part does not contain the detailed information of the sample. The RHR image lose some high frequency information without affecting the sample details after interpolation transformation. The GT image and LR image are pixel-matched [Figs. 9(a₁) and 9(d₁)] and the GT image is closer to the UHR image in the frequency distribution, indicating that only the frequency distribution of the image is changed after being input to the network with fewer iterations, and the pixel correspondence between the images is preserved.

Fig. 9. Illustration of images and frequency spectra during registration. (a₁) is the LR en face images. (b₁) is the UHR image. (c₁) is the RHR image by the registration algorithm. (d₁) is the pre-trained model's optimized HR image, which is the GT image. (a₂) - (d₂) are spatial frequency distribution of (a₁) -(d₁), respectively. (e) plots the radially-averaged intensity of each one of the spatial frequency spectra shown in (a₂)- (d₂).

Download Full Size | PDF

Considering the change of the defocus distance along the depth direction, we constructed multiple datasets corresponding to different defocus distances for the same sample area, and trained the models separately. Then, the en face images with different defocus distances are super-resolved by the trained specific models. In our experiments, an overall dataset was also built which contained all pairs of LR images with different defocus distances and GT images, and was used to train a deep learning model (named the single model). Then, the trained single model was used to super-resolve all defocused en face images. The comparison of multiple datasets and an overall dataset is shown in Fig. 10. Figures 10(a₁) and 10(b₁) are the collected thyroid en face images at defocus distances of 120 µm and 240 µm, respectively. The corresponding SR en face images by specific models, SR en face images by the single model, and GT images are shown in Figs. 10(a₂) and 10(b₂), Figs. 10(a₃) and 10(b₃), and Figs. 10(a₄) and 10(b₄), respectively. Zoomed-in ROIs are demonstrated in Figs. 10(a₅) and 10(b₅).

Fig. 10. SR results of OCT en face images. (a₁) and (b₁) are the defocused en face images collected at defocus distances 120 µm and 240 µm, respectively. (a₂) and (b₂) are the results of specific model. (a₃) and (b₃) are the SR images of the single model. (a₄) and (b₄) are the corresponding GT images. (a₅) and (b₅) are zoomed-in ROIs. Scale bar is 200 µm.

Download Full Size | PDF

Compared with LR images, the lateral resolution of each SR image is improved. As shown in Figs. 10(a₂), 10(a₃), and ROIs in Fig. 10(a₅), the quality of SR images by the specific model and single model are almost the same. When the defocus distance is 120 µm, excellent results can be both obtained by training the networks on the separated and overall datasets. Figure 10(b₂) is closer to Fig. 10(b₄) than Fig. 10(b₃) in the sample structure and image details, which can be clearly revealed in Fig. 10(b₅). The results indicate that the larger the defocus distance, the more difficult to obtain correct SR results for the model trained by the single model. This is because the overall dataset contains complicated mapping relationships between image pairs, which make it difficult for the neural network to converge. In the future, a better network needs to be studied to learn the complex mapping relationships of en face OCT images in order to simplify the construction of dataset.

Also, we conducted the ablation study to show the performance of RFB in the RRDB block. In the ablation study, RFB-RRDB blocks were replaced by RRDB blocks in the networks and the results are shown in Fig. 11. Figure 11(a) is the collected thyroid en face image at defocus distance 240 µm. Figures 11(b) and (c) are the SR images generated by the RFB-RRDB and RRDB, respectively. Figure 11(d) is the GT image. It is demonstrated that both RRDB and RFB-RRDB achieved accurate OCT digital refocusing, and the result of the ablation study indicates that the introduction of RFB in RRDB bring a certain numerical improvement in Lpips and nearly the same in RSP. Therefore, the introduction of RFB improved the performance of the network.

Fig. 11. SR results of OCT en face images. (a) is the defocused en face images collected at defocus distance 240 µm. (b) and (c) are the SR results of RFB-RRDB and RRDB, respectively. (d) is the corresponding GT images. Scale bar is 200 µm.

Download Full Size | PDF

In order to further compare the difference among actual defocused OCT images and the bicubic downsampled images, a zebrafish larvae focused GT OCT en face image, its downsampled LR images with 4×, 8× and 12× bicubic interpolation, and the actual LR image at defocused distance 240 µm are given in Figs. 12(a) - (e₁), respectively. For easy to compare, the bicubic downsampled LR images are upsampled to the size of the actual LR images. Figures 12(b₂) -(e₂) are the zoomed-in ROIs in Figs. 12(b₁) -(e₁), respectively. As shown in Fig. 12(a), rich texture details can be seen in the GT image for the zebrafish larvae. The bicubic downsampled image with downscale factor 4 [Fig. 12(b₁)] still retains high-frequency information similar to its GT image. Figures. 12(c₁) and 12(d₁) are smooth due to downsampling operations, which are quite different from the defocused image in terms of texture details. The same results are also clearly shown in the zoomed-in ROIs (Figs. 12(c₂) -(d₂)), which demonstrates that the bicubic downsampling method cannot simulate the degradation of en face image due to defocusing.

Fig. 12. Zebrafish OCT en face images of different resolutions. (a) is the GT image. (b₁) -(d₁) are downsampled LR images with downscale 4, 8 and 12, respectively. (e₁) is the defocused en face images collected at defocus distances 240 µm. (b₂) -(e₂) show the zoomed-in ROIs of (b₁) -(e₁). Scale bar in each image is 200 µm.

Download Full Size | PDF

We focus more on building correct mapping LR-HR image pairs and proving the feasibility of deep learning for digital refocusing in this paper. The network was trained for specific samples to show SR effects, and we did not study cross-sample refocusing. The transferability or generalization ability is important for neural network. Deep learning network is more like a black box, and we cannot clearly know what knowledge it has learned or how its internal parameters change in the training process. As said in Ref. [36], the output of deep learning-based SR model is always highly dependent on sufficient correspondence between the training and test data, and SR model developed for a specific sample type is highly recommended in practical applications. According to our study [22,23], the more similar the morphological features and sparsity characteristics among different samples, the better the SR effect of cross samples. In addition, transferring learning from a previously trained network to another class of samples can accelerate the convergence of this training process. We will further investigate the generalization of the networks by analyzing the parameters of networks and characteristics of samples. With the advancement and development of deep learning, we believe there will be stronger networks to optimize OCT images and we will also work on it.

5. Conclusions

Based on fine registered image pairs, we obtained refocusing of OCT images in this paper, and the lateral resolutions at different depths along the axial direction were improved. It was demonstrated by the experimental results that data-driven deep learning methods are effective and accurate in refocusing OCT images. We believe that deep-learning methods have a promising application in optimizing OCT images.

Funding

National Natural Science Foundation of China (61875092); Science and Technology Support Program of Tianjin (17YFZCSY00740); the Beijing-Tianjin-Hebei Basic Research Cooperation Special Program (19JCZDJC65300).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the corresponding author upon reasonable request.

References

1. R. F. Spaide, J. M. Klancnik, and M. J. Cooney, “Retinal vascular layers imaged by fluorescein angiography and optical coherence tomography angiography,” JAMA Ophthalmol. 133(1), 45–50 (2015). [CrossRef]

2. C. A. Puliafito, M. R. Hee, C. P. Lin, E. Reichel, J. S. Schuman, J. S. Duker, J. A. Izatt, E. A. Swanson, and J. G. Fujimoto, “Imaging of macular diseases with optical coherence tomography,” Ophthalmology 102(2), 217–229 (1995). [CrossRef]

3. I.-K. Jang, B. E. Bouma, D.-H. Kang, S.-J. Park, S.-W. Park, K.-B. Seung, K.-B. Choi, M. Shishkov, K. Schlendorf, E. Pomerantsev, S. L. Houser, H. T. Aretz, and G. J. Tearney, “Visualization of coronary atherosclerotic plaques in patients using optical coherence tomography: comparison with intravascular ultrasound,” J. Am. Coll. Cardiol. 39(4), 604–609 (2002). [CrossRef]

4. M. Schemann and M. Camilleri, “Functions and imaging of mast cell and neural axis of the gut,” Gastroenterology 144(4), 698–704.e4 (2013). [CrossRef]

5. T. Gambichler, V. Jaedicke, and S. Terras, “Optical coherence tomography in dermatology: technical and clinical aspects,” Arch. Dermatol. Res. 303(7), 457–473 (2011). [CrossRef]

6. M. Ulrich, L. Themstrup, N. D. Carvalho, M. Manfredi, C. Grana, S. Ciardo, R. Kästle, J. Holmes, R. Whitehead, G. B. E. Jemec, G. Pellacani, and J. Welzel, “Dynamic optical coherence tomography in dermatology,” Dermatology (Basel, Switz.) 232(3), 298–311 (2016). [CrossRef]

7. D. Huang, E. A. Swanson, C. P. Lin, J. S. Schuman, W. G. Stinson, W. Chang, M. R. Hee, T. Flotte, K. W. Gregory, C. A. Puliafito, and J. G. Fujimoto, “Optical coherence tomography,” Science 254(5035), 1178–1181 (1991). [CrossRef]

8. X. Yao, Y. Gan, C. C. Marboe, and C. P. Hendon, “Myocardial imaging using ultrahigh-resolution spectral domain optical coherence tomography,” J. Biomed. Opt. 21(6), 061006 (2016). [CrossRef]

9. Y.-J. You, C. Wang, Y.-L. Lin, A. Zaytsev, P. Xue, and C.-L. Pan, “Ultrahigh-resolution optical coherence tomography at 1.3 µm central wavelength by using a supercontinuum source pumped by noise-like pulses,” Laser Phys. Lett. 13(2), 025101 (2016). [CrossRef]

10. W. Yuan, R. Brown, W. Mitzner, L. Yarmus, and X. D. Li, “Super-achromatic monolithic microprobe for ultrahigh-resolution endoscopic optical coherence tomography at 800 nm,” Nat. Commun. 8(1), 1531 (2017). [CrossRef]

11. J. Kim, J. Xing, H. S. Nam, J. W. Song, J. W. Kim, and H. Yoo, “Endoscopic micro-optical coherence tomography with extended depth of focus using a binary phase spatial filter,” Opt. Lett. 42(3), 379–382 (2017). [CrossRef]

12. L. Yi, L. Sun, and W. Ding, “Multifocal spectral-domain optical coherence tomography based on Bessel beam for extended imaging depth,” J. Biomed. Opt. 22(10), 106016 (2017). [CrossRef]

13. E. Bo, Y. Luo, S. Chen, X. Liu, N. Wang, X. Ge, X. Wang, S. Chen, S. Chen, J. Li, and L. Liu, “Depth-of-focus extension in optical coherence tomography via multiple aperture synthesis,” Optica 4(7), 701–706 (2017). [CrossRef]

14. H. Pahlevaninezhad, M. Khorasaninejad, Y.-W. Huang, Z. Shi, L. P. Hariri, D. C. Adams, V. Ding, A. Zhu, C.-W. Qiu, F. Capasso, and M. J. Suter, “Nano-optic endoscope for high-resolution optical coherence tomography in vivo,” Nat. Photonics 12(9), 540–547 (2018). [CrossRef]

15. K. Shen, H. Lu, S. Baig, and M. R. Wang, “Improving lateral resolution and image quality of optical coherence tomography by the multi-frame superresolution technique for 3D tissue imaging,” Biomed. Opt. Express 8(11), 4887–4918 (2017). [CrossRef]

16. T. S. Ralston, D. L. Marks, P. S. Carney, and S. A. Boppart, “Interferometric synthetic aperture microscopy,” Nat. Phys. 3(2), 129–134 (2007). [CrossRef]

17. A. A. Moiseev, G. V. Gelikonov, S. Y. Ksenofontov, P. A. Shilyagin, D. A. Terpelov, I. V. Kasatkina, D. A. Karashtin, A. A. Sovetsky, and V. M. Gelikonov, “Digital refocusing in optical coherence tomography using finite impulse response filters,” Laser Phys. Lett. 15(9), 095601 (2018). [CrossRef]

18. H. Zhang, C. Fang, X. Xie, Y. Yang, W. Mei, D. Jin, and P. Fei, “High- throughput, high-resolution deep learning microscopy based on registration-free generative adversarial network,” Biomed. Opt. Express 10(3), 1044–1063 (2019). [CrossRef]

19. H. Wang, Y. Rivenson, Y. Jin, Z. Wei, R. Gao, H. Günaydın, L. A. Bentolila, C. Kural, and A. Ozcan, “Deep learning enables cross-modality super-resolution in fluorescence microscopy,” Nat. Methods 16(1), 103–110 (2019). [CrossRef]

20. Y. Wu, Y. Rivenson, H. Wang, Y. Luo, E. Ben-David, L. A. Bentolila, C. Pritz, and A. Ozcan, “Three-dimensional virtual refocusing of fluorescence microscopy images using deep learning,” Nat. Methods 16(12), 1323–1331 (2019). [CrossRef]

21. L. Jin, Y. Tang, Y. Wu, J. B. Coole, M. T. Tan, X. Zhao, H. Badaoui, J. T. Robinson, M. D. Williams, A. M. Gillenwater, R. R. Richards-Kortum, and A. Veeraraghavan, “Deep learning extended depth-of-field microscope for fast and slide-free histology,” Proc. Natl. Acad. Sci. U. S. A. 117(52), 33051–33060 (2020). [CrossRef]

22. Z. Yuan, D. Yang, H. Pan, and Y. Liang, “Axial super-resolution study for optical coherence tomography images via deep learning,” IEEE Access 8, 204941–204950 (2020). [CrossRef]

23. H. Pan, D. Yang, Z. Yuan, and Y. Liang, “More realistic low-resolution OCT image generation approach for training deep neural networks,” OSA Continuum 3(11), 3197 (2020). [CrossRef]

24. Y. Huang, Z. Lu, Z. Shao, M. Ran, J. Zhou, L. Fang, and Y. Zhang, “Simultaneous denoising and super-resolution of optical coherence tomography images based on generative adversarial network,” Opt. Express 27(9), 12289–12307 (2019). [CrossRef]

25. K. Liang, X. Liu, S. Chen, J. Xie, W. Qing Lee, L. Liu, and H. K. Lee, “Resolution enhancement and realistic speckle recovery with generative adversarial modeling of micro-optical coherence tomography,” Biomed. Opt. Express 11(12), 7236–7252 (2020). [CrossRef]

26. X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. C. Loy, “ESRGAN: Enhanced super-resolution generative adversarial networks,” in European Conference on Computer Vision (ECCV) (2018), pp. 1–16.

27. S. Liu, D. Huang, and Y. Wang, “Receptive field block net for accurate and fast object detection,” in European Conference on Computer Vision (ECCV) (2018), pp. 404–419.

28. T. Shang, Q. Dai, S. Zhu, T. Yang, and Y. Guo, “Perceptual extreme super-resolution network with receptive field block,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2020), pp. 440–441.

29. A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proceedings of International Conference on Machine Learning (2013).

30. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 586–595.

31. R. F. Laine, K. L. Tosheva, N. Gustafsson, R. D. Gray, P. Almada, D. Albrecht, G. T. Risa, F. Hurtig, A.C. Lindås, B. Baum, J. Mercer, C. Leterrier, P. M. Pereira, S. Culley, and R. Henriques, “NanoJ: a high-performance open-source super-resolution microscopy toolbox,” J. Phys. D: Appl. Phys. 52(16), 163001 (2019). [CrossRef]

32. D. Yang, M. Hu, M. Zhang, and Y. Liang, “High-resolution polarization-sensitive optical coherence tomography for zebrafish muscle imaging,” Biomed. Opt. Express 11(10), 5618–5632 (2020). [CrossRef]

33. D. P. Kingma and J. L. Ba, “Adam: a method for stochastic optimization,” in Proceedings of International Conference for Learning Representation (2015), pp. 1–41.

34. S. A. Hojjatoleslami, M. R. N. Avanaki, and A. G. Podoleanu, “Image quality improvement in optical coherence tomography using Lucy–Richardson deconvolution algorithm,” Appl. Opt. 52(23), 5663–5670 (2013). [CrossRef]

35. F. Hou, Y. Yu, and Y. Liang, “Automatic identification of parathyroid in optical coherence tomography images,” Lasers Surg. Med. 49(3), 305–311 (2017). [CrossRef]

36. L. Fang, F. Monroe, S. W. Novak, L. Kirk, C. R. Schiavon, S. B. Yu, T. Zhang, M. Wu, K. Kastner, A. A. Latif, Z. Lin, A. Shaw, Y. Kubota, J. Mendenhall, Z. Zhang, G. Pekkurnaz, K. Harris, J. Howard, and U. Manor, “Deep learning-based point-scanning super-resolution imaging,” Nat. Methods 18(4), 406–416 (2021). [CrossRef]

Digital refocusing based on deep learning in optical coherence tomography

Abstract

1. Introduction

2. Methods and materials

2.1 Image collection and registration

2.2 Digital refocusing of en face images

2.3 Quantitative and qualitative evaluation

2.4 Experimental system and datasets

3. Results

3.1 Digital refocusing of en face images on the phantom

3.2 Digital refocusing of en face images on biological samples

3.3 Digital refocusing of three dimensional images

4. Discussions

5. Conclusions

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (12)

Tables (1)

Equations (1)

Biomedical Optics Express