1 Introduction

Obstetric ultrasound (US) is conducted as a routine screening examination between 18–24 weeks of gestation. US imaging of the fetal head enables clinicians to assess fetal brain development and detect growth abnormalities. This requires the careful selection of standard scan planes such as the transventricular (TV) and transcerebellar (TC) plane that contain key anatomical structures [6]. However, it is challenging and time-consuming even for experienced sonographers to manually navigate a 2D US probe to find the correct standard plane. The task is highly operator-dependent and requires a great amount of expertise. With the advent of 3D fetal US, a volume of the entire fetal brain can be acquired quickly with little training. But the problem of locating diagnostically required standard planes for biometric measurements remains. There is a strong need to develop automatic methods for 2D standard plane extraction from 3D volumes to improve clinical workflow efficiency.

Related work: Recently, deep learning approaches have shown successes in many medical image analysis applications. Several works have applied deep learning techniques to standard plane detection in fetal US [1,2,3, 7]. Baumgartner et al. [1] use a convolutional neural network (CNN) for categorisation of 13 fetal standard views. Chen et al. [3] adopt a CNN-based image classification approach for detecting fetal abdominal standard planes, which they later combined with a recurrent neural network (RNN) that takes into account temporal information [2]. However, these methods identify standard planes from 2D US videos and not 3D volumes. Ryou et al. [7] attempt to detect fetal head and abdominal planes from 3D fetal US by breaking down the 3D volume into a stack of 2D slices which are then classified as head or abdomen using a CNN.

Plane detection is considered an image classification problem in the above works. In contrast, we approach the plane detection problem by regressing rigid transformation parameters that define the plane position and orientation. There are several works on using CNN to predict transformations. Kendall et al. [5] introduce PoseNet for regressing 6-DoF camera pose from RGB image with a loss function that uses quaternions to represent rotation. Hou et al. [4] propose SVRNet for predicting transformation from 2D image to 3D space and use anchor points as a new representation for rigid transformations. These works predict absolute transformation with respect to a known reference coordinate system with one pass of CNN. Our work is different as we use an iterative approach with multiple passes of CNN to predict relative transformation with respect to current plane coordinates, which change at each iteration. Relative transformation is used as our 3D volumes are not aligned to a reference coordinate system.

Contributions: In this paper, we propose the Iterative Transformation Network (ITN) that uses a CNN to detect standard planes in 3D fetal US. The network learns a mapping between a 2D plane and the transformation required to move that plane towards the standard plane within a 3D volume. Our contributions are threefold: (1) ITN is a general deep learning framework built for 2D plane detection in 3D volumes. The iterative approach regresses transformations that bring the plane closer to the standard plane. This reduces computation cost as ITN selectively samples only a few planes in the 3D volume unlike classification-based methods that require dense sampling [1,2,3, 7]. (2) We study the effect on plane detection accuracy using different transformation representations (quaternions, Euler angles, rotation matrix, anchor points) as CNN regression outputs. (3) We improve ITN performance by incorporating additional classification probability outputs as confidence measures of the regressed transformation parameters. At inference, the classification probabilities are used as confidence scores to yield more accurate localisation. During training, regression and classification outputs are learned in a multi-task learning framework, which improves the generalisation ability of the model and prevents overfitting.

Fig. 1.
figure 1

(a) Overall plane detection framework using ITN. (b) Composition of transformations. Red: GT plane. Blue: Arbitrary plane. Black: Identity plane.

2 Method

Overall Framework: Fig. 1a presents the overall ITN framework for plane detection. Given a 3D volume V, the goal is to find the ground truth (GT) standard plane (red). Starting with a random plane initialisation (blue), the 2D image of the plane is extracted and input to a CNN which then predicts a 3D transformation \(\varDelta T\) that will move the plane to a new position closer to the GT plane. The image extracted at the new plane location is then passed to the CNN and the process is repeated until the plane reaches the GT plane.

Composition of Transformations: Transformation is defined with respect to a reference coordinate system. In Fig. 1b, we define an identity plane (black) with origin at the volume centre. T and \(T^{GT}\) are defined in the coordinate system of the identity plane and they move the identity plane to the arbitrary plane (blue) and GT plane (red) respectively. \(\varDelta T^{GT}\) is defined in the coordinate system of the arbitrary plane and \(\varDelta T^{GT}\) moves the arbitrary plane to the GT plane. Note that our ITN predicts \(\varDelta T^{GT}\) which is a relative transformation from the point of view of the current plane, and not from the identity plane. We compute these transformations from each other using \(T^{GT}=T\oplus \varDelta T^{GT}\) and \(\varDelta T^{GT}=T^{GT}\ominus T\) where \(\oplus \) and \(\ominus \) are the composition and inverse composition operators respectively. The computations defined by the operators are dependent on the choice of the transformation representation.

Network Training: During training, an arbitrary plane is randomly sampled from a volume V by applying a random transformation T to the identity plane. The corresponding 2D plane image X is then extracted. We define \(X=I(V, T, s)\) where \(I(\cdot )\) is the plane extraction function and s is the length of the square plane. We sample T such that the plane centre falls in the middle 60% of V and the rotation of the plane is within an angle of \(\pm 45^{\circ }\) about each coordinate axis. This avoids sampling of planes at the edges of the volume where there is no informative image data due to regions falling outside of the US imaging cone.

figure a
Table 1. Representations of rigid transformations and their loss functions.

A training sample is represented by \((X, \varDelta T^{GT})\) and the training loss function can be formulated as the L2 norm of the error between the GT and predicted transformation parameters: \(L={ \left\| \varDelta { T }^{ GT }-\varDelta T \right\| }_{ 2 }^{ 2 }\)

Network Inference: Algorithm 1 summarises the steps during network inference to detect a plane. The iterative approach gives rough estimates of the plane in the first few iterations and subsequently makes smaller and more accurate refinements. This coarse-to-fine adjustment improves accuracy and is less susceptible to different initialisations. To improve accuracy and convergence, we repeat Algorithm 1 with 5 random plane initialisations per volume and average their final transformations \(T_N\) after N iterations.

Transformation Representations: In ITN, plane transformation \(\varDelta T\) is rigid, comprising only translation and rotation. We explore the effect of using different transformation representations as the CNN regression outputs (Table 1) since there are few comparative studies that investigate this on deep networks. The first three representations explicitly separate translation and rotation in which rotation is represented by quaternions, Euler angles and rotation matrix respectively. \(\alpha \) and \(\beta \) are weightings given to the translation and rotation losses. Specifically, anchor points [4] are defined as the coordinates of three fixed points on the plane (we use: centre, bottom-left and bottom-right corner). The points uniquely and jointly represent any translation and rotation in 3D space. During inference, the predicted values of certain representations need to be constrained to give valid rotation. For instance, quaternions need to be normalised to unit quaternions and rotation matrices need to be orthogonalised. Anchor points need to be converted to valid rotation matrices as described in [4].

figure b

Classification Probability as Confidence Measure: We further extend our ITN by incorporating classification probability as a confidence measure for the regressed values of translation and rotation. The method can be applied to any transformation representation but we use quaternions since it yields the best results. In addition to the regression outputs \(\varvec{t}\) and \(\varvec{q}\), the CNN also predicts two classification probability outputs \(\varvec{P}\) and \(\varvec{Q}\) for translation and rotation respectively. We divide translation into 6 discrete classification categories: positive and negative translation along each coordinate axis. Denoting c as the translation classification label, we have \(c\in \{ { c }_{ 1 }^{ + },{ c }_{ 1 }^{ - },{ c }_{ 2 }^{ + },{ c }_{ 2 }^{ - },{ c }_{ 3 }^{ + },{ c }_{ 3 }^{ - }\} \) where \({ c }_{ 1 }^{+}\) is the category representing translation along the positive x-axis. \(\varvec{P}\) is then a 6-element vector giving the probability of translation along each axis direction. Similarly, we divide rotation into 6 categories: clockwise and counter-clockwise rotation about each coordinate axis. Denoting k as the rotation classification label, we have \(k\in \{ { k }_{ 1 }^{ + },{ k }_{ 1 }^{ - },{ k }_{ 2 }^{ + },{ k }_{ 2 }^{ - },{ k }_{ 3 }^{ + },{ k }_{ 3 }^{ - }\} \) where \({ k }_{ 1 }^{+}\) is the category representing clockwise rotation about the x-axis. \(\varvec{Q}\) is then a 6-element vector giving the probability of rotation about each axis.

A training sample is represented by \((X, \varvec{t}^{GT}, \varvec{q}^{GT}, {c^{GT}}, {k^{GT}})\). \({c^{GT}}\) gives the coordinate axis along which the current plane centre has the furthest absolute distance from the GT plane centre. Similarly, \({k^{GT}}\) gives the coordinate axis about which the current plane will rotate the most to reach the GT plane. Appendix A derives the computations of \({c^{GT}}\) and \({k^{GT}}\) during training. The overall training loss function can then be written as:

$$\begin{aligned} L={ \alpha { \left\| \varvec{t}^{ GT }-\varvec{t} \right\| }_{ 2 }^{ 2 }+\beta \left\| \varvec{q}^{ GT }-\frac{ \varvec{q} }{ \left\| \varvec{q} \right\| } \right\| }_{ 2 }^{ 2 } -\gamma \log { { P }_{ c^{ GT } } } -\delta \log { { Q }_{ { k }^{ GT } } } \end{aligned}$$
(1)

The first and second terms are the L2 losses for translation and rotation regression while the third and fourth terms are the cross-entropy losses for translation and rotation classification. \(\alpha \), \(\beta \), \(\gamma \) and \(\delta \) are weights given to the losses.

During inference, the CNN outputs \(\varvec{t}\), \(\varvec{q}\), \({\varvec{P}}\) and \({\varvec{Q}}\) are combined to compute the relative transformation \(\varDelta T\) (Algorithm 2). For translation, each component of the regressed translation \(\varvec{t}\) is weighted by the corresponding probabilities in the vector \(\varvec{P}\). For rotation, we only rotate the plane about the most confident rotation axis as predicted by \(\varvec{Q}\). In order to determine the magnitude of that rotation, the regressed quaternion \(\varvec{q}\) needs to be broken down into Euler angles using the appropriate convention in order to determine the rotation angle about that most confident rotation axis. An Euler angle representation using convention ‘xyz’ means a rotation about x-axis first followed by y-axis and finally z-axis. Hence, \({\varvec{P}}\) and \({\varvec{Q}}\) are used as confidence weighting for \(\varvec{t}\) and \(\varvec{q}\), allowing the plane to translate and rotate to a greater extent along the more confident axis.

Network Architecture: ITN utilises a multi-task learning framework for predictions of multiple outputs. The architecture differs according to the number of outputs that the CNN predicts. All our networks comprise 5 convolution layers, each followed by a max-pooling layer. These layers contain shared features for all outputs. After the 5th pooling layer, the network branches into fully-connected layers to learn the specific features for each output. Details of all network architectures are described in Appendix B.

3 Experiments and Results

Data and Experiments: ITN is evaluated on 3D US volumes of fetal brain from 72 subjects. For each volume, TV and TC standard planes are manually selected by a clinical expert. 70% of the dataset is randomly selected for training and the rest 30% used for testing. All volumes are processed to be isotropic with mean dimensions of 324 \(\times \)207\(\,\times \)279 voxels. ITN is implemented using Tensorflow running on a machine with Intel Xeon CPU E5-1630 at 3.70 GHz and one NVIDIA Titan Xp 12GB GPU. We set plane size s=225, N=10 and \(\alpha \)=\(\beta \)=\(\gamma \)=\(\delta \)=1. During training, we use a batch size of 64. Weights are initialised randomly from a distribution with zero mean and 0.1 standard deviation. Optimisation is carried out for 100,000 iterations using the Adam algorithm with learning rate=0.001, \(\beta _1\)=0.9 and \(\beta _2\)=0.999. The predicted plane is evaluated against the GT using distance between the plane centres (\(\delta x\)) and rotation angle between the planes (\(\delta \theta \)). Image similarity of the planes is also measured using peak signal-to-noise ratio (PSNR) and structural similarity (SSIM).

Table 2. Evaluation of ITN with different transformation representations for standard plane detection. Results presented as (Mean ± Standard Deviation).
Table 3. Evaluation of ITN with/without confidence probability for standard plane detection. Results presented as (Mean ± Standard Deviation).

Results: Table 2 compares the plane detection results when different transformation representations are used by ITN. In general, there is little difference in the translation error. This is because all translation representations are the same, which use the three Cartesian axes except for anchor points which have slightly greater translation error. The rotation errors on TC plane suggest that quaternions are a good representation. Rotation matrices and anchor points over-parameterise rotation and can make network learning more difficult with greater degree of freedom. Since these parameters are not constrained, it is also harder to convert them back into valid rotations during inference. Quaternions have fewer parameters and slightly-off quaternion can still be easily normalised to give valid rotation. Compared to Euler angles, quaternions avoid the problem of gimbal lock. For TV plane, there is little difference in rotation error. This is because sonographers use the TV plane as a visual reference when acquiring 3D volumes. This causes the TV plane to lie roughly in the central plane of the volume with lower rotation variances, thus making the choice of rotation representation less important. Table 3 compares the performance of ITN with/without classification probability outputs. Given a baseline model (M1) that only has regression outputs \(\varvec{t}\), \(\varvec{q}\), the addition of classification probabilities \(\varvec{P}\), \(\varvec{Q}\) improves the translation and rotation accuracy respectively (M2-M4). The classification probabilities act as confidence weights for the regression outputs to improve plane detection accuracy. Furthermore, the classification and regression outputs are trained in a multi-task learning fashion, which allows feature sharing and enables more generic features to be learned, thus preventing model overfitting. M1-M4 use one plane image as CNN input. We further improve our results by using three orthogonal plane images instead as this provides more information about the 3D volume (M4+). M4 and M4+ take 0.46 s and 1.35 s to predict one plane per volume. The supplementary material provides videos showing the update of a randomly initialised plane and its extracted image through 10 inference iterations.

Figure 2 shows a visual comparison between the GT planes and the planes predicted by M4. To evaluate the clinical relevance of the predicted planes, a clinical expert manually measures the head circumference (HC) on both the predicted and GT planes and computes the standard deviation of the measurement error to be 1.05 mm (TV) and 1.25 mm (TC). This is similar to the intraobserver variability of 2.65 mm reported for HC measurements on TC plane [8]. Thus, accurate biometrics can be extracted from our predicted planes.

Fig. 2.
figure 2

Visualisation of GT planes and planes predicted by M4.

4 Conclusion

We presented ITN, a new approach for standard plane detection in 3D fetal US by using a CNN to regress rigid transformation iteratively. We compare the use of different transformation representations and show quaternions to be a good representation for iterative pose estimation. Additional classification probabilities are learned via multi-task learning which act as confidence weights for the regressed transformation parameters to improve plane detection accuracy. As future work, we are evaluating ITN on other plane detection tasks (eg. view plane selection in cardiac MRI). It is also worthwhile to explore new transformation representations and extend ITN to simultaneous detection of multiple planes.