Keywords

1 Introduction

Cervical cancer is the second most common cancer among women, with more than half a million new cases reported every year [1]. However systematic screening of cervical cancer using Papanicolaou test (PAP-test) can reduce mortality rate by 70% or more [6]. PAP-test consists of a cytologist scanning a slide of vaginal smear, typically at 400x magnification. At this magnification, cytologist has to look at thousands of field of views raising the possibility of fatigue, thereby restricting the number of samples observed to 70 per day [5]. In light of these challenges automation of cervical cancer screening has the potential to significantly improve healthcare.

In this work we propose a new deep learning algorithm for classification of cervical cancer cells. The algorithm is intended to be usable in a health centre with limited computing resources. Hence it is designed to be extremely fast and lightweight. The algorithm surpasses state of the art performance in Herlev dataset while being robust to segmentation errors. We also conducted experimets using our AIndra dataset. This new cervical cell dataset contains annotations for nuclear boundary and labels by multiple expert annotators. This dataset enables novel analysis and interpretation of classification, segmentation and detection algorithms.

The combination of algorithm and dataset enables unique evaluation strategy. By taking into account the inter-observer variability, we are able to unearth interesting insights into the data which is not evident on the surface during evaluation. To best of our knowledge, no work has been reported, which demostrate the effect of inter-observability for PAP semear images.

2 Related Works

2.1 Performance Measures

Common measures of performance like accuracy, precision and recall presuppose the existence of unique ground truth. This assumes that the disagreement between two observers on a classification label is quite small. In many medical problems, this assumption does not hold true. As an example, [17] found that only 35% of their PAP-test samples have unanimous agreement between pathologists. Hence, it is essential to include inter-observer variability in analysing algorithms.

A common strategy to deal with inter observer variance is to remove samples where observers disagree, but this willreduce the difficulty of problem by removing ambiguous samples. Another approach is to take a majority vote with odd number of observers. This method forces a label on samples which are fundamentally ambiguous. Consequently training/evaluation with this data penalises algorithm on samples where pathologists were indecisive. The consensus in medical community to deal with this challenge is to use measures like percentage agreement between the observers [10] or Cohen’s kappa coefficient (\(\kappa \)) [4, 10] which discount observer agreement due to random chance. We refer to percentage agreement between the observers using the symbol \(\varTheta \) in the following sections.

2.2 Datasets

The most popular dataset for evaluating cervical cancer cell classification is the Herlev dataset [12]. It consists of 917 high quality images of single cells in seven classes. During data collection, cells were labelled by two cyto-technicians. Cells that were labelled differently were discarded. Consequently the dataset has artificially reduced difficulty as discussed in Sect. 2.1. Though there are other datasets like HEMLBC [20] none of them provide annotations from multiple pathologists, ruling out inter observer variability analysis.

2.3 Algorithms for Cervical Cancer Cell Classification

During the past decade, extensive research has been devoted towards accurate classification of cells for automating PAP test. Most of the methods looked at classification of single cells into various stages of carcinoma [3, 13]. These methods in turn relied on accurate segmentation of cell or nucleus. However state of the art segmentation algorithms [3, 8, 16, 18] do not provide the needed segmentation accuracy. As an illustrative example, the best segmentation algorithm achieve a ZSI score of 0.92 on Herlev dataset [19]. When this performance is taken into account classification accuracy drops [20].

An interesting approach that does not rely on accurate segmentation is classification of image patches [11, 14]. These patches consists of a cell and its immediate neighbourhood. The recent work [20] used this patch based method. However their evaluation does not involve challenges like clumping, staining variation, overlapping cell etc. The algorithm proposed by [20] is also computationally expensive, taking around 3.5 seconds per input. Given that there are typically upto 300,000 cells in a single slide [12], the algorithm will not be usable on a clinical device.

3 Our Contributions

In this work, we propose a new deep learning algorithm for classification of cervical cancer cells. We also illustrate a unique evaluation strategy.

Salient Aspects of the Proposed Algorithm

  • Not reliant on accurate segmentation of nucleus or cytoplasm

  • Extremely fast and lightweight

  • Surpasses state of the art performance on Herlev dataset

Salient Aspects of Evaluation Method

  • Accounts for inter-observer variability using our AIndra dataset

  • Includes common evaluation strategy as a subset

  • Brings out latent information in the dataset

3.1 AIndra Dataset

Fig. 1.
figure 1

Sample images containing epithelial cells that exemplify challenges in datasets with respect to image quality, cell distribution and blurring

This dataset consists, 140 images of conventional PAP smear with sizes varying from 640\(\,\times \,\)480 to 1280\(\,\times \,\)720 pixels with a total of 1201 cells. Each image consists of multiple epithelial cells along with granulocytes like neutrophils. The images also exhibit clumping, defocus blur, staining variation etc. Few sample images from the dataset are shown in Fig. 1. Each epithelial cell in the dataset has its nuclear boundary marked and is classified by a cyto-patholgist (annotator-1) and a cyto-technician (annotator-2) according to Bethesda system [15]. Unlike other datasets, we retain both labels. Among the annotated nucleus, we have an \(\varTheta \) of 76.55% and \(\kappa \) of 0.61.

This dataset is the first cervical cancer cell dataset with multiple expert annotations enabling inter observer variability analysis. Since the dataset contains images with multiple cells, it can be used for benchmarking detection and segmentation of nuclei. The dataset also enables the use of features that are external to epithelial cells, like presence of neutrophils.Footnote 1

3.2 DeepCerv: Network architecture

Our Network design is guided by the twin goals of accuracy and speed. Hence the network takes in raw RGB pixel values without any data prepossessing. Design of the network follows the observation that neural networks for medical image analysis, display adequate performance with low depth. This observation is validated by popular networks in literature like in [21]. The network constitutes the initial three layers of AlexNet [7], batch normalization layer to reduce overfitting, followed by a fully connected layer as depicted in the Table 1

Table 1. Network architecture

The network is designed to process image patches of size (99,99,3) because, at 40X resolution complete cell information will be captured in 99\(\,\times \,\)99 field of view. It classifies cells into two classes; normal and abnormal. The abnormal class consists of the abnormal classes in the Bethesda system and normal class captures the rest.

Model Size and Inference Time:- Our implementation of the network in tensorflow is only 6 MB in size without any optimizations like weight quantization. The network is extremely fast, processing an input in 1.7 ms on a Nvidia Geforce GTX930 M GPU with only 5–10% utilisation. With an estimated count of 300,000 cells per slide [12], our network takes 8.5 min per slide as compared to 12 days for [20]. This performance is without any discount for the low end gpu we use in comparison to the TITAN Z used in [20]. The small size of network coupled with extremely fast performance enable future applications for cervical cancer screening on mobile devices.

3.3 Nucleus Detection for Generating DeepCerv Input

DeepCerv expects input to be single cell images. But all the images in AIndra dataset contains multiple cells, even overlapping cells. Hence we pass these images through a cell detection algorithm to detect cell regions. The algorithm will do a contrast adaptive histogram equalisation (CLAHE) followed by thresholding on the AIndra dataset images to get a binary image. Connected components from this binary image is analysed and those with less than 20% overlap with any groundtruth epithelial cells are rejected. Remaining connected components are considered to be potential nuclei. Though the algorithm is simple, it achieves reasonable performance as given in Table 2. We denote this method by the label SEED.

Table 2. Localization performance of detection algorithm for SEED

We also use the annotated ground truth to generate input patches. These patches would be free of segmentation errors and hence would serve as a benchmark of DeepCerv performance. This is similar to the strategies employed in other segmentation free algorithms like [20]. To generate the data we crop a patch of fixed size around the centroid of ground truth. We use the label GND to refer this strategy in following sections.

4 Experiments and Results

DeepCerv is evaluated on Herlev dataset and in the AIndra dataset. In line with other work in literature we report accuracy on Herlev dataset. The reason for better performance of features, estimated using first few layers of the networks, could be that they capture low-level features of cell image, such as texture and smoothness of the cell boundary, hence become decisive features for abnormal cell and supported by Bethseda system also. On the proposed dataset we use inter observer agreement and Cohen’s kappa as per the discussion in Sect. 2.1.

Experiment Setup:- We have used Stochastic gradient descent(SGD) optimiser for training the network described in Sect. 3.2, with parameter setup:- learning rate= 0.0001, decay= 1\(e^{-6}\), momentum = 0.9. The network performance is improved by using the data augmentation methods like image rotation, width/height shift and horizontal flip

4.1 Experiments on Herlev Dataset

We did a 5 fold cross-validation on this dataset and the results are in Table 3. It is to be noted that no prepossessing of any sort is involved apart from resizing. The seven classes in the dataset were converted to two classes by combining all abnormal classes in Herlev into one and all the normal classes into another. The table clearly shows that we are able to achieve state of the art performance on Herlev dataset.

Table 3. Performance on Herlev dataset

4.2 Experiments on the AIndra Dataset

Experiments on Data Where Annotators Agree.

The discussion in Sect. 2.1 brought forward the impact of various strategies to deal with inter-observer variance. Since the strategy of discarding samples where annotators do not agree is prevalent in literature, we explore the impact of such a strategy. Hence in this experiment we use the cells on which annotators agree, hereafter referred as common data. We perform 5-fold crossvalidation on common data using DeepCerv. From the results given in Table 4 we can see that the percentage agreement and \(\kappa \) between algorithm and common data (\(80.53\%\), 0.57) is close to that between two annotators (\(76.55\%\), 0.61). When the same experiment is repeated on SEED, the results do not show significant change thereby validating robustness of DeepCerv to segmentation errors.

Table 4. Performance on data where annotators agree

Performance on Data from Individual Annotators. The strategy of training and testing on common data illustrates the performance of DeepCerv on comparatively error free data. However as this strategy reduces the problem complexity, the results do not reflect the performance on a practical problem. A better estimate would be to see how DeepCerv performs when the data contains cells that are ambiguous and labels have randomness associated with them. Consequently, in this experiment we generate data by using all cells annotated by individual annotators. Similar to the earlier section, we perform 5-fold crossvalidation on this data using DeepCerv. The result of the experiment is given in Table 5.

Table 5. Performance on data from individual annotators

The results in Table 5 shows observations in line to that of common data. While the algorithm is able to achieve high percentage agreement on both annotators, the performance on annotator-1 exceeds that on annotator-2. Surprisingly algorithm is able to achieve better agreement with annotator-1 than with common data in the previous section. These observations may indicate low annotation consistency of annotator-2 in comparison to annotator-1. Interestingly this aligns with the annotator profiles given in Sect. 3.1 and acts as a validation to the algorithm.

5 Conclusions

In this work, we proposed a new deep learning algorithm for classification of cervical cells. The algorithm is able to surpass state of the art performance in Herlev dataset while being extremely fast in comparison to similar work on cervical cancer cell classification in literature. The algorithm in virtue of high accuracy and speed has the potential to enable automated cervical cancer screening on low power devices while the AIndra dataset allows novel analysis that is much closer to real world applications. Through the combination of algorithm and dataset, we are able to provide novel analysis that brings forward the importance of considering inter observer reliability in context of medical problems and the insights it can provide on data.