Abstract
Automated classification of cervical cancer cells has the potential to reduce high mortality rates due to cervical cancer in developing countries. However traditional algorithms for the same depend on accurate segmentation of cells, which in itself is an open problem. Often the algorithms are also not evaluated by considering the huge inter-observer variability in ground truth labels. We propose a new deep learning algorithm that does not depend on accurate segmentation by directly classifying image patches with cells. We evaluate the proposed algorithm on the popular Herlev dataset and show that it achieves state of the art accuracy while being extremely fast. The experimental results are also demonstrated using AIndra dataset collected by us, which also captures the inter observer variability.
O. U. Nirmal Jith and K. K. Harinarayanan—Equal contribution.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Cervical cancer
- Cell classification
- PAP-test
- Deep learning
- Dataset
- Inter-observer reliability
- Cohen’s kappa
1 Introduction
Cervical cancer is the second most common cancer among women, with more than half a million new cases reported every year [1]. However systematic screening of cervical cancer using Papanicolaou test (PAP-test) can reduce mortality rate by 70% or more [6]. PAP-test consists of a cytologist scanning a slide of vaginal smear, typically at 400x magnification. At this magnification, cytologist has to look at thousands of field of views raising the possibility of fatigue, thereby restricting the number of samples observed to 70 per day [5]. In light of these challenges automation of cervical cancer screening has the potential to significantly improve healthcare.
In this work we propose a new deep learning algorithm for classification of cervical cancer cells. The algorithm is intended to be usable in a health centre with limited computing resources. Hence it is designed to be extremely fast and lightweight. The algorithm surpasses state of the art performance in Herlev dataset while being robust to segmentation errors. We also conducted experimets using our AIndra dataset. This new cervical cell dataset contains annotations for nuclear boundary and labels by multiple expert annotators. This dataset enables novel analysis and interpretation of classification, segmentation and detection algorithms.
The combination of algorithm and dataset enables unique evaluation strategy. By taking into account the inter-observer variability, we are able to unearth interesting insights into the data which is not evident on the surface during evaluation. To best of our knowledge, no work has been reported, which demostrate the effect of inter-observability for PAP semear images.
2 Related Works
2.1 Performance Measures
Common measures of performance like accuracy, precision and recall presuppose the existence of unique ground truth. This assumes that the disagreement between two observers on a classification label is quite small. In many medical problems, this assumption does not hold true. As an example, [17] found that only 35% of their PAP-test samples have unanimous agreement between pathologists. Hence, it is essential to include inter-observer variability in analysing algorithms.
A common strategy to deal with inter observer variance is to remove samples where observers disagree, but this willreduce the difficulty of problem by removing ambiguous samples. Another approach is to take a majority vote with odd number of observers. This method forces a label on samples which are fundamentally ambiguous. Consequently training/evaluation with this data penalises algorithm on samples where pathologists were indecisive. The consensus in medical community to deal with this challenge is to use measures like percentage agreement between the observers [10] or Cohen’s kappa coefficient (\(\kappa \)) [4, 10] which discount observer agreement due to random chance. We refer to percentage agreement between the observers using the symbol \(\varTheta \) in the following sections.
2.2 Datasets
The most popular dataset for evaluating cervical cancer cell classification is the Herlev dataset [12]. It consists of 917 high quality images of single cells in seven classes. During data collection, cells were labelled by two cyto-technicians. Cells that were labelled differently were discarded. Consequently the dataset has artificially reduced difficulty as discussed in Sect. 2.1. Though there are other datasets like HEMLBC [20] none of them provide annotations from multiple pathologists, ruling out inter observer variability analysis.
2.3 Algorithms for Cervical Cancer Cell Classification
During the past decade, extensive research has been devoted towards accurate classification of cells for automating PAP test. Most of the methods looked at classification of single cells into various stages of carcinoma [3, 13]. These methods in turn relied on accurate segmentation of cell or nucleus. However state of the art segmentation algorithms [3, 8, 16, 18] do not provide the needed segmentation accuracy. As an illustrative example, the best segmentation algorithm achieve a ZSI score of 0.92 on Herlev dataset [19]. When this performance is taken into account classification accuracy drops [20].
An interesting approach that does not rely on accurate segmentation is classification of image patches [11, 14]. These patches consists of a cell and its immediate neighbourhood. The recent work [20] used this patch based method. However their evaluation does not involve challenges like clumping, staining variation, overlapping cell etc. The algorithm proposed by [20] is also computationally expensive, taking around 3.5 seconds per input. Given that there are typically upto 300,000 cells in a single slide [12], the algorithm will not be usable on a clinical device.
3 Our Contributions
In this work, we propose a new deep learning algorithm for classification of cervical cancer cells. We also illustrate a unique evaluation strategy.
Salient Aspects of the Proposed Algorithm
-
Not reliant on accurate segmentation of nucleus or cytoplasm
-
Extremely fast and lightweight
-
Surpasses state of the art performance on Herlev dataset
Salient Aspects of Evaluation Method
-
Accounts for inter-observer variability using our AIndra dataset
-
Includes common evaluation strategy as a subset
-
Brings out latent information in the dataset
3.1 AIndra Dataset
This dataset consists, 140 images of conventional PAP smear with sizes varying from 640\(\,\times \,\)480 to 1280\(\,\times \,\)720 pixels with a total of 1201 cells. Each image consists of multiple epithelial cells along with granulocytes like neutrophils. The images also exhibit clumping, defocus blur, staining variation etc. Few sample images from the dataset are shown in Fig. 1. Each epithelial cell in the dataset has its nuclear boundary marked and is classified by a cyto-patholgist (annotator-1) and a cyto-technician (annotator-2) according to Bethesda system [15]. Unlike other datasets, we retain both labels. Among the annotated nucleus, we have an \(\varTheta \) of 76.55% and \(\kappa \) of 0.61.
This dataset is the first cervical cancer cell dataset with multiple expert annotations enabling inter observer variability analysis. Since the dataset contains images with multiple cells, it can be used for benchmarking detection and segmentation of nuclei. The dataset also enables the use of features that are external to epithelial cells, like presence of neutrophils.Footnote 1
3.2 DeepCerv: Network architecture
Our Network design is guided by the twin goals of accuracy and speed. Hence the network takes in raw RGB pixel values without any data prepossessing. Design of the network follows the observation that neural networks for medical image analysis, display adequate performance with low depth. This observation is validated by popular networks in literature like in [21]. The network constitutes the initial three layers of AlexNet [7], batch normalization layer to reduce overfitting, followed by a fully connected layer as depicted in the Table 1
The network is designed to process image patches of size (99,99,3) because, at 40X resolution complete cell information will be captured in 99\(\,\times \,\)99 field of view. It classifies cells into two classes; normal and abnormal. The abnormal class consists of the abnormal classes in the Bethesda system and normal class captures the rest.
Model Size and Inference Time:- Our implementation of the network in tensorflow is only 6 MB in size without any optimizations like weight quantization. The network is extremely fast, processing an input in 1.7 ms on a Nvidia Geforce GTX930 M GPU with only 5–10% utilisation. With an estimated count of 300,000 cells per slide [12], our network takes 8.5 min per slide as compared to 12 days for [20]. This performance is without any discount for the low end gpu we use in comparison to the TITAN Z used in [20]. The small size of network coupled with extremely fast performance enable future applications for cervical cancer screening on mobile devices.
3.3 Nucleus Detection for Generating DeepCerv Input
DeepCerv expects input to be single cell images. But all the images in AIndra dataset contains multiple cells, even overlapping cells. Hence we pass these images through a cell detection algorithm to detect cell regions. The algorithm will do a contrast adaptive histogram equalisation (CLAHE) followed by thresholding on the AIndra dataset images to get a binary image. Connected components from this binary image is analysed and those with less than 20% overlap with any groundtruth epithelial cells are rejected. Remaining connected components are considered to be potential nuclei. Though the algorithm is simple, it achieves reasonable performance as given in Table 2. We denote this method by the label SEED.
We also use the annotated ground truth to generate input patches. These patches would be free of segmentation errors and hence would serve as a benchmark of DeepCerv performance. This is similar to the strategies employed in other segmentation free algorithms like [20]. To generate the data we crop a patch of fixed size around the centroid of ground truth. We use the label GND to refer this strategy in following sections.
4 Experiments and Results
DeepCerv is evaluated on Herlev dataset and in the AIndra dataset. In line with other work in literature we report accuracy on Herlev dataset. The reason for better performance of features, estimated using first few layers of the networks, could be that they capture low-level features of cell image, such as texture and smoothness of the cell boundary, hence become decisive features for abnormal cell and supported by Bethseda system also. On the proposed dataset we use inter observer agreement and Cohen’s kappa as per the discussion in Sect. 2.1.
Experiment Setup:- We have used Stochastic gradient descent(SGD) optimiser for training the network described in Sect. 3.2, with parameter setup:- learning rate= 0.0001, decay= 1\(e^{-6}\), momentum = 0.9. The network performance is improved by using the data augmentation methods like image rotation, width/height shift and horizontal flip
4.1 Experiments on Herlev Dataset
We did a 5 fold cross-validation on this dataset and the results are in Table 3. It is to be noted that no prepossessing of any sort is involved apart from resizing. The seven classes in the dataset were converted to two classes by combining all abnormal classes in Herlev into one and all the normal classes into another. The table clearly shows that we are able to achieve state of the art performance on Herlev dataset.
4.2 Experiments on the AIndra Dataset
Experiments on Data Where Annotators Agree.
The discussion in Sect. 2.1 brought forward the impact of various strategies to deal with inter-observer variance. Since the strategy of discarding samples where annotators do not agree is prevalent in literature, we explore the impact of such a strategy. Hence in this experiment we use the cells on which annotators agree, hereafter referred as common data. We perform 5-fold crossvalidation on common data using DeepCerv. From the results given in Table 4 we can see that the percentage agreement and \(\kappa \) between algorithm and common data (\(80.53\%\), 0.57) is close to that between two annotators (\(76.55\%\), 0.61). When the same experiment is repeated on SEED, the results do not show significant change thereby validating robustness of DeepCerv to segmentation errors.
Performance on Data from Individual Annotators. The strategy of training and testing on common data illustrates the performance of DeepCerv on comparatively error free data. However as this strategy reduces the problem complexity, the results do not reflect the performance on a practical problem. A better estimate would be to see how DeepCerv performs when the data contains cells that are ambiguous and labels have randomness associated with them. Consequently, in this experiment we generate data by using all cells annotated by individual annotators. Similar to the earlier section, we perform 5-fold crossvalidation on this data using DeepCerv. The result of the experiment is given in Table 5.
The results in Table 5 shows observations in line to that of common data. While the algorithm is able to achieve high percentage agreement on both annotators, the performance on annotator-1 exceeds that on annotator-2. Surprisingly algorithm is able to achieve better agreement with annotator-1 than with common data in the previous section. These observations may indicate low annotation consistency of annotator-2 in comparison to annotator-1. Interestingly this aligns with the annotator profiles given in Sect. 3.1 and acts as a validation to the algorithm.
5 Conclusions
In this work, we proposed a new deep learning algorithm for classification of cervical cells. The algorithm is able to surpass state of the art performance in Herlev dataset while being extremely fast in comparison to similar work on cervical cancer cell classification in literature. The algorithm in virtue of high accuracy and speed has the potential to enable automated cervical cancer screening on low power devices while the AIndra dataset allows novel analysis that is much closer to real world applications. Through the combination of algorithm and dataset, we are able to provide novel analysis that brings forward the importance of considering inter observer reliability in context of medical problems and the insights it can provide on data.
Notes
- 1.
Presence of neutrophils signifies inflammation in that region.
References
Bengtsson, E., Malm, P.: Screening for Cervical Cancer Using Automated Analysis of PAP-Smears. Computational and Mathematical Methods in Medicine 2014, 1–12 (2014)
Bora, K., Chowdhury, M., Mahanta, L.B., Kundu, M.K., Das, A.K.: Automated classification of Pap smear images to detect cervical dysplasia. Computer Methods and Programs in Biomedicine 138, 31–47 (2017)
Chankong, T., Theera-Umpon, N., Auephanwiriyakul, S.: Automatic cervical cell segmentation and classification in Pap smears. Computer Methods and Programs in Biomedicine 113(2), 539–556 (2014)
Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20(1), 37–46 (1960)
Elsheikh, T.M., Austin, R.M., Chhieng, D.F., Miller, F.S., Moriarty, A.T., Renshaw, A.A.: American Society of Cytopathology: American Society of Cytopathology workload recommendations for automated Pap test screening: Developed by the productivity and quality assurance in the era of automated screening task force. Diagnostic Cytopathology 41(2), 174–178 (2013)
Kitchener, H.C., Castle, P.E., Cox, J.T.: Chapter 7: achievements and limitations of cervical cytology screening. Vaccine 24, S63–S70 (2006)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. pp. 1097–1105 (2012)
Lu, Z., Carneiro, G., Bradley, A.P., Ushizima, D., Nosrati, M.S., Bianchi, A.G.C., Carneiro, C.M., Hamarneh, G.: Evaluation of Three Algorithms for the Segmentation of Overlapping Cervical Cells. IEEE Journal of Biomedical and Health Informatics 21(2), 441–450 (2017)
Marinakis, Y., Dounias, G., Jantzen, J.: Pap smear diagnosis using a hybrid intelligent scheme focusing on genetic algorithm based feature selection and nearest neighbor classification. Computers in Biology and Medicine 39(1), 69–78 (2009)
McHugh, M.L.: Interrater reliability: The kappa statistic. Biochemia Medica 22(3), 276–282 (2012)
Nanni, L., Lumini, A., Brahnam, S.: Local binary patterns variants as texture descriptors for medical image analysis. Artificial Intelligence in Medicine 49(2), 117–125 (2010)
Norup, J.: Classification of Pap-Smear Data by Tranduction Neuro-Fuzzy Methods. masterThesis, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark (2005)
Phoulady, H.A., Zhou, M., Goldgof, D.B., Hall, L.O., Mouton, P.R.: Automatic quantification and classification of cervical cancer via adaptive nucleus shape modeling. In: Image Processing (ICIP), 2016 IEEE International Conference On. pp. 2658–2662. IEEE (2016)
Sokouti, B., Haghipour, S., Tabrizi, A.D.: A framework for diagnosing cervical cancer disease based on feedforward MLP neural network and ThinPrep histopathological cell image features. Neural Computing and Applications 24(1), 221–232 (2014)
Solomon, D.: The 2001 Bethesda SystemTerminology for Reporting Results of Cervical Cytology. JAMA 287(16), 2114 (2002)
Song, Y., Zhang, L., Chen, S., Ni, D., Lei, B., Wang, T.: Accurate Segmentation of Cervical Cytoplasm and Nuclei Based on Multiscale Convolutional Network and Graph Partitioning. IEEE transactions on bio-medical engineering 62(10), 2421–2433 (2015)
Young, N.A., Naryshkin, S., Atkinson, B.F., Ehya, H., Gupta, P.K., Kline, T.S., Luff, R.D.: Interobserver variability of cervical smears with squamous-cell abnormalities: A philadelphia study. Diagnostic cytopathology 11(4), 352–357 (1994)
Zhang, L., Kong, H., Liu, S., Wang, T., Chen, S., Sonka, M.: Graph-based segmentation of abnormal nuclei in cervical cytology. Computerized Medical Imaging and Graphics 56, 38–48 (2017)
Zhang, L., Kong, H., Ting Chin, C., Liu, S., Fan, X., Wang, T., Chen, S.: Automation-assisted cervical cancer screening in manual liquid-based cytology with hematoxylin and eosin staining. Cytometry. Part A: The Journal of the International Society for Analytical Cytology 85(3), 214–230 (Mar 2014)
Zhang, L., Lu, L., Nogues, I., Summers, R.M., Liu, S., Yao, J.: DeepPap: Deep Convolutional Networks for Cervical Cell Classification. IEEE Journal of Biomedical and Health Informatics 21(6), 1633–1643 (2017)
Zhu, X., Yao, J., Zhu, F., Huang, J.: Wsisa: Making survival prediction from whole slide histopathological images. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 7234–7242 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Nirmal Jith, O.U., Harinarayanan, K.K., Gautam, S., Bhavsar, A., Sao, A.K. (2018). DeepCerv: Deep Neural Network for Segmentation Free Robust Cervical Cell Classification. In: Stoyanov, D., et al. Computational Pathology and Ophthalmic Medical Image Analysis. OMIA COMPAY 2018 2018. Lecture Notes in Computer Science(), vol 11039. Springer, Cham. https://doi.org/10.1007/978-3-030-00949-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-00949-6_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00948-9
Online ISBN: 978-3-030-00949-6
eBook Packages: Computer ScienceComputer Science (R0)