Learning to detect chest radiographs containing pulmonary lesions using visual attention networks
Graphical abstract
Introduction
Lung cancer is the most common cancer worldwide and the second most common cancer in Europe and the USA (Ferlay, Steliarova-Foucher, Lortet-Tieulent, Rosso, Coebergh, Comber, Forman, Bray, 2013, American Cancer Society). Due to delays in diagnosis, it is typically discovered at an advanced stage with a very low survival rate (Cancer Research UK, 2014). The chest radiograph is the most commonly performed radiological investigation in the initial assessment of suspected lung cancer because it is inexpensive and delivers a low radiation dose. On a chest radiograph, a nodule is defined as a rounded opacity ≤ 3 cm, which can be well- or poorly marginated. The detection of lesions ≥ 3 cm do not typically pose a diagnostic challenge (Hansell et al., 2008). However, detecting small pulmonary nodules on plain film is challenging, even despite high spatial resolution because an x-ray is a single projection of the entire 3D thorax volume. The planar nature of radiograph acquisition means that thoracic structures are superimposed, thus, the heart, diaphragm, and mediastinum may obscure a large portion of the lungs. Patients may also have several co-existing pathologies visible on each radiograph. Many benign lesions can mimic a pathology, due to composite shadowing, and, furthermore, the nodule can be very small or with ill-defined margins. Studies have shown that in up to 40% of new lung cancer diagnoses, the lesion was present on previous plain film, but was missed and only picked up in hindsight (Forrest, Friedman, 1981, Quekel, Kessels, Goei, van Engelshoven, 1999).
Computer-aided detection (CAD) systems using machine learning techniques can facilitate the automated detection of lung nodules and provide a cost-effective second-opinion reporting mechanism. The reported performance of these CAD systems varies substantially depending on the size and nature of the samples. For instance, sensitivity rates reported in the literature for lesions larger than 5 mm vary from (Moore, Ripton-Snyder, Wu, Hendler, 2011, Szucs-Farkas, Schick, Cullmann, Ebner, Megyeri, Vock, Christe, 2013). Currently, state-of-the-art results for automated object detection in images are obtained by deep convolutional neural networks (DCNN). During training, these methods require a large number of manually annotated images in which the contours of each object are identified or, at the very least, have a bounding box indicating their location within the image. The large majority of these methods use regression models to predict the coordinates of the bounding boxes (Erhan, Szegedy, Toshev, Anguelov, 2014, Szegedy, Toshev, Erhan, 2013) or, alternatively, make use of sliding windows (Ren, He, Girshick, Sun, 2015, Sermanet, Eigen, Zhang, Mathieu, Fergus, LeCun, 2014). Most documented studies rely on large datasets of natural images (Everingham, Van Gool, Williams, Winn, Zisserman, 2010, Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, Zitnick, 2014) where the objects to be detected are typically well-defined and sufficiently within the context of the entire image. Fundamentally, the applicability of these technologies in radiology has not been fully explored, partially due to the paucity of large databases of annotated medical images.
In recent years, the increasing availability of digital archiving and reporting systems, such as PACS (Picture Archiving and Communication System) and RIS (Radiology Information System), is laying the technological foundations needed to make available large volumes of clinical data and images from hospital archives (Cho, Lee, Shin, Choy, Do, 2015, Cornegruta, Bakewell, Withey, Montana, 2016). In this study, our aim is to leverage a large number of radiological exams extracted from a hospital’s data archives to explore the feasibility of deep learning for lung nodule detection. In particular, we assess the performance of a statistical classifier that discriminates between chest radiographs with elements/regions indicating the presence of a pulmonary lesion and those that do not. Our first hypothesis is that, with a sufficiently large training database, a classifier based on deep convolutional networks can be trained to accomplish this task using only weak image labels. In order to address our hypothesis, we collected over 700,000 historical chest radiographs from two large teaching hospitals in London (UK). A natural language processing (NLP) system was developed to parse all free-text radiological reports to identify all the exams containing pulmonary lesions. This is a challenging learning task as a proportion of automatically-extracted labels in the training dataset is expected to be erroneous or incomplete due to reporting errors or omissions (estimated to be at least (Brady, 2017)), inter-reader variability (Elmore, Wells, Lee, Howard, Feinstein, 1994, Elmore, Longton, Carney, Geller, Onega, Tosteson, Nelson, Pepe, Allison, Schnitt, et al., 2015) and potential NLP failures. The performance of the resulting image classifier was assessed using a manually curated, independent dataset of over 6,000 exams.
Our second and main hypothesis is that significant classification improvements can be obtained by augmenting the weak and potentially noisy labels by using bounding boxes to indicate the exact location of any lesions in a subset of the training exams. Manual annotation simply does not scale well given the size of currently available historical datasets; realistically only a fraction of all the exams can be reviewed and annotated. It would be, therefore, of interest to design a classifier that leverages both weakly labelled and annotated images. To investigate this hypothesis, approximately 9% of the radiographs presenting lesions were randomly selected and reviewed by a radiologist who manually delineated the bounding boxes. This annotation process resulted in over 3,000 lesion examples.
We present two different learning strategies to leverage both weak labels and the annotations of lesions. Our guiding principle was that, when the position of a lesion is known during training, it can be exploited to provide the network with visual feedback that can inform on the quality of the features learned by the convolutional filters. As such, both strategies introduce attention mechanisms within the classifier in order to learn improved imaging representations. Our first approach exploits a soft attention mechanism. Using weakly-labelled images, a convolutional network learns imaging features by minimising the classification error and generates saliency maps highlighting parts of an image that are likely to contain a lesion. Using the subset of annotated images, a composite loss function is employed to penalise the discrepancy between the network’s implied position of a lesion, provided by the saliency map during training and the real position of the lesion. A large loss indicates that the network’s current representation does not accurately capture the lesion’s visual patterns, and provides an additional mechanism for self-improvement through back-propagation. The resulting architecture, a convolutional neural network with attention feedback (CONAF), features an improved localisation capability, which, in turn, boosts the classification performance.
Our second approach implements a hard attention mechanism, and specifically an extension of the Recurrent Attention Model (RAM) (Ba, Mnih, Kavukcuoglu, 2014, Mnih, Heess, Graves, et al., 2014, Sermanet, Frome, Real, 2014, Ypsilantis, Montana, 2017). In contrast to CONAF, each image is processed in a finite number of sequential steps. At each step, only at a portion of the image is used as input; each location is sampled from a probability distribution that leverages the knowledge acquired in the previous steps. The information accumulated through a random path image culminates in the classification of the image. The classification score acts as a reward signal which, in turn, updates the probability distribution controlling the sequence of image locations that should be visited. This results in more precise attention being paid to the most relevant parts of the image, i.e. the lungs. Our proposed architecture, RAMAF (Recurrent Attention Model with Attention Feedback), rewards a higher classification score when the glimpses attended by the algorithms during training overlap with the correct lesion locations. Establishing this improves the rate of learning, yielding a faster convergence rate and increased classification performance.
The article is structured as follows. In Section 2, we introduce the dataset used in our experiments and explain how the chest radiographs have been automatically labelled using a natural language processing system. The CONAF and RAMAF algorithms are presented in Sections 3.1 and 3.2, respectively. Their performance has been assessed and compared to a number of alternative architectures that use either weak labels or annotated images. In Section 4, we describe our experimental results supporting the hypothesis that leveraging a relatively small portion of manually annotated lesions, in addition to a large sample of weakly-annotated training examples, can drastically enhance the classification performance.
Section snippets
A repository of chest radiographs
For this study, we obtained a dataset consisting of 745,479 chest x-ray exams collected from the historical archives of Guy’s and St. Thomas’ NHS Foundation Trust in London from January 2005 to March 2016. For each exam, the free-text radiologist report was extracted from the RIS (Radiology Information System). For a subset of 634,781 exams, we were also able to obtain the DICOM files containing pixel data. All paediatric exams ( ≤ 16 years of age) were removed from the dataset resulting in a
Convolution networks with attention feedback (CONAF)
In this section we set out our proposal of an image classifier based on deep convolutional neural networks. Our aim is to detect chest radiographs that are likely to contain one or more lesions. Although the localisation of the lesions within an image is not our primary interest, this information can be extracted from a trained network to generate saliency maps, i.e. heatmaps indicating where the lesions are more likely to be located within the original x-ray. Our proposed architecture exploits
Further implementation details
In this section we provide additional implementation details. The CONAF loss function was fully specified using and as these parameters yielded optimal performance on the validation test. Training was done using back-propagation with adadelta (Zeiler, 2012), mini-batches of 32 images and a learning rate of 0.03. During the training we fed the network through two types of mini-batches: one is composed by only images associated to weak labels and the other is composed of images with
Discussion and conclusions
Wherea as other imaging modalities for cancer detection (e.g. mammograms and the breast screening programme more widely) are routinely double-read and associated with an improvement in sensitivity of detection (Anderson et al., 1994), the same is not feasible with chest radiographs (due to the sheer volume of scans, 40% of the 3.6 billion annual medical images are chest radiographs) and a lack of resources. Machine learning systems powered by deep learning algorithms offer a mechanism to
Acknowledgments
The authors acknowledge the support from the Department of Health via the National Institute for Health Research Comprehensive Biomedical Research Centre award to Guy’s & St Thomas’ NHS Foundation Trust in partnership with King’s College London and King’s College Hospital NHS Foundation Trust; and from the King’s College London/University College London Comprehensive Cancer Imaging Centre funded by Cancer Research UK and Engineering and Physical Sciences Research Council in association with the
References (72)
- et al.
The efficacy of double reading mammograms in breast screening
Clin. Radiol.
(1994) - et al.
Cancer incidence and mortality patterns in europe: estimates for 40 countries in 2012
Eur. J. Cancer
(2013) - et al.
Sensitivity and specificity of a cad solution for lung nodule detection on chest radiograph with cta correlation
J. Digit. Imag.
(2011) - et al.
Faster R-CNN: towards real-time object detection with region proposal networks.
(2015) - American Cancer Society, 1999. Key statistics for lung cancer....
- et al.
Multiple object recognition with visual attention.
(2014) - et al.
Natural Language Processing with Python
(2009) Error and discrepancy in radiology: inevitable or avoidable?
Insights Imag.
(2017)- Bush, I., 2016. Lung nodule detection and classification....
- Cancer Research UK, 2014. Lung cancer statistics....
Capturing top-down visual attention with feedback convolutional neural networks
In Proceedings of the IEEE International Conference on Computer Vision
Human pose estimation with iterative error feedback
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Extending the negex lexicon for multiple languages
Stud. Health Technol. Inform.
How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?
Resolving human object recognition in space and time
Nat. Neurosci.
Modelling radiological language with bidirectional long short-term memory networks
LOUHI
A deep multi-level network for saliency prediction.
Universal stanford dependencies: a cross-linguistic typology
LREC
Learning where to attend with deep architectures for image tracking.
Diagnostic concordance among pathologists interpreting breast biopsy specimens
JAMA
Variability in radiologists’ interpretations of mammograms
N. Engl. J. Med.
Scalable object detection using deep neural networks
Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition
The pascal visual object classes (VOC) challenge
Int. J. Comput. Vis.
Radiologic errors in patients with lung cancer
Western J. Med.
Who said what: modeling individual labelers improves classification.
Fleischner society: glossary of terms for thoracic imaging
Radiology
Long short-term memory
Neural Comput.
Bottom-up and Top-down Reasoning with Hierarchical Rectified Gaussians
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Adam: a method for stochastic optimization.
Radlex: a new method for indexing online educational materials
Radiographics
Radlex: a new method for indexing online educational materials
Radiographics
Learning to combine foveal glimpses with a third-order boltzmann machine
NIPS
Microsoft coco: common objects in context.
The stanford coreNLP natural language processing toolkit
In Association for Computational Linguistics (ACL) System Demonstrations
Approximate gradient methods in policy-space optimization of markov reward processes
Discret. Event Dyn. Syst.
Stacked convolutional auto-encoders for hierarchical feature extraction
ICANN
Cited by (90)
Automated image label extraction from radiology reports — A review
2024, Artificial Intelligence in MedicineVision Transformers in medical computer vision—A contemplative retrospection
2023, Engineering Applications of Artificial IntelligenceCombating medical noisy labels by disentangled distribution learning and consistency regularization
2023, Future Generation Computer SystemsCheXGAT: A disease correlation-aware network for thorax disease diagnosis from chest X-ray images
2022, Artificial Intelligence in MedicineDisease Localization and Severity Assessment in Chest X-Ray Images using Multi-Stage Superpixels Classification
2022, Computer Methods and Programs in Biomedicine
- 1
Present: Warwick Manufacturing Group, University of Warwick, Coventry, UK