Elsevier

Medical Image Analysis

Volume 53, April 2019, Pages 26-38
Medical Image Analysis

Learning to detect chest radiographs containing pulmonary lesions using visual attention networks

https://doi.org/10.1016/j.media.2018.12.007Get rights and content

Highlights

  • We propose two novel neural network architectures to detect pulmonary lesions in chest x-rays imagesthat use visual attention mechanisms.

  • The architectures are designed to learn from a large number of weakly-labelled images (class labels only) and a small number of annotated images (bounding boxes).

  • The algorithms and trained and tested using a large dataset of over 700k chest x-ray exams extracted from two hospital’s archives and processed using a novel NLP pipeline.

  • Results are validated on 6,131 manually labelled exams from two independent radiologists.

  • Our experimental resultsare either in line or better than currently reported performance from other studies.

Abstract

Machine learning approaches hold great potential for the automated detection of lung nodules on chest radiographs, but training algorithms requires very large amounts of manually annotated radiographs, which are difficult to obtain. The increasing availability of PACS (Picture Archiving and Communication System), is laying the technological foundations needed to make available large volumes of clinical data and images from hospital archives. Binary labels indicating whether a radiograph contains a pulmonary lesion can be extracted at scale, using natural language processing algorithms. In this study, we propose two novel neural networks for the detection of chest radiographs containing pulmonary lesions. Both architectures make use of a large number of weakly-labelled images combined with a smaller number of manually annotated x-rays. The annotated lesions are used during training to deliver a type of visual attention feedback informing the networks about their lesion localisation performance. The first architecture extracts saliency maps from high-level convolutional layers and compares the inferred position of a lesion against the true position when this information is available; a localisation error is then back-propagated along with the softmax classification error. The second approach consists of a recurrent attention model that learns to observe a short sequence of smaller image portions through reinforcement learning; the reward function penalises the exploration of areas, within an image, that are unlikely to contain nodules. Using a repository of over 430,000 historical chest radiographs, we present and discuss the proposed methods over related architectures that use either weakly-labelled or annotated images only.

Introduction

Lung cancer is the most common cancer worldwide and the second most common cancer in Europe and the USA (Ferlay, Steliarova-Foucher, Lortet-Tieulent, Rosso, Coebergh, Comber, Forman, Bray, 2013, American Cancer Society). Due to delays in diagnosis, it is typically discovered at an advanced stage with a very low survival rate (Cancer Research UK, 2014). The chest radiograph is the most commonly performed radiological investigation in the initial assessment of suspected lung cancer because it is inexpensive and delivers a low radiation dose. On a chest radiograph, a nodule is defined as a rounded opacity  ≤  3 cm, which can be well- or poorly marginated. The detection of lesions  ≥  3 cm do not typically pose a diagnostic challenge (Hansell et al., 2008). However, detecting small pulmonary nodules on plain film is challenging, even despite high spatial resolution because an x-ray is a single projection of the entire 3D thorax volume. The planar nature of radiograph acquisition means that thoracic structures are superimposed, thus, the heart, diaphragm, and mediastinum may obscure a large portion of the lungs. Patients may also have several co-existing pathologies visible on each radiograph. Many benign lesions can mimic a pathology, due to composite shadowing, and, furthermore, the nodule can be very small or with ill-defined margins. Studies have shown that in up to 40% of new lung cancer diagnoses, the lesion was present on previous plain film, but was missed and only picked up in hindsight (Forrest, Friedman, 1981, Quekel, Kessels, Goei, van Engelshoven, 1999).

Computer-aided detection (CAD) systems using machine learning techniques can facilitate the automated detection of lung nodules and provide a cost-effective second-opinion reporting mechanism. The reported performance of these CAD systems varies substantially depending on the size and nature of the samples. For instance, sensitivity rates reported in the literature for lesions larger than 5 mm vary from 5171% (Moore, Ripton-Snyder, Wu, Hendler, 2011, Szucs-Farkas, Schick, Cullmann, Ebner, Megyeri, Vock, Christe, 2013). Currently, state-of-the-art results for automated object detection in images are obtained by deep convolutional neural networks (DCNN). During training, these methods require a large number of manually annotated images in which the contours of each object are identified or, at the very least, have a bounding box indicating their location within the image. The large majority of these methods use regression models to predict the coordinates of the bounding boxes (Erhan, Szegedy, Toshev, Anguelov, 2014, Szegedy, Toshev, Erhan, 2013) or, alternatively, make use of sliding windows (Ren, He, Girshick, Sun, 2015, Sermanet, Eigen, Zhang, Mathieu, Fergus, LeCun, 2014). Most documented studies rely on large datasets of natural images  (Everingham, Van Gool, Williams, Winn, Zisserman, 2010, Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, Zitnick, 2014) where the objects to be detected are typically well-defined and sufficiently within the context of the entire image. Fundamentally, the applicability of these technologies in radiology has not been fully explored, partially due to the paucity of large databases of annotated medical images.

In recent years, the increasing availability of digital archiving and reporting systems, such as PACS (Picture Archiving and Communication System) and RIS (Radiology Information System), is laying the technological foundations needed to make available large volumes of clinical data and images from hospital archives (Cho, Lee, Shin, Choy, Do, 2015, Cornegruta, Bakewell, Withey, Montana, 2016). In this study, our aim is to leverage a large number of radiological exams extracted from a hospital’s data archives to explore the feasibility of deep learning for lung nodule detection. In particular, we assess the performance of a statistical classifier that discriminates between chest radiographs with elements/regions indicating the presence of a pulmonary lesion and those that do not. Our first hypothesis is that, with a sufficiently large training database, a classifier based on deep convolutional networks can be trained to accomplish this task using only weak image labels. In order to address our hypothesis, we collected over 700,000 historical chest radiographs from two large teaching hospitals in London (UK). A natural language processing (NLP) system was developed to parse all free-text radiological reports to identify all the exams containing pulmonary lesions. This is a challenging learning task as a proportion of automatically-extracted labels in the training dataset is expected to be erroneous or incomplete due to reporting errors or omissions (estimated to be at least 35% (Brady, 2017)), inter-reader variability (Elmore, Wells, Lee, Howard, Feinstein, 1994, Elmore, Longton, Carney, Geller, Onega, Tosteson, Nelson, Pepe, Allison, Schnitt, et al., 2015) and potential NLP failures. The performance of the resulting image classifier was assessed using a manually curated, independent dataset of over 6,000 exams.

Our second and main hypothesis is that significant classification improvements can be obtained by augmenting the weak and potentially noisy labels by using bounding boxes to indicate the exact location of any lesions in a subset of the training exams. Manual annotation simply does not scale well given the size of currently available historical datasets; realistically only a fraction of all the exams can be reviewed and annotated. It would be, therefore, of interest to design a classifier that leverages both weakly labelled and annotated images. To investigate this hypothesis, approximately 9% of the radiographs presenting lesions were randomly selected and reviewed by a radiologist who manually delineated the bounding boxes. This annotation process resulted in over 3,000 lesion examples.

We present two different learning strategies to leverage both weak labels and the annotations of lesions. Our guiding principle was that, when the position of a lesion is known during training, it can be exploited to provide the network with visual feedback that can inform on the quality of the features learned by the convolutional filters. As such, both strategies introduce attention mechanisms within the classifier in order to learn improved imaging representations. Our first approach exploits a soft attention mechanism. Using weakly-labelled images, a convolutional network learns imaging features by minimising the classification error and generates saliency maps highlighting parts of an image that are likely to contain a lesion. Using the subset of annotated images, a composite loss function is employed to penalise the discrepancy between the network’s implied position of a lesion, provided by the saliency map during training and the real position of the lesion. A large loss indicates that the network’s current representation does not accurately capture the lesion’s visual patterns, and provides an additional mechanism for self-improvement through back-propagation. The resulting architecture, a convolutional neural network with attention feedback (CONAF), features an improved localisation capability, which, in turn, boosts the classification performance.

Our second approach implements a hard attention mechanism, and specifically an extension of the Recurrent Attention Model (RAM) (Ba, Mnih, Kavukcuoglu, 2014, Mnih, Heess, Graves, et al., 2014, Sermanet, Frome, Real, 2014, Ypsilantis, Montana, 2017). In contrast to CONAF, each image is processed in a finite number of sequential steps. At each step, only at a portion of the image is used as input; each location is sampled from a probability distribution that leverages the knowledge acquired in the previous steps. The information accumulated through a random path image culminates in the classification of the image. The classification score acts as a reward signal which, in turn, updates the probability distribution controlling the sequence of image locations that should be visited. This results in more precise attention being paid to the most relevant parts of the image, i.e. the lungs. Our proposed architecture, RAMAF (Recurrent Attention Model with Attention Feedback), rewards a higher classification score when the glimpses attended by the algorithms during training overlap with the correct lesion locations. Establishing this improves the rate of learning, yielding a faster convergence rate and increased classification performance.

The article is structured as follows. In Section 2, we introduce the dataset used in our experiments and explain how the chest radiographs have been automatically labelled using a natural language processing system. The CONAF and RAMAF algorithms are presented in Sections 3.1 and 3.2, respectively. Their performance has been assessed and compared to a number of alternative architectures that use either weak labels or annotated images. In Section 4, we describe our experimental results supporting the hypothesis that leveraging a relatively small portion of manually annotated lesions, in addition to a large sample of weakly-annotated training examples, can drastically enhance the classification performance.

Section snippets

A repository of chest radiographs

For this study, we obtained a dataset consisting of 745,479 chest x-ray exams collected from the historical archives of Guy’s and St. Thomas’ NHS Foundation Trust in London from January 2005 to March 2016. For each exam, the free-text radiologist report was extracted from the RIS (Radiology Information System). For a subset of 634,781 exams, we were also able to obtain the DICOM files containing pixel data. All paediatric exams ( ≤ 16 years of age) were removed from the dataset resulting in a

Convolution networks with attention feedback (CONAF)

In this section we set out our proposal of an image classifier based on deep convolutional neural networks. Our aim is to detect chest radiographs that are likely to contain one or more lesions. Although the localisation of the lesions within an image is not our primary interest, this information can be extracted from a trained network to generate saliency maps, i.e. heatmaps indicating where the lesions are more likely to be located within the original x-ray. Our proposed architecture exploits

Further implementation details

In this section we provide additional implementation details. The CONAF loss function was fully specified using λ1=10 and λ2=0.1 as these parameters yielded optimal performance on the validation test. Training was done using back-propagation with adadelta (Zeiler, 2012), mini-batches of 32 images and a learning rate of 0.03. During the training we fed the network through two types of mini-batches: one is composed by only images associated to weak labels and the other is composed of images with

Discussion and conclusions

Wherea as other imaging modalities for cancer detection (e.g. mammograms and the breast screening programme more widely) are routinely double-read and associated with an improvement in sensitivity of detection (Anderson et al., 1994), the same is not feasible with chest radiographs (due to the sheer volume of scans, 40% of the 3.6 billion annual medical images are chest radiographs) and a lack of resources. Machine learning systems powered by deep learning algorithms offer a mechanism to

Acknowledgments

The authors acknowledge the support from the Department of Health via the National Institute for Health Research Comprehensive Biomedical Research Centre award to Guy’s & St Thomas’ NHS Foundation Trust in partnership with King’s College London and King’s College Hospital NHS Foundation Trust; and from the King’s College London/University College London Comprehensive Cancer Imaging Centre funded by Cancer Research UK and Engineering and Physical Sciences Research Council in association with the

References (72)

  • C. Cao et al.

    Capturing top-down visual attention with feedback convolutional neural networks

    In Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • J. Carreira et al.

    Human pose estimation with iterative error feedback

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • W.W. Chapman et al.

    Extending the negex lexicon for multiple languages

    Stud. Health Technol. Inform.

    (2013)
  • J. Cho et al.

    How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?

    (2015)
  • R.M. Cichy et al.

    Resolving human object recognition in space and time

    Nat. Neurosci.

    (2014)
  • S. Cornegruta et al.

    Modelling radiological language with bidirectional long short-term memory networks

    LOUHI

    (2016)
  • M. Cornia et al.

    A deep multi-level network for saliency prediction.

    (2016)
  • M.-C. De Marneffe et al.

    Universal stanford dependencies: a cross-linguistic typology

    LREC

    (2014)
  • M. Denil et al.

    Learning where to attend with deep architectures for image tracking.

    (2011)
  • J.G. Elmore et al.

    Diagnostic concordance among pathologists interpreting breast biopsy specimens

    JAMA

    (2015)
  • J.G. Elmore et al.

    Variability in radiologists’ interpretations of mammograms

    N. Engl. J. Med.

    (1994)
  • D. Erhan et al.

    Scalable object detection using deep neural networks

    Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • M. Everingham et al.

    The pascal visual object classes (VOC) challenge

    Int. J. Comput. Vis.

    (2010)
  • J.V. Forrest et al.

    Radiologic errors in patients with lung cancer

    Western J. Med.

    (1981)
  • M.Y. Guan et al.

    Who said what: modeling individual labelers improves classification.

    (2017)
  • D.M. Hansell et al.

    Fleischner society: glossary of terms for thoracic imaging

    Radiology

    (2008)
  • S. Hochreiter et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • P. Hu et al.

    Bottom-up and Top-down Reasoning with Hierarchical Rectified Gaussians

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • D.P. Kingma et al.

    Adam: a method for stochastic optimization.

    (2014)
  • C.P. Langlotz

    Radlex: a new method for indexing online educational materials

    Radiographics

    (2006)
  • C.P. Langlotz

    Radlex: a new method for indexing online educational materials

    Radiographics

    (2006)
  • H. Larochelle et al.

    Learning to combine foveal glimpses with a third-order boltzmann machine

    NIPS

    (2010)
  • T.-Y. Lin et al.

    Microsoft coco: common objects in context.

    (2014)
  • C. Manning et al.

    The stanford coreNLP natural language processing toolkit

    In Association for Computational Linguistics (ACL) System Demonstrations

    (2014)
  • P. Marbach et al.

    Approximate gradient methods in policy-space optimization of markov reward processes

    Discret. Event Dyn. Syst.

    (2003)
  • J. Masci et al.

    Stacked convolutional auto-encoders for hierarchical feature extraction

    ICANN

    (2011)
  • Cited by (90)

    • Vision Transformers in medical computer vision—A contemplative retrospection

      2023, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus
    1

    Present: Warwick Manufacturing Group, University of Warwick, Coventry, UK

    View full text