research-article

Open Access

Hidden stratification causes clinically meaningful failures in machine learning for medical imaging

Authors:
Luke Oakden-Rayner

Australian Institute for Machine Learning, University of Adelaide

Australian Institute for Machine Learning, University of Adelaide
View Profile

,
Jared Dunnmon

Department of Computer Science, Stanford University

Department of Computer Science, Stanford University
View Profile

,
Gustavo Carneiro

Australian Institute for Machine Learning, University of Adelaide

Australian Institute for Machine Learning, University of Adelaide
View Profile

,
Christopher Re

Department of Computer Science, Stanford University

Department of Computer Science, Stanford University
View Profile

CHIL '20: Proceedings of the ACM Conference on Health, Inference, and LearningApril 2020Pages 151–159https://doi.org/10.1145/3368555.3384468

Published:02 April 2020Publication History

CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning

Pages 151–159

ABSTRACT

Machine learning models for medical image analysis often suffer from poor performance on important subsets of a population that are not identified during training or testing. For example, overall performance of a cancer detection model may be high, but the model may still consistently miss a rare but aggressive cancer subtype. We refer to this problem as hidden stratification, and observe that it results from incompletely describing the meaningful variation in a dataset. While hidden stratification can substantially reduce the clinical efficacy of machine learning models, its effects remain difficult to measure. In this work, we assess the utility of several possible techniques for measuring hidden stratification effects, and characterize these effects both via synthetic experiments on the CIFAR-100 benchmark dataset and on multiple real-world medical imaging datasets. Using these measurement techniques, we find evidence that hidden stratification can occur in unidentified imaging subsets with low prevalence, low label quality, subtle distinguishing features, or spurious correlates, and that it can result in relative performance differences of over 20% on clinically important subsets. Finally, we discuss the clinical implications of our findings, and suggest that evaluation of hidden stratification should be a critical component of any machine learning deployment in medical imaging.

References

Denis Agniel, Isaac S Kohane, and Griffin M Weber. 2018. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ 361 (April 2018), k1479.Google Scholar
Marcus A Badgeley, John R Zech, Luke Oakden-Rayner, Benjamin S Glicksberg, Manway Liu, William Gale, Michael V McConnell, Bethany Percha, Thomas M Snyder, and Joel T Dudley. 2019. Deep learning predicts hip fracture using confounding patient and healthcare variables. NPJ Digit Med 2 (April 2019), 31.Google Scholar
Nicholas Bien, Pranav Rajpurkar, Robyn L Ball, Jeremy Irvin, Allison Park, Erik Jones, Michael Bereket, Bhavik N Patel, Kristen W Yeom, Katie Shpanskaya, Safwan Halabi, Evan Zucker, Gary Fanton, Derek F Amanatullah, Christopher F Beaulieu, Geoffrey M Riley, Russell J Stewart, Francis G Blankenberg, David B Larson, Ricky H Jones, Curtis P Langlotz, Andrew Y Ng, and Matthew P Lungren. 2018. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS Med. 15, 11 (Nov. 2018), e1002699.Google ScholarCross Ref
Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. 2018. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106 (Oct. 2018), 249--259.Google Scholar
Tadeusz Caliński and Jerzy Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods 3, 1 (1974), 1--27.Google ScholarCross Ref
Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs. 2019. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature Medicine 25, 8 (2019), 1301--1309.Google ScholarCross Ref
Lon R Cardon and Lyle J Palmer. 2003. Population stratification and spurious allelic association. Lancet 361, 9357 (2003), 598--604.Google Scholar
Vincent Chen, Sen Wu, Alexander J Ratner, Jen Weng, and Christopher Ré. 2019. Slice-based learning: A programming model for residual learning in critical data slices. In Advances in neural information processing systems. 9392--9402.Google Scholar
Sasank Chilamkurthy, Rohit Ghosh, Swetha Tanamala, Mustafa Biviji, Norbert G Campeau, Vasantha Kumar Venugopal, Vidur Mahajan, Pooja Rao, and Prashant Warier. 2018. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study. Lancet 392, 10162 (Dec. 2018), 2388--2396.Google ScholarCross Ref
Jared Dunnmon, Alexander Ratner, Nishith Khandwala, Khaled Saab, Matthew Markert, Hersh Sagreiya, Roger Goldman, Christopher Lee-Messer, Matthew Lungren, Daniel Rubin, and Christopher Ré. 2019a. Cross-Modal Data Programming Enables Rapid Medical Machine Learning. arXiv preprint arXiv: 1903.11101 (March 2019).Google Scholar
Jared A Dunnmon, Darvin Yi, Curtis P Langlotz, Christopher Ré, Daniel L Rubin, and Matthew P Lungren. 2019b. Assessment of Convolutional Neural Networks for Automated Classification of Chest Radiographs. Radiology 290, 2 (Feb. 2019), 537--544.Google ScholarCross Ref
Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. 2017. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 7639 (Feb. 2017), 115--118.Google ScholarCross Ref
Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. 2019. A guide to deep learning in healthcare. Nature Medicine 25, 1 (2019), 24.Google ScholarCross Ref
Jason A Fries, Paroma Varma, Vincent S Chen, Ke Xiao, Heliodoro Tejeda, Priyanka Saha, Jared Dunnmon, Henry Chubb, Shiraz Maskatia, Madalina Fiterau, Scott Delp, Euan Ashley, Christopher Ré, and James R Priest. 2019. Weakly supervised classification of aortic valve malformations using unlabeled cardiac MRI sequences. Nat. Commun. 10, 1 (July 2019), 3111.Google ScholarCross Ref
William Gale, Luke Oakden-Rayner, Gustavo Carneiro, Andrew P Bradley, and Lyle J Palmer. 2017. Detecting hip fractures with radiologist-level performance using deep neural networks. arXiv preprint arXiv:1711.06504 (2017).Google Scholar
Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, Ramasamy Kim, Rajiv Raman, Philip C Nelson, Jessica L Mega, and Dale R Webster. 2016. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 316, 22 (Dec. 2016), 2402--2410.Google ScholarCross Ref
Holger A Haenssle, Christine Fink, R Schneiderbauer, Ferdinand Toberer, Timo Buhl, A Blum, A Kalloo, A Ben Hadj Hassen, L Thomas, A Enk, and Others. 2018. Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol. 29, 8 (2018), 1836--1842.Google ScholarCross Ref
Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, and Others. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. arXiv preprint arXiv:1901. 07031 (2019).Google Scholar
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. 2009. Cifar-10 and cifar-100 datasets. URl: https://www. cs. toronto. edu/kriz/cifar. html 6 (2009).Google Scholar
Jiamin Liu, Jianhua Yao, Mohammadhadi Bagheri, Veit Sandfort, and Ronald M Summers. 2019. A Semi-Supervised CNN Learning Method with Pseudo-class Labels for Atherosclerotic Vascular Calcification Detection. 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) (2019).Google ScholarCross Ref
Vidur Mahajan, Vasanthakumar Venugopal, Saumya Gaur, Salil Gupta, Murali Murugavel, and Harsh Mahajan. 2019. The Algorithmic Audit: Working with Vendors to Validate Radiology-AI Algorithms - How We Do It. viXra (July 2019).Google Scholar
Maciej A Mazurowski, Piotr A Habas, Jacek M Zurada, Joseph Y Lo, Jay A Baker, and Georgia D Tourassi. 2008. Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw. 21, 2-3 (March 2008), 427--436.Google ScholarCross Ref
Stephanie A Mulherin and William C Miller. 2002. Spectrum bias or spectrum effect? Subgroup variation in diagnostic test evaluation. Annals of Internal Medicine 137, 7 (2002), 598--602.Google ScholarCross Ref
Luke Oakden-Rayner. 2020. Exploring Large-scale Public Medical Image Datasets. Academic Radiology 27, 1 (2020), 106--112.Google ScholarCross Ref
Inioluwa Deborah Raji, Andrew Smart, Rebecca N White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. 2020. Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing. arXiv preprint arXiv:2001.00973 (2020).Google ScholarDigital Library
Pranav Rajpurkar, Jeremy Irvin, Aarti Bagul, Daisy Ding, Tony Duan, Hershel Mehta, Brandon Yang, Kaylie Zhu, Dillon Laird, Robyn L Ball, et al. 2017a. Mura: Large dataset for abnormality detection in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957 (2017).Google Scholar
Pranav Rajpurkar, Jeremy Irvin, Robyn L Ball, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis P Langlotz, Bhavik N Patel, Kristen W Yeom, Katie Shpanskaya, Francis G Blankenberg, Jayne Seekins, Timothy J Amrhein, David A Mong, Safwan S Halabi, Evan J Zucker, Andrew Y Ng, and Matthew P Lungren. 2018. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med. 15, 11 (Nov. 2018), e1002686.Google ScholarCross Ref
Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, Matthew Lungren, and Andrew Ng. 2017b. CheXNet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017).Google Scholar
Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré. 2017. Learning to Compose Domain-Specific Transformations for Data Augmentation. In Advances in Neural Information Processing Systems 30, I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett (Eds.). Curran Associates, Inc., 3236--3246.Google Scholar
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2018. Do CIFAR-10 classifiers generalize to CIFAR-10? arXiv preprint arXiv:1806.00451 (2018).Google Scholar
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135--1144.Google ScholarDigital Library
Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 53--65.Google ScholarDigital Library
Andrew D Selbst. 2017. Disparate impact in big data policing. Ga. L. Rev. 52 (2017), 109.Google Scholar
Pu Wang, Tyler M Berzin, Jeremy Romek Glissen Brown, Shishira Bharadwaj, Aymeric Becq, Xun Xiao, Peixi Liu, Liangping Li, Yan Song, Di Zhang, Yi Li, Guangre Xu, Mengtian Tu, and Xiaogang Liu. 2019. Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study. Gut (Feb. 2019).Google Scholar
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. 2017. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In IEEE Conference on CVPR, Computer Vision and Pattern Recognition (2017). 3462--3471.Google ScholarCross Ref
Julia K Winkler, Christine Fink, Ferdinand Toberer, Alexander Enk, Teresa Deinlein, Rainer Hofmann-Wellenhof, Luc Thomas, Aimilios Lallas, Andreas Blum, Wilhelm Stolz, and Holger A Haenssle. 2019. Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition. JAMA Dermatology (2019).Google ScholarCross Ref
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated Residual Transformations for Deep Neural Networks. arXiv preprint arXiv:1611.05431 (Nov. 2016).Google Scholar
Wei Yang. 2019. pytorch-classification. https://github.com/bearpaw/pytorch-classificationGoogle Scholar
Bangpeng Yao, Aditya Khosla, and Li Fei-Fei. 2011. Combining randomization and discrimination for fine-grained image categorization. In CVPR 2011. IEEE, 1577--1584.Google ScholarDigital Library
John Zech. 2019. reproduce-chexnet. https://github.com/jrzech/reproduce-chexnetGoogle Scholar
John R Zech, Marcus A Badgeley, Manway Liu, Anthony B Costa, Joseph J Titano, and Eric K Oermann. 2018. Confounding variables can degrade generalization performance of radiological deep learning models. arXiv preprint arXiv:1807.00431 (July 2018).Google Scholar

Index Terms

Hidden stratification causes clinically meaningful failures in machine learning for medical imaging
1. Computing methodologies
  1. Machine learning

Recommendations

Deep and machine learning techniques for medical imaging-based breast cancer: A comprehensive review
Abstract
Breast cancer is the second leading cause of death for women, so accurate early detection can help decrease breast cancer mortality rates. Computer-aided detection allows radiologists to detect abnormalities efficiently. Medical images ...
Highlights
- Analysis the current research methodologies on deep learning and machine learning techniques.
Read More
Single and Clustered Cervical Cell Classification with Ensemble and Deep Learning Methods
Abstract
Cervical cancer if detected early has an upward of 89% survival rate. The leading tool in identifying cervical cancer in its infancy is the Papanicolaou (Pap smear) test, which since its introduction dropped cervical cancer related deaths by 60%. ...
Read More
Machine Learning: The State of the Art

The two fundamental problems in machine learning (ML) are statistical analysis and algorithm design. The former tells us the principles of the mathematical models that we establish from the observation data. The latter defines the conditions on which ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning
April 2020
265 pages
ISBN:9781450370462
DOI:10.1145/3368555
General Chair:
Marzyeh Ghassemi
University of Toronto and the Vector Institute
Copyright © 2020 Owner/Author
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 April 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
convolutional neural networks
hidden stratification
machine learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate27of110submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 146
  Total Citations
  View Citations
- 2,861
  Total Downloads
- Downloads (Last 12 months)672
- Downloads (Last 6 weeks)77
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hidden stratification causes clinically meaningful failures in machine learning for medical imaging

CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning

ABSTRACT

References

Cited By

Index Terms

Recommendations

Deep and machine learning techniques for medical imaging-based breast cancer: A comprehensive review

Single and Clustered Cervical Cell Classification with Ensemble and Deep Learning Methods

Machine Learning: The State of the Art

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Hidden stratification causes clinically meaningful failures in machine learning for medical imaging

CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning

ABSTRACT

References

Cited By

Index Terms

Recommendations

Deep and machine learning techniques for medical imaging-based breast cancer: A comprehensive review

Single and Clustered Cervical Cell Classification with Ensemble and Deep Learning Methods

Machine Learning: The State of the Art

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media