Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

Learning from data with structured missingness

Abstract

Missing data are an unavoidable complication in many machine learning tasks. When data are ‘missing at random’ there exist a range of tools and techniques to deal with the issue. However, as machine learning studies become more ambitious, and seek to learn from ever-larger volumes of heterogeneous data, an increasingly encountered problem arises in which missing values exhibit an association or structure, either explicitly or implicitly. Such ‘structured missingness’ raises a range of challenges that have not yet been systematically addressed, and presents a fundamental hindrance to machine learning at scale. Here we outline the current literature and propose a set of grand challenges in learning from data with structured missingness.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The data missingness life cycle.
Fig. 2: Examples of SM.

Similar content being viewed by others

References

  1. Little, R. J. A. & Rubin, D. B. Statistical Analysis With Missing Data Vol. 793 (John Wiley & Sons, 2019).

  2. Karlaš, B. et al. Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions. Preprint at https://arxiv.org/abs/2005.05117 (2020).

  3. Rubin, D. B. Inference and missing data. Biometrika 63, 581–592 (1976).

    Article  MATH  Google Scholar 

  4. Pigott, T. D. A review of methods for missing data. Educ. Res. Eval. 7, 353–383 (2001).

    Article  Google Scholar 

  5. Schafer, J. L. & Graham, J. W. Missing data: our view of the state of the art. Psychol. Methods 7, 147–177 (2002).

    Article  Google Scholar 

  6. Heitjan, D. F. & Rubin, D. B. Ignorability and coarse data. Ann. Stat. 19, 2244–2253 (1991).

    Article  MATH  Google Scholar 

  7. Emmanuel, T. et al. A survey on missing data in machine learning. J. Big Data 8, 1–37 (2021).

    Article  Google Scholar 

  8. Gao, J., Li, P., Chen, Z. & Zhang, J. A survey on deep learning for multimodal data fusion. Neur. Comput. 32, 829–864 (2020).

    Article  MATH  Google Scholar 

  9. Yan, X., Hu, S., Mao, Y., Ye, Y. & Yu, H. Deep multi-view learning methods: a review. Neurocomputing 448, 106–129 (2021).

    Article  Google Scholar 

  10. Xu, C., Tao, D. & Xu, C. A survey on multi-view learning. Preprint at https://arxiv.org/abs/1304.5634 (2013).

  11. Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25, 44–56 (2019).

    Article  Google Scholar 

  12. Silva, L. A. V. & Rohr, K. Pan-cancer prognosis prediction using multimodal deep learning. In 2020 IEEE 17th International Symposium on Biomedical Imaging 568–571 (IEEE, 2020).

  13. Rubin, D. B. Multiple imputation after 18+ years. J. Am. Stat. Assoc. 91, 473–489 (1996).

    Article  MATH  Google Scholar 

  14. Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2021).

  15. Kaissis, G. A., Makowski, M. R., Rückert, D. & Braren, R. F. Secure, privacy-preserving and federated machine learning in medical imaging. Nat. Mach. Intell. 2, 305–311 (2020).

    Article  Google Scholar 

  16. Li, T., Sahu, A. K., Talwalkar, A. & Smith, V. Federated learning: challenges, methods, and future directions. IEEE Signal Process. Mag. 37, 50–60 (2020).

    Google Scholar 

  17. Holmes, C. Artificial Intelligence and Health: A Summary Report of a Roundtable Held on 16 January 2019 (Academy of Medical Sciences, 2019); https://acmedsci.ac.uk/policy/policy-projects/artificial--intelligence-and-health

  18. Dong, X. et al. TOBMI: trans-omics block missing data imputation using a k-nearest neighbor weighted approach. Bioinformatics 35, 1278–1283 (2019).

    Article  Google Scholar 

  19. Naito, T. et al. A deep learning method for HLA imputation and trans-ethnic MHC fine-mapping of type 1 diabetes. Nat. Commun. 12, 1639 (2021).

    Article  Google Scholar 

  20. Audigier, V. et al. Multiple imputation for multilevel data with continuous and binary variables. Stat. Sci. 33, 160–183 (2018).

    Article  MATH  Google Scholar 

  21. Kamphuis, R., Jolani, S. & Lugtig, P. The blocked imputation approach for missing data. Preprint at ResearchGate https://doi.org/10.13140/RG.2.2.12467.32803 (2018).

  22. Che, Z., Purushotham, S., Cho, K., Sontag, D. & Liu, Y. Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 8, 6085 (2018).

    Article  Google Scholar 

  23. Wang, Z., Akande, O., Poulos, J. & Li, F. Are deep learning models superior for missing data imputation in large surveys? Evidence from an empirical comparison. Preprint at https://arxiv.org/abs/2103.09316 (2021).

  24. Tierney, N. J., Harden, F. A., Harden, M. J. & Mengersen, K. L. Using decision trees to understand structure in missing data. BMJ Open 5, e007450 (2015).

    Article  Google Scholar 

  25. Singal, G. et al. Development and validation of a real-world clinicogenomic database. J. Clin. Oncol. 35, 2514 (2017).

    Article  Google Scholar 

  26. Van Buuren, S. & Groothuis-Oudshoorn, K. mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67 (2011).

    Article  Google Scholar 

  27. Leslie, D. et al. Artificial intelligence, human rights, democracy, and the rule of law: a primer. Preprint at https://arxiv.org/abs/2104.04147 (2021).

  28. MacArthur, B. D., Dorobantu, C. & Margetts, H. Resilient government requires data science reform. Nat. Hum. Behav. https://doi.org/10.1038/s41562-022-01423-6 (2022).

  29. Seaman, S., Galati, J., Jackson, D. & Carlin, J. What is meant by “missing at random"? Stat. Sci. 28, 257–268 (2013).

    Article  MATH  Google Scholar 

  30. Doretti, M., Geneletti, S. & Stanghellini, E. Missing data: a unified taxonomy guided by conditional independence. Int. Stat. Rev. 86, 189–204 (2018).

    Article  Google Scholar 

  31. Tian, J. Missing at random in graphical models. In Artificial Intelligence and Statistics 977–985 (PMLR, 2015).

  32. Antelmi, L. et al. Combining multi-task learning and multi-channel variational auto-encoders to exploit datasets with missing observations -application to multi-modal neuroimaging studies in dementia. Preprint at https://hal.inria.fr/hal-03114888 (2021).

  33. Newman, M. Networks (Oxford Univ. Press, 2018).

  34. Bianconi, G. Higher-Order Networks (Cambridge Univ. Press, 2021).

  35. Gutknecht, A. J., Wibral, M. & Makkeh, A. Bits and pieces: understanding information decomposition from part-whole relationships and formal logic. Proc. R. Soc. A 477, 20210110 (2021).

    Article  Google Scholar 

  36. Bick, C., Gross, E., Harrington, H. A. & Schaub, M. T. What are higher-order networks? Preprint at https://arxiv.org/abs/2104.11329 (2021).

  37. Carlsson, G. Topology and data. Bull. Am. Math. Soc. 46, 255–308 (2009).

    Article  MATH  Google Scholar 

  38. Joharinad, P. & Jost, J. Geometry of data. Preprint at https://arxiv.org/abs/2203.07208 (2022).

  39. Bianconi, G. Multilayer Networks (Oxford Univ. Press, 2018).

  40. Kiani, N. A., Gomez-Cabrero, D. & Bianconi, G. (eds) Networks of Networks in Biology (Cambridge Univ. Press, 2021).

  41. Lee, K. M., Biedermann, S. & Mitra, R. D-optimal designs for multiarm trials with dropouts. Stat. Med. 38, 2749–2766 (2019).

    Article  Google Scholar 

  42. Lee, K. M., Mitra, R. & Biedermann, S. Optimal design when outcome values are not missing at random. Stat. Sinica https://doi.org/10.5705/ss.202016.0526 (2018).

  43. Lee, K. M., Biedermann, S. & Mitra, R. Optimal design for experiments with possibly incomplete observations. Stat. Sinica 28, 1611–1632 (2018).

    MATH  Google Scholar 

  44. Noonan, J. & Zhigljavsky, A. in Black Box Optimization, Machine Learning, and No-Free Lunch Theorems (eds Pardalos, P. M. et al.) 273–318 (Springer, 2021).

  45. Zhigljavsky, A. & Noonan, J. Covering of high-dimensional cubes and quantization. SN Oper. Res. Forum 1, 18 (2020).

    Article  MATH  Google Scholar 

  46. Burnett, T. & Jennison, C. Adaptive enrichment trials: what are the benefits? Stat. Med. 40, 690–711 (2020).

    Article  Google Scholar 

  47. Nijman, S. W. J. et al. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J. Clin. Epidemiol. 142, 218–229 (2022).

    Article  Google Scholar 

  48. Ipsen, N., Mattei, P.-A. & Frellsen, J. How to deal with missing data in supervised deep learning? In Artemiss-ICML Workshop on the Art of Learning with Missing Values (2020).

  49. Buolamwini, J. & Gebru, T. Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency 77–91 (PMLR, 2018).

  50. Leslie, D. Understanding bias in facial recognition technologies. Preprint at https://doi.org/10.48550/arXiv.2010.07023 (2020).

  51. Gelman, A. et al. Bayesian Data Analysis (3rd ed.). (Chapman and Hall/CRC, 2013).

  52. Gelfand, A. E. & Smith, A. F. M. Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 85, 398–409 (1990).

    Article  MATH  Google Scholar 

  53. Van Buuren, S. Flexible Imputation of Missing Data (CRC, 2018).

  54. Schouten, R. M., Lugtig, P. & Vink, G. Generating missing values for simulation purposes: a multivariate amputation procedure. J. Stat. Comput. Sim. 88, 2909–2930 (2018).

    Article  MATH  Google Scholar 

  55. Brand, J. P. L. Development, Implementation and Evaluation of Multiple Imputation Strategies for the Statistical Analysis of Incomplete Data Sets (Print Partners Ispkamp, 1999).

  56. Brand, J. P. L., Van Buuren, S., Groothuis-Oudshoorn, K. & Gelsema, E. S. A toolkit in SAS for the evaluation of multiple imputation methods. Stat. Neerland. 57, 36–45 (2003).

    Article  Google Scholar 

  57. Mayer, I. Causal Inference from Heterogeneous Data with Missing Data: Application to Critical Care Management. PhD thesis, EHESS (2021).

  58. Kusner, M. J., Loftus, J., Russell, C. & Silva, R. Counterfactual fairness. Advances in neural information processing systems, 30. NeurIPS (2017).

  59. Shen, A., Han, X., Cohn, T., Baldwin, T. & Frermann, L. Contrastive learning for fair representations. Preprint at https://arxiv.org/abs/2109.10645 (2021).

  60. Ding, P. & Li, F. Causal inference: a missing data perspective. Stat. Sci. 33, 214–237 (2017).

    MATH  Google Scholar 

  61. Seaman, S. R. & White, I. R. Review of inverse probability weighting for dealing with missing data. Stat. Methods Med. Res. 22, 278–295 (2013).

    Article  Google Scholar 

  62. Sun, BaoLuo et al. Inverse-probability-weighted estimation for monotone and nonmonotone missing data. Am. J. Epidemiol. 187, 585–591 (2017).

    Article  Google Scholar 

  63. Westreich, D. et al. Imputation approaches for potential outcomes in causal inference. Int. J. Epidemiol. 44, 1731–1737 (2015).

    Article  Google Scholar 

  64. Verheij, R. A., Curcin, V., Delaney, B. C. & McGilchrist, M. M. Possible sources of bias in primary care electronic health record data use and reuse. J. Med. Internet Res. 20, e185 (2018).

    Article  Google Scholar 

  65. Kiang, M. V. et al. Sociodemographic characteristics of missing data in digital phenotyping. Sci. Rep. 11, 15408 (2021).

    Article  Google Scholar 

  66. Tsiampalis, T. & Panagiotakos, D. B. Missing-data analysis: socio-demographic, clinical and lifestyle determinants of low response rate on self-reported psychological and nutrition related multi-item instruments in the context of the ATTICA epidemiological study. BMC Med. Res. Methodol. 20, 148 (2020).

  67. Leslie, D., Mazumder, A., Peppin, A., Wolters, M. K. & Hagerty, A. Does “AI" stand for augmenting inequality in the era of covid-19 healthcare? BMJ 372, n304 (2021).

    Article  Google Scholar 

  68. Fatumo, S. et al. A roadmap to increase diversity in genomic studies. Nat. Med. 28, 243–250 (2022).

    Article  Google Scholar 

  69. Abdill, R. J., Adamowicz, E. M. & Blekhman, R. Public human microbiome data are dominated by highly developed countries. PLoS Biol. 20, e3001536 (2022).

  70. Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).

    Article  Google Scholar 

  71. Rostamzadeh, N. et al. Healthsheet: development of a transparency artifact for health datasets. Preprint at https://arxiv.org/abs/2202.13028 (2022).

  72. Tierney, N. J., Harden, F. A., Harden, M. J. & Mengersen, K. L. Using decision trees to understand structure in missing data. BMJ Open 5, e007450 (2015).

    Article  Google Scholar 

  73. Martínez-Plumed, F., Ferri, C., Nieves, D. & Hernández-Orallo, J. Missing the missing values: the ugly duckling of fairness in machine learning. Int. J. Intell. Syst. 36, 3217–3258 (2021).

  74. Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).

    Article  Google Scholar 

  75. Bansal, A., Sharma, R. & Kathuria, M. A systematic review on data scarcity problem in deep learning: solution and applications. ACM Comput. Surv. 54, 1–29 (2022).

    Article  Google Scholar 

  76. Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interf. 15, 20170387 (2018).

    Article  Google Scholar 

  77. Liang, W. et al. Advances, challenges and opportunities in creating data for trustworthy AI. Nat. Mach. Intell. 4, 669–677 (2022).

    Article  Google Scholar 

  78. Koch, B., Denton, E., Hanna, A. & Foster, J. G. Reduced, reused and recycled: the life of a dataset in machine learning research. Preprint at https://arxiv.org/abs/2112.01716 (2021).

  79. Heather, J. M. & Chain, B. The sequence of sequencers: the history of sequencing DNA. Genomics 107, 1–8 (2016).

    Article  Google Scholar 

  80. Li, P. et al. CleanML: a study for evaluating the impact of data cleaning on ml classification tasks. In 2021 IEEE 37th International Conference on Data Engineering 13–24 (IEEE, 2021).

  81. Krishnan, S., Wang, J., Wu, E., Franklin, M. J. & Goldberg, K. ActiveClean: interactive data cleaning for statistical modeling. Proc. VLDB Endow. 9, 948–959 (2016).

    Article  Google Scholar 

  82. Zhang, L., Yang, M. & Feng, X. Sparse representation or collaborative representation: which helps face recognition? In IEEE International Conference on Computer Vision 471–478 (IEEE, 2011).

  83. Chakraborti, T., McCane, B., Mills, S. & Pal, U. A generalised formulation for collaborative representation of image patches (GP-CRC). In Proc. British Machine Vision Conference (2017).

  84. Ben Schafer, J., Frankowski, D., Herlocker, J. & Sen, S. Collaborative filtering recommender systems. In Lecture Notes in Computer Science: The Adaptive Web. Springer, Berlin, Heidelberg. 291–324 (2007).

  85. Chakraborti, T., McCane, B., Mills, S. & Pal, U. Collaborative representation based fine-grained species recognition. In Proc. IEEE International Conference on Image and Vision Computing New Zealand, 1-6 (IEEE, 2016).

  86. Vinje, W. E. & Gallant, J. L. Sparse coding and decorrelation in primary visual cortex during natural vision. Science 287, 1273–1276 (2000).

    Article  Google Scholar 

  87. Raghunathan, T. E. Synthetic data. Annu. Rev. Stat. Appl. 8, 129–140 (2021).

    Article  Google Scholar 

  88. Jordon, J. et al. Synthetic data—what, why and how? Preprint at https://arxiv.org/abs/2205.03257 (2022).

  89. Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (2017).

  90. Zhang, H., Goodfellow, I., Metaxas, D. & Odena, A. Self-attention generative adversarial networks. In International conference on machine learning. 7354–7363 (PMLR, 2019)

  91. Yoon, J., Jordon, J. & Schaar, M. GAIN: missing data imputation using generative adversarial nets. In International Conference on Machine Learning 80, 5689–5698 (PMLR, 2018).

  92. Birnbaum, B. et al. Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research. Preprint at https://doi.org/10.48550/arXiv.2001.09765 (2020).

  93. Alerskans, E. et al. Construction of a climate data record of sea surface temperature from passive microwave measurements. Remote Sens. Environ. 236, 111485 (2020).

    Article  Google Scholar 

  94. Katiraie-Boroujerdy, P. S., Nasrollahi, N., Hsu, K. L. & Sorooshian, S. Evaluation of satellite-based precipitation estimation over Iran. J. Arid Environ. 97, 205–219 (2013).

  95. Andersson, T. R. et al. Seasonal arctic sea ice forecasting with probabilistic deep learning. Nat. Commun. 12, 5124 (2021).

    Article  Google Scholar 

  96. Groves, R. M. et al. Survey Methodology (John Wiley & Sons, 2011).

  97. Ledford, H. How Facebook, Twitter and other data troves are revolutionizing social science. Nature 582, 328–331 (2020).

    Article  Google Scholar 

Download references

Acknowledgements

This work was sponsored by the Turing-Roche Strategic Partnership. We thank C. Matus for her talents in figure illustrations and design and V. Hellon for her expert community management.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Robin Mitra, Sarah F. McGough, Chris Harbron or Ben D. MacArthur.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Subho Majumdar, Girmaw Abebe Tadesse and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mitra, R., McGough, S.F., Chakraborti, T. et al. Learning from data with structured missingness. Nat Mach Intell 5, 13–23 (2023). https://doi.org/10.1038/s42256-022-00596-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-022-00596-z

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing