Data curation during a pandemic and lessons learned from COVID-19

Kraemer, Moritz U. G.; Scarpino, Samuel V.; Marivate, Vukosi; Gutierrez, Bernardo; Xu, Bo; Lee, Graham; Hawkins, Jared B.; Rivers, Caitlin; Pigott, David M.; Katz, Rebecca; Brownstein, John S.

doi:10.1038/s43588-020-00015-6

Download PDF

Comment
Published: 14 January 2021

Data curation during a pandemic and lessons learned from COVID-19

Moritz U. G. Kraemer¹,
Samuel V. Scarpino^2,3,
Vukosi Marivate ORCID: orcid.org/0000-0002-6731-6267⁴,
Bernardo Gutierrez^1,5,
Bo Xu^1,6,
Graham Lee ORCID: orcid.org/0000-0001-7592-8880⁷,
Jared B. Hawkins^8,9,
Caitlin Rivers¹⁰,
David M. Pigott¹¹,
Rebecca Katz¹² &
…
John S. Brownstein^8,9

Nature Computational Science volume 1, pages 9–10 (2021)Cite this article

4849 Accesses
25 Citations
15 Altmetric
Metrics details

Subjects

Detailed, accurate data related to a disease outbreak enable informed public health decision making. Given the variety of data types available across different regions, global data curation and standardization efforts are essential to guarantee rapid data integration and dissemination in times of a pandemic.

A wide range of data are critical to characterizing disease outbreaks and informing public health responses¹. Pathogen genomic data have become essential to identify the causative agent of an infection, and they can also help track mutations and investigate transmission networks and the geographic spread of an infectious disease². Clinical data are useful to understand disease severity, develop clinical case definition, evaluate pharmaceutical interventions and monitor disease outcomes³. Serological data are important to characterize immunity, antibody responses and how they may relate to clinical outcomes⁴. Additionally, epidemiological data ranging from aggregated case counts⁵ to detailed contact-tracing data have been used extensively to characterize basic parameters such as the reproduction number, key time distributions (onset of symptoms to hospitalization among others), and heterogeneity in transmission⁶. Metadata associated with individual epidemiological cases can be of great importance to understand early disease dynamics and critical transitions from imported infections to those locally acquired⁷. Further, disease data that contain demographic information have been used extensively to understand population level attack rates⁸. All of these data inform response actions designed to mitigate the consequences of the disease event.

Whereas many countries and jurisdictions collect detailed data during outbreaks, they may not be shared openly due to various ethical, legal and privacy issues, political regulations and concerns, and/or computational limitations⁹. Computational frameworks for rapid ingestion of these data are not widely available either. In addition, there are no standardized data formats that facilitate the open reporting of such information while ensuring its compliance with regulations around data privacy (primarily de-anonymization). This makes international comparisons of large, detailed outbreak data difficult and prevents inferences from such data to be effective in response to disease outbreaks. As a consequence, there is a missed opportunity in having a single platform where all of the data, irrespective of type and region, can be easily and quickly shared among the scientific community, which can greatly accelerate research related to a disease outbreak.

During the COVID-19 pandemic, the Global.health initiative was established to create a global infrastructure for consolidating, standardizing and sharing individual-level epidemiological data across different geographic regions. Nevertheless, challenges related to data ingestion and curation still persist, and addressing them is of crucial importance to enable rapid data analysis of open data during future outbreaks.

Data standardization

There are substantial challenges when ingesting epidemiological data from across the world mainly due to the diversity in formats that data is collected, summarized and disseminated. Therefore, a critical step to facilitate data sharing among different countries and communities is to define a standard format that is common across all the different health reporting systems. Because data can come from different sources, this standard format must accommodate multiple data streams, such as those from official government sources, news outlets and social media. In addition, this format must be pathogen agnostic, meaning that it should be extensible to multiple infectious diseases in order to facilitate adoption in future outbreaks.

In the Global.health initiative, a standardized format for epidemiological data¹⁰ was agreed on by different regional working groups after consulting a number of public health agencies, academic research groups, and health policy experts. The main goal during these interviews was to identify a format that could accommodate most use cases, enable rapid decision making and reduce time to clean data. This included mapping data to a common geographic reference frame. International collaborations, as well as sharing of testing protocols and local expertise, were essential to define a standard: these collaborations must continue in the future to make sure that we have usable standards for the community. Finally, the standardized format was developed with extensibility in mind, although this needs to be periodically revised to make sure that the resources in place can include pathogen-specific information from other infectious diseases.

There are still some challenges to be solved. For example, symptom ontologies vary substantially across regions and limited amount of information about country specific triaging protocols were available. In addition, observational data are likely to suffer from substantial biases in how data are collected. Better ways to characterize these different biases in the standardized format will be particularly important in future outbreaks.

Data ingestion workflow

Entering data by hand is certainly unfeasible beyond ~50,000 cases and as an outbreak grows exponentially. Therefore, automatic workflows for data ingestion are essential to allow rapid ingestion of data. They need combined expertise of data scientists and epidemiologist working alongside each other.

A series of programmatic approaches were developed in the context of Global.health to allow ingestion of data from structured and unstructured sources. Importantly, these approaches can be applied to datasets ranging from PDF documents to standard application programming interfaces (APIs), and can be performed manually and in more automated ways. Designing more integrated workflows that enable rapid communication and exchange between public health agencies could greatly enhance the understanding of the data-collection process and its associated limitations.

Data quality

Data quality may vary substantially across different data streams, especially early during an outbreak, when data are often less structured. Also, biases may only become apparent after they have been collected and analyzed. Therefore, while researchers and decision makers need to perform their own assessments of the feasibility of specific data to support their study findings, it is imperative to have a data validation process in place for the ingestion.

For future outbreaks, it will be important to implement validation workflows in which human curators look at data streams at a daily level. In addition, the data must indicate whether an entry has been verified by a curator, to better guide users. Scalability here can be a challenge: as the data grows, manual validation can become unfeasible. Nevertheless, a decentralized model, where volunteers and team members across different regions perform data validation, can help alleviate this challenge, as shown by the COVID Tracking Project. More automated workflows for validation can also be implemented to assist human curators (for example, anomaly detection algorithms).

Even though we believe that a decentralized model will be the most effective in building a trusted community, we do acknowledge the need for stable funding and a well-organized and transparent informatics and administrative center, especially during the start of a pandemic, when the nature of data is uncoordinated and often disparate. Flexible approaches will be needed to accommodate them, especially as every outbreak is different. Furthermore, and because all infectious disease outbreaks have the risk of becoming global, a globally comprehensive database will be helpful in guiding coordinated responses.

Data integration

Key data that are collected during a public health investigation for infectious disease management are not limited to demographic and clinical information. Other important data may be serological data, pathogen genomic data, or non-epidemiological, spatial data that help characterize drivers of transmission (these are usually socio-economic, demographic and environmental). The underlying process of data generation (at which resolution, which study population and so on) varies across these different types and integrating them sensibly is a major opportunity for improving infectious disease research.

Remarkable innovations have been made to make such data available at a global scale and at fine spatial resolution¹¹. To accommodate research and inferences of disease dynamics that pair epidemiological data with spatial data, the common geographic reference frame defined in the standardization process will help to merge data across data types. However, computational platforms that allow researchers to easily pair spatial data (for example, population density) with epidemiological data from a given geographic region will enable researchers with little experience in spatial data processing to perform complex and integrated data analyses.

Outlook

The collection, integration and dissemination of timely high-resolution global epidemic data has the opportunity to augment public health response and inform policy in real-time. Trust will be one of the key components enabling rapid data sharing and the misuse of data has been detrimental to data sharing and disincentivized open collaborations. We advocate for better principles around the terms of use of data and general principles of data sharing that must include guidelines that prevent data use to reinforce existing biases or discrimination against specific populations based on gender, age or location. Furthermore, appropriate computational infrastructure will need to be developed so that the risk of re-identification is minimized and balanced against the potential impact on health of the wider population.

It is also crucial that all of the data and code are open source to enable rapid integration across multiple research groups and governments in the future. A truly open platform can assist users in overcoming existing geographical, organizational and societal barriers to information access, and enable great public health empowerment and democratization.

References

Lipsitch, M., Swerdlow, D. L. & Finelli, L. N. Engl. J. Med. 382, 1194–1196 (2020).
Article Google Scholar
Grubaugh, N. D. et al. Nat. Microbiol. 4, 10–19 (2019).
Article Google Scholar
Salje, H. et al. Science 3517, 208–211 (2020).
Article Google Scholar
Kucharski, A. J. & Nilles, E. J. Lancet Infect. Dis. 20, 1351–1352 (2020).
Article Google Scholar
Dong, E., Du, H. & Gardner, L. Lancet Infect. Dis. 20, 533–534 (2020).
Article Google Scholar
Lewnard, J. A. et al. BMJ 369, m1923 (2020).
Article Google Scholar
Kraemer, M. U. G. et al. Science 368, 493–497 (2020).
Article Google Scholar
Rodriguez-Barraquer, I., Salje, H. & Cummings, D. A. eLife 8, e45474 (2019).
Article Google Scholar
Xu, B. & Kraemer, M. U. G. Lancet Infect. Dis. 20, 534 (2020).
Article Google Scholar
Xu, B. et al. Sci. Data 7, 106 (2020).
Article Google Scholar
Kraemer, M. U. G. et al. Trends Parasitol. 32, 19–29 (2016).
Article Google Scholar

Download references

Acknowledgements

We thank all members of the Open COVID-19 Data Working Group (https://github.com/beoutbreakprepared/nCoV2019). The Global.health initiative was supported by a philanthropic gift from Google.org and a grant from the Rockefeller Foundation.

Author information

Authors and Affiliations

Department of Zoology, University of Oxford, Oxford, UK
Moritz U. G. Kraemer, Bernardo Gutierrez & Bo Xu
Network Science Institute, Northeastern University, Boston, MA, USA
Samuel V. Scarpino
Santa Fe Institute, Santa Fe, NM, USA
Samuel V. Scarpino
Department of Computer Science, University of Pretoria, Pretoria, South Africa
Vukosi Marivate
School of Biological and Environmental Sciences, Universidad San Francisco de Quito, Quito, Ecuador
Bernardo Gutierrez
Department of Earth System Science, Tsinghua University, Beijing, China
Bo Xu
Research Software Engineering Group, University of Oxford, Oxford, UK
Graham Lee
Boston Children’s Hospital, Boston, MA, USA
Jared B. Hawkins & John S. Brownstein
Harvard Medical School, Boston, MA, USA
Jared B. Hawkins & John S. Brownstein
Johns Hopkins Center for Health Security, Baltimore, MD, USA
Caitlin Rivers
Institute for Health Metrics and Evaluation, Department of Health Metrics Sciences, University of Washington, Seattle, WA, USA
David M. Pigott
Georgetown University, Washington, DC, USA
Rebecca Katz

Authors

Moritz U. G. Kraemer
View author publications
You can also search for this author in PubMed Google Scholar
Samuel V. Scarpino
View author publications
You can also search for this author in PubMed Google Scholar
Vukosi Marivate
View author publications
You can also search for this author in PubMed Google Scholar
Bernardo Gutierrez
View author publications
You can also search for this author in PubMed Google Scholar
Bo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Graham Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jared B. Hawkins
View author publications
You can also search for this author in PubMed Google Scholar
Caitlin Rivers
View author publications
You can also search for this author in PubMed Google Scholar
David M. Pigott
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca Katz
View author publications
You can also search for this author in PubMed Google Scholar
John S. Brownstein
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conception and design: M.U.G.K., S.V.S., J.S.B., D.M.P., R.K., B.G. and C.R. M.U.G.K. wrote the first draft of the manuscript. V.M., B.X., G.L. and J.H. edited the manuscript. All authors contributed to revisions and approved the manuscript.

Corresponding authors

Correspondence to Moritz U. G. Kraemer, Samuel V. Scarpino, Rebecca Katz or John S. Brownstein.

Ethics declarations

Competing interests

The authors declare no competing interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kraemer, M.U.G., Scarpino, S.V., Marivate, V. et al. Data curation during a pandemic and lessons learned from COVID-19. Nat Comput Sci 1, 9–10 (2021). https://doi.org/10.1038/s43588-020-00015-6

Download citation

Published: 14 January 2021
Issue Date: January 2021
DOI: https://doi.org/10.1038/s43588-020-00015-6

This article is cited by

A worldwide epidemiological database for COVID-19 at fine-grained spatial resolution
- Emanuele Guidotti
Scientific Data (2022)
The ethical use of high-performance computing and artificial intelligence: fighting COVID-19 at Barcelona Supercomputing Center
- Ulises Cortés
- Atia Cortés
- Enric Àlvarez
AI and Ethics (2022)
Improving prediction of COVID-19 evolution by fusing epidemiological and mobility data
- Santi García-Cremades
- Juan Morales-García
- José M. Cecilia
Scientific Reports (2021)

Data curation during a pandemic and lessons learned from COVID-19

Subjects

Data standardization

Data ingestion workflow

Data quality

Data integration

Outlook

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

A worldwide epidemiological database for COVID-19 at fine-grained spatial resolution

The ethical use of high-performance computing and artificial intelligence: fighting COVID-19 at Barcelona Supercomputing Center

Improving prediction of COVID-19 evolution by fusing epidemiological and mobility data

Search

Quick links

Subjects

Data standardization

Data ingestion workflow

Data quality

Data integration

Outlook

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

A worldwide epidemiological database for COVID-19 at fine-grained spatial resolution

The ethical use of high-performance computing and artificial intelligence: fighting COVID-19 at Barcelona Supercomputing Center

Improving prediction of COVID-19 evolution by fusing epidemiological and mobility data

Search

Quick links