1. Introduction

Data management and stewardship in scientific research are critical to accelerating knowledge discovery across domains. Recently, the scientific community has promoted the use of Findable, Accessible, Interoperable, and Reproducible (FAIR) principles to make data from research activities broadly available (; ). The FAIR principles outline how to make data and information easy to “discover, access, interoperate, and sensibly re-use, with proper citation” (). This can be achieved in part by archiving data supporting the results of scientific research in public repositories for long-term preservation and discoverability. In addition, adopting data or metadata standards and reporting formats that specify preferred file formats and variable names will improve reusability (). In most cases, community engagement and consensus is requisite to adopting these standards and guidelines, and building cohesiveness among archived datasets ().

Many current standards are targeted towards observational and experimental datasets (https://fairsharing.org/standards/). In contrast, guidelines on archival of model data are limited, but have proven to be extremely useful when available (e.g, ; ). For example, researchers involved in the World Climate Research Programme (WCRP) Coupled Model Intercomparison Project (CMIP) activities built consensus on requirements for their archived data to be traceable, reproducible, and usable for scientific purposes (; ). The global climate model data are available for broader use outside of the CMIP network via the distributed data archive, Earth System Grid Federation (ESGF) (https://esgf.llnl.gov/). The scientific objectives of each CMIP project informed the design of the data archives and the standardization of the datasets providing for example, detailed documentation of experimental conditions, requested variables, data reference syntax and controlled vocabulary, general structure and format of the data, and file directory system organization (https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html). These archives enabled much of the model-based research in the Intergovernmental Panel on Climate Change assessments, such as improving estimates of the carbon cycle ().

Terrestrial models (alternatively known as land models) are a broad class of Earth science numerical models that simulate land dynamics and fluxes of energy, water, carbon, and nutrients (). Terrestrial models can be coupled with global Earth system models and other regional-scale models, or run ‘offline’ at site, watershed, river basin, continental, or global scales (). Terrestrial modeling datasets lack guidelines for public archiving, and have a unique set of attributes that make building consensus on standardized archiving protocols challenging. First, the data are very diverse since they are used to address a broad range of questions across different scientific domains spanning climate, hydrology, biogeochemistry, and ecology. Moreover, these models can be used at vastly different spatial and temporal scales to study ecosystem processes. For example these models can be used to investigate the drivers of the terrestrial carbon sink at global scales (), as well as to understand the fate of riverine chemistry at local to watershed scales (; ). Finally, model data can have many components, including output files of various dimensions and resolutions (e.g., final raw outputs, spin-up output files, restart files, test data files, and higher level outputs corresponding to figures); a variety of metadata files (embedded within output files such as those in NetCDF formats or external to the data files); visualization files; model code; input files (e.g., model parameters, climate forcing data, surface data); scripts for model set-up and initialization; code to calculate and assign input parameters; post-processing; and visualizations.

The terrestrial modeling community would benefit from a set of guidelines for curating model data for long-term archival. However, there is no current community consensus on answers to several important questions related to publishing model data, including 1) what model-related data are worth archiving, 2) how much storage space is needed and what are suitable repositories to host such data, and 3) what are best practices for curating the datasets and associated files (e.g., model code and pre- and post-processing scripts). Guidelines for curating modeling data for long-term public archival would enable their reuse for purposes such as spinning up new simulations, model synthesis and intercomparisons, comparisons of model predictions with observational data, and informing experimental designs that reduce prediction uncertainty.

Terrestrial models are used in U.S. Department of Energy (DOE) research to advance a robust, predictive understanding of climate impacts on ecosystem processes such as carbon cycle changes caused by warming (; ), vegetation dynamics (), or watershed responses to disturbances such as early snowmelt and droughts (). The Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) is a data repository established to serve as the long-term steward of environmental research data sponsored by the DOE (). ESS-DIVE stores heterogeneous data types (e.g., hydrological, biogeochemical, ecological, climate, remote sensing) generated by observational, experimental, and modeling activities and seeks to enable data discovery and reuse by partnering with the science community. Several terrestrial model datasets from DOE research are publicly available on ESS-DIVE (e.g., ; ; ; ; ) and other archives such as the ESGF. In this study, the ESS-DIVE team worked collaboratively with a diverse set of modeling researchers across the DOE community to determine guidelines for long-term archival of terrestrial model data in public repositories.

The main objectives of this study were to (1) synthesize current practices and recommendations across the Earth Science modeling and data repository communities for archiving model data, (2) assess requirements for public archiving, synthesis, and utilization of a diverse selection of terrestrial model data, and (3) provide pragmatic recommendations about best practices for curating scientifically useful model datasets, including those associated with scientific publications, towards enabling reproducibility of modeling workflows and data reuse for purposes such as model results intercomparison and synthesis. Below we describe our review of previous approaches to storing model data and our recommendations on archiving terrestrial model data. Although the study was designed to inform the ESS-DIVE repository policies, the guidelines are broadly applicable to other model types and data archives given the diversity of terrestrial model data considered in this study. To our knowledge, this is the first study that provides recommendations for archiving different components of model data for scientific purposes. Such guidelines are necessary as publication of model datasets is expected to grow significantly as journals and funding sources expand their requirements, and needs special consideration due to the volume and complexity of data associated with typical simulations.

2. Methods

2.1 Review of existing model data archiving guidelines

First, we researched capabilities of existing data systems that support Earth science model or large data archiving including the ESGF, the National Aeronautics and Space Administration (NASA) Earth Observing System Data and Information System (EOSDIS) Distributed Active Archive Centers (DAACs) (https://earthdata.nasa.gov/eosdis/daacs), the National Center for Atmospheric Research (NCAR) Research Data Archive (RDA) (https://rda.ucar.edu/), the Earth Observatory Lab (EOL) data archive (https://data.eol.ucar.edu/), and the National Science Foundation (NSF) Arctic Data Center (https://arcticdata.io/). We also reviewed general-purpose repositories Dryad (https://datadryad.org/), Zenodo (https://zenodo.org/) and ESS-DIVE that accept large data files. The review considered current storage capacities and guidelines provided by these systems for contributing model-specific and other types of data.

Additionally, we reviewed existing guidelines for archiving model data from the National Science Foundation (NSF) EarthCube Model Data Research Coordination Network (RCN) and the American Geophysical Union (AGU). The NSF EarthCube model data RCN (EarthCube-RCN) group has been researching and hosting workshops on best practices for geoscientific model data preservation and reproducibility (https://modeldatarcn.github.io/) and developed a rubric as a decision-support tool for researchers choosing how much of their simulation workflow output (raw outputs to post-processed outputs) to publicly archive in a FAIR-aligned data repository. We also reviewed journal-specific guidance on publishing modeling data on the AGU website (https://www.agu.org/Publish-with-AGU/Publish/Author-Resources/Data-for-Authors) (). The results from the review are summarized in section 3.1.

2.2 Understanding community terrestrial model data archiving needs and perspectives

We determined needs for archiving, sharing, and utilizing archived data across a broad range of terrestrial models used in DOE research projects. For this study, we gathered input from 12 researchers who work across multiple DOE projects and institutions and use a diverse set of modeling codes to address a wide variety of science questions. We collected input using a form with a set of questions to determine 1) what types of models are currently used in DOE research? 2) what are approximate data volumes and file types generated for different simulations, 3) what components of model data are considered scientifically useful to archive? 4) how long does archived model data remain useful for the scientific community? 5) how do modelers currently archive their data? 6) what features should data repositories support to enable storage and reuse of archived model data in the future? (see supplementary information for the full list of questions). The questions regarding the value of archiving different model data components and importance of different repository features used a rank measure on a five-point scale ranging from 1 (not important) to 5 (highly important).

We also conducted discussions with researchers who had published model data in five scientific publications to determine their workflows and priorities for archiving data (; ; ; ; ).

2.3. Determining model data archival guidelines

We aggregated the input provided by the modelers by taking average scores for questions that had an importance rank, and by tabulating responses for the other questions. We additionally considered input from the follow-on discussions and reviewed the data and code availability statements in the 5 journal publications (Supplemental Table 1) to determine the range of data archiving practices across modelers and to determine practical challenges associated with publishing simulation data. We then drafted an initial version of the guidelines based on our review of other Earth science model archiving practices (Section 3.1) and the input from the modelers participating in this study. The guidelines were finalized with community consensus on what was practical to archive given potential uses of the data and capabilities of current repositories that accept model data.

3. Results

3.1 Synthesis of model data archiving capabilities and guidelines across data centers and organizations in the Earth science community

Table 1 summarizes properties of seven data centers used by the Earth science community that we reviewed in terms of their data publication storage limitations and the availability of guidelines for curating a model data publication or other archiving best practices. At the time this study was conducted, only the NSF Arctic Data Center (ADC) and NASA’s Oak Ridge National Laboratory Distributed Active Archive Center (ORNL-DAAC) for Biogeochemical Dynamics provided some guidance that could be used by data contributors to publish model-related data, code, or scripts.

Table 1

Summary of data centers and their data publication storage limitations, and resources for data contributors on best practices for curating data packages, modeling related and in general.


PROVIDES DATA CONTRIBUTOR GUIDELINES

DATA CENTERSTORAGE LIMIT PER DATA PUBLICATIONMODEL-DATA SPECIFIC?OTHER?

National Science Foundation Arctic Data CenterNo limitYesYes

Oak Ridge National Laboratory DAACNA1 YesYes

NASA’s Earth Observing System Data and Information System (EOSDIS)NA1 NA1 Yes

U.S. DOE ESS-DIVE10GB/500 GB2 NoYes

Dryad300 GB2 NoYes

Zenodo50 GBNoNo

Earth System Grid Federation (ESGF)NA1 NA1 NA1

1 NA: Not available, i.e. no public information found.

2 Limit on size of individual files. For ESS-DIVE, 10GB is the default file size limit, and can be increased upto 500GB by request. Files >500GB are considered upon review.

The ADC provides guidelines on metadata associated with software (includes models); files to include for models and scripts; file organization and formats; and considerations for archiving large datasets including model output data (https://arcticdata.io/submit/). The ORNL-DAAC provides guidelines for submission of model code or scripts and recommend including model code, documentation specifying the model name and version, model process representation and, as appropriate, a description of model lineage, sample input and output (https://daac.ornl.gov/submit/). Their guidelines specify acceptable file formats, including common model output and input file formats (e.g., NetCDF, HDF5, GeoTIFF, shapefile, CSV), and suggests including files necessary “to represent a complete, and reproducible, body of work”. They also provide general guidelines on data and file organization; file-level metadata; file formats and naming; types of files expected in a data publication such as data files, supplemental files (including photos, reports, or metadata), documentation, code (if applicable), and the published paper or manuscript draft (if applicable; https://daac.ornl.gov/datamanagement/#best_practices). The ADC and ORNL-DAAC do not explicitly describe which components of model data files should be archived, such as model inputs, testing data, outputs, model code, and scripts.

Of the other data systems, the EOSDIS has standards and templates, specifies file formats (netCDF/HDF5), and provides a curation service for data publication based on the user’s service level. Dryad (https://datadryad.org/stash/best_practices) and ESS-DIVE (https://docs.ess-dive.lbl.gov/contributing-data/data-submission-guidelines) have guidelines for dataset-level metadata and submissions. ESS-DIVE also has formats for specific data types and we note that the guidelines presented here will be adopted for its model datasets in the future.

The NSF EarthCube rubric allows modelers to respond to a series of questions that assess potential uses of simulation data ranging from data production to knowledge production (). A score is calculated based on their responses indicating how much of the outputs (all data to minimal data) should be archived. The level of importance of eight themes in the simulation workflow is considered in the rubric: (1) data production for downstream uses (e.g., CMIP would score highly); (2) repository data accessibility; (3) simulation workflow accessibility (e.g., system requirements, code availability and ease of use); (4) post-processing workflow accessibility (e.g., system requirements, ease of use of scripts and documentation); (5) simulation data accessibility (e.g., follows community standards, ease of use with metadata and documentation); (6) research feature reproducibility; (7) cost of running simulations; and (8) cost of data repository storage and management services.

The AGU guidelines only require that the data that supports the research and visualizations presented in a journal article submission be archived in a FAIR-aligned data repository. They provide tiered options (acceptable, good, best) for citing and describing the model, configuration, and parameters within the journal article, and what to do regarding data corresponding to tables and figures, and model data output.

3.2 Diversity in terrestrial modeling data

Several terrestrial models are used in DOE research projects for standalone or coupled simulations (Table 2), but the majority of the codes used are sponsored by the DOE. The DOE models are run at different spatial (soil pore to global) and temporal scales and resolutions (Table 3). Each simulation can contain 5 to a few million files with average file sizes ranging from 100 MB to 2 TB (mean = 280 GB/file, median = 3 GB/file), and currently require hundreds of megabytes to a few hundred terabytes of storage space (mean = 28 TB/modeler, median = 650 GB/modeler). While most modelers used HDF5 or netCDF file formats to save model outputs and metadata, some also used other common formats such as text, comma separated value, or DAT files as well as formats unique to certain models (e.g. Tecplot files, XML, MESH, VTK, PY, EXO). There are numerous types of scripts used in a modeling workflow, ranging from single analyses for specific papers to scripts used every time for preparing model inputs. These scripts similarly can be in a diversity of file formats including those produced by workflow tools such as Jupyter Notebooks (http://jupyter.org).

Table 2

Summary of the standalone terrestrial models used by 12 researchers participating in this study. Coupled models (e.g., ELM-FATES and ELM-PLOTRAN) are not listed but were also considered in evaluating archiving needs.


MODEL ACRONYMMODEL NAME (ORGANIZATION)REFERENCESDESCRIPTION

ELMEnergy Exascale Earth System Model (E3SM) Land Model (DOE)Golaz et al. (); https://e3sm.org/ Land model component of the E3SM Earth System Model

FATESFunctionally Assembled Terrestrial Ecosystem Simulator (DOE)Koven et al. ();https://github.com/NGEET/fates-release Size and age-structured vegetation demographic model within a land surface model and can be coupled with an Earth system model

PFLOTRANParallel Flow and Transport (DOE)Hammond, Lichtner and Mills (); https://www.pflotran.org Parallel reactive flow and transport model for subsurface hydrobiogeochemical processes

ATSAdvanced Terrestrial Simulator (DOE)Coon et al. (); https://amanzi.github.io/ats/ An integrated, distributed watershed hydrology model including surface and subsurface flow, energy transport, reactive transport, and ecohydrology.

CrunchFlowN/A (DOE)Steefel and Molins ()Model for simulating multicomponent multi-dimensional reactive transport in porous media

MAATMulti-Assumption Architecture & Testbed (DOE)Walker, Ye, et al. (); https://github.com/walkeranthonyp/MAAT Modular terrestrial ecosystem process modeling framework for building multiple models that vary in process representation/hypotheses.

CLMCommunity Land Model (NCAR)Lawrence et al. (); https://www.cesm.ucar.edu/models/clm/ Land model for the Community Earth System Model (CESM), a fully-coupled global climate model

ED2Ecosystem Demography Biosphere Model (NSF/NASA)Longo et al., (); https://github.com/EDmodel/ED2 Size- and age- structured terrestrial biosphere model

PRMSPrecipitation Runoff Modeling System (USGS)Markstrom et al. (); https://www.usgs.gov/software/precipitation-runoff-modeling-system-prms Deterministic process-based model developed to evaluate the impacts of climate and land use on streamflow and watershed hydrology.

SWATSoil and Water Assessment Tool (USDA/Texas A&M University)Bieger et al. (); https://swat.tamu.edu/ Watershed to river basin-scale model used to simulate the quality and quantity of surface and ground water and predict the environmental impact of land use, land management practices, and climate change.

LPJ-GUESSLund-Potsdam-Jena General Ecosystem Simulator (Lund University)Smith, Prentice and Sykes (); https://web.nateko.lu.se/lpj-guess/ Dynamic vegetation-terrestrial ecosystem model for regional or global studies

GDAYGeneric Decomposition and YieldComins and McMurtrie ();https://github.com/mdekauwe/GDAY Stand-scale ecosystem model that simulates carbon, nitrogen, and water dynamics.

SDGVMSheffield Dynamic Global Vegetation Model (Sheffield University)Woodward and Lomas (); https://bitbucket.org/walkeranthonyp/sdgvm/ Terrestrial biosphere carbon cycle model for ecosystem to global scale simulations. Simple size and age structure.

OpenFOAMN/A (OpenFOAM foundation)https://openfoam.org/Computational fluid dynamics open source software

CALANDCalifornia Natural and Working Lands Carbon and Greenhouse Gas Model (California Natural Resources Agency)Di Vittorio and Simmonds (); https://doi.org/10.5281/zenodo.3256727.Carbon stock and flux model that simulates the effects of various management practices, land use and land cover change, wildfire, and climate change on ecosystem carbon dynamics across all California lands

Table 3

Estimates of archiving needs for typical spatial and temporal representations of simulation data from DOE terrestrial models, which are the most commonly-used models by the researchers in this study. Note that the same models are often run at different spatial extents (e.g., site to global) and temporal duration (e.g., weeks to centuries).


DETAILS FOR TYPICAL SIMULATION1 TO BE ARCHIVED

MODELSPATIAL RESOLUTION OR REPRESENTATIONSPATIAL EXTENTTEMPORAL RESOLUTION2 TEMPORAL DURATIONNO. OF FILESMEAN FILE SIZE (GB)TYPES OF FILE FORMATSTOTAL ANNUAL STORAGE NEEDS (GB)

Multiple LSMs3 Point4 pointdaily200 yrs3000.1CSV50

ELMpointpointhourly, daily10 – 20 yrs200.004netCDF3

ELM1/2° – 2°globalmonthly250 yrs25000.2netCDF15000

ELM-FATESpoint, ~1 km, ~1 degreepoint, regional, and global modessub-daily, monthly~500 yrs1K – 10K50netCDF1000

FATESpointpoint<hourly10 yrs703netCDF2000

ELM-PFLOTRAN1 – 100 m100 m – 10 kmhourly/daily10+ yrs10 – 10010HDF5, netCDF1000

PFLOTRAN<1 m5-6 km<hourly30 yrs51000HDF510000

ATS100 m – 250 m10 kmdaily10 – 100 yrs20100XML + HDF5, CSV1000

ATS<1 – 100 m10 m – 10 kmdaily10 – 100 yrs2XML + HDF51000

ATS0.25 m25 mdaily100 yrs50 – 200XML + HDF510

CrunchFlow<1 m<1 km<hourly30 days1000.001TXT1

1 Note that “ensembles” of simulations were not considered in this survey, except in the total annual storage needs reported.

2 This could represent either the simulation temporal resolution, or output file temporal resolution.

3 Here we use Land Surface Model “LSMs” to include both standard CMIP-style Earth System Models (e.g. ELM) and more complex vegetation phenology models (e.g. FATES).

4 Note that “point” is used to indicate a single vertical column of cells or otherwise a single location in horizontal space.

3.3 Perspectives on best practices for preservation and reuse

There was broad consensus amongst the modelers participating in this study that model input files, metadata, and scripts used in the workflow or analysis should be archived for the data to be usable and traceable (Figure 1a). Many of the modelers considered it useful, as defined by an importance rank of 3 or higher (somewhat important to very important), to archive the entire workflow including model code (10 out of 12 modelers; mean importance = 4.3), the outputs corresponding to final simulations (8 out of 12; mean importance = 3.9), model input parameters and forcings (11 out of 12; mean importance = 4.8), and scripts for pre-processing and post-processing, model configuration, and analysis (11 out of 12; mean importance = 4.5).

Figure 1 

Perspectives from a group of 12 U.S. Department of Energy terrestrial model researchers of (a) archiving different components of model data in a public repository (b) the period of time over which publicly archived model data remain useful, and (c) purposes served by archiving model data in a public repository. The importance ranking for (a) and (c) are shown as 1 (not important at all) to 5 (extremely important), and represent average importance scores across 12 researchers.

However, there were diverse opinions on the specifics of which model data files are worth preserving. If possible, modelers preferred to archive the majority of model data from final simulation runs (e.g., raw and aggregated outputs), with the exception of files already stored in a repository or public codebase separately with preexisting digital object identifiers (DOIs) or files produced from intermediate steps that are easily reproduced. However, modelers sometimes preferred to only archive high-level outputs corresponding to results presented in a journal article, because the full set of model outputs may be too large to store in most data repositories and can be reproduced with affordable computational cost. Fewer modelers ranked archiving of testing data as important (6 out of 12 modelers; mean importance = 3.9). The rationale provided was that frequently the validation datasets used to test model performance are archived elsewhere and can be referenced in the metadata of a published dataset.

Aside from the simulation files used to derive the published figures and tables in a journal article, modelers also run spin-up simulations and in some cases a small number of higher-resolution simulations than the final simulations used for publication. There was consensus that spin-up simulations are not a high priority for archiving, but that it is worthwhile to publicly archive restart files that allow a model data user to rerun a segment of a simulation in the event that they want to reanalyze the data.

Besides the data files, most modelers (11 out of 12) preferred that specific scripts used for analysis should be archived. However, if a modeler anticipates running analogous simulations many times, then the scripts and model outputs can be archived separately with DOIs, allowing the outputs to be updated over time. Ten out of 12 modelers agreed that model code should be publicly archived for various reasons, but they had different perspectives about where and for how long it should be archived given that model codes can evolve significantly over time. One consideration for storing model code in a data repository was the need for long-term preservation with citable DOIs. Alternatively, most models are currently stored in collaborative software development and sharing platforms (e.g., GitHub, Bitbucket) that interface with Version Control Systems (VCS). Although VCS platforms were considered to be useful for versioning and interaction on model development, releases, tracking issues and bug-fixes, there was concern that VCS systems are not guaranteed to be long-term archives. An approach that some modelers used to balance the needs for long-term preservation and practical software development was to archive tagged releases of model codes with a DOI by utilizing an established partnership between the GitHub software platform and Zenodo data archive.

The modelers also had different perspectives on how long publicly archived model data would remain useful (Figure 1b), spanning short (2-5 years; 3 out of 12 modelers), medium (5-10 years; 5 out of 12) and long (>10 years; 4 out of 12) time periods. However, they generally agreed that it was important (as indicated by an importance rank of 3 or higher) to archive data in public repositories for many purposes (Figure 1c) that includes sharing (11 out of 12 modelers; mean importance = 4.3), preservation (11 out of 12; mean importance = 4.5), clear documentation of the model runs (12 out of 12; mean importance = 4.5); ensuring reproducibility of workflows (8 out of 12; mean importance = 3.9), and reuse of model data (7 out of 12; mean importance = 4.2). Ultimately, all the modelers agreed that standards for archiving model data are needed to ensure its usability and were willing to learn new organizational guidelines or standardized reporting formats for model data.

Model data archived in a FAIR-aligned data repository for the 5 journal articles considered in this study include metadata and model outputs (4 out of 5 articles), followed by model inputs, testing data, model code, and a user guide or readme files (3 out of 5 articles for each component). Three of the 5 journal articles published model-related files under a single DOI, while 2 articles archived multiple datasets. Two of the researchers archived scripts and Jupyter Notebooks for generating inputs, post-processing model output, generating figures, or initiating model simulations. One researcher archived file-level metadata that defined variables and file-naming conventions for machine-readability. Three out of 5 authors made the model code available using GitHub (with or without a Commit ID referenced in article) or Zenodo. Most researchers referenced the storage location of the model data and code in the Data and Code availability section(s) of the paper (Supplemental Table 1).

We recommend the following guidelines for organizing model-related files for simulation workflows. Components of archived model data should include metadata, data files, and optionally user guides, which are described in further detail below. We also provide a decision tree to determine whether to group components into one data publication or split into multiple datasets (Figure 2). This decision tree helps address the challenges associated with choosing how much model data to save and other considerations in publishing model-related files, such as varying authorship for different model data components and repository storage limitations.

Figure 2 

Decision tree for determining recommended approach for grouping model-related files for public archiving.

  1. Metadata – This refers to pertinent information about data and code archived (e.g., abstract, geographical and temporal extents), as well as description of the files being archived with links to other DOI-issued publications within the entire simulation workflow, as applicable.
  2. Required Data Files – Archived datasets should specify or include model inputs, outputs, code, and scripts depending on whether the data are published elsewhere or exceed repository dataset size limits. File names should be unique and can use an intuitive file naming nomenclature to help with discoverability. File names should only contain letters, numbers, hyphens, and underscores, should not contain spaces, and should not rely on case-sensitive file systems.
    1. Model Inputs – Input files should be included unless publicly available elsewhere, in which case a hyperlink to the specific input files (e.g., climate forcings, meshes, soil parameterizations) should be provided in the metadata and user guide. Use open-sourced formats such as comma separated value (.csv) or NetCDF (.nc) formats where possible.
    2. Model Outputs – Archive all model outputs if the size of the data files are within the repository storage limitations. This output should include the raw and post-processed data, and if associated with a scientific publication the data that support the main findings, tables, and figures. If the size of the model output exceeds repository storage limitations, evaluate recommendations based on the decision tree (Figure 2) on which data to publish. Use open-sourced formats such as comma separated value (.csv) or NetCDF (.nc) formats where possible.
    3. Model Code – Include source code(s) used to generate results in paper unless the code is publicly available elsewhere (e.g., GitHub or Zenodo), in which case include specific version, hash information, or citation allowing the exact source code to be recovered. Include links to any external model codes in the metadata and user guide. If published on GitHub, provide the commit hash associated with the specific version. If available, include a reference (with DOI) to the tagged release in an established data repository.
    4. Scripts – Include run scripts if they are necessary for running the model to generate published results. Optionally also include scripts necessary for reproducing the parameters and model configuration for the simulations and input files, for post-processing model outputs to produce the results (e.g., tables and figures in a publication), and for executing the entire workflow used to generate the model results.
  3. Optional Files
    1. *File-level metadata (FLMD) – Include descriptions of all the data files as one file catalog (e.g., ). Optionally also include one data dictionary for each file type within the data publication describing columns and variables.
    2. Model Testing Data – Include data files of observations from each location simulated to produce the results in the paper in an open source format (e.g., CSV). If the data are publicly available in another repository, include a reference (with DOI) in the metadata and user guide.
    3. Documentation or user guide – Include a readme file (e.g., pdf) for each site-specific or large-scale simulation and provide details on the model name and version number, and required data or code dependencies. Also include a citation for the model code and licensing information if applicable.
  4. Use in publications – If publishing model results, cite and include links to the data and code publication(s) in the Data or Code Availability section. Include the citations of the dataset and code publication(s) with DOI(s) in the references section. Examples of data or code Availability statements associated with the journal articles researched in this study are provided in Supplemental Table 1.

Further details on these guidelines are described on the ESS-DIVE Community Space on GitHub (https://github.com/ess-dive-community/essdive-model-data-archiving-guidelines). The GitHub site also allows for users of these guidelines to provide feedback, and for tracking any future revisions to the guidelines ().

4. Discussion

The guidelines we propose are a first step towards improving search capabilities and discovery within model data files, and support the following scientific purposes: 1) repeat the simulations with the same models for traceability and evaluation of the main findings (e.g., data in figures and tables of scientific publication); 2) evaluate published model simulations against observations and other models to gain understanding about model discrepancies and evaluate model uncertainty; and 3) leverage the work for model intercomparison; synthesis of results for meta-analysis or model ensembles; developing new simulations, (e.g., with new spatial domains or input parameters); and for training. We also provide a decision flowchart as a framework for choosing how much of the model data workflow to archive, particularly when storage limitation is an issue or when flexibility is required for supporting a variety of model data archival options.

The guidelines can enable reproducibility of complex scientific workflows that include data ingestion to generate parameter files or other model inputs, running a model multiple times, and analysis of model outputs. We note that these guidelines are specifically focused on establishing provenance of the data used in simulations and enabling reproducibility of modeling workflows, which is sometimes referred to as traceability (). The guidelines are not sufficient to ensure computational (bitwise) reproducibility of model results, which is challenging because of the complexity of modeling codes and diversity of compute architectures and software libraries (; ). The ambiguity in how modelers perceive reproducibility may have been a reason for why it received a lower importance rank compared to other purposes for archiving model data (Figure 1c).

Although the guidelines were developed in partnership with DOE scientists, the breadth of models used in their research make our recommendations broadly applicable to archival of data from other mechanistic process-based models. In comparison to pre-existing model data guidelines (EarthCube-RCN, NSF Arctic Data Center, ORNL-DAAC), our recommendations strike a balance between the complexity of considerations needed to properly archive the various components of model data and a need for the guidelines to be practical and useful for scientists. We have created additional user-friendly documentation using the GitBooks feature of GitHub () to enable adoption of these guidelines (https://ess-dive.gitbook.io/model-data-archiving-guidelines/).

4.2 Enabling model data intercomparison, workflow reproducibility and synthesis

Archiving model data using such guidelines can facilitate coordinated Model Intercomparison Projects (MIPs) and synthesis of data from individual simulation experiments. Data standardization is necessary for MIP efforts since the primary goal is to compare model outputs. Standards have been established and developed for Earth system model outputs, including standardized variable names, units, and other metadata, as part of intercomparison efforts such as CMIP and the Distributed Model Inter-comparison Project (DMIP) (). However, terrestrial models have typically not conformed to standards in their direct outputs. Sometimes a translation tool, such as the Climate Model Output Rewriter (CMOR; https://pcmdi.github.io/cmor-site/), can be used to translate the native model output to a standards-compliant format. Most of these toolchains are designed around large-scale modeling exercises and may not be applicable to small-scale studies, such as individual manuscripts or even niche intercomparison efforts. For example the permafrost model intercomparison effort was a small MIP effort undertaken as part of the permafrost carbon network (), which produced a large number of manuscripts but with only a subset of the models having a standardized output. Another small MIP example that succeeded in establishing an internally-consistent standardized format is the Free Air CO2 Enrichment Model-Data Synthesis (FACE-MDS), and in this instance, the format took several months to develop (; ; ). Furthermore, conflicting standards exist between MIPs with similar objectives such as the North American Carbon Program (NACP) Multi-scale synthesis and Terrestrial Model Intercomparison Project (MsTMIP; ) and the Global Carbon Project (GCP; ), which complicates efforts to converge towards a standard. The guidelines presented here are the first steps toward resolving these issues and enabling model intercomparisons. Further work is needed to develop more complex terrestrial model data standards for variable conventions, units, and other aspects relevant to specific MIP efforts.

Archiving model data from individual studies can also enable reproducibility of their workflows and reuse or synthesis of the data for other analyses. Individual researchers may pre-process data or parameterize and calibrate models in different ways, but the use of computational tools such as Jupyter Notebooks allows the archiving of such analyses and runtime scripts in a more transparent way for subsequent researchers to build on. For example, Koven et al. () synthesized multiple datasets on plant traits alongside other model drivers such as site-observed meteorology to run multiple instances of the FATES vegetation model and analyze its outputs. The workflow was captured in Jupyter-based scripts that were cited and archived in Zenodo with a DOI (). We note that despite the integration between Zenodo and GitHub, many projects hosted on GitHub do not take the extra step to archive their content into long-term data repositories (), and we highlight this as a key step toward long-term model data reuse and accreditation.

The use cases provided by the modelers participating in this study highlight some of the valuable outcomes of using a common methodology for curating terrestrial model data for publication (e.g., standardized output formats, variable names and units), thereby enabling synthesis of modeled and observational datasets. Coordination in the approaches used for curating the model data would also support the development of products for coupled models ().

4.3 Desired data repository features and cyberinfrastructure tools for improving model data storage and reuse

There are several cyberinfrastructure and data management challenges related to archiving model data. First, data are rapidly increasing in volume and complexity. For example, there is increasing use of ensemble model runs (e.g., ; ; ) and very high-resolution simulations (; , ), which are critical for watershed models and the global land-surface modeling community and result in very large output data volumes. Second, the data are extremely diverse across scientific domains and spatial and temporal scales. Third, there is a disconnect between model and observational data, and fragmentation between workflows attempting to integrate these data. This problem is difficult for many modeling workflows that require manual retrieval of data from multiple sources and subsequent pre-processing for use in modeling analyses.

The immediate need for many researchers to publicly archive data associated with scientific publications to meet journal and funding requirements. Archiving big data on cloud platforms with public accessibility to analytical tools is becoming a trend and is especially important for models with terabyte to petabyte scale outputs. However, cloud storage can be prohibitively expensive and can incur recurring costs for storage, egress and access. Unfortunately, many data repositories are not designed to meet the expected annual storage needs of current large simulations (order of ~1-10 TB; Table 2) and need additional capabilities for enabling archival of datasets at this scale. First, a significant expansion of repository storage capacities is needed to support individual dataset sizes of hundreds of gigabytes to terabytes. In addition, improvements in data transfer capabilities, such as the use of programmatic web services or file transfer services (e.g. Globus; https://www.globus.org/) are needed to support large data ingestion and download. Data replication is needed for redundancy and long-term preservation, but poses a challenge with larger datasets. Data repositories also need to support versioning of the numerous files generated from model simulations over the course of a project, especially since many modelers change their archived data several times during manuscript preparation to final publication and beyond.

A long-term need for modelers is to have a more seamless process for publication, such as a model-to-archive pipeline that would constitute various data repository resources and services that can support consistent archiving of diverse model data. For example, a support tool for assisting modelers in following the recommended guidelines would be useful, such as an interface or scripts that automate the writing and organization of the files comprising the simulation workflow components. This tool could be model-specific and assemble all the required data for publication in specified formats by extracting subsets of model simulations corresponding to specific runs, locations, variables, or figures. Such a tool could also be extended to enable more advanced querying, utilization, and synthesis of model datasets beyond the metadata. Another example is a tool that provides support for containerized images (e.g. Docker; https://www.docker.com/) containing model codes and associated data, which can make it a lot easier to reproduce model data and results.

In an effort to improve the transparency of model-data integration and data provenance, repositories should consider mechanisms to provide links to internal and external datasets that are part of the pre- or post-processing workflows. For example, interoperability is needed between the repository or data center storing a researcher’s model data, and other systems that store data needed for generating model inputs or testing datasets. Linking datasets across repositories require consensus on which existing metadata standards to use and how to identify the different relationships and linked data types needed to provide a comprehensive view of the model dataset. A longer-term need is a data-to-model pipeline that can enable integration of observational data available across data systems with simulation codes, which would dramatically improve the efficiency of modeling workflows. Such a pipeline could focus on supporting data formats that are typically used in model simulations (e.g., NetCDF and HDF5), including the ability to retrieve data through programmatic means and export data into these formats.

There is a pressing need for more repositories to archive the growing volumes of model datasets and design solutions to address the challenges posed by their size, diversity and interoperability requirements. Community engagement with modelers is essential to identify archival priorities and to develop practically feasible guidelines for curating standardized model data publications that follow data management best practices.

5. Conclusions

The terrestrial modeling community needs to publish standardized simulation datasets in repositories that can support large data archival, model data reuse, and integration with other data centers. In this study, we synthesize archiving needs across several terrestrial models used by U.S. DOE researchers and propose an initial set of guidelines that specify how different model data components (e.g., model inputs, outputs, scripts, metadata) should be archived. The guidelines serve different scientific purposes, including traceability of published research and reuse of data for model intercomparisons and synthesis efforts. We also provide guidance for splitting model data into multiple datasets depending on repository capabilities, authorship, and other considerations. Finally, we identify short-term and long-term repository features and software tools to assist modelers with archiving and sharing simulation data and codes, and improving their scientific workflows. These guidelines are broadly applicable beyond the models considered in this study, and are urgently needed given increasing volumes of published terrestrial model data.

Data Accessibility Statement

The data presented in this publication including the recommended guidelines are published in the ESS-DIVE repository (). Future updates to the guidelines will be managed and available through the ESS-DIVE community GitHub repository (https://github.com/ess-dive-community/essdive-model-data-archiving-guidelines).

Additional File

The additional file for this article can be found as follows:

Supplemental Information

File containing the list of questions and details of journal articles used in this study to assess model data archiving needs. DOI: https://doi.org/10.5334/dsj-2022-003.s1