Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities
Introduction
Novel technologies in several scientific areas have led to the generation of very large volumes of data at an unprecedented rate. This is particularly true for the life sciences, where, for instance, innovations in Next Generation Sequencing (NGS) have led to a revolution in genome sequencing. Current instruments can sequence 200 human genomes in one week whereas 12 years have been necessary for the first human genome [1]. Many laboratories have thus acquired NGS machines, resulting in an avalanche of data which has to be further analyzed using a series of tools and programs for new scientific knowledge and discoveries to emanate.
The same kind of situation occurs in completely different domains, such as plant phenotyping which aims at understanding the complexity of interactions between plants and environments in order to accelerate the discovery of new genes and traits thus optimize the use of genetic diversity under different environments. Here, thousands of plants are grown in controlled environments, capturing a lot of information and generating huge amounts of raw data to be stored and then analyzed by very complex computational analysis pipelines for scientific advancements and discoveries to emerge.
Faced with the complexity of analysis pipelines designed, the number of computational tools available and the amount of data to manage, there is compelling evidence that the large majority of scientific discoveries will not stand the test of time: increasing reproducibility of results is of paramount importance.
Over the recent years, many authors have drawn attention to the rise of purely computational experiments which are not reproducible [2], [3], [4], [5]. Major reproducibility issues have been highlighted in a very large number of cases: while [6] has shown that even when very specific tools were used, textual description of the methodology followed was not sufficient to repeat experiments, [7] has focused on top impact factor papers and shown that insufficient data were made available by the authors to make experiments reproducible, despite the data publication policies recently put in place by most publishers.
Scientific communities in different domains have started to act in an attempt to address this problem. Prestigious conferences (such as two major conferences from the database community, namely, VLDB1 and SIGMOD2) and journals such as PNAS,3 Biostatistics [8], Nature [9] and Science [10], to name only a few, encourage or require published results to be accompanied by all the information necessary to reproduce them. However, making their results reproducible remains a very difficult and extremely time-consuming task for most authors.
In the meantime, considerable efforts have been put into the development of scientific workflow management systems. They aim at supporting scientists in developing, running, and monitoring chains of data analysis programs. A variety of systems (e.g., [11], [12], [13]) have reached a level of maturity that allows them to be used by scientists for their bioinformatics experiments, including analysis of NGS or plant phenotyping data.
By capturing the exact methodology followed by scientists (in terms of experimental steps associated with tools used) scientific workflows play a major role in the reproducibility of experiments. However, previous work have either introduced individual workflow systems allowing to design reproducible analyses (e.g., [14], [15]) without the aim to draw more general conclusions and discuss the capabilities of the scientific workflow systems to reproduce experiments or it has discussed computational reproducibility challenges in e-science (e.g., [16], [17]) without considering the specific case where scientific workflow systems are used to design an experiment. There is thus a need to better understand the core problematic of reproducibility in the specific context of scientific workflow systems, which is the aim of this paper.
In this paper, we place scientific workflows in the context of computational reproducibility in the life sciences to provide answers to the following key points: How can we define the different levels of reproducibility that can be achieved when a workflow is used to implement an in silico experiment? Which are the criteria of scientific workflow systems that make them reproducibility-friendly? What is concretely offered by the scientific workflow systems in use in the life science community to deal with reproducibility? Which are the open problems to be tackled in computer science (in algorithmics, systems, knowledge representation etc.) which may have huge impact in the problems of reproducing experiments when using scientific workflow systems?
Accordingly, we make the following five contributions: We present three use cases from the life science domain involving in silico experiments, and elicit concrete reproducibility issues that they raise (Section 2). We define several kinds of reproducibility that can be reached when scientific workflows are used to perform experiments (Section 3). We characterize and define the criteria that need to be catered for by reproducibility-friendly scientific workflow systems (Section 4). Using the framework of the criteria identified, we place several representative and widely used workflow systems and companion tools within such a framework (Section 5). We go on to discuss the challenges posed by reproducible scientific workflows in the life sciences and describe the remaining opportunities of research in several areas of computer science which may address them in Section 6 before closing the paper in Section 7.
Section snippets
Use cases
This paper starts with a set of three use cases, extracted from real projects, where scientific workflow systems are used to manage data analyses.
Levels of reproducibility
The use cases presented in the previous section exhibit different reproducibility needs. These can be better placed by examining the levels of reproducibility and reuse described in the literature. We present in this section such levels of reproducibility. We then introduce definitions of such levels in the specific context of use of scientific workflow systems.
Reproducible-friendly criteria for scientific workflow management systems
Scientific workflow systems have very different shapes and features, making them not equivalent in the context of reproducibility. In this section we introduce a set of criteria playing a major role in the ability of an in silico experiment to be reproducible. Specifically, we tease apart the criteria that need to be catered for when (i) specifying workflows, (ii) executing them, and (iii) packaging them considering the context and runtime environment with the reproducibility levels
Workflow system and companion tools faced with reproducibility and reuse: Status
In the first subsection, we review standards, models and tools that were proposed in the last few years to cater for some of the workflow reproducibility needs presented in Section 4. The next subsection is dedicated to the evaluation of workflow systems on such criteria.
Challenges and opportunities
While referencing to the specific problems encountered in the use cases, this section discusses major remaining open challenges related to reproducibility and reuse of experiments implemented using scientific workflows. We clearly distinguish problems associated with social issues from computer science issues, and focus on the latter. The first subsection is dedicated to problems where partial solutions are available (we clearly underline which are the remaining challenges) while the next
Conclusion
Reproducibility of in silico experiments analyzing life science data is recognized to be a major need. As they provide a means to design and run scientific experiments, scientific workflow systems have a crucial role to play to enhance reproducibility. In this context, the contributions of this paper are five-folds. First, we introduce a set of three use cases, highlighting reproducibility needs in real contexts. Second, we provide a terminology to describe reproducibility levels when
Acknowledgments
The authors acknowledge the support of GDR CNRS MaDICS, programme CPER Région Bretagne “CeSGO”, and programme Région Pays de la Loire “Connect Talent” (SyMeTRIC) . We acknowledge funding by the call “Infrastructures in Biology and Health” in the framework of the French “Investments for the Future” (ANR-11-INBS-0012 and ANR-11-INBS-0013). This work was conducted in part at the IBC (Institute of Computational Biology) in Montpellier, France.
Sarah Cohen-Boulakia is an Associate Professor at the Laboratoire de Recherche en Informatique at Universite Paris-Sud. She holds a Ph.D. in Computer Science and a habilitation from the Universite Paris-Sud. She has been working for fifteen years in multi-disciplinary groups involving computer scientists and biologists of various domains. She spent two-years as a postdoctoral researcher at the University of Pennsylvania, USA and 18 months at the Institute of Computational Biology (IBC) of
References (86)
- et al.
Phenomics–technologies to relieve the phenotyping bottleneck
Trends Plant Sci.
(2011) - et al.
Traitcapture: genomic and environment modelling of plant phenomic data
Curr. Opin. Plant Biol.
(2014) - et al.
Common motifs in scientific workflows: An empirical analysis
Future Gener. Comput. Syst.
(2014) - et al.
Using a suite of ontologies for preserving workflow-centric research objects
J. Web Sem.
(2015) - et al.
Characterizing and profiling scientific workflows
Future Gener. Comput. Syst.
(2013) - et al.
Domain-specific summarization of Life-Science e-experiments from provenance traces
Web Semant. Sci. Serv. Agents World Wide Web
(2014) - et al.
DistillFlow: removing redundancy in scientific workflows
- et al.
The data playground: An intuitive workflow specification environment
Future Gener. Comput. Syst.
(2009) - et al.
Similarity assessment and efficient retrieval of semantic workflows
Inf. Syst.
(2014) - et al.
Cost and accuracy aware scientific workflow retrieval based on distance measure
Inform. Sci.
(2015)
Effective and efficient similarity search in scientific workflow repositories
Future Gener. Comput. Syst.
A decade’s perspective on DNA sequencing technology
Nature
Toward reproducible computational research: an empirical analysis of data and code policy adoption by journals
PLoS One
Implementing Reproducible Research
Quantifying reproducibility in computational biology: the case of the tuberculosis drugome
PLoS One
The economics of reproducibility in preclinical research
PLoS Biol.
Next-generation sequencing data interpretation: enhancing reproducibility and accessibility
Nature Rev. Genet.
Public availability of published research data in high-impact journals
PLoS One
Reproducible research and biostatistics
Biostatistics
Journals should drive data reproducibility
Nature
Reproducibility in science
Sci. Signaling
Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences
Genome Biol.
Taverna: lessons in creating a workflow environment for the life sciences
J. Concurr. Comput.: Pract. Exp.
OpenAlea: a visual programming and component-based software platform for plant modelling
Funct. Plant Biol.
Managing rapidly-evolving scientific workflows
Proc. IPAW
Use of semantic workflows to enhance transparency and reproducibility in clinical omics
Genome Med.
Computational reproducibility: state-of-the-art, challenges, and database research opportunities
Advances in understanding cancer genomes through second-generation sequencing
Nature Rev. Genet.
OpenAlea: scientific workflows combining data analysis and simulation
A survey of best practices for RNA-seq data analysis
Genome Biol.
A review of bioinformatic pipeline frameworks
Brief. Bioinform.
An open investigation of the reproducibility of cancer biology research
Elife
Systematic variation improves reproducibility of animal experiments
Nat. Methods
A proposal regarding reporting of in vitro testing results
Clin. Cancer Res.
Drug development: Raise standards for preclinical cancer research
Nature
Reproducibility in science improving the standard for basic and preclinical research
Circ. Res.
Results may vary: reproducibility, open science and all that jazz
What does research reproducibility mean?
Sci. Transl. Med.
Similarity search for scientific workflows
PVLDB
Why workflows break - understanding and combating decay in taverna workflows
Cited by (111)
Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems
2023, Computational and Structural Biotechnology JournalEstimating and mapping the availability of earth resource for light earth building using a soil geodatabase in Brittany (France)
2022, Resources, Conservation and RecyclingCommunity-scale models of microbiomes: Articulating metabolic modelling and metagenome sequencing
2024, Microbial Biotechnology
Sarah Cohen-Boulakia is an Associate Professor at the Laboratoire de Recherche en Informatique at Universite Paris-Sud. She holds a Ph.D. in Computer Science and a habilitation from the Universite Paris-Sud. She has been working for fifteen years in multi-disciplinary groups involving computer scientists and biologists of various domains. She spent two-years as a postdoctoral researcher at the University of Pennsylvania, USA and 18 months at the Institute of Computational Biology (IBC) of Montpellier in the Inria teams Zenith and VirtualPlants. Dr. Cohen-Boulakia’s research interests include provenance and design of scientific workflows, reproducibility of scientific experiments, integration, querying and ranking in the context of biological and biomedical databases. She actively collaborates with major International groups in these domains, resulting in several major publications, in particular in provenance in scientific workflows. She currently co-animates with Ch. Blanchet a National working group on reproducibility of scientific experiments (GDR MaDICS).
Khalid Belhajjame is an associate professor at the University Paris-Dauphine. Before moving to Paris, he has been a researcher for several years at the University of Manchester, and prior to that a Ph.D. student at the University of Grenoble. His research interests lie in the areas of information and knowledge management. He made key contributions in the areas of pay-as-you data integration, e-Science, scientific workflow management, provenance tracking and exploitation, and semantic web services. He has published over 60 papers in the aforementioned topics. Most of his research proposals were validated against real-world applications from the fields of astronomy, biodiversity and life sciences. He is member of the editorial board of the MethodX Elsevier journal, has participated in multiple European-, French- and UK-funded projects, and has been an active member of the W3C Provenance working group and the NSF funded DataONE working group on scientific workflows and provenance.
Olivier Collin is a Senior Engineer at IRISA, head of GenOuest bioinformatics core facility, one of the major French bioinformatics platforms. His interests focus on designing methods and techniques to help biologist end-users analyze their complex biological data sets. In particular, his expertise lies in designing virtual environments for the execution of scientific workflows in the largest sense of the term, to allow reproducibility of experiments.
Jérôme Chopard is a researcher in computational biology who has worked at INRA as part of the OpenAlea group of active developers. He holds a degree from the Ecole Polytechnique and a Master degree in Biology of Evolution and Ecology. He worked at CIRAD, INRA and Inria and spent three years in the Center of Excellence for Climate Change Woodland and Forest Health at the University of Western Australia. He has a long-term experience working in multi-disciplinary groups. His research interests include formalizing, designing and implementing biological models and processes using techniques extracted from physics and mathematics.
Christine Froidevaux is a full Professor at the Laboratoire de Recherche en Informatique at Universite Paris-Sud. Her interest includes integrating and querying biological data sources, especially by means of ontologies, analyzing and guiding the design of scientific workflows, and ethics in science. Her interests in bioinformatics focus on design and analysis of biological networks.
Alban Gaignard is a CNRS engineer who holds a Ph.D. in Computer Science from the University of Nice-Sophia Antipolis since 2013. His research interests cover the fields of knowledge engineering (semantic web, linked data) and distributed systems (workflows, large scale computing infrastructures). He has been actively involved in a large number of projects gathering researchers and engineers from various disciplines in computer science, biology and medicine.
Konrad Hinsen is a CNRS research scientist whose fields of work are theoretical biophysics and the methodology of computational science. He obtained a Ph.D. in Theoretical Physics from RWTH Aachen University in 1992. Dr. Hinsen is member of the editorial board and co-editor of the Scientific Programming Department of Computing in Science and Engineering, published by the American Institute of Physics and the IEEE Computer Society. Dr Hinsen has been working on computational reproducibility for several years and has co-organized two conferences on this topic.
Pierre Larmande is a staff scientist at IRD and associate Researcher in the INRIA team Zenith. Since 2013, he is leading the data integration group at the Institute of Computational Biology (IBC) of Montpellier. His main research interests are plant-ontologies, data integration, Semantic Web, Metadata, Knowledge management, agronomic data management.
Yvan Le Bras is initially a marine Biologist, focusing on Populations structure, who received a Ph.D. on quantitative genetics and genomics from Rennes University. He has actively worked in Integrative genomics and e-Science and has a strong expertise in designing innovative Virtual Research Environments (VRE).
Frédéric Lemoine, Ph.D. in computer science from University Paris-Sud, joined Institut Pasteur in 2015 to work in the Evolutionary Bioinformatics Unit and participate to the development of new methodologies and algorithms in the field of evolution and molecular phylogeny. Dr. Lemoine spent one year in Lausanne, Switzerland as a postdoctoral fellow and five years at GenoSplice, a bioinformatics company, as a responsible for next generation sequencing projects. He is an active user of several workflow systems, including Nextflow.
Fabien Mareuil is a research engineer at the Center of Bioinformatics, Biostatistics and Integrative Biology (C3BI) of the Institut Pasteur. He obtained a Ph.D. in structural bioinformatics in 2008. He worked two years in the structural bioinformatics research team of Michael Nilges, Institut Pasteur, as a postdoctoral fellow. Since 2011, he is in charge of the maintenance and deployment of the Pasteur Galaxy platform. He joined the web development group in the C3BI Hub team in 2015. He is involved in several python development projects and in the French Galaxy Working Group of the French Institute of Bioinformatics (IFB).
Hervé Ménager is a research engineer at the Bioinformatics and Biostatistics Hub of the C3BI of the Institut Pasteur. His research interests include the development and deployment of scientific workflow systems, a subject he started working on at the Arizona State University. He later joined the Institut Pasteur, where he has been one of the main designers and developers of the Mobyle scientific workflow system. He is currently in charge of the web development group in the C3BI Hub team, and is involved in the European ELIXIR infrastructure, where he contributes to the development of the bio.tools registry.
Christophe Pradal is a researcher at CIRAD (The French agricultural research and international cooperation organization working for the sustainable development of tropical and Mediterranean regions). He is a member of the VirtualPlants Inria team. His research interest includes computer graphics and geometrical modeling, multiscale data-structures and algorithms, component-based architecture for plant modeling, and Scientific Workflows. He is the project leader of the OpenAlea scientific workflow system. He has been involved in several International projects on designing methods for complex plant forms, reconstruction of plant shape, study of light interception, and design of multi-scale functional–structural plant model.
Christophe Blanchet, Ph.D. in Bioinformatics/Biochemistry, is a member of the Centre National de la Recherche Scientifique (CNRS), working at the French Institute of Bioinformatics (IFB). He is involved in distributed computing for life sciences since 2001. From 2004 to 2010, he was member of the European EGEE Grid infrastructure, chairing the Bioinformatics Applications Activity (NA4-Bioinformatics) during the latest period. From 2005 to 2010, he took an active role in the Bioinformatics Network of Excellence EMBRACE (European Model for Bioinformatics Research and Community Education, EU-FP6). In the StratusLab (EU-FP7, 2010–12) and now in CYCLONE (H2020, 2015–17) projects, he is coordinating the definition and evaluation of bioinformatics use cases in the cloud. Christophe Blanchet now leads the e-infrastructure team of the IFB, setting up a national cloud infrastructure for life sciences in collaboration with the European ELIXIR infrastructure, and co-animates with S. Cohen-Boulakia a National working group on reproducibility of scientific experiments (GDR MaDICS).