Elsevier

Future Generation Computer Systems

Volume 75, October 2017, Pages 284-298
Future Generation Computer Systems

Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities

https://doi.org/10.1016/j.future.2017.01.012Get rights and content

Highlights

  • Use cases from the Life Sciences highlighting reproducibility and reuse needs.

  • Terminology to describe reproducibility levels in scientific workflows.

  • Criteria to define reproducible-friendly workflow systems and evaluation of systems.

  • Challenges and opportunities in scientific workflows reproducibility.

Abstract

With the development of new experimental technologies, biologists are faced with an avalanche of data to be computationally analyzed for scientific advancements and discoveries to emerge. Faced with the complexity of analysis pipelines, the large number of computational tools, and the enormous amount of data to manage, there is compelling evidence that many if not most scientific discoveries will not stand the test of time: increasing the reproducibility of computed results is of paramount importance.

The objective we set out in this paper is to place scientific workflows in the context of reproducibility. To do so, we define several kinds of reproducibility that can be reached when scientific workflows are used to perform experiments. We characterize and define the criteria that need to be catered for by reproducibility-friendly scientific workflow systems, and use such criteria to place several representative and widely used workflow systems and companion tools within such a framework. We also discuss the remaining challenges posed by reproducible scientific workflows in the life sciences. Our study was guided by three use cases from the life science domain involving in silico experiments.

Introduction

Novel technologies in several scientific areas have led to the generation of very large volumes of data at an unprecedented rate. This is particularly true for the life sciences, where, for instance, innovations in Next Generation Sequencing (NGS) have led to a revolution in genome sequencing. Current instruments can sequence 200 human genomes in one week whereas 12 years have been necessary for the first human genome [1]. Many laboratories have thus acquired NGS machines, resulting in an avalanche of data which has to be further analyzed using a series of tools and programs for new scientific knowledge and discoveries to emanate.

The same kind of situation occurs in completely different domains, such as plant phenotyping which aims at understanding the complexity of interactions between plants and environments in order to accelerate the discovery of new genes and traits thus optimize the use of genetic diversity under different environments. Here, thousands of plants are grown in controlled environments, capturing a lot of information and generating huge amounts of raw data to be stored and then analyzed by very complex computational analysis pipelines for scientific advancements and discoveries to emerge.

Faced with the complexity of analysis pipelines designed, the number of computational tools available and the amount of data to manage, there is compelling evidence that the large majority of scientific discoveries will not stand the test of time: increasing reproducibility of results is of paramount importance.

Over the recent years, many authors have drawn attention to the rise of purely computational experiments which are not reproducible [2], [3], [4], [5]. Major reproducibility issues have been highlighted in a very large number of cases: while [6] has shown that even when very specific tools were used, textual description of the methodology followed was not sufficient to repeat experiments, [7] has focused on top impact factor papers and shown that insufficient data were made available by the authors to make experiments reproducible, despite the data publication policies recently put in place by most publishers.

Scientific communities in different domains have started to act in an attempt to address this problem. Prestigious conferences (such as two major conferences from the database community, namely, VLDB1 and SIGMOD2) and journals such as PNAS,3 Biostatistics [8], Nature [9] and Science [10], to name only a few, encourage or require published results to be accompanied by all the information necessary to reproduce them. However, making their results reproducible remains a very difficult and extremely time-consuming task for most authors.

In the meantime, considerable efforts have been put into the development of scientific workflow management systems. They aim at supporting scientists in developing, running, and monitoring chains of data analysis programs. A variety of systems (e.g., [11], [12], [13]) have reached a level of maturity that allows them to be used by scientists for their bioinformatics experiments, including analysis of NGS or plant phenotyping data.

By capturing the exact methodology followed by scientists (in terms of experimental steps associated with tools used) scientific workflows play a major role in the reproducibility of experiments. However, previous work have either introduced individual workflow systems allowing to design reproducible analyses (e.g., [14], [15]) without the aim to draw more general conclusions and discuss the capabilities of the scientific workflow systems to reproduce experiments or it has discussed computational reproducibility challenges in e-science (e.g., [16], [17]) without considering the specific case where scientific workflow systems are used to design an experiment. There is thus a need to better understand the core problematic of reproducibility in the specific context of scientific workflow systems, which is the aim of this paper.

In this paper, we place scientific workflows in the context of computational reproducibility in the life sciences to provide answers to the following key points: How can we define the different levels of reproducibility that can be achieved when a workflow is used to implement an in silico experiment? Which are the criteria of scientific workflow systems that make them reproducibility-friendly? What is concretely offered by the scientific workflow systems in use in the life science community to deal with reproducibility? Which are the open problems to be tackled in computer science (in algorithmics, systems, knowledge representation etc.) which may have huge impact in the problems of reproducing experiments when using scientific workflow systems?

Accordingly, we make the following five contributions: We present three use cases from the life science domain involving in silico experiments, and elicit concrete reproducibility issues that they raise (Section 2). We define several kinds of reproducibility that can be reached when scientific workflows are used to perform experiments (Section 3). We characterize and define the criteria that need to be catered for by reproducibility-friendly scientific workflow systems (Section 4). Using the framework of the criteria identified, we place several representative and widely used workflow systems and companion tools within such a framework (Section 5). We go on to discuss the challenges posed by reproducible scientific workflows in the life sciences and describe the remaining opportunities of research in several areas of computer science which may address them in Section 6 before closing the paper in Section 7.

Section snippets

Use cases

This paper starts with a set of three use cases, extracted from real projects, where scientific workflow systems are used to manage data analyses.

Levels of reproducibility

The use cases presented in the previous section exhibit different reproducibility needs. These can be better placed by examining the levels of reproducibility and reuse described in the literature. We present in this section such levels of reproducibility. We then introduce definitions of such levels in the specific context of use of scientific workflow systems.

Reproducible-friendly criteria for scientific workflow management systems

Scientific workflow systems have very different shapes and features, making them not equivalent in the context of reproducibility. In this section we introduce a set of criteria playing a major role in the ability of an in silico experiment to be reproducible. Specifically, we tease apart the criteria that need to be catered for when (i) specifying workflows, (ii) executing them, and (iii) packaging them considering the context and runtime environment with the reproducibility levels

Workflow system and companion tools faced with reproducibility and reuse: Status

In the first subsection, we review standards, models and tools that were proposed in the last few years to cater for some of the workflow reproducibility needs presented in Section 4. The next subsection is dedicated to the evaluation of workflow systems on such criteria.

Challenges and opportunities

While referencing to the specific problems encountered in the use cases, this section discusses major remaining open challenges related to reproducibility and reuse of experiments implemented using scientific workflows. We clearly distinguish problems associated with social issues from computer science issues, and focus on the latter. The first subsection is dedicated to problems where partial solutions are available (we clearly underline which are the remaining challenges) while the next

Conclusion

Reproducibility of in silico experiments analyzing life science data is recognized to be a major need. As they provide a means to design and run scientific experiments, scientific workflow systems have a crucial role to play to enhance reproducibility. In this context, the contributions of this paper are five-folds. First, we introduce a set of three use cases, highlighting reproducibility needs in real contexts. Second, we provide a terminology to describe reproducibility levels when

Acknowledgments

The authors acknowledge the support of GDR CNRS MaDICS, programme CPER Région Bretagne “CeSGO”, and programme Région Pays de la Loire “Connect Talent” (SyMeTRIC) . We acknowledge funding by the call “Infrastructures in Biology and Health” in the framework of the French “Investments for the Future” (ANR-11-INBS-0012 and ANR-11-INBS-0013). This work was conducted in part at the IBC (Institute of Computational Biology) in Montpellier, France.

Sarah Cohen-Boulakia is an Associate Professor at the Laboratoire de Recherche en Informatique at Universite Paris-Sud. She holds a Ph.D. in Computer Science and a habilitation from the Universite Paris-Sud. She has been working for fifteen years in multi-disciplinary groups involving computer scientists and biologists of various domains. She spent two-years as a postdoctoral researcher at the University of Pennsylvania, USA and 18 months at the Institute of Computational Biology (IBC) of

References (86)

  • StarlingerJ. et al.

    Effective and efficient similarity search in scientific workflow repositories

    Future Gener. Comput. Syst.

    (2016)
  • MardisE.R.

    A decade’s perspective on DNA sequencing technology

    Nature

    (2011)
  • StoddenV. et al.

    Toward reproducible computational research: an empirical analysis of data and code policy adoption by journals

    PLoS One

    (2013)
  • StoddenV. et al.

    Implementing Reproducible Research

    (2014)
  • GarijoD. et al.

    Quantifying reproducibility in computational biology: the case of the tuberculosis drugome

    PLoS One

    (2013)
  • FreedmanL.P. et al.

    The economics of reproducibility in preclinical research

    PLoS Biol.

    (2015)
  • NekrutenkoA. et al.

    Next-generation sequencing data interpretation: enhancing reproducibility and accessibility

    Nature Rev. Genet.

    (2012)
  • Alsheikh-AliA.A. et al.

    Public availability of published research data in high-impact journals

    PLoS One

    (2011)
  • PengR.D.

    Reproducible research and biostatistics

    Biostatistics

    (2009)
  • SantoriG.

    Journals should drive data reproducibility

    Nature

    (2016)
  • YaffeM.B.

    Reproducibility in science

    Sci. Signaling

    (2015)
  • GoecksJ. et al.

    Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences

    Genome Biol.

    (2010)
  • OinnT. et al.

    Taverna: lessons in creating a workflow environment for the life sciences

    J. Concurr. Comput.: Pract. Exp.

    (2002)
  • PradalC. et al.

    OpenAlea: a visual programming and component-based software platform for plant modelling

    Funct. Plant Biol.

    (2008)
  • FreireJ. et al.

    Managing rapidly-evolving scientific workflows

    Proc. IPAW

    (2006)
  • ZhengC.L. et al.

    Use of semantic workflows to enhance transparency and reproducibility in clinical omics

    Genome Med.

    (2015)
  • J. Freire, N. Fuhr, A. Rauber, Reproducibility of Data-Oriented Experiments in e-Science, Technical Report, Dagstuhl...
  • FreireJ. et al.

    Computational reproducibility: state-of-the-art, challenges, and database research opportunities

  • MeyersonM. et al.

    Advances in understanding cancer genomes through second-generation sequencing

    Nature Rev. Genet.

    (2010)
  • PradalC. et al.

    OpenAlea: scientific workflows combining data analysis and simulation

  • U. Schurr, F. Tardieu, D. Inzé, X. Dreyer, J. Durner, T. Altmann, J. Doonan, M. Bennett, EMPHASIS—European...
  • ConesaA. et al.

    A survey of best practices for RNA-seq data analysis

    Genome Biol.

    (2016)
  • LeipzigJ.

    A review of bioinformatic pipeline frameworks

    Brief. Bioinform.

    (2016)
  • ErringtonT.M. et al.

    An open investigation of the reproducibility of cancer biology research

    Elife

    (2014)
  • RichterS.H. et al.

    Systematic variation improves reproducibility of animal experiments

    Nat. Methods

    (2010)
  • SmithM.A. et al.

    A proposal regarding reporting of in vitro testing results

    Clin. Cancer Res.

    (2013)
  • BegleyC.G. et al.

    Drug development: Raise standards for preclinical cancer research

    Nature

    (2012)
  • BegleyC.G. et al.

    Reproducibility in science improving the standard for basic and preclinical research

    Circ. Res.

    (2015)
  • C. Drummond, Replicability is not reproducibility: nor is it good science, 2009, unpublished note, highly...
  • GobleC.

    Results may vary: reproducibility, open science and all that jazz

  • GoodmanS.N. et al.

    What does research reproducibility mean?

    Sci. Transl. Med.

    (2016)
  • StarlingerJ. et al.

    Similarity search for scientific workflows

    PVLDB

    (2014)
  • ZhaoJ. et al.

    Why workflows break - understanding and combating decay in taverna workflows

  • Cited by (111)

    View all citing articles on Scopus

    Sarah Cohen-Boulakia is an Associate Professor at the Laboratoire de Recherche en Informatique at Universite Paris-Sud. She holds a Ph.D. in Computer Science and a habilitation from the Universite Paris-Sud. She has been working for fifteen years in multi-disciplinary groups involving computer scientists and biologists of various domains. She spent two-years as a postdoctoral researcher at the University of Pennsylvania, USA and 18 months at the Institute of Computational Biology (IBC) of Montpellier in the Inria teams Zenith and VirtualPlants. Dr. Cohen-Boulakia’s research interests include provenance and design of scientific workflows, reproducibility of scientific experiments, integration, querying and ranking in the context of biological and biomedical databases. She actively collaborates with major International groups in these domains, resulting in several major publications, in particular in provenance in scientific workflows. She currently co-animates with Ch. Blanchet a National working group on reproducibility of scientific experiments (GDR MaDICS).

    Khalid Belhajjame is an associate professor at the University Paris-Dauphine. Before moving to Paris, he has been a researcher for several years at the University of Manchester, and prior to that a Ph.D. student at the University of Grenoble. His research interests lie in the areas of information and knowledge management. He made key contributions in the areas of pay-as-you data integration, e-Science, scientific workflow management, provenance tracking and exploitation, and semantic web services. He has published over 60 papers in the aforementioned topics. Most of his research proposals were validated against real-world applications from the fields of astronomy, biodiversity and life sciences. He is member of the editorial board of the MethodX Elsevier journal, has participated in multiple European-, French- and UK-funded projects, and has been an active member of the W3C Provenance working group and the NSF funded DataONE working group on scientific workflows and provenance.

    Olivier Collin is a Senior Engineer at IRISA, head of GenOuest bioinformatics core facility, one of the major French bioinformatics platforms. His interests focus on designing methods and techniques to help biologist end-users analyze their complex biological data sets. In particular, his expertise lies in designing virtual environments for the execution of scientific workflows in the largest sense of the term, to allow reproducibility of experiments.

    Jérôme Chopard is a researcher in computational biology who has worked at INRA as part of the OpenAlea group of active developers. He holds a degree from the Ecole Polytechnique and a Master degree in Biology of Evolution and Ecology. He worked at CIRAD, INRA and Inria and spent three years in the Center of Excellence for Climate Change Woodland and Forest Health at the University of Western Australia. He has a long-term experience working in multi-disciplinary groups. His research interests include formalizing, designing and implementing biological models and processes using techniques extracted from physics and mathematics.

    Christine Froidevaux is a full Professor at the Laboratoire de Recherche en Informatique at Universite Paris-Sud. Her interest includes integrating and querying biological data sources, especially by means of ontologies, analyzing and guiding the design of scientific workflows, and ethics in science. Her interests in bioinformatics focus on design and analysis of biological networks.

    Alban Gaignard is a CNRS engineer who holds a Ph.D. in Computer Science from the University of Nice-Sophia Antipolis since 2013. His research interests cover the fields of knowledge engineering (semantic web, linked data) and distributed systems (workflows, large scale computing infrastructures). He has been actively involved in a large number of projects gathering researchers and engineers from various disciplines in computer science, biology and medicine.

    Konrad Hinsen is a CNRS research scientist whose fields of work are theoretical biophysics and the methodology of computational science. He obtained a Ph.D. in Theoretical Physics from RWTH Aachen University in 1992. Dr. Hinsen is member of the editorial board and co-editor of the Scientific Programming Department of Computing in Science and Engineering, published by the American Institute of Physics and the IEEE Computer Society. Dr Hinsen has been working on computational reproducibility for several years and has co-organized two conferences on this topic.

    Pierre Larmande is a staff scientist at IRD and associate Researcher in the INRIA team Zenith. Since 2013, he is leading the data integration group at the Institute of Computational Biology (IBC) of Montpellier. His main research interests are plant-ontologies, data integration, Semantic Web, Metadata, Knowledge management, agronomic data management.

    Yvan Le Bras is initially a marine Biologist, focusing on Populations structure, who received a Ph.D. on quantitative genetics and genomics from Rennes University. He has actively worked in Integrative genomics and e-Science and has a strong expertise in designing innovative Virtual Research Environments (VRE).

    Frédéric Lemoine, Ph.D. in computer science from University Paris-Sud, joined Institut Pasteur in 2015 to work in the Evolutionary Bioinformatics Unit and participate to the development of new methodologies and algorithms in the field of evolution and molecular phylogeny. Dr. Lemoine spent one year in Lausanne, Switzerland as a postdoctoral fellow and five years at GenoSplice, a bioinformatics company, as a responsible for next generation sequencing projects. He is an active user of several workflow systems, including Nextflow.

    Fabien Mareuil is a research engineer at the Center of Bioinformatics, Biostatistics and Integrative Biology (C3BI) of the Institut Pasteur. He obtained a Ph.D. in structural bioinformatics in 2008. He worked two years in the structural bioinformatics research team of Michael Nilges, Institut Pasteur, as a postdoctoral fellow. Since 2011, he is in charge of the maintenance and deployment of the Pasteur Galaxy platform. He joined the web development group in the C3BI Hub team in 2015. He is involved in several python development projects and in the French Galaxy Working Group of the French Institute of Bioinformatics (IFB).

    Hervé Ménager is a research engineer at the Bioinformatics and Biostatistics Hub of the C3BI of the Institut Pasteur. His research interests include the development and deployment of scientific workflow systems, a subject he started working on at the Arizona State University. He later joined the Institut Pasteur, where he has been one of the main designers and developers of the Mobyle scientific workflow system. He is currently in charge of the web development group in the C3BI Hub team, and is involved in the European ELIXIR infrastructure, where he contributes to the development of the bio.tools registry.

    Christophe Pradal is a researcher at CIRAD (The French agricultural research and international cooperation organization working for the sustainable development of tropical and Mediterranean regions). He is a member of the VirtualPlants Inria team. His research interest includes computer graphics and geometrical modeling, multiscale data-structures and algorithms, component-based architecture for plant modeling, and Scientific Workflows. He is the project leader of the OpenAlea scientific workflow system. He has been involved in several International projects on designing methods for complex plant forms, reconstruction of plant shape, study of light interception, and design of multi-scale functional–structural plant model.

    Christophe Blanchet, Ph.D. in Bioinformatics/Biochemistry, is a member of the Centre National de la Recherche Scientifique (CNRS), working at the French Institute of Bioinformatics (IFB). He is involved in distributed computing for life sciences since 2001. From 2004 to 2010, he was member of the European EGEE Grid infrastructure, chairing the Bioinformatics Applications Activity (NA4-Bioinformatics) during the latest period. From 2005 to 2010, he took an active role in the Bioinformatics Network of Excellence EMBRACE (European Model for Bioinformatics Research and Community Education, EU-FP6). In the StratusLab (EU-FP7, 2010–12) and now in CYCLONE (H2020, 2015–17) projects, he is coordinating the definition and evaluation of bioinformatics use cases in the cloud. Christophe Blanchet now leads the e-infrastructure team of the IFB, setting up a national cloud infrastructure for life sciences in collaboration with the European ELIXIR infrastructure, and co-animates with S. Cohen-Boulakia a National working group on reproducibility of scientific experiments (GDR MaDICS).

    View full text