ABSTRACT
Provenance plays a major role when understanding and reusing the methods applied in a scientific experiment, as it provides a record of inputs, the processes carried out and the use and generation of intermediate and final results. In the specific case of in-silico scientific experiments, a large variety of scientific workflow systems (e.g., Wings, Taverna, Galaxy, Vistrails) have been created to support scientists. All of these systems produce some sort of provenance about the executions of the workflows that encode scientific experiments. However, provenance is normally recorded at a very low level of detail, which complicates the understanding of what happened during execution. In this paper we propose an approach to automatically obtain abstractions from low-level provenance data by finding common workflow fragments on workflow execution provenance and relating them to templates. We have tested our approach with a dataset of workflows published by the Wings workflow system. Our results show that by using these kinds of abstractions we can highlight the most common abstract methods used in the executions of a repository, relating different runs and workflow templates with each other.
- R. Bergmann and Y. Gil. Similarity assessment and efficient retrieval of semantic workflows. To appear in the Information Systems Journal, 2012.Google Scholar
- C. Bizer, T. Heath, and T. Berners-Lee. Linked data - the story so far. International Journal on Semantic Web and Information Systems, 5(3):1--22, 2009.Google ScholarCross Ref
- S. C. Boulakia, C. Froidevaux, and J. Chen. Scientific workflow rewriting while preserving provenance. In 8th IEEE International Conference on eScience 2012, pages 1--9, Chicago, 2012. IEEE Computer Society Press, USA. Google ScholarDigital Library
- M. H. Burstein, R. Laddaga, D. D. McDonald, M. T. Cox, B. Benyo, P. Robertson, T. S. Hussain, M. Brinn, and D. V. McDermott. Poirot - integrated learning of web service procedures. In AAAI, pages 1274--1279, 2008. Google ScholarDigital Library
- S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. Silva, and H. T. Vo. Vistrails: Visualization meets data management. In ACM SIGMOD, pages 745--747. ACM Press, 2006. Google ScholarDigital Library
- D. J. Cook and L. B. Holder. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1:231--255, 1994. Google ScholarDigital Library
- S. A. Cook. The complexity of theorem-proving procedures. In Proceedings of the third annual ACM symposium on Theory of computing, STOC '71, pages 151--158, New York, NY, USA, 1971. ACM. Google ScholarDigital Library
- D. Garijo, P. Alper, K. Belhajjame, O. Corcho, Y. Gil, and C. Goble. Common motifs in scientific workflows: An empirical analysis. In 8th IEEE International Conference on eScience 2012, Chicago, 2012. IEEE Computer Society Press, USA. Google ScholarDigital Library
- D. Garijo and Y. Gil. A new approach for publishing workflows: Abstractions, standards, and linked data. In Proceedings of the 6th Workshop on Workflows in support of large-scale science, pages 47--56, Seattle, 2011. ACM. Google ScholarDigital Library
- B. Giardine et al. Galaxy: A platform for interactive large-scale genome analysis. Genome Research, 15(10):1451--1455, Oct 2005.Google ScholarCross Ref
- Y. Gil, V. Ratnakar, J. Kim, P. A. Gonzälez-Calero, P. T. Groth, J. Moody, and E. Deelman. Wings: Intelligent workflow-based design of computational experiments. IEEE Intelligent Systems, 26(1):62--72, 2011. Google ScholarDigital Library
- A. Goderis, P. Li, and C. A. Goble. Workflow discovery: the problem, a case study from e-science and a graph-based solution. In ICWS, pages 312--319, 2006. Google ScholarDigital Library
- A. Goderis, U. Sattler, P. W. Lord, and C. A. Goble. Seven bottlenecks to workflow reuse and repurposing. In International Semantic Web Conference, pages 323--337. Springer, 2005. Google ScholarDigital Library
- J. M. Gomez-Perez and O. Corcho. Problem-solving methods for understanding process executions. Computing in Science and Engineering, 10(3):47--52, May 2008. Google ScholarDigital Library
- M. Hauder, Y. Gil, and Y. Liu. A framework for efficient data analytics through automatic configuration and customization of scientific workflows. In Proceedings of the 2011 IEEE Seventh International Conference on eScience, ESCIENCE'11, pages 379--386, Washington, DC, USA, 2011. IEEE Computer Society. Google ScholarDigital Library
- L. B. Holder, D. J. Cook, and S. Djoko. Substructure Discovery in the SUBDUE System. AAAI Workshop on Knowledge Discovery, pages 169--180, 1994.Google Scholar
- D. Leake and J. Kendall-Morwick. Towards case-based support for e-science workflow generation by mining provenance. In Proceedings of the 9th European conference on Advances in Case-Based Reasoning, ECCBR '08, pages 269--283, Berlin, Heidelberg, 2008. Springer-Verlag. Google ScholarDigital Library
- B. Ludascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao, and Y. Zhao. Scientific workflow management and the kepler system. Concurrency and Computation: Practice and Experience, 18(10):1039--1065, 2006. Google ScholarDigital Library
- P. Mates, E. Santos, J. Freire, and C. T. Silva. Crowdlabs: Social analysis and visualization for the sciences. In 23rd International Conference on Scientific and Statistical Database Management (SSDBM), pages 555--564. Springer, 2011. Google ScholarDigital Library
- P. Missier, S. Soiland-Reyes, S. Owen, W. Tan, A. Nenadic, I. Dunlop, A. Williams, T. Oinn, and C. Goble. Taverna, reloaded. In 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010. Google ScholarDigital Library
- L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. Groth, N. Kwasnikowska, S. Miles, P. Missier, J. Myers, B. Plale, Y. Simmhan, E. Stephan, and J. Van den Bussche. The Open Provenance Model core specification (v1.1). Future Generation Computer Systems, July 2010. Google ScholarDigital Library
- A. G. Perez and R. Benjamins. Applications of ontologies and problem-solving methods. AI Magazine, 20(1), 1999.Google Scholar
- M. Reich, T. Liefeld, J. Gould, J. Lerner, P. Tamayo, and J. P. Mesirov. Genepattern 2.0. Nature Genetics, 38:500--501, 2006.Google ScholarCross Ref
- D. D. Roure, C. A. Goble, and R. Stevens. The design and realisation of the myExperiment virtual research environment for social sharing of workflows. Future Generation Comp. Syst., 25(5):561--567, 2009. Google ScholarDigital Library
- W. M. P. van der Aalst, A. H. M. ter Hofstede, B. Kiepuszewski, and A. P. Barros. Workflow patterns. Distributed and Parallel Databases, 14(1):5--51, 2003. Google ScholarDigital Library
- F. Yaman, T. Oates, and M. Burstein. A context driven approach for workflow mining. In Proceedings of the 21st international jont conference on Artifical intelligence, IJCAI'09, pages 1798--1803, San Francisco, CA, USA, 2009. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
Index Terms
- Detecting common scientific workflow fragments using templates and execution provenance
Recommendations
A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds
In the last years, scientific workflows have emerged as a fundamental abstraction for structuring and executing scientific experiments in computational environments. Scientific workflows are becoming increasingly complex and more demanding in terms of ...
Using Explicit Control Processes in Distributed Workflows to Gather Provenance
Provenance and Annotation of Data and ProcessesDistributing workflow tasks among high performance environments involves local processing and remote execution on clusters and grids. This dis-tribution often needs interoperation between heterogeneous workflow definition languages and their ...
A Survey of Data-Intensive Scientific Workflow Management
Nowadays, more and more computer-based scientific experiments need to handle massive amounts of data. Their data processing consists of multiple computational steps and dependencies within them. A data-intensive scientific workflow is useful for ...
Comments