Abstract
Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources.In this paper we create a taxonomy of data provenance characteristics and apply it to current research efforts in e-science, focusing primarily on scientific workflow approaches. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. The survey culminates with an identification of open research problems in the field.
- J. Brase, "Using Digital Library Techniques - Registration of Scientific Primary Data," in ECDL, 2004.]]Google Scholar
- D. G. Clarke and D. M. Clark, "Lincage," in Elements of Spatial Data Quality, 1995.]]Google Scholar
- J. L. Romeu, "Data Quality and Pedigree," in Material Ease, 1999.]]Google Scholar
- H. V. Jagadish and F. Olken, "Database Management for Life Sciences Research," in SIGMOD Record, vol. 33, 2004.]] Google ScholarDigital Library
- "Access to genetic resources and Benefit-Sharing (ABS) Program," United Nations University, 2003.]]Google Scholar
- P. Buneman, S. Khanna, and W. C. Tan, "Why and Where: A Characterization of Data Provenance," in ICDT, 2001.]] Google ScholarDigital Library
- D. P. Lanter, "Design of a Lineage-Based Meta-Data Base for GIS," in Cartography and Geographic Information Systems, vol. 18, 1991.]]Google Scholar
- M. Greenwood, C. Goble, R. Stevens, J. Zhao, M. Addis, D. Marvin, L. Moreau, and T. Oinn, "Provenance of e-Science Experiments - experience from Bioinformatics," in Proceedings of the UK OST e-Science 2nd AHM, 2003.]]Google Scholar
- Y. L. Simmhan, B. Plale, and D. Gannon, "A Survey of Data Provenance Techniques," in Technical Report TR-618: Computer Science Department, Indiana University, 2005.]]Google Scholar
- R. Bose and J. Frew, "Lineage retrieval for scientific data processing: a survey," in ACM Comput. Surv., vol. 37, 2005.]] Google ScholarDigital Library
- S. Miles, P. Groth, M. Branco, and L. Moreau, "The requirements of recording and using provenance in e-Science experiments," in Technical Report, Electronics and Computer Science, University of Southampton, 2005.]]Google Scholar
- D. Pearson, "Presentation on Grid Data Requirements Scoping Metadata & Provenance," in Workshop on Data Derivation and Provenance, Chicago, 2002.]]Google Scholar
- G. Cameron, "Provenance and Pragmatics," in Workshop on Data Provenance and Annotation, Edinburgh, 2003.]]Google Scholar
- C. Goble, "Position Statement: Musings on Provenance, Workflow and (Semantic Web) Annotations for Bioinformatics," in Workshop on Data Derivation and Provenance, Chicago, 2002.]]Google Scholar
- P. P. da Silva, D. L. McGuinness, and R. McCool, "Knowledge Provenance Infrastructure," in IEEE Data Engineering Bulletin, vol. 26, 2003.]]Google Scholar
- H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita, "Improving Data Cleaning Quality Using a Data Lineage Facility," in DMDW, 2001.]]Google Scholar
- I. T. Foster, J. S. Vöckler, M. Wilde, and Y. Zhao. "The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration," in CIDR, 2003.]]Google Scholar
- J. Zhao, C. A. Goble, R. Stevens, and S. Bechhofer, "Semantically Linking and Browsing Provenance Logs for E-science," in ICSNW, 2004.]]Google Scholar
- A. Woodruff and M. Stonebraker, "Supporting Fine-grained Data Lineage in a Database Visualization Environment," in ICDE, 1997.]] Google ScholarDigital Library
- B. Plale, D. Gannon, D. Reed, S. Graves, K. Droegemeier, B. Wilhelmson, and M. Ramamurthy, "Towards Dynamically Adaptive Weather Analysis and Forecasting in LEAD," in ICCS workshop on Dynamic Data Driven Applications, 2005.]] Google ScholarDigital Library
- D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya, "An Annotation Management System for Relational Databases," in VLDB, 2004.]]Google Scholar
- Y. Cui and J. Widom, "Practical Lineage Tracing in Data Warehouses," in ICDE, 2000.]] Google ScholarDigital Library
- J. Widom, "Trio: A System for Integrated Management of Data, Accuracy, and Lineage," in CIDR, 2005.]]Google Scholar
- C. Pancerella, J. Hewson, W. Koegler, D. Leahy, M. Lee, L. Rahn, C. Yang, J. D. Myers, B. Didier, R. McCoy, K. Schuchardt, E. Stephan, T. Windus, K. Amin, S. Bittner, C. Lansing, M. Minkoff, S. Nijsure, G. v. Laszewski, R. Pinzon, B. Ruscic, Al Wagner, B. Wang, W. Pitz, Y. L. Ho, D. Montoya, L. Xu, T. C. Allison, W. H. Green, Jr, and M. Frenklach, "Metadata in the collaboratory for multi-scale chemical science," in Dublin Core Conference, 2003.]]Google Scholar
- J. Myers, C. Pancerella, C. Lansing, K. Schuchardt, and B. Didier, "Multi-Scale Science, Supporting Emerging Practice with Semantically Derived Provenance," in ISWC workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, 2003.]]Google Scholar
- R. Bose and J. Frew, "Composing Lineage Metadata with XML for Custom Satellite-Derived Data Products," in SSDBM, 2004.]]Google Scholar
- I. T. Foster, J.-S. Vöckler, M. Wilde, and Y. Zhao, "Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation," in SSDBM, 2002.]] Google ScholarDigital Library
- J. Frew and R. Bose, "Earth System Science Workbench: A Data Management Infrastructure for Earth Science Products," in SSDBM, 2001.]]Google Scholar
- Y. Cui and J. Widom, "Lineage tracing for general data warehouse transformations," in VLDB Journal, vol. 12, 2003.]] Google ScholarDigital Library
Index Terms
- A survey of data provenance in e-science
Recommendations
Provenance opportunities for WS-VLAM: an exploration of an e-science and an e-business approach
WORKS '11: Proceedings of the 6th workshop on Workflows in support of large-scale scienceScientific applications are frequently modeled as a workflow that is executed under the control of a workflow management system. One crucial requirement during the execution is the validation of the generated results and the traceability of the ...
A survey on provenance: What for? What form? What from?
Provenance refers to any information describing the production process of an end product, which can be anything from a piece of digital data to a physical object. While this survey focuses on the former type of end product, this definition still leaves ...
Towards Next Generation Provenance Systems for e-Science
e-Science helps scientists to automate scientific discovery processes and experiments, and promote collaboration across organizational boundaries and disciplines. These experiments involve data discovery, knowledge discovery, integration, linking, and ...
Comments