skip to main content
article

A survey of data provenance in e-science

Published:01 September 2005Publication History
Skip Abstract Section

Abstract

Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources.In this paper we create a taxonomy of data provenance characteristics and apply it to current research efforts in e-science, focusing primarily on scientific workflow approaches. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. The survey culminates with an identification of open research problems in the field.

References

  1. J. Brase, "Using Digital Library Techniques - Registration of Scientific Primary Data," in ECDL, 2004.]]Google ScholarGoogle Scholar
  2. D. G. Clarke and D. M. Clark, "Lincage," in Elements of Spatial Data Quality, 1995.]]Google ScholarGoogle Scholar
  3. J. L. Romeu, "Data Quality and Pedigree," in Material Ease, 1999.]]Google ScholarGoogle Scholar
  4. H. V. Jagadish and F. Olken, "Database Management for Life Sciences Research," in SIGMOD Record, vol. 33, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. "Access to genetic resources and Benefit-Sharing (ABS) Program," United Nations University, 2003.]]Google ScholarGoogle Scholar
  6. P. Buneman, S. Khanna, and W. C. Tan, "Why and Where: A Characterization of Data Provenance," in ICDT, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. P. Lanter, "Design of a Lineage-Based Meta-Data Base for GIS," in Cartography and Geographic Information Systems, vol. 18, 1991.]]Google ScholarGoogle Scholar
  8. M. Greenwood, C. Goble, R. Stevens, J. Zhao, M. Addis, D. Marvin, L. Moreau, and T. Oinn, "Provenance of e-Science Experiments - experience from Bioinformatics," in Proceedings of the UK OST e-Science 2nd AHM, 2003.]]Google ScholarGoogle Scholar
  9. Y. L. Simmhan, B. Plale, and D. Gannon, "A Survey of Data Provenance Techniques," in Technical Report TR-618: Computer Science Department, Indiana University, 2005.]]Google ScholarGoogle Scholar
  10. R. Bose and J. Frew, "Lineage retrieval for scientific data processing: a survey," in ACM Comput. Surv., vol. 37, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Miles, P. Groth, M. Branco, and L. Moreau, "The requirements of recording and using provenance in e-Science experiments," in Technical Report, Electronics and Computer Science, University of Southampton, 2005.]]Google ScholarGoogle Scholar
  12. D. Pearson, "Presentation on Grid Data Requirements Scoping Metadata & Provenance," in Workshop on Data Derivation and Provenance, Chicago, 2002.]]Google ScholarGoogle Scholar
  13. G. Cameron, "Provenance and Pragmatics," in Workshop on Data Provenance and Annotation, Edinburgh, 2003.]]Google ScholarGoogle Scholar
  14. C. Goble, "Position Statement: Musings on Provenance, Workflow and (Semantic Web) Annotations for Bioinformatics," in Workshop on Data Derivation and Provenance, Chicago, 2002.]]Google ScholarGoogle Scholar
  15. P. P. da Silva, D. L. McGuinness, and R. McCool, "Knowledge Provenance Infrastructure," in IEEE Data Engineering Bulletin, vol. 26, 2003.]]Google ScholarGoogle Scholar
  16. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita, "Improving Data Cleaning Quality Using a Data Lineage Facility," in DMDW, 2001.]]Google ScholarGoogle Scholar
  17. I. T. Foster, J. S. Vöckler, M. Wilde, and Y. Zhao. "The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration," in CIDR, 2003.]]Google ScholarGoogle Scholar
  18. J. Zhao, C. A. Goble, R. Stevens, and S. Bechhofer, "Semantically Linking and Browsing Provenance Logs for E-science," in ICSNW, 2004.]]Google ScholarGoogle Scholar
  19. A. Woodruff and M. Stonebraker, "Supporting Fine-grained Data Lineage in a Database Visualization Environment," in ICDE, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Plale, D. Gannon, D. Reed, S. Graves, K. Droegemeier, B. Wilhelmson, and M. Ramamurthy, "Towards Dynamically Adaptive Weather Analysis and Forecasting in LEAD," in ICCS workshop on Dynamic Data Driven Applications, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya, "An Annotation Management System for Relational Databases," in VLDB, 2004.]]Google ScholarGoogle Scholar
  22. Y. Cui and J. Widom, "Practical Lineage Tracing in Data Warehouses," in ICDE, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Widom, "Trio: A System for Integrated Management of Data, Accuracy, and Lineage," in CIDR, 2005.]]Google ScholarGoogle Scholar
  24. C. Pancerella, J. Hewson, W. Koegler, D. Leahy, M. Lee, L. Rahn, C. Yang, J. D. Myers, B. Didier, R. McCoy, K. Schuchardt, E. Stephan, T. Windus, K. Amin, S. Bittner, C. Lansing, M. Minkoff, S. Nijsure, G. v. Laszewski, R. Pinzon, B. Ruscic, Al Wagner, B. Wang, W. Pitz, Y. L. Ho, D. Montoya, L. Xu, T. C. Allison, W. H. Green, Jr, and M. Frenklach, "Metadata in the collaboratory for multi-scale chemical science," in Dublin Core Conference, 2003.]]Google ScholarGoogle Scholar
  25. J. Myers, C. Pancerella, C. Lansing, K. Schuchardt, and B. Didier, "Multi-Scale Science, Supporting Emerging Practice with Semantically Derived Provenance," in ISWC workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, 2003.]]Google ScholarGoogle Scholar
  26. R. Bose and J. Frew, "Composing Lineage Metadata with XML for Custom Satellite-Derived Data Products," in SSDBM, 2004.]]Google ScholarGoogle Scholar
  27. I. T. Foster, J.-S. Vöckler, M. Wilde, and Y. Zhao, "Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation," in SSDBM, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Frew and R. Bose, "Earth System Science Workbench: A Data Management Infrastructure for Earth Science Products," in SSDBM, 2001.]]Google ScholarGoogle Scholar
  29. Y. Cui and J. Widom, "Lineage tracing for general data warehouse transformations," in VLDB Journal, vol. 12, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A survey of data provenance in e-science

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGMOD Record
            ACM SIGMOD Record  Volume 34, Issue 3
            September 2005
            115 pages
            ISSN:0163-5808
            DOI:10.1145/1084805
            Issue’s Table of Contents

            Copyright © 2005 Authors

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 September 2005

            Check for updates

            Qualifiers

            • article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader