skip to main content
10.1145/1142473.1142534acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Provenance management in curated databases

Published:27 June 2006Publication History

ABSTRACT

Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and scientific value. General purpose database systems provide little support for tracking provenance, especially when data moves among databases. This paper investigates general-purpose techniques for recording provenance for data that is copied among databases. We describe an approach in which we track the user's actions while browsing source databases and copying data into a curated database, in order to record the user's actions in a convenient, queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management. Our experiments show that although the overhead of a naive approach is fairly high, it can be decreased to an acceptable level using simple optimizations.

References

  1. G. Bader, D. Betel, and C. W. Hogue. BIND: the biomolecule interaction network database. Nucleic Acids Research, 31(1):248--250, 2003.]]Google ScholarGoogle ScholarCross RefCross Ref
  2. D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. In Proc. of the Intl. Conf. on Very Large Data Bases (VLDB), pages 900--911. Morgan Kaufmann, 2004.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Bose and J. Frew. Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv., 37(1):1--28, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Buneman, S. Davidson, W. Fan, C. Hara, and W.-C. Tan. Keys for XML. Computer Networks, 39(5), August 2002.]]Google ScholarGoogle Scholar
  5. P. Buneman, S. Khanna, K. Tajima, and W. C. Tan. Archiving scientific data. ACM Trans. Database Syst., 29:2--42, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Buneman, S. Khanna, and W.-C. Tan. Why and Where: A characterization of data provenance. In ICDT, pages 316--330, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Cherry, C. Adler, C. Ball, S. Chervitz, S. Dwight, E. Hester, Y. Jia, G. Juvik, T. Roe, M. Schroeder, S. Weng, and D. Botstein. SGD: Saccharomyces genome database. Nucleic Acids Res., 26(1):73--79, 1998.]]Google ScholarGoogle ScholarCross RefCross Ref
  8. Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. VLDB J., 12(1):41--58, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Dellaire, R. Farrall, and W. A. Bickmore. The nuclear protein database (NPD): sub-nuclear localisation and functional annotation of the nuclear proteome. Nucleic Acids Research, 31(1):328--330, 2003.]]Google ScholarGoogle ScholarCross RefCross Ref
  10. I. Foster, J. Vockler, M. Eilde, and Y. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation. In International Conference on Scientific and Statistical Database Management, pages 1--10, July 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. N. Foster, M. B. Greenwald, J. T. Moore, B. C. Pierce, and A. Schmitt. Combinators for bi-directional tree transformations: A linguistic approach to the view update problem. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), Long Beach, California, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Y. Galperin. The molecular biology database collection: 2006 update. Nucl. Acids Res., 34:D3-D5, Jan 2006. doi:10.1093/nar/gkj162.]]Google ScholarGoogle ScholarCross RefCross Ref
  13. J. Gray, D. T. Liu, M. A. Nieto-Santisteban, A. S. Szalay, G. Heber, and D. DeWitt. Scientific data management in the coming decade. Technical Report MSR-TR-2005-10, Microsoft Research, January 2005.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Groth, S. Miles, W. Fang, S. C. Wong, K.-P. Zauner, and L. Moreau. Recording and using provenance in a protein compressibility experiment. In Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC'05), 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. V. Jagadish, S. Al-Khalifa, A. Chapman, L. V. Lakshmanan, A. Nierman, S. Paparizos, J. M. Patel, D. Srivastava, N. Wiwatwattana, Y. Wu, and C. Yu. Timber: A native XML database. The VLDB Journal, 11(4):274--291, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Lee, S. Bressan, and S. E. Madnick. Source attribution for querying against semi-structured documents. In Workshop on Web Information and Data Management, pages 33--39, 1998.]]Google ScholarGoogle Scholar
  17. A. Marian, S. Abiteboul, G. Cobena, and L. Mignet. Change-centric management of versions in an XML warehouse. In P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors, VLDB, pages 581--590. Morgan Kaufmann, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mimi. http://mimi.ctaalliance.org.]]Google ScholarGoogle Scholar
  19. W. O'Mullane, J. Gray, N. Li, T. Budavari, M. A. Nieto-Santisteban, and A. Szalay. Batch query system with interactive local storage for SDSS and the VO. In F. Ochsenbein, M. Allen, and D. Egret, editors, Astronomical Data Analysis Software and Systems XIII, volume 314 of ASP Conference Series, 2004.]]Google ScholarGoogle Scholar
  20. Y. Reimer and S. A. Douglas. Implementation challenges associated with developing a web-based e-notebook. Journal of Digital Information (JoDI), 4(3), 2003.]]Google ScholarGoogle Scholar
  21. Y. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Record, 34(3):31--36, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W. Tan. Containment of relational queries with annotation propagation. In Proceedings of the International Workshop on Database and Programming Languages (DBPL), 2003.]]Google ScholarGoogle Scholar
  23. UniProt. http://www.ebi.ac.uk/uniprot/.]]Google ScholarGoogle Scholar
  24. J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.]]Google ScholarGoogle Scholar
  25. N. Wiwatwattana and A. Kumar. Organelle DB: a cross-species database of protein localization and function. Nucleic Acids Research, 33:D598--604, 2005.]]Google ScholarGoogle ScholarCross RefCross Ref
  26. A. Woodruff and M. Stonebraker. Supporting fine-grained data lineage in a database visualization environment. In International Conference of Data Engineering, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Zhao, C. A. Goble, R. Stevens, and S. Bechhofer. Semantically linking and browsing provenance logs for e-science. In ICSNW, pages 158--176, 2004.]]Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Provenance management in curated databases

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data
        June 2006
        830 pages
        ISBN:1595934340
        DOI:10.1145/1142473

        Copyright © 2006 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 June 2006

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader