Article

Provenance management in curated databases

Authors:
Peter Buneman

University of Edinburgh

University of Edinburgh
View Profile

,
Adriane Chapman

University of Michigan, Ann Arbor, MI

University of Michigan, Ann Arbor, MI
View Profile

,
James Cheney

University of Edinburgh

University of Edinburgh
View Profile

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of dataJune 2006Pages 539–550https://doi.org/10.1145/1142473.1142534

Published:27 June 2006Publication History

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

Pages 539–550

ABSTRACT

Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and scientific value. General purpose database systems provide little support for tracking provenance, especially when data moves among databases. This paper investigates general-purpose techniques for recording provenance for data that is copied among databases. We describe an approach in which we track the user's actions while browsing source databases and copying data into a curated database, in order to record the user's actions in a convenient, queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management. Our experiments show that although the overhead of a naive approach is fairly high, it can be decreased to an acceptable level using simple optimizations.

References

G. Bader, D. Betel, and C. W. Hogue. BIND: the biomolecule interaction network database. Nucleic Acids Research, 31(1):248--250, 2003.]]Google ScholarCross Ref
D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. In Proc. of the Intl. Conf. on Very Large Data Bases (VLDB), pages 900--911. Morgan Kaufmann, 2004.]]Google ScholarDigital Library
R. Bose and J. Frew. Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv., 37(1):1--28, 2005.]] Google ScholarDigital Library
P. Buneman, S. Davidson, W. Fan, C. Hara, and W.-C. Tan. Keys for XML. Computer Networks, 39(5), August 2002.]]Google Scholar
P. Buneman, S. Khanna, K. Tajima, and W. C. Tan. Archiving scientific data. ACM Trans. Database Syst., 29:2--42, 2004.]] Google ScholarDigital Library
P. Buneman, S. Khanna, and W.-C. Tan. Why and Where: A characterization of data provenance. In ICDT, pages 316--330, 2001.]] Google ScholarDigital Library
J. Cherry, C. Adler, C. Ball, S. Chervitz, S. Dwight, E. Hester, Y. Jia, G. Juvik, T. Roe, M. Schroeder, S. Weng, and D. Botstein. SGD: Saccharomyces genome database. Nucleic Acids Res., 26(1):73--79, 1998.]]Google ScholarCross Ref
Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. VLDB J., 12(1):41--58, 2003.]] Google ScholarDigital Library
G. Dellaire, R. Farrall, and W. A. Bickmore. The nuclear protein database (NPD): sub-nuclear localisation and functional annotation of the nuclear proteome. Nucleic Acids Research, 31(1):328--330, 2003.]]Google ScholarCross Ref
I. Foster, J. Vockler, M. Eilde, and Y. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation. In International Conference on Scientific and Statistical Database Management, pages 1--10, July 2002.]] Google ScholarDigital Library
J. N. Foster, M. B. Greenwald, J. T. Moore, B. C. Pierce, and A. Schmitt. Combinators for bi-directional tree transformations: A linguistic approach to the view update problem. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), Long Beach, California, 2005.]] Google ScholarDigital Library
M. Y. Galperin. The molecular biology database collection: 2006 update. Nucl. Acids Res., 34:D3-D5, Jan 2006. doi:10.1093/nar/gkj162.]]Google ScholarCross Ref
J. Gray, D. T. Liu, M. A. Nieto-Santisteban, A. S. Szalay, G. Heber, and D. DeWitt. Scientific data management in the coming decade. Technical Report MSR-TR-2005-10, Microsoft Research, January 2005.]]Google ScholarDigital Library
P. Groth, S. Miles, W. Fang, S. C. Wong, K.-P. Zauner, and L. Moreau. Recording and using provenance in a protein compressibility experiment. In Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC'05), 2005.]] Google ScholarDigital Library
H. V. Jagadish, S. Al-Khalifa, A. Chapman, L. V. Lakshmanan, A. Nierman, S. Paparizos, J. M. Patel, D. Srivastava, N. Wiwatwattana, Y. Wu, and C. Yu. Timber: A native XML database. The VLDB Journal, 11(4):274--291, 2002.]] Google ScholarDigital Library
T. Lee, S. Bressan, and S. E. Madnick. Source attribution for querying against semi-structured documents. In Workshop on Web Information and Data Management, pages 33--39, 1998.]]Google Scholar
A. Marian, S. Abiteboul, G. Cobena, and L. Mignet. Change-centric management of versions in an XML warehouse. In P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors, VLDB, pages 581--590. Morgan Kaufmann, 2001.]] Google ScholarDigital Library
Mimi. http://mimi.ctaalliance.org.]]Google Scholar
W. O'Mullane, J. Gray, N. Li, T. Budavari, M. A. Nieto-Santisteban, and A. Szalay. Batch query system with interactive local storage for SDSS and the VO. In F. Ochsenbein, M. Allen, and D. Egret, editors, Astronomical Data Analysis Software and Systems XIII, volume 314 of ASP Conference Series, 2004.]]Google Scholar
Y. Reimer and S. A. Douglas. Implementation challenges associated with developing a web-based e-notebook. Journal of Digital Information (JoDI), 4(3), 2003.]]Google Scholar
Y. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Record, 34(3):31--36, 2005.]] Google ScholarDigital Library
W. Tan. Containment of relational queries with annotation propagation. In Proceedings of the International Workshop on Database and Programming Languages (DBPL), 2003.]]Google Scholar
UniProt. http://www.ebi.ac.uk/uniprot/.]]Google Scholar
J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.]]Google Scholar
N. Wiwatwattana and A. Kumar. Organelle DB: a cross-species database of protein localization and function. Nucleic Acids Research, 33:D598--604, 2005.]]Google ScholarCross Ref
A. Woodruff and M. Stonebraker. Supporting fine-grained data lineage in a database visualization environment. In International Conference of Data Engineering, 1997.]] Google ScholarDigital Library
J. Zhao, C. A. Goble, R. Stevens, and S. Bechhofer. Semantically linking and browsing provenance logs for e-science. In ICSNW, pages 158--176, 2004.]]Google ScholarCross Ref

Index Terms

Provenance management in curated databases
1. Human-centered computing
  1. Visualization
    1. Visualization application domains
      1. Scientific visualization
2. Information systems
  1. Information systems applications

Recommendations

Curated databases
PODS '08: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Curated databases are databases that are populated and updated with a great deal of human effort. Most reference works that one traditionally found on the reference shelves of libraries -- dictionaries, encyclopedias, gazetteers etc. -- are now curated ...
Read More
Data Provenance Support in Relational Databases for Stored Procedures
Database Systems for Advanced Applications

The increasing amounts of data produced by automated scientific instruments require scalable data management platforms for storing, transforming and analyzing scientific data. At the same time, it is paramount for scientific applications to keep track ...
Read More
The perm provenance management system in action
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

In this demonstration we present the Perm provenance management system (PMS). Perm is capable of computing, storing and querying provenance information for the relational data model. Provenance is computed by using query rewriting techniques to annotate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data
June 2006
830 pages
ISBN:1595934340
DOI:10.1145/1142473
General Chairs:
Clement Yu
University of Illinois at Chicago
,
Peter Scheuermann
Northwestern University
,
Program Chair:
Surajit Chaudhuri
Microsoft Research
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 June 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
curation
provenance
storage
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 184
  Total Citations
  View Citations
- 1,784
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Provenance management in curated databases

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Curated databases

Data Provenance Support in Relational Databases for Stored Procedures

The perm provenance management system in action

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Provenance management in curated databases

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Curated databases

Data Provenance Support in Relational Databases for Stored Procedures

The perm provenance management system in action

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media