article

Archiving scientific data

Authors:
Peter Buneman

University of Edinburgh, Edinburgh, Scotland

University of Edinburgh, Edinburgh, Scotland
View Profile

,
Sanjeev Khanna

University of Pennsylvania, Philadelphia, Pennsylvania, PA

University of Pennsylvania, Philadelphia, Pennsylvania, PA
View Profile

,
Keishi Tajima

Japan Advanced Institute of Science and Technology, Ishikawa, Japan

Japan Advanced Institute of Science and Technology, Ishikawa, Japan
View Profile

,
Wang-Chiew Tan

University of California, Santa Cruz, Santa Cruz, California

University of California, Santa Cruz, Santa Cruz, California
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 29 Issue 1pp 2–42https://doi.org/10.1145/974750.974752

Published:01 March 2004Publication History

ACM Transactions on Database Systems

Abstract

Archiving is important for scientific data, where it is necessary to record all past versions of a database in order to verify findings based upon a specific version. Much scientific data is held in a hierachical format and has a key structure that provides a canonical identification for each element of the hierarchy. In this article, we exploit these properties to develop an archiving technique that is both efficient in its use of space and preserves the continuity of elements through versions of the database, something that is not provided by traditional minimum-edit-distance diff approaches. The approach also uses timestamps. All versions of the data are merged into one hierarchy where an element appearing in multiple versions is stored only once along with a timestamp. By identifying the semantic continuity of elements and merging them into one data structure, our technique is capable of providing meaningful change descriptions, the archive allows us to easily answer certain temporal queries such as retrieval of any specific version from the archive and finding the history of an element. This is in contrast with approaches that store a sequence of deltas where such operations may require undoing a large number of changes or significant reasoning with the deltas. A suite of experiments also demonstrates that our archive does not incur any significant space overhead when contrasted with diff approaches. Another useful property of our approach is that we use XML format to represent hierarchical data and the resulting archive is also in XML. Hence, XML tools can be directly applied on our archive. In particular, we apply an XML compressor on our archive, and our experiments show that our compressed archive outperforms compressed diff-based repositories in space efficiency. We also show how we can extend our archiving tool to an external memory archiver for higher scalability and describe various index structures that can further improve the efficiency of some temporal queries on our archive.

Supplemental Material

Available for Download

pdf

p2-buneman-app.pdf (144.7 KB)

Article appendix

References

Altinel, M. and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Cairo, Egypt). 53--64.]] Google ScholarDigital Library
Avilla-Campillo, I., Green, T. J., Gupta, A., Onizuka, M., Raven, D., and Suciu, D. 2002. XMLTK: An XML toolkit for scalable XML stream processing. In Proceedings of PLAN-X: Programming Language Technologies for XML (Pittsburg, Penn.).]]Google Scholar
Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL. Nucl. Acids Res. 28, 45--48.]]Google ScholarCross Ref
Buneman, P., Davidson, S., Fan, W., Hara, C., and Tan, W. 2001. Keys for XML. In Proceedings of the International World Wide Web Conference (WWW10) (Hong Kong, China). 201--210.]] Google ScholarDigital Library
Buneman, P., Deutsch, A., and Tan, W. 1999. A deterministic model for semistructured data. In Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats. (Jerusalem, Israel). 14--19.]] Google ScholarDigital Library
CELLBIODBS. The WWW Virtual Library of Cell Biology. http://vlib.org/Science/Cell_Biology/databases.shtml.]]Google Scholar
Chawathe, S. S. 1999. Comparing hierarchical data in external memory. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Edinburg, Scotland). 90--101.]] Google ScholarDigital Library
Chawathe, S. S., Abiteboul, S., and Widom, J. 1998. Representing and querying changes in semistructured data. In Proceedings of the International Conference on Data Engineering (ICDE) (Orlando, Fla.). 4--13.]] Google ScholarDigital Library
Chawathe, S. S. and Garcia-Molina, H. 1997. Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Tucson, Az). ACM, New York, 26--37.]] Google ScholarDigital Library
Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. 1996. Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Montreal, Ont., Canada). ACM, New York, 493--504.]] Google ScholarDigital Library
Chien, S., Tsotras, V., and Zaniolo, C. 2001. Efficient management of multiversion documents by object referencing. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Roma, Italy). 291--300.]] Google ScholarDigital Library
CVS. Concurrent Versions System: The open standard for version control. http://www.cvshome.org.]]Google Scholar
Cobena, G., Abiteboul, S., and Marian, A. 2001. Detecting changes in XML documents. In Proceedings of the International Conference on Data Engineering (ICDE) (Heidelberg, Germany).]]Google Scholar
Diao, Y., Fischer, P., Franklin, M. J., and To, R. 2002. Efficient and scalable filtering of XML documents. In Proceedings of the International Conference on Data Engineering (ICDE) (San Jose, Calif.).]] Google ScholarDigital Library
Driscoll, J. R., Sarnak, N., Sleator, D. D., and Tarjan, R. E. 1989. Making data structures persistent. J. Comput. Syst. Sci. 38, 1, 86--124.]] Google ScholarDigital Library
EMBL-EBI (European Bioinformations Institute). SPTr-XML Documentation. http://www.ebi.ac. uk/swissprot/SP-ML/.]]Google Scholar
Fontaine, R. L. 2001. A delta format for XML: Identifying changes in XML files and representing the changes in XML. In XML Europe (Berlin, Germany).]]Google Scholar
Liefke, H. and Suciu, D. 2000. XMill: An efficient compressor for XML data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Dallas, Tex.). ACM New York, 153--164.]] Google ScholarDigital Library
Marian, A., Abiteboul, S., Cobena, G., and Mignet, L. 2001. Change-centric management of versions in an XML warehouse. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Roma, Italy). 581--590.]] Google ScholarDigital Library
Maruyama, H., Tamura, K., and Uramoto, N. 2000. Digest values for DOM (DOMHASH). http://www.trl.ibm.com/projects/xml/xss4j/docs/rfc2803.html.]] Google Scholar
Miller, W. and Myers, E. 1985. A file comparison program. Softw.-Pract. Exp. 15, 11, 1025--1040.]]Google ScholarCross Ref
Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge University Press, Cambridge, Mass.]] Google ScholarDigital Library
Myers, E. 1986. An O(ND) difference algorithm and its variations. Algorithmica 1, 2, 251--266.]]Google ScholarDigital Library
OMIM. 2000. Online Mendelian Inheritance in Man, OMIM (TM). http://www.ncbi.nlm.nih.gov/omim/.]]Google Scholar
Ramakrishnan, R. and Gehrke, J. 2002. Database Management Systems. McGraw-Hill Higher Education, 3rd Ed. McGraw-Hill, Englewood Cliffs, N.J.]] Google ScholarDigital Library
Rochkind, M. 1975. The source code control system. IEEE Trans. Softw. Eng. 1, 4, 364--370.]]Google ScholarDigital Library
Schmidt, A. R., Waas, F., Kersten, M. L., Carey, M. J., Manolescu, I., and Busse, R. 2002. XMark: A benchmark for XML data management. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Hong Kong, China). 974--985.]]Google Scholar
Tai, K. C. 1979. The tree-to-tree correction problem. J. ACM 26, 422--433.]] Google ScholarDigital Library
Torp, K., Jensen, C. S., and Snodgrass, R. T. 2000. Effective timestamping in databases. The VLDB J. 8, 3--4, 267--288.]] Google ScholarDigital Library
Tufte, K. and Maier, D. 2001. Aggregation and accumulation of XML data. IEEE Data Eng. Bull. 24, 2, 34--39.]]Google Scholar
W3C. 1998. Extensible markup language (xml) 1.0. http://www.w3.org/TR/REC-xml.]]Google Scholar
W3C. 1999a. Namespaces in XML. http://www.w3.org/TR/REC-xml-names.]]Google Scholar
W3C. 1999b. XML Path Language (XPath). http://www.w3.org/TR/xpath.]]Google Scholar
W3C. 2000. XML Schema Part 1: Structures. http://www.w3.org/TR/xmlschema-1/.]]Google Scholar
W3C. 2001a. Canonical XML Version 1.0. http://www.w3.org/TR/xml-c14n.]]Google Scholar
W3C. 2001b. XQuery 1.0: An XML Query Language. http://www.w3.org/TR/xquery/.]]Google Scholar
XMLTREEDIFF. XML TreeDiff. http://www.alphaworks.ibm.com/formula/xmltreediff.]]Google Scholar
Zhang, K. and Shasha, D. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 6, 1245--1262.]] Google ScholarDigital Library
Zhang, K. and Shasha, D. 1990. Fast algorithms for unit cost editing distance between trees. J. Algorithms 11, 6, 581--621.]] Google ScholarDigital Library

Index Terms

Archiving scientific data

Recommendations

Archiving scientific data
SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data

We present an archiving technique for hierarchical data with key structure. Our approach is based on the notion of timestamps whereby an element appearing in multiple versions of the database is stored only once along with a compact description of ...
Read More
Digital archiving: a call for user inspired digital archiving of cultural heritage
EuroITV '10: Proceedings of the 8th European Conference on Interactive TV and Video

Recently, there has been an increasing amount of digital archiving projects. Among these, national broadcasting organisations have begun offering digital content free to the public. The amount of digitalised information is increasing, though no one ...
Read More
CODATA work in archiving scientific data
Special issue on ICSTI/CODATA/ICSU seminar on preserving the record of science

This article reviews the work and objectives of CODATA, the ICSU Committee on Data for Science and Technology, regarding the archiving and preservation of access to scientific and technical data. The CODATA perspective on data and information and the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 29, Issue 1
March 2004
232 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/974750
Issue’s Table of Contents

Copyright © 2004 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2004
Published in tods Volume 29, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Keys for XML
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 90
  Total Citations
  View Citations
- 2,976
  Total Downloads
- Downloads (Last 12 months)63
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Archiving scientific data

ACM Transactions on Database Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Archiving scientific data

Digital archiving: a call for user inspired digital archiving of cultural heritage

CODATA work in archiving scientific data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Archiving scientific data

ACM Transactions on Database Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Archiving scientific data

Digital archiving: a call for user inspired digital archiving of cultural heritage

CODATA work in archiving scientific data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media