skip to main content
article

Archiving scientific data

Authors Info & Claims
Published:01 March 2004Publication History
Skip Abstract Section

Abstract

Archiving is important for scientific data, where it is necessary to record all past versions of a database in order to verify findings based upon a specific version. Much scientific data is held in a hierachical format and has a key structure that provides a canonical identification for each element of the hierarchy. In this article, we exploit these properties to develop an archiving technique that is both efficient in its use of space and preserves the continuity of elements through versions of the database, something that is not provided by traditional minimum-edit-distance diff approaches. The approach also uses timestamps. All versions of the data are merged into one hierarchy where an element appearing in multiple versions is stored only once along with a timestamp. By identifying the semantic continuity of elements and merging them into one data structure, our technique is capable of providing meaningful change descriptions, the archive allows us to easily answer certain temporal queries such as retrieval of any specific version from the archive and finding the history of an element. This is in contrast with approaches that store a sequence of deltas where such operations may require undoing a large number of changes or significant reasoning with the deltas. A suite of experiments also demonstrates that our archive does not incur any significant space overhead when contrasted with diff approaches. Another useful property of our approach is that we use XML format to represent hierarchical data and the resulting archive is also in XML. Hence, XML tools can be directly applied on our archive. In particular, we apply an XML compressor on our archive, and our experiments show that our compressed archive outperforms compressed diff-based repositories in space efficiency. We also show how we can extend our archiving tool to an external memory archiver for higher scalability and describe various index structures that can further improve the efficiency of some temporal queries on our archive.

Skip Supplemental Material Section

Supplemental Material

References

  1. Altinel, M. and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Cairo, Egypt). 53--64.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Avilla-Campillo, I., Green, T. J., Gupta, A., Onizuka, M., Raven, D., and Suciu, D. 2002. XMLTK: An XML toolkit for scalable XML stream processing. In Proceedings of PLAN-X: Programming Language Technologies for XML (Pittsburg, Penn.).]]Google ScholarGoogle Scholar
  3. Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL. Nucl. Acids Res. 28, 45--48.]]Google ScholarGoogle ScholarCross RefCross Ref
  4. Buneman, P., Davidson, S., Fan, W., Hara, C., and Tan, W. 2001. Keys for XML. In Proceedings of the International World Wide Web Conference (WWW10) (Hong Kong, China). 201--210.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Buneman, P., Deutsch, A., and Tan, W. 1999. A deterministic model for semistructured data. In Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats. (Jerusalem, Israel). 14--19.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. CELLBIODBS. The WWW Virtual Library of Cell Biology. http://vlib.org/Science/Cell_Biology/databases.shtml.]]Google ScholarGoogle Scholar
  7. Chawathe, S. S. 1999. Comparing hierarchical data in external memory. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Edinburg, Scotland). 90--101.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chawathe, S. S., Abiteboul, S., and Widom, J. 1998. Representing and querying changes in semistructured data. In Proceedings of the International Conference on Data Engineering (ICDE) (Orlando, Fla.). 4--13.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chawathe, S. S. and Garcia-Molina, H. 1997. Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Tucson, Az). ACM, New York, 26--37.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. 1996. Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Montreal, Ont., Canada). ACM, New York, 493--504.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chien, S., Tsotras, V., and Zaniolo, C. 2001. Efficient management of multiversion documents by object referencing. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Roma, Italy). 291--300.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. CVS. Concurrent Versions System: The open standard for version control. http://www.cvshome.org.]]Google ScholarGoogle Scholar
  13. Cobena, G., Abiteboul, S., and Marian, A. 2001. Detecting changes in XML documents. In Proceedings of the International Conference on Data Engineering (ICDE) (Heidelberg, Germany).]]Google ScholarGoogle Scholar
  14. Diao, Y., Fischer, P., Franklin, M. J., and To, R. 2002. Efficient and scalable filtering of XML documents. In Proceedings of the International Conference on Data Engineering (ICDE) (San Jose, Calif.).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Driscoll, J. R., Sarnak, N., Sleator, D. D., and Tarjan, R. E. 1989. Making data structures persistent. J. Comput. Syst. Sci. 38, 1, 86--124.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. EMBL-EBI (European Bioinformations Institute). SPTr-XML Documentation. http://www.ebi.ac. uk/swissprot/SP-ML/.]]Google ScholarGoogle Scholar
  17. Fontaine, R. L. 2001. A delta format for XML: Identifying changes in XML files and representing the changes in XML. In XML Europe (Berlin, Germany).]]Google ScholarGoogle Scholar
  18. Liefke, H. and Suciu, D. 2000. XMill: An efficient compressor for XML data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Dallas, Tex.). ACM New York, 153--164.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Marian, A., Abiteboul, S., Cobena, G., and Mignet, L. 2001. Change-centric management of versions in an XML warehouse. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Roma, Italy). 581--590.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Maruyama, H., Tamura, K., and Uramoto, N. 2000. Digest values for DOM (DOMHASH). http://www.trl.ibm.com/projects/xml/xss4j/docs/rfc2803.html.]] Google ScholarGoogle Scholar
  21. Miller, W. and Myers, E. 1985. A file comparison program. Softw.-Pract. Exp. 15, 11, 1025--1040.]]Google ScholarGoogle ScholarCross RefCross Ref
  22. Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge University Press, Cambridge, Mass.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Myers, E. 1986. An O(ND) difference algorithm and its variations. Algorithmica 1, 2, 251--266.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. OMIM. 2000. Online Mendelian Inheritance in Man, OMIM (TM). http://www.ncbi.nlm.nih.gov/omim/.]]Google ScholarGoogle Scholar
  25. Ramakrishnan, R. and Gehrke, J. 2002. Database Management Systems. McGraw-Hill Higher Education, 3rd Ed. McGraw-Hill, Englewood Cliffs, N.J.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Rochkind, M. 1975. The source code control system. IEEE Trans. Softw. Eng. 1, 4, 364--370.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Schmidt, A. R., Waas, F., Kersten, M. L., Carey, M. J., Manolescu, I., and Busse, R. 2002. XMark: A benchmark for XML data management. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Hong Kong, China). 974--985.]]Google ScholarGoogle Scholar
  28. Tai, K. C. 1979. The tree-to-tree correction problem. J. ACM 26, 422--433.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Torp, K., Jensen, C. S., and Snodgrass, R. T. 2000. Effective timestamping in databases. The VLDB J. 8, 3--4, 267--288.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tufte, K. and Maier, D. 2001. Aggregation and accumulation of XML data. IEEE Data Eng. Bull. 24, 2, 34--39.]]Google ScholarGoogle Scholar
  31. W3C. 1998. Extensible markup language (xml) 1.0. http://www.w3.org/TR/REC-xml.]]Google ScholarGoogle Scholar
  32. W3C. 1999a. Namespaces in XML. http://www.w3.org/TR/REC-xml-names.]]Google ScholarGoogle Scholar
  33. W3C. 1999b. XML Path Language (XPath). http://www.w3.org/TR/xpath.]]Google ScholarGoogle Scholar
  34. W3C. 2000. XML Schema Part 1: Structures. http://www.w3.org/TR/xmlschema-1/.]]Google ScholarGoogle Scholar
  35. W3C. 2001a. Canonical XML Version 1.0. http://www.w3.org/TR/xml-c14n.]]Google ScholarGoogle Scholar
  36. W3C. 2001b. XQuery 1.0: An XML Query Language. http://www.w3.org/TR/xquery/.]]Google ScholarGoogle Scholar
  37. XMLTREEDIFF. XML TreeDiff. http://www.alphaworks.ibm.com/formula/xmltreediff.]]Google ScholarGoogle Scholar
  38. Zhang, K. and Shasha, D. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 6, 1245--1262.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Zhang, K. and Shasha, D. 1990. Fast algorithms for unit cost editing distance between trees. J. Algorithms 11, 6, 581--621.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Archiving scientific data

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Database Systems
              ACM Transactions on Database Systems  Volume 29, Issue 1
              March 2004
              232 pages
              ISSN:0362-5915
              EISSN:1557-4644
              DOI:10.1145/974750
              Issue’s Table of Contents

              Copyright © 2004 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 March 2004
              Published in tods Volume 29, Issue 1

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Author Tags

              Qualifiers

              • article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader