Abstract
Archiving is important for scientific data, where it is necessary to record all past versions of a database in order to verify findings based upon a specific version. Much scientific data is held in a hierachical format and has a key structure that provides a canonical identification for each element of the hierarchy. In this article, we exploit these properties to develop an archiving technique that is both efficient in its use of space and preserves the continuity of elements through versions of the database, something that is not provided by traditional minimum-edit-distance diff approaches. The approach also uses timestamps. All versions of the data are merged into one hierarchy where an element appearing in multiple versions is stored only once along with a timestamp. By identifying the semantic continuity of elements and merging them into one data structure, our technique is capable of providing meaningful change descriptions, the archive allows us to easily answer certain temporal queries such as retrieval of any specific version from the archive and finding the history of an element. This is in contrast with approaches that store a sequence of deltas where such operations may require undoing a large number of changes or significant reasoning with the deltas. A suite of experiments also demonstrates that our archive does not incur any significant space overhead when contrasted with diff approaches. Another useful property of our approach is that we use XML format to represent hierarchical data and the resulting archive is also in XML. Hence, XML tools can be directly applied on our archive. In particular, we apply an XML compressor on our archive, and our experiments show that our compressed archive outperforms compressed diff-based repositories in space efficiency. We also show how we can extend our archiving tool to an external memory archiver for higher scalability and describe various index structures that can further improve the efficiency of some temporal queries on our archive.
Supplemental Material
Available for Download
Article appendix
- Altinel, M. and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Cairo, Egypt). 53--64.]] Google ScholarDigital Library
- Avilla-Campillo, I., Green, T. J., Gupta, A., Onizuka, M., Raven, D., and Suciu, D. 2002. XMLTK: An XML toolkit for scalable XML stream processing. In Proceedings of PLAN-X: Programming Language Technologies for XML (Pittsburg, Penn.).]]Google Scholar
- Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL. Nucl. Acids Res. 28, 45--48.]]Google ScholarCross Ref
- Buneman, P., Davidson, S., Fan, W., Hara, C., and Tan, W. 2001. Keys for XML. In Proceedings of the International World Wide Web Conference (WWW10) (Hong Kong, China). 201--210.]] Google ScholarDigital Library
- Buneman, P., Deutsch, A., and Tan, W. 1999. A deterministic model for semistructured data. In Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats. (Jerusalem, Israel). 14--19.]] Google ScholarDigital Library
- CELLBIODBS. The WWW Virtual Library of Cell Biology. http://vlib.org/Science/Cell_Biology/databases.shtml.]]Google Scholar
- Chawathe, S. S. 1999. Comparing hierarchical data in external memory. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Edinburg, Scotland). 90--101.]] Google ScholarDigital Library
- Chawathe, S. S., Abiteboul, S., and Widom, J. 1998. Representing and querying changes in semistructured data. In Proceedings of the International Conference on Data Engineering (ICDE) (Orlando, Fla.). 4--13.]] Google ScholarDigital Library
- Chawathe, S. S. and Garcia-Molina, H. 1997. Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Tucson, Az). ACM, New York, 26--37.]] Google ScholarDigital Library
- Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. 1996. Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Montreal, Ont., Canada). ACM, New York, 493--504.]] Google ScholarDigital Library
- Chien, S., Tsotras, V., and Zaniolo, C. 2001. Efficient management of multiversion documents by object referencing. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Roma, Italy). 291--300.]] Google ScholarDigital Library
- CVS. Concurrent Versions System: The open standard for version control. http://www.cvshome.org.]]Google Scholar
- Cobena, G., Abiteboul, S., and Marian, A. 2001. Detecting changes in XML documents. In Proceedings of the International Conference on Data Engineering (ICDE) (Heidelberg, Germany).]]Google Scholar
- Diao, Y., Fischer, P., Franklin, M. J., and To, R. 2002. Efficient and scalable filtering of XML documents. In Proceedings of the International Conference on Data Engineering (ICDE) (San Jose, Calif.).]] Google ScholarDigital Library
- Driscoll, J. R., Sarnak, N., Sleator, D. D., and Tarjan, R. E. 1989. Making data structures persistent. J. Comput. Syst. Sci. 38, 1, 86--124.]] Google ScholarDigital Library
- EMBL-EBI (European Bioinformations Institute). SPTr-XML Documentation. http://www.ebi.ac. uk/swissprot/SP-ML/.]]Google Scholar
- Fontaine, R. L. 2001. A delta format for XML: Identifying changes in XML files and representing the changes in XML. In XML Europe (Berlin, Germany).]]Google Scholar
- Liefke, H. and Suciu, D. 2000. XMill: An efficient compressor for XML data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) (Dallas, Tex.). ACM New York, 153--164.]] Google ScholarDigital Library
- Marian, A., Abiteboul, S., Cobena, G., and Mignet, L. 2001. Change-centric management of versions in an XML warehouse. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Roma, Italy). 581--590.]] Google ScholarDigital Library
- Maruyama, H., Tamura, K., and Uramoto, N. 2000. Digest values for DOM (DOMHASH). http://www.trl.ibm.com/projects/xml/xss4j/docs/rfc2803.html.]] Google Scholar
- Miller, W. and Myers, E. 1985. A file comparison program. Softw.-Pract. Exp. 15, 11, 1025--1040.]]Google ScholarCross Ref
- Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge University Press, Cambridge, Mass.]] Google ScholarDigital Library
- Myers, E. 1986. An O(ND) difference algorithm and its variations. Algorithmica 1, 2, 251--266.]]Google ScholarDigital Library
- OMIM. 2000. Online Mendelian Inheritance in Man, OMIM (TM). http://www.ncbi.nlm.nih.gov/omim/.]]Google Scholar
- Ramakrishnan, R. and Gehrke, J. 2002. Database Management Systems. McGraw-Hill Higher Education, 3rd Ed. McGraw-Hill, Englewood Cliffs, N.J.]] Google ScholarDigital Library
- Rochkind, M. 1975. The source code control system. IEEE Trans. Softw. Eng. 1, 4, 364--370.]]Google ScholarDigital Library
- Schmidt, A. R., Waas, F., Kersten, M. L., Carey, M. J., Manolescu, I., and Busse, R. 2002. XMark: A benchmark for XML data management. In Proceedings of the International Conference on Very Large Data Bases (VLDB) (Hong Kong, China). 974--985.]]Google Scholar
- Tai, K. C. 1979. The tree-to-tree correction problem. J. ACM 26, 422--433.]] Google ScholarDigital Library
- Torp, K., Jensen, C. S., and Snodgrass, R. T. 2000. Effective timestamping in databases. The VLDB J. 8, 3--4, 267--288.]] Google ScholarDigital Library
- Tufte, K. and Maier, D. 2001. Aggregation and accumulation of XML data. IEEE Data Eng. Bull. 24, 2, 34--39.]]Google Scholar
- W3C. 1998. Extensible markup language (xml) 1.0. http://www.w3.org/TR/REC-xml.]]Google Scholar
- W3C. 1999a. Namespaces in XML. http://www.w3.org/TR/REC-xml-names.]]Google Scholar
- W3C. 1999b. XML Path Language (XPath). http://www.w3.org/TR/xpath.]]Google Scholar
- W3C. 2000. XML Schema Part 1: Structures. http://www.w3.org/TR/xmlschema-1/.]]Google Scholar
- W3C. 2001a. Canonical XML Version 1.0. http://www.w3.org/TR/xml-c14n.]]Google Scholar
- W3C. 2001b. XQuery 1.0: An XML Query Language. http://www.w3.org/TR/xquery/.]]Google Scholar
- XMLTREEDIFF. XML TreeDiff. http://www.alphaworks.ibm.com/formula/xmltreediff.]]Google Scholar
- Zhang, K. and Shasha, D. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 6, 1245--1262.]] Google ScholarDigital Library
- Zhang, K. and Shasha, D. 1990. Fast algorithms for unit cost editing distance between trees. J. Algorithms 11, 6, 581--621.]] Google ScholarDigital Library
Index Terms
- Archiving scientific data
Recommendations
Archiving scientific data
SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of dataWe present an archiving technique for hierarchical data with key structure. Our approach is based on the notion of timestamps whereby an element appearing in multiple versions of the database is stored only once along with a compact description of ...
Digital archiving: a call for user inspired digital archiving of cultural heritage
EuroITV '10: Proceedings of the 8th European Conference on Interactive TV and VideoRecently, there has been an increasing amount of digital archiving projects. Among these, national broadcasting organisations have begun offering digital content free to the public. The amount of digitalised information is increasing, though no one ...
CODATA work in archiving scientific data
Special issue on ICSTI/CODATA/ICSU seminar on preserving the record of scienceThis article reviews the work and objectives of CODATA, the ICSU Committee on Data for Science and Technology, regarding the archiving and preservation of access to scientific and technical data. The CODATA perspective on data and information and the ...
Comments