Abstract
The subject of this article is differential compression, the algorithmic task of finding common strings between versions of data and using them to encode one version compactly by describing it as a set of changes from its companion. A main goal of this work is to present new differencing algorithms that (i) operate at a fine granularity (the atomic unit of change), (ii) make no assumptions about the format or alignment of input data, and (iii) in practice use linear time, use constant space, and give good compression. We present new algorithms, which do not always compress optimally but use considerably less time or space than existing algorithms. One new algorithm runs in O(n) time and O(1) space in the worst case (where each unit of space contains [log n] bits), as compared to algorithms that run in O(n) time and O(n) space or in O(n2) time and O(1) space. We introduce two new techniques for differential compression and apply these to give additional algorithms that improve compression and time performance. We experimentally explore the properties of our algorithms by running them on actual versioned data. Finally, we present theoretical results that limit the compression power of differencing algorithms that are restricted to making only a single pass over the data.
- Ajtai, M. 1999. Determinism versus non-determinism for linear time RAMs with memory restrictions. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing. ACM, New York, 632--641. Google Scholar
- Banga, G., Douglis, F., and Rabinovich, M. 1997. Optimistic deltas for WWW latency reduction. In Proceedings of the 1997 USENIX Annual Technical Conference. USENIX Association, Berkeley, Calif., 289--303. Google Scholar
- Burns, R. C., and Long, D. D. E. 1997. Efficient distributed backup and restore with delta compression. In Proceedings of the 5th Workshop on I/O in Parallel and Distributed Systems. ACM, New York. Google Scholar
- Burns, R. C., and Long, D. D. E. 1998. In-place reconstruction of delta compressed files. In Proceedings of the 17th Annual ACM Symposium on Principles of Distributed Computing. ACM, New York. Google Scholar
- Chan, M., and Woo, T. 1999. Cache-based compaction: A new technique for optimizing web transfer. In Proceedings of the IEEE Infocom '99 Conference. IEEE Computer Society Press, Los Alamitos, Calif.Google Scholar
- Chawathe, S. S., and Garcia-Molina, H. 1997. Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on the Management of Data. ACM, New York. Google Scholar
- de Jong, S. P. 1972. Combining of changes to a source file. IBM Tech. Discl. Bull. 15, 4 (Sept.), 1186--1188.Google Scholar
- Gallager, R. G. 1968. Information Theory and Reliable Communication. Wiley, New York. Google Scholar
- Grossi, R., and Vitter, J. S. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd Annual ACM Symposium on Theory of Computing. ACM, New York, 397--406. Google Scholar
- Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York. Google Scholar
- Hardy, G. H., Littelwood, J. E., and Pólya, G. 1964. Inequalities. Cambridge University Press, London.Google Scholar
- Hunt, J. J., Vo, K.-P., and Tichy, W. F. 1998. Delta algorithms: An empirical analysis. ACM Trans. Softw. Eng. Method. 7, 2, 192--214. Google Scholar
- Karp, R. M., and Rabin, M. O. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Devel. 31, 2, 249--260. Google Scholar
- Knuth, D. E. 1973. The Art of Computer Programming, Volume 3, Sorting and Searching. Addison-Wesley, Reading, Mass. Google Scholar
- Korn, D. G., and Vo, K.-P. 1999. The VCDIFF generic differencing and compression format. Tech. Rep. Internet-Draft draft-vo-vcdiff-00. Google Scholar
- Kurtz, S. 1999. Reducing the space requirements of suffix trees. Softw. Pract. Exper. 29, 13, 1149--1171. Google Scholar
- MacDonald, J. P. 2000. File system support for delta compression. Masters thesis. Department of Electrical Engineering and Computer Science, University of California at Berkeley, Berkeley, Calif.Google Scholar
- Miller, W., and Myers, E. W. 1985. A file comparison program. Softw. Pract. Exper. 15, 11 (Nov.), 1025--1040.Google Scholar
- Mogul, J. C., Douglis, F., Feldman, A., and Krishnamurthy, B. 1997. Potential benefits of delta encoding and data compression for HTTP. In Proceedings of ACM SIGCOMM '97. ACM, New York. Google Scholar
- Reichenberger, C. 1991. Delta storage for arbitrary non-text files. In Proceedings of the 3rd International Workshop on Software Configuration Management. ACM, New York, 144--152. Google Scholar
- Rochkind, M. J. 1975. The source code control system. IEEE Trans. Softw. Eng. SE-1, 4 (Dec.), 364--370.Google Scholar
- Tichy, W. F. 1984. The string-to-string correction problem with block move. ACM Trans. Comput. 2, 4 (Nov.), 309--321. Google Scholar
- Tichy, W. F. 1985. RCS---A system for version control. Softw. Pract. Exper. 15, 7 (July), 637--654. Google Scholar
- Tudor, P. N. 1995. MPEG-2 video compression. Elect. Commun. Eng. J. 7, 6 (Dec.), 257--264.Google Scholar
- Wagner, R. A., and Fischer, M. J. 1973. The string-to-string correction problem. J. ACM 21, 1 (Jan.), 168--173. Google Scholar
- Weiner, P. 1973. Linear pattern matching algorithms. In Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory. IEEE Computer Society Press, Los Alamitos, Calif., 1--11.Google Scholar
- Yao, A. C. 1977. Probabilistic computation: towards a unified measure of complexity. In Proceedings of the 18th Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., 222--227.Google Scholar
- Zimmerman, P. 1995. PGP Source Code and Internals. MIT Press, Cambridge, Mass. Google Scholar
- Ziv, J., and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (May), 337--343.Google Scholar
- Ziv, J., and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 5 (Sept.), 530--536.Google Scholar
Index Terms
- Compactly encoding unstructured inputs with differential compression
Recommendations
Method for compression of optical observation data based on analysis of differential structure
A method for compression of high-contrast images based on analysis of their differential structures is suggested. The concept of the method, software implementation and additional factors that increase the efficiency of the method are considered. The ...
Lossless Compression Using Efficient Encoding of Bitmasks
ISVLSI '09: Proceedings of the 2009 IEEE Computer Society Annual Symposium on VLSILossless compression is widely used to improve both memory requirement and communication bandwidth in embedded systems. Dictionary based compression techniques are very popular because of their good compression efficiency and fast decompression ...
In-Place Reconstruction of Version Differences
In-place reconstruction of differenced data allows information on devices with limited storage capacity to be updated efficiently over low-bandwidth channels. Differencing encodes a version of data compactly as a set of changes from a previous version. ...
Comments