skip to main content
article

Compactly encoding unstructured inputs with differential compression

Published:01 May 2002Publication History
Skip Abstract Section

Abstract

The subject of this article is differential compression, the algorithmic task of finding common strings between versions of data and using them to encode one version compactly by describing it as a set of changes from its companion. A main goal of this work is to present new differencing algorithms that (i) operate at a fine granularity (the atomic unit of change), (ii) make no assumptions about the format or alignment of input data, and (iii) in practice use linear time, use constant space, and give good compression. We present new algorithms, which do not always compress optimally but use considerably less time or space than existing algorithms. One new algorithm runs in O(n) time and O(1) space in the worst case (where each unit of space contains [log n] bits), as compared to algorithms that run in O(n) time and O(n) space or in O(n2) time and O(1) space. We introduce two new techniques for differential compression and apply these to give additional algorithms that improve compression and time performance. We experimentally explore the properties of our algorithms by running them on actual versioned data. Finally, we present theoretical results that limit the compression power of differencing algorithms that are restricted to making only a single pass over the data.

References

  1. Ajtai, M. 1999. Determinism versus non-determinism for linear time RAMs with memory restrictions. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing. ACM, New York, 632--641. Google ScholarGoogle Scholar
  2. Banga, G., Douglis, F., and Rabinovich, M. 1997. Optimistic deltas for WWW latency reduction. In Proceedings of the 1997 USENIX Annual Technical Conference. USENIX Association, Berkeley, Calif., 289--303. Google ScholarGoogle Scholar
  3. Burns, R. C., and Long, D. D. E. 1997. Efficient distributed backup and restore with delta compression. In Proceedings of the 5th Workshop on I/O in Parallel and Distributed Systems. ACM, New York. Google ScholarGoogle Scholar
  4. Burns, R. C., and Long, D. D. E. 1998. In-place reconstruction of delta compressed files. In Proceedings of the 17th Annual ACM Symposium on Principles of Distributed Computing. ACM, New York. Google ScholarGoogle Scholar
  5. Chan, M., and Woo, T. 1999. Cache-based compaction: A new technique for optimizing web transfer. In Proceedings of the IEEE Infocom '99 Conference. IEEE Computer Society Press, Los Alamitos, Calif.Google ScholarGoogle Scholar
  6. Chawathe, S. S., and Garcia-Molina, H. 1997. Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on the Management of Data. ACM, New York. Google ScholarGoogle Scholar
  7. de Jong, S. P. 1972. Combining of changes to a source file. IBM Tech. Discl. Bull. 15, 4 (Sept.), 1186--1188.Google ScholarGoogle Scholar
  8. Gallager, R. G. 1968. Information Theory and Reliable Communication. Wiley, New York. Google ScholarGoogle Scholar
  9. Grossi, R., and Vitter, J. S. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd Annual ACM Symposium on Theory of Computing. ACM, New York, 397--406. Google ScholarGoogle Scholar
  10. Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York. Google ScholarGoogle Scholar
  11. Hardy, G. H., Littelwood, J. E., and Pólya, G. 1964. Inequalities. Cambridge University Press, London.Google ScholarGoogle Scholar
  12. Hunt, J. J., Vo, K.-P., and Tichy, W. F. 1998. Delta algorithms: An empirical analysis. ACM Trans. Softw. Eng. Method. 7, 2, 192--214. Google ScholarGoogle Scholar
  13. Karp, R. M., and Rabin, M. O. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Devel. 31, 2, 249--260. Google ScholarGoogle Scholar
  14. Knuth, D. E. 1973. The Art of Computer Programming, Volume 3, Sorting and Searching. Addison-Wesley, Reading, Mass. Google ScholarGoogle Scholar
  15. Korn, D. G., and Vo, K.-P. 1999. The VCDIFF generic differencing and compression format. Tech. Rep. Internet-Draft draft-vo-vcdiff-00. Google ScholarGoogle Scholar
  16. Kurtz, S. 1999. Reducing the space requirements of suffix trees. Softw. Pract. Exper. 29, 13, 1149--1171. Google ScholarGoogle Scholar
  17. MacDonald, J. P. 2000. File system support for delta compression. Masters thesis. Department of Electrical Engineering and Computer Science, University of California at Berkeley, Berkeley, Calif.Google ScholarGoogle Scholar
  18. Miller, W., and Myers, E. W. 1985. A file comparison program. Softw. Pract. Exper. 15, 11 (Nov.), 1025--1040.Google ScholarGoogle Scholar
  19. Mogul, J. C., Douglis, F., Feldman, A., and Krishnamurthy, B. 1997. Potential benefits of delta encoding and data compression for HTTP. In Proceedings of ACM SIGCOMM '97. ACM, New York. Google ScholarGoogle Scholar
  20. Reichenberger, C. 1991. Delta storage for arbitrary non-text files. In Proceedings of the 3rd International Workshop on Software Configuration Management. ACM, New York, 144--152. Google ScholarGoogle Scholar
  21. Rochkind, M. J. 1975. The source code control system. IEEE Trans. Softw. Eng. SE-1, 4 (Dec.), 364--370.Google ScholarGoogle Scholar
  22. Tichy, W. F. 1984. The string-to-string correction problem with block move. ACM Trans. Comput. 2, 4 (Nov.), 309--321. Google ScholarGoogle Scholar
  23. Tichy, W. F. 1985. RCS---A system for version control. Softw. Pract. Exper. 15, 7 (July), 637--654. Google ScholarGoogle Scholar
  24. Tudor, P. N. 1995. MPEG-2 video compression. Elect. Commun. Eng. J. 7, 6 (Dec.), 257--264.Google ScholarGoogle Scholar
  25. Wagner, R. A., and Fischer, M. J. 1973. The string-to-string correction problem. J. ACM 21, 1 (Jan.), 168--173. Google ScholarGoogle Scholar
  26. Weiner, P. 1973. Linear pattern matching algorithms. In Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory. IEEE Computer Society Press, Los Alamitos, Calif., 1--11.Google ScholarGoogle Scholar
  27. Yao, A. C. 1977. Probabilistic computation: towards a unified measure of complexity. In Proceedings of the 18th Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., 222--227.Google ScholarGoogle Scholar
  28. Zimmerman, P. 1995. PGP Source Code and Internals. MIT Press, Cambridge, Mass. Google ScholarGoogle Scholar
  29. Ziv, J., and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (May), 337--343.Google ScholarGoogle Scholar
  30. Ziv, J., and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 5 (Sept.), 530--536.Google ScholarGoogle Scholar

Index Terms

  1. Compactly encoding unstructured inputs with differential compression

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader