Abstract
Evaluating a query can involve manipulation of large volumes of temporary data. When the volume of data becomes too great, activities such as joins and sorting must use disk, and cost minimisation involves complex trade-offs. In this paper, we explore the effect of compression on the cost of external sorting. Reduction in the volume of data potentially allows costs to be reduced — through reductions in disk traffic and numbers of temporary files — but on-the-fly compression can be slow and many compression methods do not allow random access to individual records. We investigate a range of compression techniques for this problem, and develop successful methods based on common letter sequences. Our experiments show that, for a given memory limit, the overheads of compression outweigh the benefits for smaller data volumes, but for large files compression can yield substantial gains, of one-third of costs in the best case tested. Even when the data is stored uncompressed, our results show that incorporation of compression can significantly accelerate query processing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Zobel, J., Williams, H.E., Kimberley, S.: Trends in retrieval system performance. In Edwards, J., ed.: Proceedings of the Australasian Computer Science Conference, Canberra, Australia (2000) 241–248
Boncz, P.A., Manegold, S., Kersten, M.L.: Database architecture optimized for the new bottleneck: Memory access. In: The VLDB Journal. (1999) 54–65
Graefe, G.: Query evaluation techniques for large databases. ACM Computing Surveys 25 (1993) 152–153
Chen, Z., Gehrke, J., Korn, F.: Query optimization in compressed database systems. In: Proceedings of ACM SIGMOD international conference on Management of Data, Santa Barbara, California, USA (2001) 271–282
Goldstein, J., Ramakrishnan, R., Shaft, U.: Compressing relations and indexes. In: Proceedings of the Fourteenth International Conference on Data Engineering, Orlando, Florida, USA, IEEE Computer Society (1998) 370–379
Graefe, G., Shapiro, L.: Data compression and database performance. In ACM/IEEE-CS Symposium On Applied Computing (1991) 22–27
Ng, W.K., Ravishankar, C.V.: Relational database compression using augmented vector quantization. In: Proceedings of the Eleventh International Conference on Data Engineering, Taipei, Taiwan, IEEE Computer Society (1995) 540–549
Ray, G., Harista, J.R., Seshadri, S.: Database compression: A performance enhancement tool. In: Proceedings of the 7th International Conference on Management of Data (COMAD), Pune, India (1995)
Westman, T., Kossmann, D., Helmer, S., Moerkotte, G.: The implementation and performance of compressed databases. ACM SIGMOD Record 29 (2000)
Moffat, A., Zobel, J., Sharman, N.: Text compression for dynamic document databases. IEEE Transactions on Knowledge and Data Engineering 9 (1997) 302–313
Larmore, L.L., Hirschberg, D.S.: A fast algorithm for optimal length-limited Huff-man codes. Journal of the ACM 37 (1990) 464–473
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Second edn. Morgan Kaufmann, San Francisco, California (1999)
Bell, T.C., Moffat, A., Nevill-Manning, C.G., Witten, I.H., Zobel., J.: Data compression in full-text retrieval systems. Journal of the American Society for Information Science 44 (1993) 508–531
Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval. (2002) 222–229
Williams, H.E., Zobel, J.: Compressing integers for fast file access. Computer Journal 42 (1999) 193–201
Zobel, J., Moffat, A.: Adding compression to a full-text retrieval system. Software Practice and Experience 25 (1995) 891–903
Roth, M., Horn, S.V.: Database compression. ACM SIGMOD Record 22 (1993) 31–39
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems Implementation. First edn. Prentice Hall (2000)
Ramakrishnan, R., Gehrke, J.: Database Management Systems. Second edn. McGraw-Hill (2000)
Knuth, D.E.: The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, Massachusetts (1973)
Cannane, A., Williams, H.: A general-purpose compression scheme for large collections. ACM Transactions on Information Systems 20 (2002) 329–355
Moffat, A., Turpin, A.: Compression and Coding Algorithms. First edn. Kluwer (2002)
Ramakrishna, M.V., Zobel, J.: Performance in practice of string hashing functions. In: Proceedings of the Databases Systems for Advanced Applications Symposium, Melbourne, Australia (1997) 215–223
Sinha, R., Zobel, J.: Efficient trie-based sorting of large sets of strings. In Oudshoorn, M., ed.: Proceedings of the Australasian Computer Science Conference, Adelaide, Australia (2003) 11–18
Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. In Ladner, R., ed.: Proceedings of the ALENEX Workshop on Algorithm Engineering and Experiments, Baltimore, Maryland (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yiannis, J., Zobel, J. (2003). External Sorting with On-the-Fly Compression. In: James, A., Younas, M., Lings, B. (eds) New Horizons in Information Management. BNCOD 2003. Lecture Notes in Computer Science, vol 2712. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45073-4_10
Download citation
DOI: https://doi.org/10.1007/3-540-45073-4_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40536-8
Online ISBN: 978-3-540-45073-3
eBook Packages: Springer Book Archive