Skip to main content

External Sorting with On-the-Fly Compression

  • Conference paper
  • First Online:
New Horizons in Information Management (BNCOD 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2712))

Included in the following conference series:

Abstract

Evaluating a query can involve manipulation of large volumes of temporary data. When the volume of data becomes too great, activities such as joins and sorting must use disk, and cost minimisation involves complex trade-offs. In this paper, we explore the effect of compression on the cost of external sorting. Reduction in the volume of data potentially allows costs to be reduced — through reductions in disk traffic and numbers of temporary files — but on-the-fly compression can be slow and many compression methods do not allow random access to individual records. We investigate a range of compression techniques for this problem, and develop successful methods based on common letter sequences. Our experiments show that, for a given memory limit, the overheads of compression outweigh the benefits for smaller data volumes, but for large files compression can yield substantial gains, of one-third of costs in the best case tested. Even when the data is stored uncompressed, our results show that incorporation of compression can significantly accelerate query processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Zobel, J., Williams, H.E., Kimberley, S.: Trends in retrieval system performance. In Edwards, J., ed.: Proceedings of the Australasian Computer Science Conference, Canberra, Australia (2000) 241–248

    Google Scholar 

  2. Boncz, P.A., Manegold, S., Kersten, M.L.: Database architecture optimized for the new bottleneck: Memory access. In: The VLDB Journal. (1999) 54–65

    Google Scholar 

  3. Graefe, G.: Query evaluation techniques for large databases. ACM Computing Surveys 25 (1993) 152–153

    Article  Google Scholar 

  4. Chen, Z., Gehrke, J., Korn, F.: Query optimization in compressed database systems. In: Proceedings of ACM SIGMOD international conference on Management of Data, Santa Barbara, California, USA (2001) 271–282

    Google Scholar 

  5. Goldstein, J., Ramakrishnan, R., Shaft, U.: Compressing relations and indexes. In: Proceedings of the Fourteenth International Conference on Data Engineering, Orlando, Florida, USA, IEEE Computer Society (1998) 370–379

    Google Scholar 

  6. Graefe, G., Shapiro, L.: Data compression and database performance. In ACM/IEEE-CS Symposium On Applied Computing (1991) 22–27

    Google Scholar 

  7. Ng, W.K., Ravishankar, C.V.: Relational database compression using augmented vector quantization. In: Proceedings of the Eleventh International Conference on Data Engineering, Taipei, Taiwan, IEEE Computer Society (1995) 540–549

    Chapter  Google Scholar 

  8. Ray, G., Harista, J.R., Seshadri, S.: Database compression: A performance enhancement tool. In: Proceedings of the 7th International Conference on Management of Data (COMAD), Pune, India (1995)

    Google Scholar 

  9. Westman, T., Kossmann, D., Helmer, S., Moerkotte, G.: The implementation and performance of compressed databases. ACM SIGMOD Record 29 (2000)

    Google Scholar 

  10. Moffat, A., Zobel, J., Sharman, N.: Text compression for dynamic document databases. IEEE Transactions on Knowledge and Data Engineering 9 (1997) 302–313

    Article  Google Scholar 

  11. Larmore, L.L., Hirschberg, D.S.: A fast algorithm for optimal length-limited Huff-man codes. Journal of the ACM 37 (1990) 464–473

    Article  MATH  MathSciNet  Google Scholar 

  12. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Second edn. Morgan Kaufmann, San Francisco, California (1999)

    Google Scholar 

  13. Bell, T.C., Moffat, A., Nevill-Manning, C.G., Witten, I.H., Zobel., J.: Data compression in full-text retrieval systems. Journal of the American Society for Information Science 44 (1993) 508–531

    Article  Google Scholar 

  14. Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval. (2002) 222–229

    Google Scholar 

  15. Williams, H.E., Zobel, J.: Compressing integers for fast file access. Computer Journal 42 (1999) 193–201

    Article  Google Scholar 

  16. Zobel, J., Moffat, A.: Adding compression to a full-text retrieval system. Software Practice and Experience 25 (1995) 891–903

    Article  Google Scholar 

  17. Roth, M., Horn, S.V.: Database compression. ACM SIGMOD Record 22 (1993) 31–39

    Article  Google Scholar 

  18. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems Implementation. First edn. Prentice Hall (2000)

    Google Scholar 

  19. Ramakrishnan, R., Gehrke, J.: Database Management Systems. Second edn. McGraw-Hill (2000)

    Google Scholar 

  20. Knuth, D.E.: The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, Massachusetts (1973)

    Google Scholar 

  21. Cannane, A., Williams, H.: A general-purpose compression scheme for large collections. ACM Transactions on Information Systems 20 (2002) 329–355

    Article  Google Scholar 

  22. Moffat, A., Turpin, A.: Compression and Coding Algorithms. First edn. Kluwer (2002)

    Google Scholar 

  23. Ramakrishna, M.V., Zobel, J.: Performance in practice of string hashing functions. In: Proceedings of the Databases Systems for Advanced Applications Symposium, Melbourne, Australia (1997) 215–223

    Google Scholar 

  24. Sinha, R., Zobel, J.: Efficient trie-based sorting of large sets of strings. In Oudshoorn, M., ed.: Proceedings of the Australasian Computer Science Conference, Adelaide, Australia (2003) 11–18

    Google Scholar 

  25. Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. In Ladner, R., ed.: Proceedings of the ALENEX Workshop on Algorithm Engineering and Experiments, Baltimore, Maryland (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yiannis, J., Zobel, J. (2003). External Sorting with On-the-Fly Compression. In: James, A., Younas, M., Lings, B. (eds) New Horizons in Information Management. BNCOD 2003. Lecture Notes in Computer Science, vol 2712. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45073-4_10

Download citation

  • DOI: https://doi.org/10.1007/3-540-45073-4_10

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40536-8

  • Online ISBN: 978-3-540-45073-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics