skip to main content
article
Free Access

LH*—a scalable, distributed data structure

Published:01 December 1996Publication History
Skip Abstract Section

Abstract

We present a scalable distributed data structure called LH*. LH* generalizes Linear Hashing (LH) to distributed RAM and disk files. An LH* file can be created from records with primary keys, or objects with OIDs, provided by any number of distributed and autonomous clients. It does not require a central directory, and grows gracefully, through splits of one bucket at a time, to virtually any number of servers. The number of messages per random insertion is one in general, and three in the worst case, regardless of the file size. The number of messages per key search is two in general, and four in the worst case. The file supports parallel operations, e.g., hash joins and scans. Performing a parallel operation on a file of M buckets costs at most 2M + 1 messages, and between 1 and O(log2 Mrounds of messages.

We first describle the basic LH* scheme where a coordinator site manages abucket splits, and splits a bucket every time a collision occurs. We show that the average load factor of an LH* file is 65%–70% regardless of file size, and bucket capacity. We then enhance the scheme with load control, performed at no additional message cost. The average load factor then increases to 80–95%. These values are about that of LH, but the load factor for LH* varies more.

We nest define LH* schemes without a coordinator. We show that insert and search costs are the same as for the basic scheme. The splitting cost decreases on the average, but becomes more variable, as cascading splits are needed to prevent file overload. Next, we briefly describe two variants of splitting policy, using parallel splits and presplitting that should enhance performance for high-performance applications.

All together, we show that LH* files can efficiently scale to files that are orders of magnitude larger in size than single-site files. LH* files that reside in main memory may also be much faster than single-site disk files. Finally, LH* files can be more efficient than any distributed file with a centralized directory, or a static parallel or distributed hash file.

References

  1. ~ABEYSUNDARA,B.W.AND KAMAL, A. E. 1991. High-speed local area networks and their ~performance: A survey. ACM Comput. Surv. 23, 2 (June). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ~AMIN,M.B.,SCHNEIDER,D.A.,AND SINGH, V. 1994. An adaptive, load balancing parallel ~join algorithm. In the 6th International Conference on Management of Data (Bangalore, ~India, Dec.).Google ScholarGoogle Scholar
  3. ~DEVINE, R. 1993. Design and implementation of DDH: A distributed dynamic hashing ~algorithm. In Proceedings of the 4th International Conference on Foundations of Data ~Organization and Algorithms (FODO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. ~DEWITT,D.AND GRAY, J. 1992. Parallel database systems: The future of high performance ~database systems. Commun. ACM 35, 6, (June). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. ~DEWITT, D., GERBER, R., GRAEFE, G., HEYTENS, M., KUMAR, K., AND MURALIKRISHNA, M. 1986. ~GAMMA: A high performance dataflow database machine. In Proceedings of VLDB, (Aug.). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. ~ENBODY,R.AND DU, H. 1988. Dynamic hashing systems. ACM Comput. Surv. 20, 2 (June). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. ~FAGIN, R., NIEVERGELT, J., PEPPENGER, N., AND STRONG, H. R. 1979. Extendible hashing:A ~fast access method for dynamic files. ACM Trans. Database Syst. 4, 3, 315-344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. ~GALLANT, J. 1992. FDDI routers and bridges create niche for memories. In EDN (April).Google ScholarGoogle Scholar
  9. ~KNUTH, D. E. 1973. The Art of Computer Programming. Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. ~KITSUREGAWA, M., TANAKA, H., AND MOTO-OKA, T. 1984. Architecture and performance of ~relational algebra machine GRACE. In Proceedings of the International Conference on ~Parallel Processing, (Chicago).Google ScholarGoogle Scholar
  11. ~KROLL,B.AND WIDMAYER, P. 1994. Distributing a search tree among a growing number of ~processors. In Proceedings of ACM-SIGMOD, (May). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. ~LARSON, P. A. 1978. Dynamic hashing. BIT, 184-201.Google ScholarGoogle ScholarCross RefCross Ref
  13. ~LARSON, P. A. 1980. Linear hashing with partial expansions. In Proceedings of VLDB.Google ScholarGoogle Scholar
  14. ~LARSON, P. A. 1988. Dynamic hash tables. Commun. ACM 31, 4 (April) 446-57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ~LITWIN, W. 1980. Linear hashing: A new tool for file and table addressing. In Proceedings of ~VLDB, (Montreal, Canada). Reprinted in Reading in Database Systems, M. Stonebraker Ed., ~Morgan Kaufmann, 2nd ed., 1995. ~ Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. ~LITWIN, W., NEIMAT, M.-A., AND SCHNEIDER, D. A. 1993. LH*:linear hashing for distributed ~files. In Proceedings of ACM-SIGMOD, (May). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ~LITWIN, W., NEIMAT, M.-A., AND SCHNEIDER, D. 1994. RP*: A family of order-preserving ~scalable distributed data structures. In Proceedings of VLDB, (Sept.). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. ~LEVY,E.AND SILBERSCHATZ, A. 1990. Distributed file systems: Concepts and examples. ACM ~Comput. Surv. 22, 4 (Dec.). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. ~NANCE, B. 1992. The fastest LAN alive. Byte, (June) 70-74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. ~RAMAMOHANARAO,K.AND SACKS-DAVIS, R. 1984. Recursive linear hashing. ACM Trans. ~Database Syst. 9, 3, 369-391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. ~SALZBERG, B. 1988. File Structures. Prentice Hall, Englewood Cliffs, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. ~SAMET, H. 1989. The Design and Analysis of Spatial Data Structures. Addison Wesley, ~Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. ~SCHWETMAN, H. 1990. Csim reference manual (revision 14). Tech. Rep. ACT-ST-252-87, Rev. ~14, MCC, March.Google ScholarGoogle Scholar
  24. ~SEVERANCE, C., PRAMANIK, S., AND WOLBERG, P. 1990. Distributed linear hashing and ~parallel projection in main memory databases. In Proceedings of VLDB, Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. ~STONEBRAKER, M. 1986. The case for shared nothing. Database Eng. 9, 1.Google ScholarGoogle Scholar
  26. ~TANENBAUM, A. S. 1995. Distributed Operating Systems. Prentice Hall, Englewood Cliffs, ~NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. ~TERADATA CORP. 1988. DBC/1012 data base computer concepts and facilities. Teradata ~Document C02-001-05.Google ScholarGoogle Scholar
  28. ~VASKEVITCH, D. 1994. Database in crisis and transition: A technical agenda for the year ~2001. In Proceedings of ACM-SIGMOD (May). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. ~VINGRALEK, R., BREITBART, Y., AND WEIKUM, G. 1994. Distributed file organization with ~scalable cost/performance. In Proceedings of ACM-SIGMOD (May). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. LH*—a scalable, distributed data structure

              Recommendations

              Reviews

              Adam Drozdek

              Linear hashing (LH) is a directoryless, dynamic hashing technique developed by Litwin. LH* is a generalization of LH that allows for hashing in a distributed environment. A file stored at different sites can be shared by different clients. The client's image of the file can differ from the file; in particular, the local pointer n ? ,<__?__Pub Caret> indicating the next bucket to be split, may differ from the actual pointer n . Thus, the address of a key is calculated by a client and then by a server. This may lead to forwarding the key to another server, after which the client's file image is adjusted. A search needs between two and four messages, and insertion needs between one and three messages, not counting messages needed to manage a split, which can be performed asynchronously. Extensive simulations indicate that, for a system using buckets of at least 250 keys, the average number of messages per search is 2.01 and the average number of messages per insert is below 1.05, that is, almost ideal. The number of addressing errors never exceeds log 2 (number of buckets), and less active clients—more prone to making addressing errors—make these errors only about 10 percent more often than others. Moreover, the average load factor is 65 to 70 percent, and after the load control is used, the factor increases to between 80 and 95 percent. The only centralized component of the system is a split coordinator that manages splits and merges of buckets, but the coordinator is not necessary. The authors discuss a variant of LH* without a split coordinator. In that version, the splits are accomplished by cascading them. Another variant concerns concurrent splits, in which a key component is a committed split pointer to indicate that a split is finished and thus can be committed. The paper is well and clearly written, and it includes helpful examples and diagrams.

              Access critical reviews of Computing literature here

              Become a reviewer for Computing Reviews.

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader