skip to main content
10.1145/1274971.1274975acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Scalability of the Nutch search engine

Authors Info & Claims
Published:17 June 2007Publication History

ABSTRACT

Nutch is an open source search engine that is gaining increasing popularity in the commercial world. The Nutch architecture leads itself to a wide range of parallelization techniques. Multiple backend servers can be used to both partition the corpus of search data, thus increasing the rate of queries serviced, and to increase the size of the search data while preserving the service rate. Alternatively, multiple search engines can operate in parallel, further increasing the query rate. In this paper, we analyze the performance and scalability of various configurations of Nutch. The configurations were implemented as part of the Commercial Scale Out project at IBM Research, and were used to investigate the applicability of scale-out architectures in commercial environments. We conclude that Nutch is highly scalable, with the different configurations behaving differently from a performance perspective.

References

  1. M. Cafarella and D. Cutting. Building Nutch: open source search. ACM Queue. Vol. 2, no. 2, pp. 54--61, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Cafarella and O. Etzioni. A search engine for natural language applications. WWW '05: Proceedings of the 14th International World Wide Web Conference. Chiba, Japan. 2005. pp. 442--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI'04). San Francisco, CA, December, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. M. Desai, T. M. Bradicich, D. Champion, W. G. Holland, and B. M. Kreuz. BladeCenter system overview. IBM Journal of Research and Development. Vol. 49, no. 6. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien. How to build a WebFountain: An architecture for very large-scale text analytics. IBM Systems Journal. Vol. 13, no 1, 2006 pp. 64--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Hatcher and O. Gospodnetic. Lucene in Action. Manning Publications. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. E. Hughes, P. S. Patel, I. R. Zapata, T. D. Pahel, Jr., J. P. Wong, D. M. Desai, and B. D. Herrman. BladeCenter midplane and media interface card. IBM Journal of Research and Development. Vol. 49, no. 6. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. E. Hughes, M. L. Scollard, R. Land, J. Parsonese, C. C. West, V. A. Stankevich, C. L. Purrington, D. Q. Hoang, G. R. Shippy, M. L. Loeb, M. W. Williams, B. A. Smith, and D. M. Desai. BladeCenter processor blades, I/O expansion adapters, and units. IBM Journal of Research and Development. Vol. 49, no. 6. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Z. Liu, C. H. Xia, P. Momcilovic, and L. Zhang. AMBIENCE: Automatic Model Building using InfErENCE. In Congress MSR03, Metz, France, Oct. 2003.Google ScholarGoogle Scholar
  10. H. M. Mathis, J. D. McCalpin, and J. Thomas. IBM p5 575 ultra-dense, modular cluster node for high performance computing. IBM Systems and Technology Group. October 2005.Google ScholarGoogle Scholar
  11. H. M. Mathis, A. E. Mericas, J. D. McCalpin, R. J. Eickemeyer, and S. R. Kunkel. Characterization of simultaneous multithreading (SMT) efficiency in POWER5. IBM Journal of Research and Development. Vol. 49, no. 4/5. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Michael, J. E. Moreira, D. Shiloach, and R. Wisniewski. Scale-up x Scale-out: A Case Study using Nutch/Lucene. Third International Workshop on System Management Techniques, Processes, and Services (SMTPS). Held in conjunction with the 2007 International Parallel and Distributed Processing Symposium (IPDPS 2007). Long Beach, CA, March 30th, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  13. F. Schmuck and R. Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. First Conference on File and Storage Technologies (FAST). pp. 231--244. January 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. POWER5 system microarchitecture. IBM Journal of Research and Development. Vol. 49, no. 4/5. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. University of Mannheim, University of Tennessee, and NERSC/LBNL. TOP500 Supercomputer sites. http://www.top500.org/.Google ScholarGoogle Scholar
  16. E. Varki. Mean value technique for closed fork-join networks. In ACM Sigmetrics, pages 103--112, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Weisberg. Applied Linear Regression, 3rd Edition. Wiley, 2005.Google ScholarGoogle Scholar
  18. L. Zhang, C. Xia, M. S. Squillante, and W. N. Mills III. Web workload service requirement analysis: A queueing network approach. In MASCOTS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalability of the Nutch search engine

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ICS '07: Proceedings of the 21st annual international conference on Supercomputing
            June 2007
            315 pages
            ISBN:9781595937681
            DOI:10.1145/1274971

            Copyright © 2007 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 17 June 2007

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate584of2,055submissions,28%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader