ABSTRACT
Nutch is an open source search engine that is gaining increasing popularity in the commercial world. The Nutch architecture leads itself to a wide range of parallelization techniques. Multiple backend servers can be used to both partition the corpus of search data, thus increasing the rate of queries serviced, and to increase the size of the search data while preserving the service rate. Alternatively, multiple search engines can operate in parallel, further increasing the query rate. In this paper, we analyze the performance and scalability of various configurations of Nutch. The configurations were implemented as part of the Commercial Scale Out project at IBM Research, and were used to investigate the applicability of scale-out architectures in commercial environments. We conclude that Nutch is highly scalable, with the different configurations behaving differently from a performance perspective.
- M. Cafarella and D. Cutting. Building Nutch: open source search. ACM Queue. Vol. 2, no. 2, pp. 54--61, 2004. Google ScholarDigital Library
- M. Cafarella and O. Etzioni. A search engine for natural language applications. WWW '05: Proceedings of the 14th International World Wide Web Conference. Chiba, Japan. 2005. pp. 442--452. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI'04). San Francisco, CA, December, 2004. Google ScholarDigital Library
- D. M. Desai, T. M. Bradicich, D. Champion, W. G. Holland, and B. M. Kreuz. BladeCenter system overview. IBM Journal of Research and Development. Vol. 49, no. 6. 2005. Google ScholarDigital Library
- D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien. How to build a WebFountain: An architecture for very large-scale text analytics. IBM Systems Journal. Vol. 13, no 1, 2006 pp. 64--77. Google ScholarDigital Library
- E. Hatcher and O. Gospodnetic. Lucene in Action. Manning Publications. 2004. Google ScholarDigital Library
- J. E. Hughes, P. S. Patel, I. R. Zapata, T. D. Pahel, Jr., J. P. Wong, D. M. Desai, and B. D. Herrman. BladeCenter midplane and media interface card. IBM Journal of Research and Development. Vol. 49, no. 6. 2005. Google ScholarDigital Library
- J. E. Hughes, M. L. Scollard, R. Land, J. Parsonese, C. C. West, V. A. Stankevich, C. L. Purrington, D. Q. Hoang, G. R. Shippy, M. L. Loeb, M. W. Williams, B. A. Smith, and D. M. Desai. BladeCenter processor blades, I/O expansion adapters, and units. IBM Journal of Research and Development. Vol. 49, no. 6. 2005. Google ScholarDigital Library
- Z. Liu, C. H. Xia, P. Momcilovic, and L. Zhang. AMBIENCE: Automatic Model Building using InfErENCE. In Congress MSR03, Metz, France, Oct. 2003.Google Scholar
- H. M. Mathis, J. D. McCalpin, and J. Thomas. IBM p5 575 ultra-dense, modular cluster node for high performance computing. IBM Systems and Technology Group. October 2005.Google Scholar
- H. M. Mathis, A. E. Mericas, J. D. McCalpin, R. J. Eickemeyer, and S. R. Kunkel. Characterization of simultaneous multithreading (SMT) efficiency in POWER5. IBM Journal of Research and Development. Vol. 49, no. 4/5. 2005. Google ScholarDigital Library
- M. Michael, J. E. Moreira, D. Shiloach, and R. Wisniewski. Scale-up x Scale-out: A Case Study using Nutch/Lucene. Third International Workshop on System Management Techniques, Processes, and Services (SMTPS). Held in conjunction with the 2007 International Parallel and Distributed Processing Symposium (IPDPS 2007). Long Beach, CA, March 30th, 2007.Google ScholarCross Ref
- F. Schmuck and R. Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. First Conference on File and Storage Technologies (FAST). pp. 231--244. January 2002. Google ScholarDigital Library
- B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. POWER5 system microarchitecture. IBM Journal of Research and Development. Vol. 49, no. 4/5. 2005. Google ScholarDigital Library
- University of Mannheim, University of Tennessee, and NERSC/LBNL. TOP500 Supercomputer sites. http://www.top500.org/.Google Scholar
- E. Varki. Mean value technique for closed fork-join networks. In ACM Sigmetrics, pages 103--112, 1999. Google ScholarDigital Library
- S. Weisberg. Applied Linear Regression, 3rd Edition. Wiley, 2005.Google Scholar
- L. Zhang, C. Xia, M. S. Squillante, and W. N. Mills III. Web workload service requirement analysis: A queueing network approach. In MASCOTS, 2002. Google ScholarDigital Library
Index Terms
- Scalability of the Nutch search engine
Recommendations
Focused ranking in a vertical search engine
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalSince the debut of PageRank and HITS, hyperlink-induced Web document ranking has come a long way. The Web has become increasingly vast and topically diverse. Such vastness has led many into the area of topic-sensitive ranking and its variants. We ...
Parallelization of Vertical Search Engine using Hadoop and MapReduce
AICTC '16: Proceedings of the International Conference on Advances in Information Communication Technology & ComputingIn this paper, we build a parallelized Vertical Search Engine on Apache Hadoop cluster. Domain of our Vertical Search Engine is Computer related terminologies and it takes seed URLs of computer domain extracted from Wikipedia. These web-pages are then ...
Comments