ABSTRACT
This paper describes the Scalable Hyperlink Store, a distributed in-memory "database" for storing large portions of the web graph. SHS is an enabler for research on structural properties of the web graph as well as new link-based ranking algorithms. Previous work on specialized hyperlink databases focused on finding efficient compression algorithms for web graphs. By contrast, this work focuses on the systems issues of building such a database. Specifically, it describes how to build a hyperlink database that is fast, scalable, fault-tolerant, and incrementally updateable.
- M. Adler and M. Mitzenmacher.Towards Compressing Web Graphs.In 11th IEEE Data Compression Conference, March 2001, pages 203--212. Google ScholarDigital Library
- L. Becchetti, C. Castillo, D. Donato, R. Baeza-Yates, and S. Leonardi. Link Analysis for Web Spam Detection. ACM Transactions on the Web, 2(1), 2008. Google ScholarDigital Library
- K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian.The Connectivity Server: fast access to linkage information on the Web.In 7th International World Wide Web Conference,April 1998, pages 469--477. Google ScholarDigital Library
- P. Boldi and S. Vigna.The WebGraph Framework I: Compression Techniques.In 13th International World Wide Web Conference,May 2004, pages 595--601. Google ScholarDigital Library
- P. Boldi and S. Vigna.The WebGraph Framework II: Codes For The World-Wide Web. In 14th IEEE Data Compression Conference, March 2004, page 528. Google ScholarDigital Library
- A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-Wise Independent Permutations. Journal of Computer and System Sciences 60(3):630--659, 2000. Google ScholarDigital Library
- A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener.Graph structure in the Web. In 9th International World Wide Web Conference,May 2000, pages 309--320. Google ScholarDigital Library
- G. Buehrer and K. Chellapilla.A Scalable Pattern Mining Approach to Web Graph Compression with Communities.In 1st Intl. Conf. on Web Search and Data Mining, February 2008, pages 95--106. Google ScholarDigital Library
- A. Fuxman, P. Tsaparas, K. Achan, and R. Agrawal. Using the Wisdom of the Crowds for Keyword Generation. In 17th International World Wide Web Conference,April 2008, pages 61--70. Google ScholarDigital Library
- S. Gollapudi, M. Najork, and R. Panigrahy. Using Bloom Filters to Speed Up HITS-like Ranking Algorithms. In 5th Workshop on Algorithms and Models for the Web--Graph, December 2007, pages 195--201. Google ScholarDigital Library
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In 9th Annual ACM--SIAM Symposium on Discrete Algorithms, January 1998, pages 668--677. Google ScholarDigital Library
- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for Emerging Cyber-Communities. In 8th International World Wide Web Conference,May 1999, pages 11--16. Google ScholarDigital Library
- R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks and ISDN Systems, 33(1--6):387--401, 2000. Google ScholarDigital Library
- M. Marchiori. The quest for correct information on the Web: Hyper search engines. In Computer Networks and ISDN Systems, 29(8--13):1225--1236, 1997. Google ScholarDigital Library
- A. Moffat and A. Turpin. Compression and Coding Algorithms. Kluwer Academic Publishers, 2002. Google ScholarDigital Library
- M. Najork. System and method for maintaining a distributed database of hyperlinks. US Patent 7340467; filed April 2003, issued March 2008.Google Scholar
- M. Najork. Comparing the Effectiveness of HITS and SALSA. In 16th ACM Conference on Information and Knowledge Management, November 2007, pages 157--164. Google ScholarDigital Library
- M. Najork and N. Craswell. Efficient and Effective Link Analysis with Precomputed SALSA Maps. In 17th ACM Conference on Information and Knowledge Management,October 2008, pages 53--61. Google ScholarDigital Library
- M. Najork, S. Gollapudi, and R. Panigrahy. Less is More: Sampling the Neighborhood Graph Makes SALSA Better and Faster. In 2nd ACM International Conference on Web Search and Data Mining, February 2009, pages 242--251. Google ScholarDigital Library
- M. Najork and A. Heydon. High-Performance Web Crawling. In Handbook of Massive Data Sets,Kluwer Academic Publishers, 2002. Google ScholarDigital Library
- M. Najork, H. Zaragoza, and M. Taylor. HITS on the Web: How does it Compare? In 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2007, pages 471--478. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.Google Scholar
- K. Randall, R. Stata, R. Wickremesinghe, and J. Wiener. The Link Database: Fast Access to Graphs of the Web. In 12th IEEE Data Compression Conference, April 2002, pages 122--131. Google ScholarDigital Library
- T. Suel and J. Yuan. Compressing the Graph Structure of the Web. In 11th IEEE Data Compression Conference, March 2001, pages 213--222. Google ScholarDigital Library
- I. Witten, A. Moffat, and T. Bell. Managing Gigabytes (2nd edition).Academic Press, 1999.Google Scholar
Index Terms
- The scalable hyperlink store
Recommendations
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Random web crawls
WWW '07: Proceedings of the 16th international conference on World Wide WebThis paper proposes a random Web crawl model. A Web crawl is a (biased and partial) image of the Web. This paper deals with the hyperlink structure, i.e. a Web crawl is a graph, whose vertices are the pages and whose edges are the hypertextual links. Of ...
Graph structure in the web: aggregated by pay-level domain
WebSci '14: Proceedings of the 2014 ACM conference on Web sciencePrevious research on the overall graph structure of the World Wide Web mostly focused on the page level, meaning that the graph that directly results from hyperlinks between individual web pages was analyzed. This paper aims to provide additional ...
Comments