research-article

The scalable hyperlink store

Author:
Marc Najork

Microsoft Research, Mountain View, CA, USA

Microsoft Research, Mountain View, CA, USA
View Profile

HT '09: Proceedings of the 20th ACM conference on Hypertext and hypermediaJune 2009Pages 89–98https://doi.org/10.1145/1557914.1557933

Published:29 June 2009Publication History

HT '09: Proceedings of the 20th ACM conference on Hypertext and hypermedia

Pages 89–98

ABSTRACT

This paper describes the Scalable Hyperlink Store, a distributed in-memory "database" for storing large portions of the web graph. SHS is an enabler for research on structural properties of the web graph as well as new link-based ranking algorithms. Previous work on specialized hyperlink databases focused on finding efficient compression algorithms for web graphs. By contrast, this work focuses on the systems issues of building such a database. Specifically, it describes how to build a hyperlink database that is fast, scalable, fault-tolerant, and incrementally updateable.

References

M. Adler and M. Mitzenmacher.Towards Compressing Web Graphs.In 11th IEEE Data Compression Conference, March 2001, pages 203--212. Google ScholarDigital Library
L. Becchetti, C. Castillo, D. Donato, R. Baeza-Yates, and S. Leonardi. Link Analysis for Web Spam Detection. ACM Transactions on the Web, 2(1), 2008. Google ScholarDigital Library
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian.The Connectivity Server: fast access to linkage information on the Web.In 7th International World Wide Web Conference,April 1998, pages 469--477. Google ScholarDigital Library
P. Boldi and S. Vigna.The WebGraph Framework I: Compression Techniques.In 13th International World Wide Web Conference,May 2004, pages 595--601. Google ScholarDigital Library
P. Boldi and S. Vigna.The WebGraph Framework II: Codes For The World-Wide Web. In 14th IEEE Data Compression Conference, March 2004, page 528. Google ScholarDigital Library
A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-Wise Independent Permutations. Journal of Computer and System Sciences 60(3):630--659, 2000. Google ScholarDigital Library
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener.Graph structure in the Web. In 9th International World Wide Web Conference,May 2000, pages 309--320. Google ScholarDigital Library
G. Buehrer and K. Chellapilla.A Scalable Pattern Mining Approach to Web Graph Compression with Communities.In 1st Intl. Conf. on Web Search and Data Mining, February 2008, pages 95--106. Google ScholarDigital Library
A. Fuxman, P. Tsaparas, K. Achan, and R. Agrawal. Using the Wisdom of the Crowds for Keyword Generation. In 17th International World Wide Web Conference,April 2008, pages 61--70. Google ScholarDigital Library
S. Gollapudi, M. Najork, and R. Panigrahy. Using Bloom Filters to Speed Up HITS-like Ranking Algorithms. In 5th Workshop on Algorithms and Models for the Web--Graph, December 2007, pages 195--201. Google ScholarDigital Library
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In 9th Annual ACM--SIAM Symposium on Discrete Algorithms, January 1998, pages 668--677. Google ScholarDigital Library
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for Emerging Cyber-Communities. In 8th International World Wide Web Conference,May 1999, pages 11--16. Google ScholarDigital Library
R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks and ISDN Systems, 33(1--6):387--401, 2000. Google ScholarDigital Library
M. Marchiori. The quest for correct information on the Web: Hyper search engines. In Computer Networks and ISDN Systems, 29(8--13):1225--1236, 1997. Google ScholarDigital Library
A. Moffat and A. Turpin. Compression and Coding Algorithms. Kluwer Academic Publishers, 2002. Google ScholarDigital Library
M. Najork. System and method for maintaining a distributed database of hyperlinks. US Patent 7340467; filed April 2003, issued March 2008.Google Scholar
M. Najork. Comparing the Effectiveness of HITS and SALSA. In 16th ACM Conference on Information and Knowledge Management, November 2007, pages 157--164. Google ScholarDigital Library
M. Najork and N. Craswell. Efficient and Effective Link Analysis with Precomputed SALSA Maps. In 17th ACM Conference on Information and Knowledge Management,October 2008, pages 53--61. Google ScholarDigital Library
M. Najork, S. Gollapudi, and R. Panigrahy. Less is More: Sampling the Neighborhood Graph Makes SALSA Better and Faster. In 2nd ACM International Conference on Web Search and Data Mining, February 2009, pages 242--251. Google ScholarDigital Library
M. Najork and A. Heydon. High-Performance Web Crawling. In Handbook of Massive Data Sets,Kluwer Academic Publishers, 2002. Google ScholarDigital Library
M. Najork, H. Zaragoza, and M. Taylor. HITS on the Web: How does it Compare? In 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2007, pages 471--478. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.Google Scholar
K. Randall, R. Stata, R. Wickremesinghe, and J. Wiener. The Link Database: Fast Access to Graphs of the Web. In 12th IEEE Data Compression Conference, April 2002, pages 122--131. Google ScholarDigital Library
T. Suel and J. Yuan. Compressing the Graph Structure of the Web. In 11th IEEE Data Compression Conference, March 2001, pages 213--222. Google ScholarDigital Library
I. Witten, A. Moffat, and T. Bell. Managing Gigabytes (2nd edition).Academic Press, 1999.Google Scholar

Index Terms

The scalable hyperlink store
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Read More
Random web crawls
WWW '07: Proceedings of the 16th international conference on World Wide Web

This paper proposes a random Web crawl model. A Web crawl is a (biased and partial) image of the Web. This paper deals with the hyperlink structure, i.e. a Web crawl is a graph, whose vertices are the pages and whose edges are the hypertextual links. Of ...
Read More
Graph structure in the web: aggregated by pay-level domain
WebSci '14: Proceedings of the 2014 ACM conference on Web science

Previous research on the overall graph structure of the World Wide Web mostly focused on the page level, meaning that the graph that directly results from hyperlinks between individual web pages was analyzed. This paper aims to provide additional ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HT '09: Proceedings of the 20th ACM conference on Hypertext and hypermedia
June 2009
410 pages
ISBN:9781605584867
DOI:10.1145/1557914
General Chairs:
Ciro Cattuto
Institute for Scientific Interchange Foundation, Italy
,
Giancarlo Ruffo
University of Torino, Italy
,
Program Chair:
Filippo Menczer
Indiana University, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
hyperlink database
scalability
web graph
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate378of1,158submissions,33%
Upcoming Conference
HT '24

Sponsor:

sigweb

35th ACM Conference on Hypertext and Social Media

September 10 - 13, 2024

Poznan , Poland
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 243
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The scalable hyperlink store

HT '09: Proceedings of the 20th ACM conference on Hypertext and hypermedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Current challenges in web crawling

Random web crawls

Graph structure in the web: aggregated by pay-level domain