ABSTRACT
This paper presents an experimental study of the automatic correction of broken (dead) Web links focusing, in particular, on links broken by the relocation ofWeb pages. Our first contribution is that we developed an algorithm that incorporates a comprehensive set of heuristics, some of which are novel, in a single unified framework. The second contribution is that we conducted a relatively large-scale experiment, and analysis of our results revealed the characteristics of the problem of finding movedWeb pages. We demonstrated empirically that the problem of searching for moved pages is different from typical information retrieval problems. First, it is impossible to identify the final destination until the page is moved, so the index-server approach is not necessarily effective. Secondly, there is a large bias about where the new address is likely to be and crawler-based solutions can be effectively implemented, avoiding the need to search the entire Web. We analyzed the experimental results in detail to show how important each heuristic is in real Web settings, and conducted statistical analyses to show that our algorithm succeeds in correctly finding new links for more than 70% of broken links at 95% confidence level.
- H. Ashman, H. Davis: Panel Missing the 404: link integrity on the World Wide Web. Computer Networks 30(1-7): 761--762 (1998). Google ScholarDigital Library
- H. Ashman: Electronic document addressing: dealing with change. ACM Comput. Surv. 32(3): 201--212 (2000) Google ScholarDigital Library
- Ziv Bar-Yossef, Andrei Z. Broder, Ravi Kumar, Andrew Tomkins: Sic transit gloria telae: towards an understanding of the web's decay. WWW 2004: 328--337 Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, G. Zweig: Syntactic Clustering of the Web. Computer Networks 29(8-13): 1157--1166 (1997) Google ScholarDigital Library
- M. Beynon, A. Flegg: Hypertext Request Integrity and User Experience. US Patent Application Publication, US 2004/0267726 A1, Dec, 2004.Google Scholar
- M. Beynon, A. Flegg: Guaranteeing Hypertext Link Integrity. US Patent Application Publication, US 2005/0021997 A1, Jan. 2005.Google Scholar
- Ziv Bar-Yossef, Idit Keidar, Uri Schonfeld: Do not crawl in the dust: different urls with similar text. WWW 2007: 111--120. Google ScholarDigital Library
- J. Cho, N. Shivakumar, H. Garcia-Molina: Finding Replicated Web Collections. SIGMOD Conference 2000: 355--366 Google ScholarDigital Library
- S. Park, D. M. Pennock, C. L. Giles, R. Krovetz: Analysis of lexical signatures for improving information persistence on the World Wide Web. ACM Trans. Inf. Syst. 22(4): 540--572 (2004) Google ScholarDigital Library
- H. C. Davis: Referential Integrity of Links in Open Hypermedia Systems. Hypertext 1998: 207--216 Google ScholarDigital Library
- H. C. Davis: Hypertext link integrity. ACM Comput. Surv. 31(4es): 28 (1999) Google ScholarDigital Library
- R. P. Dellavalle, E. J. hester, L. F. Heilig, A. L. Drake, J. W. Kuntzman, M. Graber, L. M. Schilling: Going, Going, Gone: Lost Internet References, Science 302(31), 2003: 787--788Google ScholarCross Ref
- X. Dong, A. Y. Halevy, J. Madhavan: Reference Reconciliation in Complex Information Spaces. SIGMOD Conference 2005: 85--96 Google ScholarDigital Library
- D. Dhyani, W. K. Ng, S. S. Bhowmick: A survey of Web metrics. ACM Comput. Surv. 34(4): 469--503 (2002) Google ScholarDigital Library
- Roy T. Fielding: Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web. Computer Networks and ISDN Systems 27(2): 193--204 (1994) Google ScholarDigital Library
- David B. Ingham, Steve J. Caughey, Mark C. Little: Fixing the "Broken-Link" Problem: The W3Objects Approach. Computer Networks 28(7-11): 1255--1268 (1996) Google ScholarDigital Library
- Katsumi Tanaka, N. Nishikawa, S. Hirayama, K. Nanba: Query Pairs as Hypertext Links. ICDE 1991: 456--463. Google ScholarDigital Library
- Toshinari Iida, Natsumi Sawa, Atsuyuki Morishima, Shigeo Sugimoto, Hiroyuki Kitagawa. Efficient Search for Moved Web Pages. Proc. DEWS2007, 7 pages, 2007 (in Japanese).Google Scholar
- Jon M. Kleinberg: Authoritative Sources in a Hyperlinked Environment. SODA 1998: 668--677. Google ScholarDigital Library
- Google Technology. http://www.google.com/technology/.Google Scholar
- GVU Center, College of Computing Georgia Institute of Technology. GVU's 10th WWW User Survey. http://www.gvu.gatech.edu/user_surveys/survey-1998-10/.Google Scholar
- A. Mood, F. Graybill, D. Boes. Introduction to the theory of statistics. McGraw-Hill, 1974.Google Scholar
- A. Morishima, et al. Automatic Correction of Broken Web Links (full version of this paper) Technical Report, University of Tsukuba.Google Scholar
- Thomas A. Phelps, Robert Wilensky: Robust Hyperlinks: Cheap, Everywhere, Now. DDEP/PODDP 2000: 28--43Google Scholar
- Persistent URL Home Page. http://purl.oclc.org/.Google Scholar
- RFC2396. Uniform Resource Identifiers (URI): Generic Syntax. http://www.ietf.org/rfc/rfc2396.txt.Google Scholar
- Jonathan Shakes, Marc Langheinrich, Oren Etzioni: Dynamic Reference Sifting: A Case Study in the Homepage Domain. Computer Networks 29(8-13): 1193--1204 (1997) Google ScholarDigital Library
- L. Huxley, E. Place, D. Boyd and P. Cross. Planet SOSIG - A spring-clean for SOSIG: a systematic approach to collection management. http://www.ariadne.ac.uk/issue33/planet-sosig/.Google Scholar
- Ellen Spertus, Lynn Andrea Stein: Squeal: a structured query language for the Web. Computer Networks 33(1-6): 95--103 (2000) Google ScholarDigital Library
- Xenu's Link Sleuth. http://www.cs.washington.edu/lab/sw/LinkSleuth.html.Google Scholar
Index Terms
- Bringing your dead links back to life: a comprehensive approach and lessons learned
Recommendations
Characterizing "permanently dead" links on Wikipedia
IMC '22: Proceedings of the 22nd ACM Internet Measurement ConferenceIt is common for a web page to include links which help visitors discover related pages on other sites. When a link ceases to work (e.g., because the page that it is pointing to either no longer exists or has been moved), users could rely on an archived ...
Recommendation System for Automatic Recovery of Broken Web Links
Advances in Artificial Intelligence – IBERAMIA 2008AbstractIn the web pages accessed when navigating throughout Internet, or even in our own web pages, we sometimes find links which are not valid any more. The search of the right web pages which correspond to those links is often hard. In this work we ...
DSNotify: handling broken links in the web of data
WWW '10: Proceedings of the 19th international conference on World wide webThe Web of Data has emerged as a way of exposing structured linked data on the Web. It builds on the central building blocks of the Web (URIs, HTTP) and benefits from its simplicity and wide-spread adoption. It does, however, also inherit the unresolved ...
Comments