skip to main content
10.1145/1557914.1557921acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
research-article

Bringing your dead links back to life: a comprehensive approach and lessons learned

Published:29 June 2009Publication History

ABSTRACT

This paper presents an experimental study of the automatic correction of broken (dead) Web links focusing, in particular, on links broken by the relocation ofWeb pages. Our first contribution is that we developed an algorithm that incorporates a comprehensive set of heuristics, some of which are novel, in a single unified framework. The second contribution is that we conducted a relatively large-scale experiment, and analysis of our results revealed the characteristics of the problem of finding movedWeb pages. We demonstrated empirically that the problem of searching for moved pages is different from typical information retrieval problems. First, it is impossible to identify the final destination until the page is moved, so the index-server approach is not necessarily effective. Secondly, there is a large bias about where the new address is likely to be and crawler-based solutions can be effectively implemented, avoiding the need to search the entire Web. We analyzed the experimental results in detail to show how important each heuristic is in real Web settings, and conducted statistical analyses to show that our algorithm succeeds in correctly finding new links for more than 70% of broken links at 95% confidence level.

References

  1. H. Ashman, H. Davis: Panel Missing the 404: link integrity on the World Wide Web. Computer Networks 30(1-7): 761--762 (1998). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. H. Ashman: Electronic document addressing: dealing with change. ACM Comput. Surv. 32(3): 201--212 (2000) Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ziv Bar-Yossef, Andrei Z. Broder, Ravi Kumar, Andrew Tomkins: Sic transit gloria telae: towards an understanding of the web's decay. WWW 2004: 328--337 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Z. Broder, S. C. Glassman, M. S. Manasse, G. Zweig: Syntactic Clustering of the Web. Computer Networks 29(8-13): 1157--1166 (1997) Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Beynon, A. Flegg: Hypertext Request Integrity and User Experience. US Patent Application Publication, US 2004/0267726 A1, Dec, 2004.Google ScholarGoogle Scholar
  6. M. Beynon, A. Flegg: Guaranteeing Hypertext Link Integrity. US Patent Application Publication, US 2005/0021997 A1, Jan. 2005.Google ScholarGoogle Scholar
  7. Ziv Bar-Yossef, Idit Keidar, Uri Schonfeld: Do not crawl in the dust: different urls with similar text. WWW 2007: 111--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Cho, N. Shivakumar, H. Garcia-Molina: Finding Replicated Web Collections. SIGMOD Conference 2000: 355--366 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Park, D. M. Pennock, C. L. Giles, R. Krovetz: Analysis of lexical signatures for improving information persistence on the World Wide Web. ACM Trans. Inf. Syst. 22(4): 540--572 (2004) Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. C. Davis: Referential Integrity of Links in Open Hypermedia Systems. Hypertext 1998: 207--216 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. C. Davis: Hypertext link integrity. ACM Comput. Surv. 31(4es): 28 (1999) Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. P. Dellavalle, E. J. hester, L. F. Heilig, A. L. Drake, J. W. Kuntzman, M. Graber, L. M. Schilling: Going, Going, Gone: Lost Internet References, Science 302(31), 2003: 787--788Google ScholarGoogle ScholarCross RefCross Ref
  13. X. Dong, A. Y. Halevy, J. Madhavan: Reference Reconciliation in Complex Information Spaces. SIGMOD Conference 2005: 85--96 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Dhyani, W. K. Ng, S. S. Bhowmick: A survey of Web metrics. ACM Comput. Surv. 34(4): 469--503 (2002) Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Roy T. Fielding: Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web. Computer Networks and ISDN Systems 27(2): 193--204 (1994) Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. David B. Ingham, Steve J. Caughey, Mark C. Little: Fixing the "Broken-Link" Problem: The W3Objects Approach. Computer Networks 28(7-11): 1255--1268 (1996) Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Katsumi Tanaka, N. Nishikawa, S. Hirayama, K. Nanba: Query Pairs as Hypertext Links. ICDE 1991: 456--463. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Toshinari Iida, Natsumi Sawa, Atsuyuki Morishima, Shigeo Sugimoto, Hiroyuki Kitagawa. Efficient Search for Moved Web Pages. Proc. DEWS2007, 7 pages, 2007 (in Japanese).Google ScholarGoogle Scholar
  19. Jon M. Kleinberg: Authoritative Sources in a Hyperlinked Environment. SODA 1998: 668--677. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Google Technology. http://www.google.com/technology/.Google ScholarGoogle Scholar
  21. GVU Center, College of Computing Georgia Institute of Technology. GVU's 10th WWW User Survey. http://www.gvu.gatech.edu/user_surveys/survey-1998-10/.Google ScholarGoogle Scholar
  22. A. Mood, F. Graybill, D. Boes. Introduction to the theory of statistics. McGraw-Hill, 1974.Google ScholarGoogle Scholar
  23. A. Morishima, et al. Automatic Correction of Broken Web Links (full version of this paper) Technical Report, University of Tsukuba.Google ScholarGoogle Scholar
  24. Thomas A. Phelps, Robert Wilensky: Robust Hyperlinks: Cheap, Everywhere, Now. DDEP/PODDP 2000: 28--43Google ScholarGoogle Scholar
  25. Persistent URL Home Page. http://purl.oclc.org/.Google ScholarGoogle Scholar
  26. RFC2396. Uniform Resource Identifiers (URI): Generic Syntax. http://www.ietf.org/rfc/rfc2396.txt.Google ScholarGoogle Scholar
  27. Jonathan Shakes, Marc Langheinrich, Oren Etzioni: Dynamic Reference Sifting: A Case Study in the Homepage Domain. Computer Networks 29(8-13): 1193--1204 (1997) Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. L. Huxley, E. Place, D. Boyd and P. Cross. Planet SOSIG - A spring-clean for SOSIG: a systematic approach to collection management. http://www.ariadne.ac.uk/issue33/planet-sosig/.Google ScholarGoogle Scholar
  29. Ellen Spertus, Lynn Andrea Stein: Squeal: a structured query language for the Web. Computer Networks 33(1-6): 95--103 (2000) Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Xenu's Link Sleuth. http://www.cs.washington.edu/lab/sw/LinkSleuth.html.Google ScholarGoogle Scholar

Index Terms

  1. Bringing your dead links back to life: a comprehensive approach and lessons learned

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      HT '09: Proceedings of the 20th ACM conference on Hypertext and hypermedia
      June 2009
      410 pages
      ISBN:9781605584867
      DOI:10.1145/1557914

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 June 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate378of1,158submissions,33%

      Upcoming Conference

      HT '24
      35th ACM Conference on Hypertext and Social Media
      September 10 - 13, 2024
      Poznan , Poland

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader