skip to main content
10.1145/3091478.3091500acmconferencesArticle/Chapter ViewAbstractPublication PageswebsciConference Proceedingsconference-collections
research-article

Exploring Web Archives Through Temporal Anchor Texts

Published:25 June 2017Publication History

ABSTRACT

Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives.

In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.

References

  1. Omar Alonso, Michael Gertz, and Ricardo Baeza-Yates. 2007. On the Value of Temporal Information in Information Retrieval. SIGIR Forum 41, 2 (Dec. 2007), 35--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. 2011. Temporal index sharding for space-time efficiency in archive search. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. 2012. Index maintenance for time-travel text search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Klaus Berberich and Srikanta Bedathur. 2013. Temporal diversification of search results. In SIGIR 2013 Workshop on Time-aware Information Access (TAIA 2013).Google ScholarGoogle Scholar
  5. Klaus Berberich, Srikanta Bedathur, Omar Alonso, and Gerhard Weikum. 2010. A Language Modeling Approach for Temporal Information Needs. In Proceedings of the 32Nd European Conference on Advances in Information Retrieval (ECIR'2010). Springer-Verlag, Berlin, Heidelberg, 13--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. ACM, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ricardo Campos, García Dias, Arapio M. Jorge, and Adam Jatowt. 2014. Survey of Temporal Information Retrieval and Related Applications. ACM Comput. Surv. 47, 2 (Aug. 2014), 15:1--15:41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Miguel Costa, Daniel Gomes, Francisco Couto, and Mário Silva. 2013. A Survey of Web Archive Search Architectures. In Proceedings of the 22nd International Conference on World Wide Web (Companion). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Miguel Costa and Mário J Silva. 2010. Understanding the Information Needs of Web Archive Users . In Proceedings of the 10th International Web Archiving Workshop.Google ScholarGoogle Scholar
  10. Nick Craswell, David Hawking, and Stephen Robertson. 2001. Effective Site Finding using Link Anchor Information. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. German Federal Statistical Office (Statistisches Bundesamt, Destatis). 2011. Fast zehn Jahre Euro - Preisentwicklung vor und nach der Bargeldeinrung. Article number: 5611105119004 (Decenber 2011). https://www.destatis.de/DE/Publikationen/Thematisch/Preise/Verbraucherpreise/Fast10JahreEuro5611105119004.html {Accessed: 16/03/2017}.Google ScholarGoogle Scholar
  12. Vinay Goel. 2016. Beta Wayback Machine - Now with Site Search! (October 2016). https://blog.archive.org/2016/10/24/beta-wayback-machine-now-with-site-search {Accessed: 16/03/2017}.Google ScholarGoogle Scholar
  13. Wendy Hall, Jim Hendler, and Steffen Staab. 2017. A Manifesto for Web Science @10. arXiv:1702.08291 (2017).Google ScholarGoogle Scholar
  14. Helge Holzmann and Avishek Anand. 2016. Tempas: Temporal Archive Search Based on Tags. In Proceedings of the 25th International Conference Companion on World Wide Web. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Helge Holzmann, Vinay Goel, and Avishek Anand. 2016. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Helge Holzmann, Wolfgang Nejdl, and Avishek Anand. 2016. On the Applicability of Delicious for Temporal Search on Web Archives. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Rosie Jones and Fernando Diaz. 2007. Temporal Profiles of Queries. ACM Trans. Inf. Syst. 25, 3 (July 2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Nattiya Kanhabua and Wolfgang Nejdl. 2014. On the Value of Temporal Anchor Texts in Wikipedia. In SIGIR 2014 Workshop on Temporal, Social and Spatiallyaware Information Access (TAIA).Google ScholarGoogle Scholar
  19. Marijn Koolen and Jaap Kamps. 2010. The importance of anchor text for ad hoc search revisited. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 122--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Wessel Kraaij, Thijs Westerveld, and Djoerd Hiemstra. 2002. The Importance of Prior Probabilities for Entry Page Search. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Paul Ogilvie and Jamie Callan. 2003. Combining Document Representations for Known-Item Search. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jaspreet Singh, Wolfgang Nejdl, and Avishek Anand. 2016. History by Diversity: Helping Historians Search News Archives. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval (CHIIR '16). ACM, New York, NY, USA, 183--192. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploring Web Archives Through Temporal Anchor Texts

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WebSci '17: Proceedings of the 2017 ACM on Web Science Conference
        June 2017
        438 pages
        ISBN:9781450348966
        DOI:10.1145/3091478

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 June 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        WebSci '17 Paper Acceptance Rate30of85submissions,35%Overall Acceptance Rate218of875submissions,25%

        Upcoming Conference

        Websci '24
        16th ACM Web Science Conference
        May 21 - 24, 2024
        Stuttgart , Germany

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader