ABSTRACT
Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives.
In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.
- Omar Alonso, Michael Gertz, and Ricardo Baeza-Yates. 2007. On the Value of Temporal Information in Information Retrieval. SIGIR Forum 41, 2 (Dec. 2007), 35--41. Google ScholarDigital Library
- Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. 2011. Temporal index sharding for space-time efficiency in archive search. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. Google ScholarDigital Library
- Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. 2012. Index maintenance for time-travel text search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in Information Retrieval. Google ScholarDigital Library
- Klaus Berberich and Srikanta Bedathur. 2013. Temporal diversification of search results. In SIGIR 2013 Workshop on Time-aware Information Access (TAIA 2013).Google Scholar
- Klaus Berberich, Srikanta Bedathur, Omar Alonso, and Gerhard Weikum. 2010. A Language Modeling Approach for Temporal Information Needs. In Proceedings of the 32Nd European Conference on Advances in Information Retrieval (ECIR'2010). Springer-Verlag, Berlin, Heidelberg, 13--25. Google ScholarDigital Library
- Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. ACM, 3--10. Google ScholarDigital Library
- Ricardo Campos, García Dias, Arapio M. Jorge, and Adam Jatowt. 2014. Survey of Temporal Information Retrieval and Related Applications. ACM Comput. Surv. 47, 2 (Aug. 2014), 15:1--15:41. Google ScholarDigital Library
- Miguel Costa, Daniel Gomes, Francisco Couto, and Mário Silva. 2013. A Survey of Web Archive Search Architectures. In Proceedings of the 22nd International Conference on World Wide Web (Companion). Google ScholarDigital Library
- Miguel Costa and Mário J Silva. 2010. Understanding the Information Needs of Web Archive Users . In Proceedings of the 10th International Web Archiving Workshop.Google Scholar
- Nick Craswell, David Hawking, and Stephen Robertson. 2001. Effective Site Finding using Link Anchor Information. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM. Google ScholarDigital Library
- German Federal Statistical Office (Statistisches Bundesamt, Destatis). 2011. Fast zehn Jahre Euro - Preisentwicklung vor und nach der Bargeldeinrung. Article number: 5611105119004 (Decenber 2011). https://www.destatis.de/DE/Publikationen/Thematisch/Preise/Verbraucherpreise/Fast10JahreEuro5611105119004.html {Accessed: 16/03/2017}.Google Scholar
- Vinay Goel. 2016. Beta Wayback Machine - Now with Site Search! (October 2016). https://blog.archive.org/2016/10/24/beta-wayback-machine-now-with-site-search {Accessed: 16/03/2017}.Google Scholar
- Wendy Hall, Jim Hendler, and Steffen Staab. 2017. A Manifesto for Web Science @10. arXiv:1702.08291 (2017).Google Scholar
- Helge Holzmann and Avishek Anand. 2016. Tempas: Temporal Archive Search Based on Tags. In Proceedings of the 25th International Conference Companion on World Wide Web. Google ScholarDigital Library
- Helge Holzmann, Vinay Goel, and Avishek Anand. 2016. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. Google ScholarDigital Library
- Helge Holzmann, Wolfgang Nejdl, and Avishek Anand. 2016. On the Applicability of Delicious for Temporal Search on Web Archives. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- Rosie Jones and Fernando Diaz. 2007. Temporal Profiles of Queries. ACM Trans. Inf. Syst. 25, 3 (July 2007). Google ScholarDigital Library
- Nattiya Kanhabua and Wolfgang Nejdl. 2014. On the Value of Temporal Anchor Texts in Wikipedia. In SIGIR 2014 Workshop on Temporal, Social and Spatiallyaware Information Access (TAIA).Google Scholar
- Marijn Koolen and Jaap Kamps. 2010. The importance of anchor text for ad hoc search revisited. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 122--129. Google ScholarDigital Library
- Wessel Kraaij, Thijs Westerveld, and Djoerd Hiemstra. 2002. The Importance of Prior Probabilities for Entry Page Search. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM. Google ScholarDigital Library
- Paul Ogilvie and Jamie Callan. 2003. Combining Document Representations for Known-Item Search. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM. Google ScholarDigital Library
- Jaspreet Singh, Wolfgang Nejdl, and Avishek Anand. 2016. History by Diversity: Helping Historians Search News Archives. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval (CHIIR '16). ACM, New York, NY, USA, 183--192. Google ScholarDigital Library
Index Terms
- Exploring Web Archives Through Temporal Anchor Texts
Recommendations
Annotating the web archives – an exploration of web archives cataloging and semantic web
ICADL'06: Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and OpportunitiesDespite the success of Internet access via search technology, it has become increasing plain that such a mode is inadequate when applied to holdings in a Web Archives. A greater amount of relevant contextual information is essential in accessing Web ...
Managing and accessing web archives: Irish practitioners’ perspectives
AbstractThis article provides practitioners’ perspectives on preservation of the Irish web space by the National Library of Ireland (the NLI). The context of this work is outlined including the history of Ireland’s national library, its role, resources ...
Tempurion: a collaborative temporal URI collection for named entities
JCDL '19: Proceedings of the 18th Joint Conference on Digital LibrariesWeb archives preserve the history of the Web and help users to access resources that may not be discoverable anymore by traditional web search engines due to changes or deletion. Navigating these vast archives without knowing the exact URI of interest ...
Comments