ABSTRACT
Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper, we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories: "healthcare bill," "manchester bombing," "london terrorism," "trump russia," "travel ban," "hurricane harvey," and "hurricane irma." We tracked each URI in all collections over time to estimate the discoverability of URIs from the first five SERPs. Our results showed that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21 - 0.54, and a weekly rate of 0.39 - 0.79, suggesting the fast replacement of older stories by newer stories. The probability of finding the same URI of a news story after one day from the initial appearance on the SERP ranged from 0.34 - 0.44. After a week, the probability of finding the same news stories diminishes rapidly to 0.01 - 0.11. In addition to the reporting of these probabilities, we also provide two predictive models for estimating the probability of finding the URI of an arbitrary news story on SERPs as a function of time. The web archiving community considers link rot and content drift important reasons for collection building. Similarly, our findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in order to capture the evolution of the events, because it becomes more difficult to find the same news stories with the same queries on Google, as time progresses.
- Eytan Adar, Jaime Teevan, Susan T Dumais, and Jonathan L Elsas . 2009. The web changes everything: understanding the dynamics of web content Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM 2009). 282--291. Google ScholarDigital Library
- Alexander Nwala . 2016. Local Memory Project - Local Stories Collection Generator. https://chrome.google.com/webstore/detail/local-memory-project/khineeknpnogfcholchjihimhofilcfp.Google Scholar
- Alexander Nwala . 2018. Scraping SERPs for archival seeds: it matters when you start - Git Repo. https://github.com/anwala/SERPRefind.Google Scholar
- Anne Aula, Natalie Jhaveri, and Mika K"aki . 2005. Information search and re-access strategies of experienced web users Proceedings of the 14th international conference on World Wide Web (WWW 2005). 583--592. Google ScholarDigital Library
- David Bainbridge, Sally Jo Cunningham, Annika Hinze, and J Stephen Downie . 2017. Writers of the Lost Paper: A Case Study on Barriers to (Re-) Finding Publications International Conference on Asian Digital Libraries (ICADL 2017). 212--224.Google Scholar
- Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta . 2009. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation Vol. 43, 3 (2009), 209--226.Google Scholar
- Donna Bergmark . 2002. Collection synthesis Joint Conference on Digital Libraries (JCDL 2002). 253--262. Google ScholarDigital Library
- Brian E Brewington and George Cybenko . 2000. How dynamic is the Web? Computer Networks, Vol. 33, 1 (2000), 257--276. Google ScholarDigital Library
- Andrei Broder . 2002. A taxonomy of web search. In ACM SIGIR forum, Vol. Vol. 36. 3--10. Google ScholarDigital Library
- Justin F Brunelle, Michele C Weigle, and Michael L Nelson . 2015. Archiving Deferred Representations Using a Two-Tiered Crawling Approach. International Conference on Digital Preservation (iPRES) (2015).Google Scholar
- Robert G Capra and Manuel A Pérez-Qui nones . 2005. Using web search engines to find and refind information. Computer, Vol. 38, 10 (2005), 36--42. Google ScholarDigital Library
- Soumen Chakrabarti, Martin Van den Berg, and Byron Dom . 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Computer networks, Vol. 31, 11 (1999), 1623--1640. Google ScholarDigital Library
- Junghoo Cho and Hector Garcia-Molina . 2000. The Evolution of the Web and Implications for an Incremental Crawler Proceedings of the 26th International Conference on Very Large Data Bases (VLDB '00). 200--209. Google ScholarDigital Library
- Junghoo Cho and Hector Garcia-Molina . 2003. Estimating frequency of change. ACM Transactions on Internet Technology (TOIT), Vol. 3, 3 (2003), 256--290. Google ScholarDigital Library
- Mohamed MG Farag, Sunshin Lee, and Edward A Fox . 2017. Focused crawler for events. International Journal on Digital Libraries (IJDL 2017) (2017), 1--17.Google Scholar
- Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener . 2003. A large-scale study of the evolution of web pages. Proceedings of the 12th international conference on World Wide Web (WWW 2003). 669--678. Google ScholarDigital Library
- Shawn M. Jones, Herbert Van de Sompel, Harihar Shankar, Martin Klein, Richard Tobin, and Claire Grover . 2016. Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PloS one, Vol. 11, 12 (2016).Google Scholar
- Jinyoung Kim and Vitor R Carvalho . 2011. An analysis of time-instability in web search results European Conference on Information Retrieval (ECIR 2011). 466--478. Google ScholarDigital Library
- Martin Klein, Lyudmila Balakireva, and Herbert Van de Sompel . 2018. Focused Crawl of Web Archives to Build Event Collections Web Science Conference (WebSci 2018). Google ScholarDigital Library
- Martin Klein, Michael L Nelson, and Juliet Z Pao . 2007. OAI-PMH Repository Enhancement for the NASA Langley Research Center Atmospheric Sciences Data Center. In Proceedings of the 7th International Web Archiving Workshop (IWAW 2007).Google Scholar
- Martin Klein, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, Lyudmila Balakireva, Ke Zhou, and Richard Tobin . 2014. Scholarly context not found: one in five articles suffers from reference rot. PloS one, Vol. 9, 12 (2014), e115253.Google ScholarCross Ref
- Frank McCown and Michael L Nelson . 2007. Agreeing to disagree: search engines and their public interfaces Joint Conference on Digital Libraries (JCDL 2007). 309--318. Google ScholarDigital Library
- National Library of Medicine . 2014. Global Health Events. https://archive-it.org/collections/4887.Google Scholar
- Alexander C Nwala and Michael L Nelson . 2016. A supervised learning algorithm for binary domain classification of Web queries using SERPs Joint Conference on Digital Libraries (JCDL 2016). 237--238. Google ScholarDigital Library
- Alexander C Nwala, Michele C Weigle, Adam B Ziegler, Anastasia Aizman, and Michael L Nelson . 2017. Local Memory Project: Providing Tools to Build Collections of Stories for Local Events from Local Sources. In Joint Conference on Digital Libraries (JCDL 2017). 1--10. Google ScholarDigital Library
- Christopher Olston and Sandeep Pandey . 2008. Recrawl scheduling based on information longevity. Proceedings of the 17th international conference on World Wide Web (WWW 2008). 437--446. Google ScholarDigital Library
- Thomas Risse, Elena Demidova, and Gerhard Gossen . 2014. What do you want to collect from the web. In Proceedings of the Building Web Observatories Workshop (BWOW 2014).Google Scholar
- Steven M Schneider, Kirsten Foot, Michele Kimpton, and Gina Jones . 2003. Building thematic web collections: challenges and experiences from the September 11 Web Archive and the Election 2002 Web Archive. Third Workshop on Web Archives (2003), 77--94.Google Scholar
- Jaime Teevan, Eytan Adar, Rosie Jones, and Michael AS Potts . 2007. Information re-retrieval: repeat queries in Yahoo's logs Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 2007). 151--158. Google ScholarDigital Library
- Antal Van den Bosch, Toine Bogers, and Maurice De Kunder . 2016. Estimating search engine index size variability: a 9-year longitudinal study. Scientometrics, Vol. 107, 2 (2016), 839--856. Google ScholarDigital Library
- Shuyi Zheng, Pavel Dmitriev, and C Lee Giles . 2009. Graph based crawler seed selection. In Proceedings of the 18th international conference on World Wide Web (WWW 2009). 1089--1090. Google ScholarDigital Library
- Ziming Zhuang, Rohit Wagle, and C Lee Giles . 2005. What's there and what's not?: focused crawling for missing documents in digital libraries Joint Conference on Digital Libraries (JCDL 2005). 301--310. Google ScholarDigital Library
Index Terms
- Scraping SERPs for Archival Seeds: It Matters When You Start
Recommendations
Focused Crawl of Web Archives to Build Event Collections
WebSci '18: Proceedings of the 10th ACM Conference on Web ScienceEvent collections are frequently built by crawling the live web on the basis of seed URIs nominated by human experts. Focused web crawling is a technique where the crawler is guided by reference content pertaining to the event. Given the dynamic nature ...
Using micro-collections in social media to generate seeds for web archive collections
JCDL '19: Proceedings of the 18th Joint Conference on Digital LibrariesIn a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events ranging from elections to disease outbreaks. These archived collections start with seed URIs (...
Focused ranking in a vertical search engine
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalSince the debut of PageRank and HITS, hyperlink-induced Web document ranking has come a long way. The Web has become increasingly vast and topically diverse. Such vastness has led many into the area of topic-sensitive ranking and its variants. We ...
Comments