skip to main content
10.1145/3197026.3197056acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Scraping SERPs for Archival Seeds: It Matters When You Start

Published:23 May 2018Publication History

ABSTRACT

Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper, we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories: "healthcare bill," "manchester bombing," "london terrorism," "trump russia," "travel ban," "hurricane harvey," and "hurricane irma." We tracked each URI in all collections over time to estimate the discoverability of URIs from the first five SERPs. Our results showed that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21 - 0.54, and a weekly rate of 0.39 - 0.79, suggesting the fast replacement of older stories by newer stories. The probability of finding the same URI of a news story after one day from the initial appearance on the SERP ranged from 0.34 - 0.44. After a week, the probability of finding the same news stories diminishes rapidly to 0.01 - 0.11. In addition to the reporting of these probabilities, we also provide two predictive models for estimating the probability of finding the URI of an arbitrary news story on SERPs as a function of time. The web archiving community considers link rot and content drift important reasons for collection building. Similarly, our findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in order to capture the evolution of the events, because it becomes more difficult to find the same news stories with the same queries on Google, as time progresses.

References

  1. Eytan Adar, Jaime Teevan, Susan T Dumais, and Jonathan L Elsas . 2009. The web changes everything: understanding the dynamics of web content Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM 2009). 282--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alexander Nwala . 2016. Local Memory Project - Local Stories Collection Generator. https://chrome.google.com/webstore/detail/local-memory-project/khineeknpnogfcholchjihimhofilcfp.Google ScholarGoogle Scholar
  3. Alexander Nwala . 2018. Scraping SERPs for archival seeds: it matters when you start - Git Repo. https://github.com/anwala/SERPRefind.Google ScholarGoogle Scholar
  4. Anne Aula, Natalie Jhaveri, and Mika K"aki . 2005. Information search and re-access strategies of experienced web users Proceedings of the 14th international conference on World Wide Web (WWW 2005). 583--592. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David Bainbridge, Sally Jo Cunningham, Annika Hinze, and J Stephen Downie . 2017. Writers of the Lost Paper: A Case Study on Barriers to (Re-) Finding Publications International Conference on Asian Digital Libraries (ICADL 2017). 212--224.Google ScholarGoogle Scholar
  6. Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta . 2009. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation Vol. 43, 3 (2009), 209--226.Google ScholarGoogle Scholar
  7. Donna Bergmark . 2002. Collection synthesis Joint Conference on Digital Libraries (JCDL 2002). 253--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Brian E Brewington and George Cybenko . 2000. How dynamic is the Web? Computer Networks, Vol. 33, 1 (2000), 257--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Andrei Broder . 2002. A taxonomy of web search. In ACM SIGIR forum, Vol. Vol. 36. 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Justin F Brunelle, Michele C Weigle, and Michael L Nelson . 2015. Archiving Deferred Representations Using a Two-Tiered Crawling Approach. International Conference on Digital Preservation (iPRES) (2015).Google ScholarGoogle Scholar
  11. Robert G Capra and Manuel A Pérez-Qui nones . 2005. Using web search engines to find and refind information. Computer, Vol. 38, 10 (2005), 36--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Soumen Chakrabarti, Martin Van den Berg, and Byron Dom . 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Computer networks, Vol. 31, 11 (1999), 1623--1640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Junghoo Cho and Hector Garcia-Molina . 2000. The Evolution of the Web and Implications for an Incremental Crawler Proceedings of the 26th International Conference on Very Large Data Bases (VLDB '00). 200--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Junghoo Cho and Hector Garcia-Molina . 2003. Estimating frequency of change. ACM Transactions on Internet Technology (TOIT), Vol. 3, 3 (2003), 256--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Mohamed MG Farag, Sunshin Lee, and Edward A Fox . 2017. Focused crawler for events. International Journal on Digital Libraries (IJDL 2017) (2017), 1--17.Google ScholarGoogle Scholar
  16. Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener . 2003. A large-scale study of the evolution of web pages. Proceedings of the 12th international conference on World Wide Web (WWW 2003). 669--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Shawn M. Jones, Herbert Van de Sompel, Harihar Shankar, Martin Klein, Richard Tobin, and Claire Grover . 2016. Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PloS one, Vol. 11, 12 (2016).Google ScholarGoogle Scholar
  18. Jinyoung Kim and Vitor R Carvalho . 2011. An analysis of time-instability in web search results European Conference on Information Retrieval (ECIR 2011). 466--478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Martin Klein, Lyudmila Balakireva, and Herbert Van de Sompel . 2018. Focused Crawl of Web Archives to Build Event Collections Web Science Conference (WebSci 2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Martin Klein, Michael L Nelson, and Juliet Z Pao . 2007. OAI-PMH Repository Enhancement for the NASA Langley Research Center Atmospheric Sciences Data Center. In Proceedings of the 7th International Web Archiving Workshop (IWAW 2007).Google ScholarGoogle Scholar
  21. Martin Klein, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, Lyudmila Balakireva, Ke Zhou, and Richard Tobin . 2014. Scholarly context not found: one in five articles suffers from reference rot. PloS one, Vol. 9, 12 (2014), e115253.Google ScholarGoogle ScholarCross RefCross Ref
  22. Frank McCown and Michael L Nelson . 2007. Agreeing to disagree: search engines and their public interfaces Joint Conference on Digital Libraries (JCDL 2007). 309--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. National Library of Medicine . 2014. Global Health Events. https://archive-it.org/collections/4887.Google ScholarGoogle Scholar
  24. Alexander C Nwala and Michael L Nelson . 2016. A supervised learning algorithm for binary domain classification of Web queries using SERPs Joint Conference on Digital Libraries (JCDL 2016). 237--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Alexander C Nwala, Michele C Weigle, Adam B Ziegler, Anastasia Aizman, and Michael L Nelson . 2017. Local Memory Project: Providing Tools to Build Collections of Stories for Local Events from Local Sources. In Joint Conference on Digital Libraries (JCDL 2017). 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Christopher Olston and Sandeep Pandey . 2008. Recrawl scheduling based on information longevity. Proceedings of the 17th international conference on World Wide Web (WWW 2008). 437--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Thomas Risse, Elena Demidova, and Gerhard Gossen . 2014. What do you want to collect from the web. In Proceedings of the Building Web Observatories Workshop (BWOW 2014).Google ScholarGoogle Scholar
  28. Steven M Schneider, Kirsten Foot, Michele Kimpton, and Gina Jones . 2003. Building thematic web collections: challenges and experiences from the September 11 Web Archive and the Election 2002 Web Archive. Third Workshop on Web Archives (2003), 77--94.Google ScholarGoogle Scholar
  29. Jaime Teevan, Eytan Adar, Rosie Jones, and Michael AS Potts . 2007. Information re-retrieval: repeat queries in Yahoo's logs Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 2007). 151--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Antal Van den Bosch, Toine Bogers, and Maurice De Kunder . 2016. Estimating search engine index size variability: a 9-year longitudinal study. Scientometrics, Vol. 107, 2 (2016), 839--856. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shuyi Zheng, Pavel Dmitriev, and C Lee Giles . 2009. Graph based crawler seed selection. In Proceedings of the 18th international conference on World Wide Web (WWW 2009). 1089--1090. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ziming Zhuang, Rohit Wagle, and C Lee Giles . 2005. What's there and what's not?: focused crawling for missing documents in digital libraries Joint Conference on Digital Libraries (JCDL 2005). 301--310. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scraping SERPs for Archival Seeds: It Matters When You Start

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries
      May 2018
      453 pages
      ISBN:9781450351782
      DOI:10.1145/3197026

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 May 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      JCDL '18 Paper Acceptance Rate26of71submissions,37%Overall Acceptance Rate415of1,482submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader