research-article

Scraping SERPs for Archival Seeds: It Matters When You Start

Authors:
Alexander C. Nwala

Old Dominion University, Norfolk, VA, USA

Old Dominion University, Norfolk, VA, USA
View Profile

,
Michele C. Weigle

Old Dominion University, Norfolk, VA, USA

Old Dominion University, Norfolk, VA, USA
View Profile

,
Michael L. Nelson

Old Dominion University, Norfolk, VA, USA

Old Dominion University, Norfolk, VA, USA
View Profile

JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital LibrariesMay 2018Pages 263–272https://doi.org/10.1145/3197026.3197056

Published:23 May 2018Publication History

JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries

Pages 263–272

ABSTRACT

Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper, we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories: "healthcare bill," "manchester bombing," "london terrorism," "trump russia," "travel ban," "hurricane harvey," and "hurricane irma." We tracked each URI in all collections over time to estimate the discoverability of URIs from the first five SERPs. Our results showed that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21 - 0.54, and a weekly rate of 0.39 - 0.79, suggesting the fast replacement of older stories by newer stories. The probability of finding the same URI of a news story after one day from the initial appearance on the SERP ranged from 0.34 - 0.44. After a week, the probability of finding the same news stories diminishes rapidly to 0.01 - 0.11. In addition to the reporting of these probabilities, we also provide two predictive models for estimating the probability of finding the URI of an arbitrary news story on SERPs as a function of time. The web archiving community considers link rot and content drift important reasons for collection building. Similarly, our findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in order to capture the evolution of the events, because it becomes more difficult to find the same news stories with the same queries on Google, as time progresses.

References

Eytan Adar, Jaime Teevan, Susan T Dumais, and Jonathan L Elsas . 2009. The web changes everything: understanding the dynamics of web content Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM 2009). 282--291. Google ScholarDigital Library
Alexander Nwala . 2016. Local Memory Project - Local Stories Collection Generator. https://chrome.google.com/webstore/detail/local-memory-project/khineeknpnogfcholchjihimhofilcfp.Google Scholar
Alexander Nwala . 2018. Scraping SERPs for archival seeds: it matters when you start - Git Repo. https://github.com/anwala/SERPRefind.Google Scholar
Anne Aula, Natalie Jhaveri, and Mika K"aki . 2005. Information search and re-access strategies of experienced web users Proceedings of the 14th international conference on World Wide Web (WWW 2005). 583--592. Google ScholarDigital Library
David Bainbridge, Sally Jo Cunningham, Annika Hinze, and J Stephen Downie . 2017. Writers of the Lost Paper: A Case Study on Barriers to (Re-) Finding Publications International Conference on Asian Digital Libraries (ICADL 2017). 212--224.Google Scholar
Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta . 2009. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation Vol. 43, 3 (2009), 209--226.Google Scholar
Donna Bergmark . 2002. Collection synthesis Joint Conference on Digital Libraries (JCDL 2002). 253--262. Google ScholarDigital Library
Brian E Brewington and George Cybenko . 2000. How dynamic is the Web? Computer Networks, Vol. 33, 1 (2000), 257--276. Google ScholarDigital Library
Andrei Broder . 2002. A taxonomy of web search. In ACM SIGIR forum, Vol. Vol. 36. 3--10. Google ScholarDigital Library
Justin F Brunelle, Michele C Weigle, and Michael L Nelson . 2015. Archiving Deferred Representations Using a Two-Tiered Crawling Approach. International Conference on Digital Preservation (iPRES) (2015).Google Scholar
Robert G Capra and Manuel A Pérez-Qui nones . 2005. Using web search engines to find and refind information. Computer, Vol. 38, 10 (2005), 36--42. Google ScholarDigital Library
Soumen Chakrabarti, Martin Van den Berg, and Byron Dom . 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Computer networks, Vol. 31, 11 (1999), 1623--1640. Google ScholarDigital Library
Junghoo Cho and Hector Garcia-Molina . 2000. The Evolution of the Web and Implications for an Incremental Crawler Proceedings of the 26th International Conference on Very Large Data Bases (VLDB '00). 200--209. Google ScholarDigital Library
Junghoo Cho and Hector Garcia-Molina . 2003. Estimating frequency of change. ACM Transactions on Internet Technology (TOIT), Vol. 3, 3 (2003), 256--290. Google ScholarDigital Library
Mohamed MG Farag, Sunshin Lee, and Edward A Fox . 2017. Focused crawler for events. International Journal on Digital Libraries (IJDL 2017) (2017), 1--17.Google Scholar
Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener . 2003. A large-scale study of the evolution of web pages. Proceedings of the 12th international conference on World Wide Web (WWW 2003). 669--678. Google ScholarDigital Library
Shawn M. Jones, Herbert Van de Sompel, Harihar Shankar, Martin Klein, Richard Tobin, and Claire Grover . 2016. Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PloS one, Vol. 11, 12 (2016).Google Scholar
Jinyoung Kim and Vitor R Carvalho . 2011. An analysis of time-instability in web search results European Conference on Information Retrieval (ECIR 2011). 466--478. Google ScholarDigital Library
Martin Klein, Lyudmila Balakireva, and Herbert Van de Sompel . 2018. Focused Crawl of Web Archives to Build Event Collections Web Science Conference (WebSci 2018). Google ScholarDigital Library
Martin Klein, Michael L Nelson, and Juliet Z Pao . 2007. OAI-PMH Repository Enhancement for the NASA Langley Research Center Atmospheric Sciences Data Center. In Proceedings of the 7th International Web Archiving Workshop (IWAW 2007).Google Scholar
Martin Klein, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, Lyudmila Balakireva, Ke Zhou, and Richard Tobin . 2014. Scholarly context not found: one in five articles suffers from reference rot. PloS one, Vol. 9, 12 (2014), e115253.Google ScholarCross Ref
Frank McCown and Michael L Nelson . 2007. Agreeing to disagree: search engines and their public interfaces Joint Conference on Digital Libraries (JCDL 2007). 309--318. Google ScholarDigital Library
National Library of Medicine . 2014. Global Health Events. https://archive-it.org/collections/4887.Google Scholar
Alexander C Nwala and Michael L Nelson . 2016. A supervised learning algorithm for binary domain classification of Web queries using SERPs Joint Conference on Digital Libraries (JCDL 2016). 237--238. Google ScholarDigital Library
Alexander C Nwala, Michele C Weigle, Adam B Ziegler, Anastasia Aizman, and Michael L Nelson . 2017. Local Memory Project: Providing Tools to Build Collections of Stories for Local Events from Local Sources. In Joint Conference on Digital Libraries (JCDL 2017). 1--10. Google ScholarDigital Library
Christopher Olston and Sandeep Pandey . 2008. Recrawl scheduling based on information longevity. Proceedings of the 17th international conference on World Wide Web (WWW 2008). 437--446. Google ScholarDigital Library
Thomas Risse, Elena Demidova, and Gerhard Gossen . 2014. What do you want to collect from the web. In Proceedings of the Building Web Observatories Workshop (BWOW 2014).Google Scholar
Steven M Schneider, Kirsten Foot, Michele Kimpton, and Gina Jones . 2003. Building thematic web collections: challenges and experiences from the September 11 Web Archive and the Election 2002 Web Archive. Third Workshop on Web Archives (2003), 77--94.Google Scholar
Jaime Teevan, Eytan Adar, Rosie Jones, and Michael AS Potts . 2007. Information re-retrieval: repeat queries in Yahoo's logs Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 2007). 151--158. Google ScholarDigital Library
Antal Van den Bosch, Toine Bogers, and Maurice De Kunder . 2016. Estimating search engine index size variability: a 9-year longitudinal study. Scientometrics, Vol. 107, 2 (2016), 839--856. Google ScholarDigital Library
Shuyi Zheng, Pavel Dmitriev, and C Lee Giles . 2009. Graph based crawler seed selection. In Proceedings of the 18th international conference on World Wide Web (WWW 2009). 1089--1090. Google ScholarDigital Library
Ziming Zhuang, Rohit Wagle, and C Lee Giles . 2005. What's there and what's not?: focused crawling for missing documents in digital libraries Joint Conference on Digital Libraries (JCDL 2005). 301--310. Google ScholarDigital Library

Index Terms

Scraping SERPs for Archival Seeds: It Matters When You Start
1. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

Focused Crawl of Web Archives to Build Event Collections
WebSci '18: Proceedings of the 10th ACM Conference on Web Science

Event collections are frequently built by crawling the live web on the basis of seed URIs nominated by human experts. Focused web crawling is a technique where the crawler is guided by reference content pertaining to the event. Given the dynamic nature ...
Read More
Using micro-collections in social media to generate seeds for web archive collections
JCDL '19: Proceedings of the 18th Joint Conference on Digital Libraries

In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events ranging from elections to disease outbreaks. These archived collections start with seed URIs (...
Read More
Focused ranking in a vertical search engine
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Since the debut of PageRank and HITS, hyperlink-induced Web document ranking has come a long way. The Web has become increasingly vast and topically diverse. Such vastness has led many into the area of topic-sensitive ranking and its variants. We ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries
May 2018
453 pages
ISBN:9781450351782
DOI:10.1145/3197026
General Chairs:
Jiangping Chen
College of Information, UNT, USA
,
Marcos André Gonçalves
, Brazil
,
Jeff M. Allen
College of Information, UNT, USA
,
Program Chairs:
Edward A. Fox
Virginia Tech, USA
,
Min-Yen Kan
National University of Singapore, Singapore
,
Vivien Petras
Humboldt-Universität zu Berlin, Germany
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 May 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
collection building
crawling
discoverability
web archiving
Qualifiers
- research-article
Conference

Acceptance Rates
JCDL '18 Paper Acceptance Rate26of71submissions,37%Overall Acceptance Rate415of1,482submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 98
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scraping SERPs for Archival Seeds: It Matters When You Start

JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Focused Crawl of Web Archives to Build Event Collections

Using micro-collections in social media to generate seeds for web archive collections

Focused ranking in a vertical search engine