ABSTRACT
We describe a static, open-access news corpus using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages. Our derived corpus, CC-News-En, contains 44 million English documents collected between September 2016 and March 2018. The collection is comparable in size with the number of documents typically found in a single shard of a large-scale, distributed search engine, and is four times larger than the news collections previously used in offline information retrieval experiments. To complement the corpus, 173 topics were curated using titles from Reddit threads, forming a temporally representative sampling of relevant news topics over the 583 day collection window. Information needs were then generated using automatic summarization tools to produce textual and audio representations, and used to elicit query variations from crowdworkers, with a total of 10,437 queries collected against the 173 topics. Of these, 10,089 include key-stroke level instrumentation that captures the timings of character insertions and deletions made by the workers while typing their queries. These new resources support a wide variety of experiments, including large-scale efficiency exercises and query auto-completion synthesis, with scope for future addition of relevance judgments to support offline effectiveness experiments and hence batch evaluation campaigns.
Supplemental Material
- A. Agarwal, A. Mandal, M. Schaffeld, F. Ji, J. Zhang, Y. Sun, and A. Aker. Good, neutral or bad: News classification. In Proc. NewsIR'19 Workshop at SIGIR, 2019.Google Scholar
- D. Albakour, M. Martinez, S. Tippmann, A. Aker, J. Stray, S. Dori-Hacohen, and A. Barrón-Cedeño. Third international workshop on recent trends in news information retrieval (NewsIR'19). In Proc. SIGIR, pages 1429--1431, 2019.Google ScholarDigital Library
- O. Alonso, D. E. Rose, and B. Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2):9--15, 2008.Google ScholarDigital Library
- J. Aslam, F. Diaz, M. Ekstrand-Abueg, R. McCreadie, V. Pavlu, and T. Sakai. TREC 2014 temporal summarization track overview. In Proc. TREC, 2014.Google Scholar
- L. Azzopardi and M. de Rijke. Automatic construction of known-item finding test beds. In Proc. SIGIR, pages 603--604, 2006.Google ScholarDigital Library
- L. Azzopardi, R. W. White, P. Thomas, and N. Craswell. Data-driven evaluation metrics for heterogeneous search engine result pages. In Proc. CHIIR, pages 213--222, 2020.Google ScholarDigital Library
- P. Bailey, A. Moffat, F. Scholer, and P. Thomas. User variability and IR system evaluation. In Proc. SIGIR, pages 625--634, 2015.Google ScholarDigital Library
- P. Bailey, A. Moffat, F. Scholer, and P. Thomas. UQV100: A test collection with query variability. In Proc. SIGIR, pages 725--728, 2016.Google ScholarCross Ref
- P. Bailey, A. Moffat, F. Scholer, and P. Thomas. Retrieval consistency in the presence of query variations. In Proc. SIGIR, pages 395--404, 2017.Google ScholarDigital Library
- B. Baxendale. Machine-made index for technical literature -- an experiment. IBM Journal, pages 354--361, 1958.Google ScholarDigital Library
- R. Benham and J. S. Culpepper. Risk-reward trade-offs in rank fusion. In Proc. Aust. Doc. Comp. Symp., pages 1.1--1.8, 2017.Google ScholarDigital Library
- R. Benham, L. Gallagher, J. Mackenzie, T. T. Damessie, R.-C. Chen, F. Scholer, A. Moffat, and J. S. Culpepper. RMIT at the 2017 TREC CORE track. In Proc. TREC, 2017.Google Scholar
- R. Benham, L. Gallagher, J. Mackenzie, B. Liu, X. Lu, F. Scholer, A. Moffat, and J. S. Culpepper. RMIT at the 2018 TREC CORE track. In Proc. TREC, 2018.Google Scholar
- R. Benham, J. Mackenzie, A. Moffat, and J. S. Culpepper. Boosting search performance using query variations. ACM Trans. Inf. Sys., 37(4):41.1--41.25, 2019.Google Scholar
- A. J. Biega, J. Schmidt, and R. S. Roy. Towards query logs for privacy studies: On deriving search queries from questions. In Proc. ECIR, pages 110--117, 2020.Google ScholarDigital Library
- A. Checco, J. Bates, and G. Demartini. Adversarial attacks on crowdsourcing quality control. J. Artif. Intell. Res., 67:375--408, 2020.Google ScholarCross Ref
- M. Chmielewski and S. C. Kucker. An MTurk crisis? Shifts in data quality and the impact on study results. Soc. Psychol. Pers. Sci., 11(4):464--473, 2020.Google ScholarCross Ref
- A. Chuklin, A. Severyn, J. R. Trippas, E. Alfonseca, H. Silen, and D. Spina. Using audio transformations to improve comprehension in voice question answering. In Proc. CLEF, pages 164--170, 2019.Google ScholarDigital Library
- R. Clancy, N. Ferro, C. Hauff, J. Lin, T. Sakai, and Z. Z. Wu. The SIGIR 2019 opensource IR replicability challenge (OSIRRC 2019). In Proc. SIGIR, pages 1432--1434, 2019.Google Scholar
- C. Cleverdon. The Cranfield tests on index language devices. Aslib Proceedings, 19(6):173--194, 1967.Google ScholarCross Ref
- D. Corney, D. Albakour, M. Martinez, and S. Moussa. What do a million news articles look like? In Proc. NewsIR'16 Workshop at ECIR, pages 42--47, 2016.Google Scholar
- N. Ferro, N. Fuhr, M. Maistro, T. Sakai, and I. Soboroff. CENTRE@CLEF 2019. In Proc. CLEF, pages 283--290, 2019.Google ScholarDigital Library
- J. S. Garofolo, E. M. Voorhees, C. G. Auzanne, V. M. Stanford, and B. A. Lund. 1998 TREC-7 spoken document retrieval track overview and results. In Broadcast News Workshop, pages 215--225, 1999.Google Scholar
- L. Han, K. Roitero, E. Maddalena, S. Mizzaro, and G. Demartini. On transforming relevance scales. In Proc. CIKM, pages 39--48, 2019.Google ScholarDigital Library
- M. Hu, A. Sun, and E.-P. Lim. Comments-oriented blog summarization by sentence extraction. In Proc. CIKM, pages 901--904, 2007.Google ScholarDigital Library
- U. Krishnan, B. Billerbeck, A. Moffat, and J. Zobel. Abstraction of query auto completion logs for anonymity-preserving analysis. Inf. Retr., 22(5):499--524, 2019.Google ScholarDigital Library
- U. Krishnan, B. Billerbeck, A. Moffat, and J. Zobel. Generation of synthetic query auto completion logs. In Proc. ECIR, pages 621--635, 2020.Google ScholarDigital Library
- C.-Y. Lin and E. Hovy. Identifying topics by position. In Proc. ANLP, pages 283--290, 1997.Google ScholarDigital Library
- J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, and S. Vigna. Toward reproducible baselines: The open-source IR reproducibility challenge. In Proc. ECIR, 2016.Google ScholarCross Ref
- J. Lin, A. Roegiest, L. Tan, R. McCreadie, E. Voorhees, and F. Diaz. Overview of the TREC 2016 real-time summarization track. In Proc. TREC, 2016.Google Scholar
- J. Lin, J. Mackenzie, C. Kamphuis, C. Macdonald, A. Mallia, M. Siedlaczek, A. Trotman, and A. de Vries. Supporting interoperability between open-source search engines with the common index file format. In Proc. SIGIR, pages 2149--2152, 2020.Google ScholarDigital Library
- B. Liu, N. Craswell, X. Lu, O. Kurland, and J. S. Culpepper. A comparative analysis of human and automatic query variants. In Proc. ICTIR, pages 47--50, 2019.Google ScholarDigital Library
- J. Mackenzie, K. Gupta, F. Qiao, A. H. Awadallah, and M. Shokouhi. Exploring user behavior in email re-finding tasks. In Proc. WWW, pages 1245--1255, 2019.Google ScholarDigital Library
- J. Mackenzie, A. Mallia, M. Petri, J. S. Culpepper, and T. Suel. Compressing inverted indexes with recursive graph bisection: A reproducibility study. In Proc. ECIR, pages 339--352, 2019.Google ScholarDigital Library
- S. Mackie, R. McCreadie, C. Macdonald, and I. Ounis. Experiments in newswire summarisation. In Proc. ECIR, pages 421--435, 2016.Google ScholarCross Ref
- D. Maxwell, L. Azzopardi, and Y. Moshfeghi. A study of snippet length and informativeness: Behaviour, performance and user experience. In Proc. SIGIR, pages 135--144, 2017.Google ScholarDigital Library
- A. Moffat. Judgment pool effects caused by query variations. In Proc. Aust. Doc. Comp. Symp., pages 65--68, 2016.Google ScholarDigital Library
- A. Moffat and M. Petri. Index compression using byte-aligned ANS coding and two-dimensional contexts. In Proc. WSDM, pages 405--413, 2018.Google ScholarDigital Library
- F. Moraes, J. Yang, R. Zhang, and V. Murdock. The role of attributes in product quality comparisons. In Proc. CHIIR, pages 253--262, 2020.Google ScholarDigital Library
- A. Nenkova. Automatic text summarization of newswire: Lessons learned from the Document Understanding Conference. In Proc. AAAI, pages 1436--1441, 2005.Google Scholar
- M. Petri and A. Moffat. Compact inverted index storage using general-purpose compression libraries. Soft. Prac. & Exp., 48(4):974--982, 2018.Google ScholarCross Ref
- G. E. Pibiri and R. Venturini. On optimally partitioning variable-byte codes. IEEE Trans. Knowl. Data Eng., 2019.Google Scholar
- G. E. Pibiri, M. Petri, and A. Moffat. Fast dictionary-based compression for inverted indexes. In Proc. WSDM, pages 6--14, 2019.Google ScholarDigital Library
- M. Sanderson. Test collection based evaluation of information retrieval systems. Found. Trnd. Inf. Retr., 4(4):247--375, 2010.Google ScholarCross Ref
- I. Soboroff, S. Huang, and D. Harman. TREC 2018 news track overview. In Proc. TREC, 2018.Google Scholar
- D. Spina, J. R. Trippas, L. Cavedon, and M. Sanderson. Extracting audio summaries to support effective spoken document search. J. Assoc. Inf. Sci. Technol., 68(9): 2101--2115, 2017.Google ScholarDigital Library
- I. Stanton, S. Ieong, and N. Mishra. Circumlocution in diagnostic medical queries. In Proc. SIGIR, pages 133--142, 2014.Google ScholarDigital Library
- J. R. Trippas, D. Spina, M. Sanderson, and L. Cavedon. Towards understanding the impact of length in web search result summaries over a speech-only communication channel. In Proc. SIGIR, pages 991--994, 2015.Google ScholarDigital Library
- P. Yang, H. Fang, and J. Lin. Anserini: Reproducible ranking baselines using Lucene. J. Data Inf. Qual., 10(4):1--20, 2018.Google ScholarDigital Library
- J. Ye and S. Skiena. Mediarank: Computational ranking of online news sources. In Proc. KDD, pages 2469--2477, 2019.Google ScholarDigital Library
Index Terms
- CC-News-En: A Large English News Corpus
Recommendations
Click-through prediction for news queries
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalA growing trend in commercial search engines is the display of specialized content such as news, products, etc. interleaved with web search results. Ideally, this content should be displayed only when it is highly relevant to the search query, as it ...
A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus
NISS '19: Proceedings of the 2nd International Conference on Networking, Information Systems & SecurityPart-of-speech (POS) tagging is a fundamental task of Natural Language Processing (NLP). It provides useful information for many other NLP tasks, including word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic ...
Integration of news content into web results
WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data MiningAggregated search refers to the integration of content from specialized corpora or verticals into web search results. Aggregation improves search when the user has vertical intent but may not be aware of or desire vertical search. In this paper, we ...
Comments