research-article

CC-News-En: A Large English News Corpus

Authors:
Joel Mackenzie

The University of Melbourne, Melbourne, Australia

The University of Melbourne, Melbourne, Australia
View Profile

,
Rodger Benham

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

,
Matthias Petri

Amazon Alexa, Manhattan Beach, CA, USA

Amazon Alexa, Manhattan Beach, CA, USA
View Profile

,
Johanne R. Trippas

The University of Melbourne, Melbourne, Australia

The University of Melbourne, Melbourne, Australia
View Profile

,
J. Shane Culpepper

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

,
Alistair Moffat

The University of Melbourne, Melbourne, Australia

The University of Melbourne, Melbourne, Australia
View Profile

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge ManagementOctober 2020Pages 3077–3084https://doi.org/10.1145/3340531.3412762

Published:19 October 2020Publication History

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Pages 3077–3084

ABSTRACT

We describe a static, open-access news corpus using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages. Our derived corpus, CC-News-En, contains 44 million English documents collected between September 2016 and March 2018. The collection is comparable in size with the number of documents typically found in a single shard of a large-scale, distributed search engine, and is four times larger than the news collections previously used in offline information retrieval experiments. To complement the corpus, 173 topics were curated using titles from Reddit threads, forming a temporally representative sampling of relevant news topics over the 583 day collection window. Information needs were then generated using automatic summarization tools to produce textual and audio representations, and used to elicit query variations from crowdworkers, with a total of 10,437 queries collected against the 173 topics. Of these, 10,089 include key-stroke level instrumentation that captures the timings of character insertions and deletions made by the workers while typing their queries. These new resources support a wide variety of experiments, including large-scale efficiency exercises and query auto-completion synthesis, with scope for future addition of relevance judgments to support offline effectiveness experiments and hence batch evaluation campaigns.

Supplemental Material

3340531.3412762.mp4

mp4

133.6 MB

Download

References

A. Agarwal, A. Mandal, M. Schaffeld, F. Ji, J. Zhang, Y. Sun, and A. Aker. Good, neutral or bad: News classification. In Proc. NewsIR'19 Workshop at SIGIR, 2019.Google Scholar
D. Albakour, M. Martinez, S. Tippmann, A. Aker, J. Stray, S. Dori-Hacohen, and A. Barrón-Cedeño. Third international workshop on recent trends in news information retrieval (NewsIR'19). In Proc. SIGIR, pages 1429--1431, 2019.Google ScholarDigital Library
O. Alonso, D. E. Rose, and B. Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2):9--15, 2008.Google ScholarDigital Library
J. Aslam, F. Diaz, M. Ekstrand-Abueg, R. McCreadie, V. Pavlu, and T. Sakai. TREC 2014 temporal summarization track overview. In Proc. TREC, 2014.Google Scholar
L. Azzopardi and M. de Rijke. Automatic construction of known-item finding test beds. In Proc. SIGIR, pages 603--604, 2006.Google ScholarDigital Library
L. Azzopardi, R. W. White, P. Thomas, and N. Craswell. Data-driven evaluation metrics for heterogeneous search engine result pages. In Proc. CHIIR, pages 213--222, 2020.Google ScholarDigital Library
P. Bailey, A. Moffat, F. Scholer, and P. Thomas. User variability and IR system evaluation. In Proc. SIGIR, pages 625--634, 2015.Google ScholarDigital Library
P. Bailey, A. Moffat, F. Scholer, and P. Thomas. UQV100: A test collection with query variability. In Proc. SIGIR, pages 725--728, 2016.Google ScholarCross Ref
P. Bailey, A. Moffat, F. Scholer, and P. Thomas. Retrieval consistency in the presence of query variations. In Proc. SIGIR, pages 395--404, 2017.Google ScholarDigital Library
B. Baxendale. Machine-made index for technical literature -- an experiment. IBM Journal, pages 354--361, 1958.Google ScholarDigital Library
R. Benham and J. S. Culpepper. Risk-reward trade-offs in rank fusion. In Proc. Aust. Doc. Comp. Symp., pages 1.1--1.8, 2017.Google ScholarDigital Library
R. Benham, L. Gallagher, J. Mackenzie, T. T. Damessie, R.-C. Chen, F. Scholer, A. Moffat, and J. S. Culpepper. RMIT at the 2017 TREC CORE track. In Proc. TREC, 2017.Google Scholar
R. Benham, L. Gallagher, J. Mackenzie, B. Liu, X. Lu, F. Scholer, A. Moffat, and J. S. Culpepper. RMIT at the 2018 TREC CORE track. In Proc. TREC, 2018.Google Scholar
R. Benham, J. Mackenzie, A. Moffat, and J. S. Culpepper. Boosting search performance using query variations. ACM Trans. Inf. Sys., 37(4):41.1--41.25, 2019.Google Scholar
A. J. Biega, J. Schmidt, and R. S. Roy. Towards query logs for privacy studies: On deriving search queries from questions. In Proc. ECIR, pages 110--117, 2020.Google ScholarDigital Library
A. Checco, J. Bates, and G. Demartini. Adversarial attacks on crowdsourcing quality control. J. Artif. Intell. Res., 67:375--408, 2020.Google ScholarCross Ref
M. Chmielewski and S. C. Kucker. An MTurk crisis? Shifts in data quality and the impact on study results. Soc. Psychol. Pers. Sci., 11(4):464--473, 2020.Google ScholarCross Ref
A. Chuklin, A. Severyn, J. R. Trippas, E. Alfonseca, H. Silen, and D. Spina. Using audio transformations to improve comprehension in voice question answering. In Proc. CLEF, pages 164--170, 2019.Google ScholarDigital Library
R. Clancy, N. Ferro, C. Hauff, J. Lin, T. Sakai, and Z. Z. Wu. The SIGIR 2019 opensource IR replicability challenge (OSIRRC 2019). In Proc. SIGIR, pages 1432--1434, 2019.Google Scholar
C. Cleverdon. The Cranfield tests on index language devices. Aslib Proceedings, 19(6):173--194, 1967.Google ScholarCross Ref
D. Corney, D. Albakour, M. Martinez, and S. Moussa. What do a million news articles look like? In Proc. NewsIR'16 Workshop at ECIR, pages 42--47, 2016.Google Scholar
N. Ferro, N. Fuhr, M. Maistro, T. Sakai, and I. Soboroff. CENTRE@CLEF 2019. In Proc. CLEF, pages 283--290, 2019.Google ScholarDigital Library
J. S. Garofolo, E. M. Voorhees, C. G. Auzanne, V. M. Stanford, and B. A. Lund. 1998 TREC-7 spoken document retrieval track overview and results. In Broadcast News Workshop, pages 215--225, 1999.Google Scholar
L. Han, K. Roitero, E. Maddalena, S. Mizzaro, and G. Demartini. On transforming relevance scales. In Proc. CIKM, pages 39--48, 2019.Google ScholarDigital Library
M. Hu, A. Sun, and E.-P. Lim. Comments-oriented blog summarization by sentence extraction. In Proc. CIKM, pages 901--904, 2007.Google ScholarDigital Library
U. Krishnan, B. Billerbeck, A. Moffat, and J. Zobel. Abstraction of query auto completion logs for anonymity-preserving analysis. Inf. Retr., 22(5):499--524, 2019.Google ScholarDigital Library
U. Krishnan, B. Billerbeck, A. Moffat, and J. Zobel. Generation of synthetic query auto completion logs. In Proc. ECIR, pages 621--635, 2020.Google ScholarDigital Library
C.-Y. Lin and E. Hovy. Identifying topics by position. In Proc. ANLP, pages 283--290, 1997.Google ScholarDigital Library
J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, and S. Vigna. Toward reproducible baselines: The open-source IR reproducibility challenge. In Proc. ECIR, 2016.Google ScholarCross Ref
J. Lin, A. Roegiest, L. Tan, R. McCreadie, E. Voorhees, and F. Diaz. Overview of the TREC 2016 real-time summarization track. In Proc. TREC, 2016.Google Scholar
J. Lin, J. Mackenzie, C. Kamphuis, C. Macdonald, A. Mallia, M. Siedlaczek, A. Trotman, and A. de Vries. Supporting interoperability between open-source search engines with the common index file format. In Proc. SIGIR, pages 2149--2152, 2020.Google ScholarDigital Library
B. Liu, N. Craswell, X. Lu, O. Kurland, and J. S. Culpepper. A comparative analysis of human and automatic query variants. In Proc. ICTIR, pages 47--50, 2019.Google ScholarDigital Library
J. Mackenzie, K. Gupta, F. Qiao, A. H. Awadallah, and M. Shokouhi. Exploring user behavior in email re-finding tasks. In Proc. WWW, pages 1245--1255, 2019.Google ScholarDigital Library
J. Mackenzie, A. Mallia, M. Petri, J. S. Culpepper, and T. Suel. Compressing inverted indexes with recursive graph bisection: A reproducibility study. In Proc. ECIR, pages 339--352, 2019.Google ScholarDigital Library
S. Mackie, R. McCreadie, C. Macdonald, and I. Ounis. Experiments in newswire summarisation. In Proc. ECIR, pages 421--435, 2016.Google ScholarCross Ref
D. Maxwell, L. Azzopardi, and Y. Moshfeghi. A study of snippet length and informativeness: Behaviour, performance and user experience. In Proc. SIGIR, pages 135--144, 2017.Google ScholarDigital Library
A. Moffat. Judgment pool effects caused by query variations. In Proc. Aust. Doc. Comp. Symp., pages 65--68, 2016.Google ScholarDigital Library
A. Moffat and M. Petri. Index compression using byte-aligned ANS coding and two-dimensional contexts. In Proc. WSDM, pages 405--413, 2018.Google ScholarDigital Library
F. Moraes, J. Yang, R. Zhang, and V. Murdock. The role of attributes in product quality comparisons. In Proc. CHIIR, pages 253--262, 2020.Google ScholarDigital Library
A. Nenkova. Automatic text summarization of newswire: Lessons learned from the Document Understanding Conference. In Proc. AAAI, pages 1436--1441, 2005.Google Scholar
M. Petri and A. Moffat. Compact inverted index storage using general-purpose compression libraries. Soft. Prac. & Exp., 48(4):974--982, 2018.Google ScholarCross Ref
G. E. Pibiri and R. Venturini. On optimally partitioning variable-byte codes. IEEE Trans. Knowl. Data Eng., 2019.Google Scholar
G. E. Pibiri, M. Petri, and A. Moffat. Fast dictionary-based compression for inverted indexes. In Proc. WSDM, pages 6--14, 2019.Google ScholarDigital Library
M. Sanderson. Test collection based evaluation of information retrieval systems. Found. Trnd. Inf. Retr., 4(4):247--375, 2010.Google ScholarCross Ref
I. Soboroff, S. Huang, and D. Harman. TREC 2018 news track overview. In Proc. TREC, 2018.Google Scholar
D. Spina, J. R. Trippas, L. Cavedon, and M. Sanderson. Extracting audio summaries to support effective spoken document search. J. Assoc. Inf. Sci. Technol., 68(9): 2101--2115, 2017.Google ScholarDigital Library
I. Stanton, S. Ieong, and N. Mishra. Circumlocution in diagnostic medical queries. In Proc. SIGIR, pages 133--142, 2014.Google ScholarDigital Library
J. R. Trippas, D. Spina, M. Sanderson, and L. Cavedon. Towards understanding the impact of length in web search result summaries over a speech-only communication channel. In Proc. SIGIR, pages 991--994, 2015.Google ScholarDigital Library
P. Yang, H. Fang, and J. Lin. Anserini: Reproducible ranking baselines using Lucene. J. Data Inf. Qual., 10(4):1--20, 2018.Google ScholarDigital Library
J. Ye and S. Skiena. Mediarank: Computational ranking of online news sources. In Proc. KDD, pages 2469--2477, 2019.Google ScholarDigital Library

Index Terms

CC-News-En: A Large English News Corpus
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Test collections

Recommendations

Click-through prediction for news queries
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

A growing trend in commercial search engines is the display of specialized content such as news, products, etc. interleaved with web search results. Ideally, this content should be displayed only when it is highly relevant to the search query, as it ...
Read More
A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus
NISS '19: Proceedings of the 2nd International Conference on Networking, Information Systems & Security

Part-of-speech (POS) tagging is a fundamental task of Natural Language Processing (NLP). It provides useful information for many other NLP tasks, including word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic ...
Read More
Integration of news content into web results
WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining

Aggregated search refers to the integration of content from specialized corpora or verticals into web search results. Aggregation improves search when the user has vertical intent but may not be aware of or desire vertical search. In this paper, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
October 2020
3619 pages
ISBN:9781450368599
DOI:10.1145/3340531
General Chairs:
Mathieu d'Aquin
DSI, Insight, NUI Galway, Ireland
,
Stefan Dietze
GESIS, Cologne, Germany, Heinrich-Heine-University Düsseldorf, Germany, L3S Research Center, Germany
,
Program Chairs:
Claudia Hauff
TU Delft, The Netherlands
,
Edward Curry
DSI, Insight, NUI Galway, Ireland
,
Philippe Cudre Mauroux
eXascale, University of Fribourg, Switzerland
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
collection
corpus
crowdsourcing
news search
user query variations
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 366
  Total Downloads
- Downloads (Last 12 months)90
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CC-News-En: A Large English News Corpus

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Click-through prediction for news queries

A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus

Integration of news content into web results