skip to main content
10.1145/3340531.3412762acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

CC-News-En: A Large English News Corpus

Published:19 October 2020Publication History

ABSTRACT

We describe a static, open-access news corpus using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages. Our derived corpus, CC-News-En, contains 44 million English documents collected between September 2016 and March 2018. The collection is comparable in size with the number of documents typically found in a single shard of a large-scale, distributed search engine, and is four times larger than the news collections previously used in offline information retrieval experiments. To complement the corpus, 173 topics were curated using titles from Reddit threads, forming a temporally representative sampling of relevant news topics over the 583 day collection window. Information needs were then generated using automatic summarization tools to produce textual and audio representations, and used to elicit query variations from crowdworkers, with a total of 10,437 queries collected against the 173 topics. Of these, 10,089 include key-stroke level instrumentation that captures the timings of character insertions and deletions made by the workers while typing their queries. These new resources support a wide variety of experiments, including large-scale efficiency exercises and query auto-completion synthesis, with scope for future addition of relevance judgments to support offline effectiveness experiments and hence batch evaluation campaigns.

Skip Supplemental Material Section

Supplemental Material

3340531.3412762.mp4

mp4

133.6 MB

References

  1. A. Agarwal, A. Mandal, M. Schaffeld, F. Ji, J. Zhang, Y. Sun, and A. Aker. Good, neutral or bad: News classification. In Proc. NewsIR'19 Workshop at SIGIR, 2019.Google ScholarGoogle Scholar
  2. D. Albakour, M. Martinez, S. Tippmann, A. Aker, J. Stray, S. Dori-Hacohen, and A. Barrón-Cedeño. Third international workshop on recent trends in news information retrieval (NewsIR'19). In Proc. SIGIR, pages 1429--1431, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. O. Alonso, D. E. Rose, and B. Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2):9--15, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Aslam, F. Diaz, M. Ekstrand-Abueg, R. McCreadie, V. Pavlu, and T. Sakai. TREC 2014 temporal summarization track overview. In Proc. TREC, 2014.Google ScholarGoogle Scholar
  5. L. Azzopardi and M. de Rijke. Automatic construction of known-item finding test beds. In Proc. SIGIR, pages 603--604, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Azzopardi, R. W. White, P. Thomas, and N. Craswell. Data-driven evaluation metrics for heterogeneous search engine result pages. In Proc. CHIIR, pages 213--222, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Bailey, A. Moffat, F. Scholer, and P. Thomas. User variability and IR system evaluation. In Proc. SIGIR, pages 625--634, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Bailey, A. Moffat, F. Scholer, and P. Thomas. UQV100: A test collection with query variability. In Proc. SIGIR, pages 725--728, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  9. P. Bailey, A. Moffat, F. Scholer, and P. Thomas. Retrieval consistency in the presence of query variations. In Proc. SIGIR, pages 395--404, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. Baxendale. Machine-made index for technical literature -- an experiment. IBM Journal, pages 354--361, 1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Benham and J. S. Culpepper. Risk-reward trade-offs in rank fusion. In Proc. Aust. Doc. Comp. Symp., pages 1.1--1.8, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Benham, L. Gallagher, J. Mackenzie, T. T. Damessie, R.-C. Chen, F. Scholer, A. Moffat, and J. S. Culpepper. RMIT at the 2017 TREC CORE track. In Proc. TREC, 2017.Google ScholarGoogle Scholar
  13. R. Benham, L. Gallagher, J. Mackenzie, B. Liu, X. Lu, F. Scholer, A. Moffat, and J. S. Culpepper. RMIT at the 2018 TREC CORE track. In Proc. TREC, 2018.Google ScholarGoogle Scholar
  14. R. Benham, J. Mackenzie, A. Moffat, and J. S. Culpepper. Boosting search performance using query variations. ACM Trans. Inf. Sys., 37(4):41.1--41.25, 2019.Google ScholarGoogle Scholar
  15. A. J. Biega, J. Schmidt, and R. S. Roy. Towards query logs for privacy studies: On deriving search queries from questions. In Proc. ECIR, pages 110--117, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Checco, J. Bates, and G. Demartini. Adversarial attacks on crowdsourcing quality control. J. Artif. Intell. Res., 67:375--408, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  17. M. Chmielewski and S. C. Kucker. An MTurk crisis? Shifts in data quality and the impact on study results. Soc. Psychol. Pers. Sci., 11(4):464--473, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  18. A. Chuklin, A. Severyn, J. R. Trippas, E. Alfonseca, H. Silen, and D. Spina. Using audio transformations to improve comprehension in voice question answering. In Proc. CLEF, pages 164--170, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Clancy, N. Ferro, C. Hauff, J. Lin, T. Sakai, and Z. Z. Wu. The SIGIR 2019 opensource IR replicability challenge (OSIRRC 2019). In Proc. SIGIR, pages 1432--1434, 2019.Google ScholarGoogle Scholar
  20. C. Cleverdon. The Cranfield tests on index language devices. Aslib Proceedings, 19(6):173--194, 1967.Google ScholarGoogle ScholarCross RefCross Ref
  21. D. Corney, D. Albakour, M. Martinez, and S. Moussa. What do a million news articles look like? In Proc. NewsIR'16 Workshop at ECIR, pages 42--47, 2016.Google ScholarGoogle Scholar
  22. N. Ferro, N. Fuhr, M. Maistro, T. Sakai, and I. Soboroff. CENTRE@CLEF 2019. In Proc. CLEF, pages 283--290, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. S. Garofolo, E. M. Voorhees, C. G. Auzanne, V. M. Stanford, and B. A. Lund. 1998 TREC-7 spoken document retrieval track overview and results. In Broadcast News Workshop, pages 215--225, 1999.Google ScholarGoogle Scholar
  24. L. Han, K. Roitero, E. Maddalena, S. Mizzaro, and G. Demartini. On transforming relevance scales. In Proc. CIKM, pages 39--48, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Hu, A. Sun, and E.-P. Lim. Comments-oriented blog summarization by sentence extraction. In Proc. CIKM, pages 901--904, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. U. Krishnan, B. Billerbeck, A. Moffat, and J. Zobel. Abstraction of query auto completion logs for anonymity-preserving analysis. Inf. Retr., 22(5):499--524, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. U. Krishnan, B. Billerbeck, A. Moffat, and J. Zobel. Generation of synthetic query auto completion logs. In Proc. ECIR, pages 621--635, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C.-Y. Lin and E. Hovy. Identifying topics by position. In Proc. ANLP, pages 283--290, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, and S. Vigna. Toward reproducible baselines: The open-source IR reproducibility challenge. In Proc. ECIR, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  30. J. Lin, A. Roegiest, L. Tan, R. McCreadie, E. Voorhees, and F. Diaz. Overview of the TREC 2016 real-time summarization track. In Proc. TREC, 2016.Google ScholarGoogle Scholar
  31. J. Lin, J. Mackenzie, C. Kamphuis, C. Macdonald, A. Mallia, M. Siedlaczek, A. Trotman, and A. de Vries. Supporting interoperability between open-source search engines with the common index file format. In Proc. SIGIR, pages 2149--2152, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. Liu, N. Craswell, X. Lu, O. Kurland, and J. S. Culpepper. A comparative analysis of human and automatic query variants. In Proc. ICTIR, pages 47--50, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Mackenzie, K. Gupta, F. Qiao, A. H. Awadallah, and M. Shokouhi. Exploring user behavior in email re-finding tasks. In Proc. WWW, pages 1245--1255, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Mackenzie, A. Mallia, M. Petri, J. S. Culpepper, and T. Suel. Compressing inverted indexes with recursive graph bisection: A reproducibility study. In Proc. ECIR, pages 339--352, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Mackie, R. McCreadie, C. Macdonald, and I. Ounis. Experiments in newswire summarisation. In Proc. ECIR, pages 421--435, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  36. D. Maxwell, L. Azzopardi, and Y. Moshfeghi. A study of snippet length and informativeness: Behaviour, performance and user experience. In Proc. SIGIR, pages 135--144, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Moffat. Judgment pool effects caused by query variations. In Proc. Aust. Doc. Comp. Symp., pages 65--68, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Moffat and M. Petri. Index compression using byte-aligned ANS coding and two-dimensional contexts. In Proc. WSDM, pages 405--413, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. F. Moraes, J. Yang, R. Zhang, and V. Murdock. The role of attributes in product quality comparisons. In Proc. CHIIR, pages 253--262, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Nenkova. Automatic text summarization of newswire: Lessons learned from the Document Understanding Conference. In Proc. AAAI, pages 1436--1441, 2005.Google ScholarGoogle Scholar
  41. M. Petri and A. Moffat. Compact inverted index storage using general-purpose compression libraries. Soft. Prac. & Exp., 48(4):974--982, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  42. G. E. Pibiri and R. Venturini. On optimally partitioning variable-byte codes. IEEE Trans. Knowl. Data Eng., 2019.Google ScholarGoogle Scholar
  43. G. E. Pibiri, M. Petri, and A. Moffat. Fast dictionary-based compression for inverted indexes. In Proc. WSDM, pages 6--14, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. Sanderson. Test collection based evaluation of information retrieval systems. Found. Trnd. Inf. Retr., 4(4):247--375, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  45. I. Soboroff, S. Huang, and D. Harman. TREC 2018 news track overview. In Proc. TREC, 2018.Google ScholarGoogle Scholar
  46. D. Spina, J. R. Trippas, L. Cavedon, and M. Sanderson. Extracting audio summaries to support effective spoken document search. J. Assoc. Inf. Sci. Technol., 68(9): 2101--2115, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. I. Stanton, S. Ieong, and N. Mishra. Circumlocution in diagnostic medical queries. In Proc. SIGIR, pages 133--142, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. R. Trippas, D. Spina, M. Sanderson, and L. Cavedon. Towards understanding the impact of length in web search result summaries over a speech-only communication channel. In Proc. SIGIR, pages 991--994, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. P. Yang, H. Fang, and J. Lin. Anserini: Reproducible ranking baselines using Lucene. J. Data Inf. Qual., 10(4):1--20, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. J. Ye and S. Skiena. Mediarank: Computational ranking of online news sources. In Proc. KDD, pages 2469--2477, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CC-News-En: A Large English News Corpus

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
      October 2020
      3619 pages
      ISBN:9781450368599
      DOI:10.1145/3340531

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 October 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader