Skip to main content

SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2020)

Abstract

The paper presents SberQuAD – a large Russian reading comprehension (RC) dataset created similarly to English SQuAD. SberQuAD contains about 50K question-paragraph-answer triples and is seven times larger compared to the next competitor. We provide its description, thorough analysis, and baseline experimental results. We scrutinized various aspects of the dataset that can have impact on the task performance: question/paragraph similarity, misspellings in questions, answer structure, and question types. We applied five popular RC models to SberQuAD and analyzed their performance. We believe our work makes an important contribution to research in multilingual question answering.

P. Efimov—Work done as an intern at JetBrains Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/sberbank-ai/data-science-journey-2017.

  2. 2.

    https://toloka.yandex.com.

  3. 3.

    http://docs.deeppavlov.ai/en/master/features/models/squad.html.

  4. 4.

    https://yandex.ru/dev/speller/ (in Russian).

  5. 5.

    The multilingual BERT model is trained on English OntoNotes corpus and transferred to Russian, see http://docs.deeppavlov.ai/en/master/features/models/ner.html.

  6. 6.

    http://aot.ru.

  7. 7.

    Table 5 provides data for the testing set, but the distribution for the training set is quite similar.

  8. 8.

    https://docs.python.org/3/library/difflib.html.

  9. 9.

    https://github.com/deepmipt/ru_sentence_tokenizer.

  10. 10.

    https://yandex.ru/dev/mystem/ (in Russian).

  11. 11.

    Note that in the interface for crowdsourcing SQuAD questions, prompts at each screen reminded the workers to formulate questions in their own words; in addition, the copy-paste functionality for the paragraph was purposefully disabled.

  12. 12.

    https://github.com/buriy/spacy-ru.

  13. 13.

    https://fasttext.cc/docs/en/crawl-vectors.html.

  14. 14.

    https://github.com/sberbank-ai/data-science-journey-2017/tree/master/problem_B/.

  15. 15.

    https://github.com/HKUST-KnowComp/R-Net.

  16. 16.

    https://github.com/allenai/bi-att-flow.

  17. 17.

    https://github.com/allenai/document-qa.

  18. 18.

    https://github.com/facebookresearch/DrQA.

  19. 19.

    http://docs.deeppavlov.ai/en/master/features/models/squad.html.

  20. 20.

    Among these 14 questions the majority are long sentences from the paragraph with a single word (answer) substituted by a question word; there is an exact copy with just a question mark at the end; one question has the answer erroneously attached after the very question.

  21. 21.

    Adverbial phrases appears to be even harder, but they are too few to make reliable conclusions.

References

  1. Artetxe, M., Ruder, S., Yogatama, D.: On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856 (2019)

  2. Bajaj, P., et al.: MS MARCO: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)

  3. Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051 (2017)

  4. Choi, E., et al.: QuAC: question answering in context. In: EMNLP, pp. 2174–2184 (2018)

    Google Scholar 

  5. Clark, C., Gardner, M.: Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723 (2017)

  6. Clark, J.H., et al.: TyDi QA: a benchmark for information-seeking question answering in typologically diverse languages. arXiv preprint arXiv:2003.05002 (2020)

  7. Dang, H.T., Kelly, D., Lin, J.J.: Overview of the TREC 2007 question answering track. In: Proceedings of the 16th TREC (2007)

    Google Scholar 

  8. Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  9. d’Hoffschmidt, M., Vidal, M., Belblidia, W., Brendlé, T.: FQuAD: french question answering dataset. arXiv preprint arXiv:2002.06071 (2020)

  10. Giampiccolo, D., et al.: Overview of the CLEF 2007 multilingual question answering track. In: Peters, C., et al. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 200–236. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85760-0_27

    Chapter  Google Scholar 

  11. He, W., et al.: DuReader: a Chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073 (2017)

  12. Joshi, M., et al.: TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In: ACL, pp. 1601–1611 (2017)

    Google Scholar 

  13. Kuratov, Y., Arkhipov, M.: Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv preprint arXiv:1905.07213 (2019)

  14. Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. TACL 7, 453–466 (2019)

    Google Scholar 

  15. Lewis, P., Oğuz, B., Rinott, R., Riedel, S., Schwenk, H.: MLQA: evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475 (2019)

  16. Prager, J.M.: Open-domain question-answering. Found. Trends Inf. Retrieval 1(2), 91–231 (2006)

    Article  Google Scholar 

  17. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)

  18. Reddy, S., Chen, D., Manning, C.D.: CoQA: a conversational question answering challenge. TACL 7, 249–266 (2019)

    Google Scholar 

  19. Rondeau, M.A., Hazen, T.J.: Systematic error analysis of the Stanford question answering dataset. In: MRQA Workshop (2018)

    Google Scholar 

  20. Seo, M., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603 (2016)

  21. Wadhwa, S., Chandu, K.R., Nyberg, E.: Comparative analysis of neural QA models on SQuAD. arXiv preprint arXiv:1806.06972 (2018)

  22. Wang, W., Yang, N., Wei, F., Chang, B., Zhou, M.: Gated self-matching networks for reading comprehension and question answering. In: ACL, pp. 189–198 (2017)

    Google Scholar 

  23. Zhang, X., Yang, A., Li, S., Wang, Y.: Machine reading comprehension: a literature review. arXiv preprint arXiv:1907.01686 (2019)

Download references

Acknowledgments

We thank Peter Romov, Vladimir Suvorov, and Ekaterina Artemova (Chernyak) for providing us with details about SberQuAD preparation. We also thank Natasha Murashkina for initial data processing. PB acknowledges support by Ural Mathematical Center under agreement No. 075-02-2020-1537/1 with the Ministry of Science and Higher Education of the Russian Federation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pavel Braslavski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Efimov, P., Chertok, A., Boytsov, L., Braslavski, P. (2020). SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis. In: Arampatzis, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2020. Lecture Notes in Computer Science(), vol 12260. Springer, Cham. https://doi.org/10.1007/978-3-030-58219-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58219-7_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58218-0

  • Online ISBN: 978-3-030-58219-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics