SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis

Efimov, Pavel; Chertok, Andrey; Boytsov, Leonid; Braslavski, Pavel

doi:10.1007/978-3-030-58219-7_1

Pavel Efimov¹⁸,
Andrey Chertok¹⁹,
Leonid Boytsov²⁰ &
…
Pavel Braslavski ORCID: orcid.org/0000-0002-6964-458X^21,22

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12260))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1524 Accesses
23 Citations

Abstract

The paper presents SberQuAD – a large Russian reading comprehension (RC) dataset created similarly to English SQuAD. SberQuAD contains about 50K question-paragraph-answer triples and is seven times larger compared to the next competitor. We provide its description, thorough analysis, and baseline experimental results. We scrutinized various aspects of the dataset that can have impact on the task performance: question/paragraph similarity, misspellings in questions, answer structure, and question types. We applied five popular RC models to SberQuAD and analyzed their performance. We believe our work makes an important contribution to research in multilingual question answering.

P. Efimov—Work done as an intern at JetBrains Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/sberbank-ai/data-science-journey-2017.
2.
https://toloka.yandex.com.
3.
http://docs.deeppavlov.ai/en/master/features/models/squad.html.
4.
https://yandex.ru/dev/speller/ (in Russian).
5.
The multilingual BERT model is trained on English OntoNotes corpus and transferred to Russian, see http://docs.deeppavlov.ai/en/master/features/models/ner.html.
6.
http://aot.ru.
7.
Table 5 provides data for the testing set, but the distribution for the training set is quite similar.
8.
https://docs.python.org/3/library/difflib.html.
9.
https://github.com/deepmipt/ru_sentence_tokenizer.
10.
https://yandex.ru/dev/mystem/ (in Russian).
11.
Note that in the interface for crowdsourcing SQuAD questions, prompts at each screen reminded the workers to formulate questions in their own words; in addition, the copy-paste functionality for the paragraph was purposefully disabled.
12.
https://github.com/buriy/spacy-ru.
13.
https://fasttext.cc/docs/en/crawl-vectors.html.
14.
https://github.com/sberbank-ai/data-science-journey-2017/tree/master/problem_B/.
15.
https://github.com/HKUST-KnowComp/R-Net.
16.
https://github.com/allenai/bi-att-flow.
17.
https://github.com/allenai/document-qa.
18.
https://github.com/facebookresearch/DrQA.
19.
http://docs.deeppavlov.ai/en/master/features/models/squad.html.
20.
Among these 14 questions the majority are long sentences from the paragraph with a single word (answer) substituted by a question word; there is an exact copy with just a question mark at the end; one question has the answer erroneously attached after the very question.
21.
Adverbial phrases appears to be even harder, but they are too few to make reliable conclusions.

References

Artetxe, M., Ruder, S., Yogatama, D.: On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856 (2019)
Bajaj, P., et al.: MS MARCO: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)
Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051 (2017)
Choi, E., et al.: QuAC: question answering in context. In: EMNLP, pp. 2174–2184 (2018)
Google Scholar
Clark, C., Gardner, M.: Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723 (2017)
Clark, J.H., et al.: TyDi QA: a benchmark for information-seeking question answering in typologically diverse languages. arXiv preprint arXiv:2003.05002 (2020)
Dang, H.T., Kelly, D., Lin, J.J.: Overview of the TREC 2007 question answering track. In: Proceedings of the 16th TREC (2007)
Google Scholar
Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
d’Hoffschmidt, M., Vidal, M., Belblidia, W., Brendlé, T.: FQuAD: french question answering dataset. arXiv preprint arXiv:2002.06071 (2020)
Giampiccolo, D., et al.: Overview of the CLEF 2007 multilingual question answering track. In: Peters, C., et al. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 200–236. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85760-0_27
Chapter Google Scholar
He, W., et al.: DuReader: a Chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073 (2017)
Joshi, M., et al.: TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In: ACL, pp. 1601–1611 (2017)
Google Scholar
Kuratov, Y., Arkhipov, M.: Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv preprint arXiv:1905.07213 (2019)
Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. TACL 7, 453–466 (2019)
Google Scholar
Lewis, P., Oğuz, B., Rinott, R., Riedel, S., Schwenk, H.: MLQA: evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475 (2019)
Prager, J.M.: Open-domain question-answering. Found. Trends Inf. Retrieval 1(2), 91–231 (2006)
Article Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
Reddy, S., Chen, D., Manning, C.D.: CoQA: a conversational question answering challenge. TACL 7, 249–266 (2019)
Google Scholar
Rondeau, M.A., Hazen, T.J.: Systematic error analysis of the Stanford question answering dataset. In: MRQA Workshop (2018)
Google Scholar
Seo, M., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603 (2016)
Wadhwa, S., Chandu, K.R., Nyberg, E.: Comparative analysis of neural QA models on SQuAD. arXiv preprint arXiv:1806.06972 (2018)
Wang, W., Yang, N., Wei, F., Chang, B., Zhou, M.: Gated self-matching networks for reading comprehension and question answering. In: ACL, pp. 189–198 (2017)
Google Scholar
Zhang, X., Yang, A., Li, S., Wang, Y.: Machine reading comprehension: a literature review. arXiv preprint arXiv:1907.01686 (2019)

Download references

Acknowledgments

We thank Peter Romov, Vladimir Suvorov, and Ekaterina Artemova (Chernyak) for providing us with details about SberQuAD preparation. We also thank Natasha Murashkina for initial data processing. PB acknowledges support by Ural Mathematical Center under agreement No. 075-02-2020-1537/1 with the Ministry of Science and Higher Education of the Russian Federation.

Author information

Authors and Affiliations

St. Petersburg State University, St. Petersburg, Russia
Pavel Efimov
Sberbank, Moscow, Russia
Andrey Chertok
Pittsburgh, PA, USA
Leonid Boytsov
Ural Federal University, Yekaterinburg, Russia
Pavel Braslavski
JetBrains Research, St. Petersburg, Russia
Pavel Braslavski

Authors

Pavel Efimov
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Chertok
View author publications
You can also search for this author in PubMed Google Scholar
Leonid Boytsov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Braslavski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pavel Braslavski .

Editor information

Editors and Affiliations

Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, Greece
Avi Arampatzis
University of Amsterdam, Amsterdam, The Netherlands
Evangelos Kanoulas
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Theodora Tsikrika
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Stefanos Vrochidis
Faculty of Library, Information and Media Science, University of Tsukuba, Ibaraki, Japan
Hideo Joho
Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
Christina Lioma
Brown University, Providence, RI, USA
Carsten Eickhoff
LIMSI-CNRS, Orsay, France
Aurélie Névéol
Department of Information Engineering, University of Padova, Padua, Italy
Linda Cappellato
Department of Information Engineering, University of Padova, Padua, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Efimov, P., Chertok, A., Boytsov, L., Braslavski, P. (2020). SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis. In: Arampatzis, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2020. Lecture Notes in Computer Science(), vol 12260. Springer, Cham. https://doi.org/10.1007/978-3-030-58219-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-58219-7_1
Published: 15 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58218-0
Online ISBN: 978-3-030-58219-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics