A Systematic Review of Automatic Question Generation for Educational Purposes

Kurdi, Ghader; Leo, Jared; Parsia, Bijan; Sattler, Uli; Al-Emari, Salam

doi:10.1007/s40593-019-00186-y

A Systematic Review of Automatic Question Generation for Educational Purposes

Article
Open access
Published: 21 November 2019

Volume 30, pages 121–204, (2020)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Artificial Intelligence in Education Aims and scope Submit manuscript

A Systematic Review of Automatic Question Generation for Educational Purposes

Download PDF

Ghader Kurdi ORCID: orcid.org/0000-0003-1745-5581¹,
Jared Leo¹,
Bijan Parsia¹,
Uli Sattler¹ &
…
Salam Al-Emari²

45k Accesses
158 Citations
15 Altmetric
1 Mention
Explore all metrics

Abstract

While exam-style questions are a fundamental educational tool serving a variety of purposes, manual construction of questions is a complex process that requires training, experience, and resources. This, in turn, hinders and slows down the use of educational activities (e.g. providing practice questions) and new advances (e.g. adaptive testing) that require a large pool of questions. To reduce the expenses associated with manual construction of questions and to satisfy the need for a continuous supply of new questions, automatic question generation (AQG) techniques were introduced. This review extends a previous review on AQG literature that has been published up to late 2014. It includes 93 papers that were between 2015 and early 2019 and tackle the automatic generation of questions for educational purposes. The aims of this review are to: provide an overview of the AQG community and its activities, summarise the current trends and advances in AQG, highlight the changes that the area has undergone in the recent years, and suggest areas for improvement and future opportunities for AQG. Similar to what was found previously, there is little focus in the current literature on generating questions of controlled difficulty, enriching question forms and structures, automating template construction, improving presentation, and generating feedback. Our findings also suggest the need to further improve experimental reporting, harmonise evaluation metrics, and investigate other evaluation methods that are more feasible.

Qgen: An Automatic Question Paper Generator

Automating Question Generation From Educational Text

Towards Generalized Methods for Automatic Question Generation in Educational Domains

Introduction

Exam-style questions are a fundamental educational tool serving a variety of purposes. In addition to their role as an assessment instrument, questions have the potential to influence student learning. According to Thalheimer (2003), some of the benefits of using questions are: 1) offering the opportunity to practice retrieving information from memory; 2) providing learners with feedback about their misconceptions; 3) focusing learners’ attention on the important learning material; 4) reinforcing learning by repeating core concepts; and 5) motivating learners to engage in learning activities (e.g. reading and discussing). Despite these benefits, manual question construction is a challenging task that requires training, experience, and resources. Several published analyses of real exam questions (mostly multiple choice questions (MCQs)) (Hansen and Dexter 1997; Tarrant et al. 2006; Hingorjo and Jaleel 2012; Rush et al. 2016) demonstrate their poor quality, which Tarrant et al. (2006) attributed to a lack of training in assessment development. This challenge is augmented further by the need to replace assessment questions consistently to ensure their validity, since their value will decrease or be lost after a few rounds of usage (due to being shared between test takers), as well as the rise of e-learning technologies, such as massive open online courses (MOOCs) and adaptive learning, which require a larger pool of questions.

Automatic question generation (AQG) techniques emerged as a solution to the challenges facing test developers in constructing a large number of good quality questions. AQG is concerned with the construction of algorithms for producing questions from knowledge sources, which can be either structured (e.g. knowledge bases (KBs) or unstructured (e.g. text)). As Alsubait (2015) discussed, research on AQG goes back to the 70’s. Nowadays, AQG is gaining further importance with the rise of MOOCs and other e-learning technologies (Qayyum and Zawacki-Richter 2018; Gaebel et al. 2014; Goldbach and Hamza-Lup 2017).

In what follows, we outline some potential benefits that one might expect from successful automatic generation of questions. AQG can reduce the cost (in terms of both money and effort) of question construction which, in turn, enables educators to spend more time on other important instructional activities. In addition to resource saving, having a large number of good-quality questions enables the enrichment of the teaching process with additional activities such as adaptive testing (Vie et al. 2017), which aims to adapt learning to student knowledge and needs, as well as drill and practice exercises (Lim et al. 2012). Finally, being able to automatically control question characteristics, such as question difficulty and cognitive level, can inform the construction of good quality tests with particular requirements.

Although the focus of this review is education, the applications of question generation (QG) are not limited to education and assessment. Questions are also generated for other purposes, such as validation of knowledge bases, development of conversational agents, and development of question answering or machine reading comprehension systems, where questions are used for training and testing.

This review extends a previous systematic review on AQG (Alsubait 2015), which covers the literature up to the end of 2014. Given the large amount of research that has been published since Alsubait’s review was conducted (93 papers over a four year period compared to 81 papers over the preceding 45-year period), an extension of Alsubait’s review is reasonable at this stage. To capture the recent developments in the field, we review the literature on AQG from 2015 to early 2019. We take Alsubait’s review as a starting point and extend the methodology in a number of ways (e.g. additional review questions and exclusion criteria), as will be described in the sections titled “Review Objective” and “Review Method”. The contribution of this review is in providing researchers interested in the field with the following:

1.
a comprehensive summary of the recent AQG approaches;
2.
an analysis of the state of the field focusing on differences between the pre- and post-2014 periods;
3.
a summary of challenges and future directions; and
4.
an extensive reference to the relevant literature.

Summary of Previous Reviews

There have been six published reviews on the AQG literature. The reviews reported by Le et al. 2014, Kaur and Bathla 2015, Alsubait 2015 and Rakangor and Ghodasara (2015) cover the literature that has been published up to late 2014 while those reported by Ch and Saha (2018) and Papasalouros and Chatzigiannakou (2018) cover the literature that has been published up to late 2018. Out of these, the most comprehensive review is Alsubait’s, which includes 81 papers (65 distinct studies) that were identified using a systematic procedure. The other reviews were selective and only cover a small subset of the AQG literature. Of interest, due to it being a systematic review and due to the overlap in timing with our review, is the review developed by Ch and Saha (2018). However, their review is not as rigorous as ours, as theirs only focuses on automatic generation of MCQs using text as input. In addition, essential details about the review procedure, such as the search queries used for each electronic database and the resultant number of papers, are not reported. In addition, several related studies found in other reviews on AQG are not included.

Findings of Alsubait’s Review

In this section, we concentrate on summarising the main results of Alsubait’s systematic review, due to its being the only comprehensive review. We do so by elaborating on interesting trends and speculating about the reasons for those trends, as well as highlighting limitations observed in the AQG literature.

Alsubait characterised AQG studies along the following dimensions: 1) purpose of generating questions, 2) domain, 3) knowledge sources, 4) generation method, 5) question type, 6) response format, and 7) evaluation.

The results of the review and the most prevalent categories within each dimension are summarised in Table 1. As can be seen in Table 1, generating questions for a specific domain is more prevalent than generating domain-unspecific questions. The most investigated domain is language learning (20 studies), followed by mathematics and medicine (four studies each). Note that, for these three domains, there are large standardised tests developed by professional organisations (e.g. Test of English as a Foreign Language (TOEFL), International English Language Testing System (IELTS) and Test of English for International Communication (TOEIC) for language, Scholastic Aptitude Test (SAT) for mathematics and board examinations for medicine). These tests require a continuous supply of new questions. We believe that this is one reason for the interest in generating questions for these domains. We also attribute the interest in the language learning domain to the ease of generating language questions, relative to questions belonging to other domains. Generating language questions is easier than generating other types of questions for two reasons: 1) the ease of adopting text from a variety of publicly available resources (e.g. a large number of general or specialised textual resources can be used for reading comprehension (RC)) and 2) the availability of natural language processing (NLP) tools for shallow understanding of text (e.g. part of speech (POS) tagging) with an acceptable performance, which is often sufficient for generating language questions. To illustrate, in Chen et al. (2006), the distractors accompanying grammar questions are generated by changing the verb form of the key (e.g. “write”, “written”, and “wrote” are distractors while “writing” is the key). Another plausible reason for interest in questions on medicine is the availability of NLP tools (e.g. named entity recognisers and co-reference resolvers) for processing medical text. There are also publicly available knowledge bases, such as UMLS (Bodenreider 2004) and SNOMED-CT (Donnelly 2006), that are utilised in different tasks such as text annotation and distractor generation. The other investigated domains are analytical reasoning, geometry, history, logic, programming, relational databases, and science (one study each).

Table 1 Results of Alsubait’s review. Categories with frequency of three or less are classified under “other”

A Systematic Review of Automatic Question Generation for Educational Purposes

Abstract

Similar content being viewed by others

Qgen: An Automatic Question Paper Generator

Automating Question Generation From Educational Text

Towards Generalized Methods for Automatic Question Generation in Educational Domains

Introduction

Summary of Previous Reviews

Findings of Alsubait’s Review

Review Objective

Review Method

Inclusion and Exclusion Criteria

Search Strategy

Data Sources

Search Queries

Screening

Data Extraction

Quality Assessment

Results and Discussion

Search and Screening Results

Data Extraction Results

Rate of Publication

Types of Papers and Publication Venues

Research Groups

Purpose of Question Generation

Generation Methods

Generation Tasks

Preprocessing

Question Construction

Post-processing

Input

Domain, Question Types and Language

Feedback Generation

Difficulty

Verbalisation

Evaluation

Standard Datasets

Types of Evaluation

Quality Criteria and Metrics

Performance of Generation Approaches and Gold Standard Performance

Quality Assessment Results

Conclusion and Future Work

Providing an Overview of the AQG Community and its Activities

Summarising Current QG Approaches

Identifying Gold Standard performance in AQG

Tracking the Evolution of AQG Since Alsubait’s Review

Improvement at the Question Level

Generating Questions with Controlled Difficulty

Enriching Question Forms and Structures

Automating Template Construction

Verbalisation

Feedback Generation

Improvement of Evaluation Methods

Using Human-Authored Questions for Evaluation

Standardisation and Development of Automatic Scoring Procedures

Improvement of Reporting

Other Areas of Improvement and Further Research

Assembling Exams from the Generated Questions

Mining Human-Authored Questions

Similarity Computation and Optimisation

Source Acquisition and Enrichment

Limitations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Appendix

Search Queries

Excluded Studies

Publication Venues

Active Research Groups

Summary of Included Studies

Evaluation

Quality assessment

Rights and permissions

About this article