Introduction

The rapid growth of technology has meant that computer learning has increasingly integrated artificial intelligence techniques in order to develop more personalized educational systems. These systems are known as Intelligent Tutoring Systems (ITS).

MetaTutorES (Cerezo et al., 2020a,b), a Spanish adaptation of MetaTutor (Azevedo et al., 2011) is an ITS designed to detect, model, trace, and foster students’ self-regulated learning while learning various science topics (e.g., by modeling and scaffolding metacognitive monitoring, facilitating the use of effective learning strategies, and setting and coordinating relevant learning goals). The system uses human-like avatar technology that allows pedagogical agents to track student behavior and provide interaction on this basis. Tracking students’ behavior is also a powerful research tool used to collect data on students’ cognitive, metacognitive, affective, and motivational processes deployed during learning (Azevedo et al., 2018; Greene & Azevedo, 2010; Taub et al., 2021). These different data sources can be fused and mined to reveal learning-related information such as student performance. In this regard, Educational data mining (EDM) and Learning Analytics (LA) can be applied to understand educational processes using information extracted from educational data, which is then used to improve the educational process and the quality of learning (Romero & Ventura, 2020).

One of the oldest and most commonly studied issues in EDM/LA is the prediction of learners’ performance. It is still a challenge to predict student learning achievement in ITSs using Multimodal Learning Analytics (MLA) with learning data from different sources and doing a single analysis (Blikstein & Worsley, 2016). MLA uses log-files and gaze data, biosensors, interactions with videos, audio and digital documents, and any other relevant data source to measure and understand the learning process.

One important issue in MLA is how to combine, or fuse, the data extracted from various sources/modalities in order to provide a better, more comprehensive view of teaching–learning processes (Bogarín et al., 2018; Chango et al., 2021). The most common and simplest data fusion approach for combining all the data sources is to build a machine-learning classifier from the summary statistics produced from each of the data sources. An important task when fusing data is to reduce the dimensions of the variables/attributes and to identify the most fruitful feature sets. Feature selection algorithms are normally used in data fusion for classification problems in order to reduce the data dimensions and produce the best results (Jesus et al., 2016). Finally, classification ensembles have demonstrated very good results in predicting student academic performance from multimodal data sources (Adejo & Connolly, 2018).

In this paper we perform a classification task, predicting the value of a categorical/nominal attribute (the class or final knowledge status of the student (Pass, Fail) based on other attributes (the predictive attributes from various available data sources). We propose applying classification algorithms, feature selection algorithms, and ensembles to data gathered from a variety of sources (learning strategies from ITS logs, emotions from face recording videos, and interaction zones from eye tracking) in order to predict the students’ final performance in the ITS. In this sense, the ultimate contribution of this study is to analyze the learning process through resources, allowing a more personalized response to each learner.

The research questions posed by this study are:

  • Question 1.- Can attribute selection and classification ensemble algorithms improve the prediction of students’ final performance from our ITS data?

  • Question 2.- How useful are the models produced and what are the best variables to help teachers understand how to predict students’ final performance in the ITS?

This paper is organized as follows. The first section covers the background of the related research area of MLA. Subsequently, we describe the proposed methodology, the data used, and how it was preprocessed. Then, we describe the experiments we performed and the results they produced. Finally, we discuss the implications, conclusions, and lines for future research.

Background

MLA aims to combine different sources of learning traces into a single analysis, it is a subfield of EDM related to multi-view and multi-relational data and data fusion. It aims to understand and optimize learning in digital where the use of videos is currently consolidated, from traditional courses to mixed and online courses (Chan et al., 2020). MLA can generate distinctive insights into what happens when students create unique solution paths to problems, interact with peers, and act in both physical and digital environments. It has become increasingly broadly applied in both digital and in real-world scenarios where interactions are not solely mediated through computers or digital devices (Blikstein & Worsley, 2016). In MLA, learning traces are extracted not only from log-files but also from digital documents, recorded video and audio, pen strokes, position tracking devices, biosensors, and any other data sources that could be useful for understanding or measuring the learning process. Below, we describe the data sources used in the present study.

Learning strategies from ITS logs

There is empirical evidence about performance prediction through computer learning environment log data (Cerezo et al., 2016; Lerche & Kiel, 2018; Li & Tsai, 2017), including predicting performance in offline courses from logs of online behavior (Zhou et al., 2015). As computer-based learning environments, ITSs allow us to see what learning strategies users deploy while they are studying, and are part of a new trend in the measurement of learning in general, and self-regulated learning in particular—the so called third wave—, characterized by combined use of measurement and Advanced Learning Technologies (Panadero et al., 2016; Winne & Azevedo, 2021). These performance analytics include data on the student’s performance and different learning metrics. Example include completion time, successful or unsuccessful completion of assignments, speed of task resolution, the number of attempts or failures, and the complexity of the problem-solving process (Crescenzi-Lanna, 2020). All of these data are normally produced by the computer during the student’s interaction with the learning environment and are stored in database or log files (Cristóbal Romero et al., 2008). This technology overcomes the limitations of self-report methodology, making it possible to detect, model, and trace students’ learning, with the added benefit of not interfering with student activity, because even though a huge amount of data is generated, it is processed automatically by the computer.

Interaction zones from eye tracking

Eye-tracking devices provide information that can be used to infer the student’s attention level, engagement, preference, or understanding. It provides an understanding of what attracts immediate attention, which target elements are ignored, what order elements are noticed in, and how elements compare to others (Cerezo et al., 2020a,b; Taub & Azevedo, 2019). In this sense, gaze data can provide very useful, accurate information for predicting student learning during interaction with ITSs (Bondareva et al., 2013), and multiple researchers have suggested that the duration of fixations are indicators of cognitive processing during learning (Antonietti et al., 2015).

There are different options for collecting eye-tracking data such as saccade amplitude, direction change, fixations, etc. (Crescenzi-Lanna, 2020). In the current study, we are interested in analyzing fixations, particularly the number of fixations in areas that could be related to the learner’s final performance. For that purpose, we defined three Areas of interest (AOIs) in our ITS interface: AOI1 Learning session timer, AOI2 ITS agent/avatar, and AOI3 Supporting image/graphics content. These are areas of interest because, in terms of the interface configurations, fixations on AOI1 may denote time management or resource management strategies, while reduced or excessive fixations on AOI1 might indicate poor time management skills. Fixation on AOI2, the agent, would show that the participant is making use of the prompts and feedback provided by the agents during the learning session and has established an interaction with the agent. Fixations on AOI3 may point to participants using a strategy of coordinating information sources (text-images), associated with learning gains (Azevedo, 2009; Cerezo et al., 2020a,b).

Emotions from face recording videos

Emotions are a critical component of learning and problem solving, especially when it comes to interacting with computer-based learning environments (Harley et al., 2015), and there is a relationship between negative learning emotion and learning performance (Chen & Wang, 2011). In this context, studies from affective computing literature suggest that facial expressions may be the best single method for accurately identifying emotional states (D’Mello & Kory, 2012). Techniques for automatic detection of emotions (Blanchard et al., 2009) are capable of isolating a learner’s mood via artificial intelligence facial recognition systems, and there are tools available that can process video data, such as the Microsoft Emotion API (2019), Face API (2019), and Affectiva (2019). In this line, including the learner’s emotional states may help enhance ITS quality and efficacy. Previous research has indicated that academic emotions are significantly related to students’ motivation, learning strategies, cognitive resources, self-regulation, and academic achievement (Pekrun et al., 2011).

In previous studies (Chango et al., 2020), student emotions as recognized by an API during a learning session with an ITS have been used as the sole data source for predicting the student’s final performance. The best models demonstrated a prediction accuracy of 63.82% and 0.67 AUC, figures that we aim to improve on by using more student features and variables from various multimodal data sources, together with ensembles and selection of the best attributes.

Proposal

The current study proposes a two-stage methodology for predicting students’ final performance from multimodal data (see Fig. 1).

Fig. 1
figure 1

Proposed methodology for predicting students’ performance from multiple data sources

As Fig. 1 shows, the two main stages in our methodology are:

  • First stage. Collecting data from various sources: learning strategies from MetaTutorES logs, number of fixations from gaze data, and emotions from face recording videos. It also includes some pre-processing tasks (anonymization, attribute normalization and discretization, and format transformation) to generate numerical and categorical datasets.

  • Second stage. Using different data fusion approaches: merge all attributes; selection of the best attributes, and ensembles of several white box classification algorithms. Finally, the predictions produced by the models are compared in order to find the best model and attributes to be used to predict the students’ final performance.

Data

Data were collected from 40 undergraduates (mean age = 23.58; SD = 8.18; 17 men and 23 women) enrolled at a public university in the north of Spain. The undergraduates participated in the study voluntarily and learned about a complex science topic (the circulatory system) while interacting with the MetaTutorES ITS (Cerezo et al., 2020a,b), a computerized learning environment. The students in the sample were studying in a variety of different knowledge areas: education, psychology, economics, law, philosophy, nursing, telecommunication, electrical engineering, geomatics, physics, and civil navy. Most students in the sample were first-year undergraduates, but there were also second-years, third-years and masters.

Gathering data

We gathered information from four ITS data sources: learning strategies from MetatutorES logs, emotions from face videos, fixation from eye tracking, and performance from the content knowledge test. The data collected was produced spontaneously from interactions with the MetaTutorES ITS during a session lasting from two-and-a-half to three hours. The data collection for the study was developed and managed in line with the ethical research principles of the Declaration of Helsinki and the protocol was approved by the research ethics committee of the Principality of Asturias and the University of Oviedo.

Learning strategies from MetaTutorES logs

Throughout each learning session, learner interaction with the ITS was logged in a log file unique to each learner. The learning environment is made up of information in text, charts, and images, through which students learn about the circulatory system. The system logs each user action and interaction with the learning environment and the study. Each line of a log represents an event or participant action in the learning environment and contains the timestamp of the event, the triggered event, the identifier of the theoretical content that the learner is studying and optional information related to that event.

For the present study, three variables were extracted from the log files: SummAll: The number of times that the learner wrote a Summary about the content they were studying, discarding the events in which they did not add any new information, e. g. After spending time reading the page about the role of the heart in the circulatory system, the user summarizes the reading; COIStotalFreq: Coordinating Information Sources (e.g. drawing and text) is the number of times the learner enlarged the image associated with the content being studied for at least fifteen seconds, e.g. Spend time studying about the heart and open the associated image. PKAtotalFreq: Prior Knowledge Activation is the number of times that the learner, after navigating to previously unvisited content, writes their prior knowledge about the new content. A correlate for when the student searches in their memory for relevant prior knowledge either before beginning task performance or during task performance., e.g. The student opens a page and, before reading, writes everything they already know about the topic on that page.

Emotions from face recording video

During the learning session a video of the participants face was recorded using a web cam which was subsequently analyzed using a desktop app. Each participant’s full session was recorded, the webcam on the computer was adjusted to the participant’s position at the beginning and they were asked to sit facing forward and be as neutral as possible, although their facial expressions were expected to vary during the session. We asked participants to tie their hair back, make sure there was nothing around their neck, remove their glasses, and remove chewing gum if necessary to have the best conditions for the recording.

The learning session videos were analyzed using Microsoft Emotion API (2019 Automatic Facial Recognition Software). The API classifies facial expression in eight emotion classes: anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise. These emotions are understood to be cross-culturally and universally communicated with specific facial expressions (Arora et al., 2018). We developed our specific application to use Microsoft Emotion API in local mode (see Fig. 2). Participants tended to experience all of emotions the system detects during the session, but we were able to produce a general index for each participant giving information about the general pattern. The analysis gave us at least one predominant emotion during the learning session from frame of student video, and there were a large number of frames (1 frame per second) for each student in every session. The confidence (values between 0 and 1) gives the likelihood for each class of emotion.

Fig. 2
figure 2

Examples of facial emotion recognition and classification (the left-hand column shows the emotion trend)

Interaction zones from eye tracking

Data from each learner was collected throughout the session using the screen-based eye tracker RED500 (https://imotions.com/hardware/smi-red500/). We used SMI’s BeGaze software in order to process the fixations on the learning environment AOIs. BeGaze performs the calculation automatically, identifying a fixation if a learner stares at an AOI for at least 80 ms with a maximum dispersion of 100px.

For the present study, we extracted three variables related to learner fixation on three AOIs (See Fig. 3). AOI1 The learning session timer (number of times the learner focused their attention on the area showing the time left in the learning session), which may denote time management or resource management strategies, while reduced or excessive fixations on AOI1 might indicate poor time management skills. AOI2 ITS agent/avatar (number of times the learner focused their attention on the area where the pedagogical agents appear). This variable may show that the participant is taking advantage of the prompts and feedback provided by the agents during the interaction in response to participants’ goals, behaviors, self-evaluations, and progress. However, it must be considered carefully, because learners may not always need to look at an agent to process their audio prompts and feedback (Bondareva et al., 2013; Lallé et al., 2021). AOI3 Images/graphics supporting content (number of times the learner focused their attention on the area covered by the images related to the learning session contents). This variable may indicate integration contributing to information processing (Mason et al., 2013).

Fig. 3
figure 3

Map of areas of interest (AOIs) in the ITS

Final grade from test/quiz

During the session and at the end of the session, each subject was tested about the learning content, giving a final performance value between 0 and 10, with 10 being the highest performance. There was a pretest about prior knowledge of the content at the beginning of the session, and a multiple-choice posttest of domain knowledge that was corrected based on pretest.

Preprocessing data

We preprocessed all of the data in the aforementioned Excel files (Romero et al., 2014). Firstly, the data were anonymized, then the input attributes were normalized/rescaled, the output attributes and input attributes were discretized, and finally the format was transformed.

Anonymizing

Student anonymity and privacy was maintained but the information in the four Excel files was linked to the same subject using anonymized coding. We implemented a basic solution, using a randomly generated number as a user ID rather than the users’ names, and replaced the students’ names with the ID in the four Excel files.

Normalizing

We adjusted all of the input values, which used different scales, to a single common scale. This was necessary because the original values had a variety of ranges. Normalization is a data transformation where the attribute values are scaled so as to fall within a specified range, such as −1.0 to 1.0, or 0.0 to 1.0. Normalization helps to prevent attributes with large ranges from outweighing attributes with smaller ranges. In this case we rescaled/normalized all of the input attribute values to the same range [0–1] by using the well-known Min–Max method, which is a linear transformation of the original data using the formula: Zi = Xi–min(X)/max(X)–min(X), where X = (x1,…,xn) and Zi is now the ith normalized data.

Discretizing

Discretization divides numerical data into categorical classes that are more user-friendly than precise magnitudes and ranges. It reduces the number of possible values of the continuous feature and provides a view of the data that is easier to understand. Generally, discretization smooths out the effect of noise and enables simpler models, which are less prone to overfitting. We discretized all the input attributes in order to have the same variables in both numerical and categorical formats. To do that, we used equal-width binning with the following 3 bins: LOW, MEDIUM and HIGH. Equal-width binning divides the range of possible values into N sub-ranges of the same size in which: bin_width = (max value–min value)/N.

We also discretized the output attribute or class to predict (the students’ final performance or status). We used a manual discretization with the user directly specifying cut-off points. In our case, the class had the following 2 values and cut-off points:

  • PASS: Students who scored 5 out of 10 or better in the performance tests. In our case, this was 21 out of 40 students (52.50%).

  • FAIL: Students who scored less than 5 out of 10 in the performance tests. In our case, this was 19 out of 40 students (47.50%).

Transforming

Finally, we converted the files from Excel to CSV (Comma-separated values) files. CSV is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. We transformed each of the two versions of the four Excel files (numerical and categorical values) into two CSV files because they can be directly opened and used by the WEKA data mining framework that we used in the experiments. We used the WEKA (Waikato Environment for Knowledge Analysis) data mining framework (Witten et al., 2011) to predict student performance. WEKA provides a collection of algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to these functions.

Experiments

We carried out three different experiments using three different approaches and six classification algorithms with the preprocessed numerical and discretized data to predict student performance in the ITS (See Fig. 4).

Fig. 4
figure 4

Visual description of the experiments

We used two types of white box classification models: Rule induction algorithms and decision trees. The models produced by these algorithms (IF–THEN rules from decision trees) are simple and clear, and so are easy for humans to understand. IF–THEN classification rules provide a high-level knowledge representation that is used for decision making, while decision trees can also be converted into a set of IF–THEN classification rules. In our experiments, we selected six well-known classification algorithms integrated in the WEKA data mining tool (Witten et al., 2011): three decision tree algorithms (J48, REPTree and RandomTree) and three rule induction algorithms (JRip, Nnge and PART). We executed these algorithms using a k-fold cross-validation (k = 10) and Accuracy and Area under the ROC curve as evaluation metrics for classification:

  • Accuracy (ACC) is the most commonly-used traditional method for evaluating classification algorithms. It provides a single-number summary of performance. In our case, it is obtained by the equation: \({\text{Acc}} = \frac{{\text{Number of students correctly classified}}}{{\text{Total number of students}}}\). This metric shows the percentage of correctly classified students.

  • Area under the ROC curve (AUC) measures the two-dimensional area underneath the entire Relative Operating Characteristic (ROC) curve. The ROC curve allows us to find possibly optimal models and discard suboptimal ones. AUC is often used when the goal of classification is to obtain a ranking because ROC curve construction requires a ranking to be produced.

Experiment 1: merging all attributes

In experiment 1 we applied the classification algorithms to a single file with all the attributes of the three different data sources merged. We created two different numerical and discrete/categorical CSV files. Each dataset had fifteen input attributes (in numerical or discrete format) and only one output attribute or class. Finally, we executed six classification algorithms on the two summary datasets, producing the results (%Accuracy and ROC Area) shown in Table 1.

Table 1 Results produced by merging all attributes

Table 1 shows that the best results (highest values) were produced by Part (80.0%Acc) and J48 (80.00%Acc and 0.80 AUC) algorithms with numerical data. In fact, on average, most of the algorithms exhibited slightly improved performance in both measures when using numerical data.

Experiment 2: selecting the best attributes

In Experiment 2, we applied the classification algorithms to a single file with only the best attributes. Firstly, we applied attribute selection algorithms to the summary files from the Experiment 1 in order to eliminate redundant or irrelevant attributes. We used the well-known CfsSubsetEval (Correlation-based Featured Selection) method provided by the WEKA tool. This method selects the features that are more strongly correlated with the class. Starting from our initial 15 input attributes, we produced two sets of 2 optimal attributes for the numerical datasets and 5 optimal attributes (see Table 2) for the discretized datasets.

Table 2 Results of the attribute selection with CLASSIFIERSUBSETEVAL

Following that, we executed the six classification algorithms with the two new summary datasets, producing the results (%Accuracy and ROC Area) shown in Table 3.

Table 3 Results obtained when selecting the best attributes

Table 3 shows that the best results (highest values) were produced by Randomtree (82.50% Acc and 0.82 AUC) algorithms. Again, on average most of the algorithms exhibited slightly improved performance in both measures when using numerical data.

Experiment 3: using ensembles and selecting the best attributes

In Experiment 3 we applied an ensemble of classification algorithms to the best attributes from each different data source. Firstly, we selected the best attributes for each of the three different datasets, again using the well-known CfsSubsetEval attribute selection algorithm. This gave the list of attributes shown in Table 4.

Table 4 Results of attribute selection with CFSSubsetEval

Following that, we applied an ensemble or combination of multiple classification base models by using the well-known Vote (Kuncheva, 2014) for automatic combining several machine learning algorithms provided by WEKA. Vote combines the probability distributions of these base learners. It produces better results than individual classification models, if the set classifiers are accurate and diverse. It has demonstrated better results than homogeneous models for standard datasets.

We executed the six classification algorithms as base or individual classification models of our Vote method with the previously described numerical and discretized datasets. Table 5 shows the results (%Accuracy and ROC Area).

Table 5 Results from using ensembles and selecting the best attributes

Table 5 shows that the best results (highest values) were produced by REPTree (87.50%Acc and 0.88 AUC). On average, most of the algorithms again exhibited slightly improved performance in both measures when using numerical data.

Discussion

Below, we address the two initial research questions by discussing the results from our four experiments.

Question 1

  • Can attribute selection and classification ensemble algorithms improve the prediction results of student final performance from our ITS data?

We used three different data fusion approaches and six white-box classification algorithms to answer this question. Table 6 shows that the average prediction performance (Average of % Accuracy and AUC) of the classification algorithms increased with each new approach.

Table 6 Average results from the three data fusion approaches

We first applied a traditional approach for merging all the attributes from the different data sources directly. This initial approach gave reasonable results (accuracy higher than 70% and AUC higher that 0.7) from numerical data. Our second approach selected the best attributes for each dataset. This was an improvement on the first approach (79% accuracy and 0.8 AUC). Finally, the third approach improved on the second approach and gave the best result by using ensembles and selection of the best attributes (82% accuracy and 0.87 AUC). In all the approaches the average values were higher when using numerical than discretized data.

However, we were unable to find a single best algorithm that would win in all cases in our experiments. This is logical and in line with the No-Free-Lunch theorem (Wolpert, 2002), in which it is generally accepted that no single supervised learning algorithm can beat another algorithm over all possible learning problems or different datasets. In the first experiment, the algorithm that produced the highest prediction values was J48 (80.00%Acc and 0.80 AUC), in the second experiment it was Randomtree (82.50%Acc and 0.82 AUC), and REPTREE produced the highest prediction values of %Acc (87.50) and AUC (0.88) when using an ensemble and selection of the best attributes from the discretized data in the fourth experiment.

Question 2

  • How useful are the models produced and what are the best variables to help teachers understand how to predict students’ final performance in the ITS?

To answer this question, we will demonstrate and describe the meaning of the prediction model that produced the highest values of Accuracy and AUC in each of our 3 experiments.

In experiment 1, the prediction model producing the best prediction was produced by the J48 algorithm using discretized data (see Table 7).

Table 7 J48 decision tree produced when merging all attributes

This prediction model (see Table 7) has 4 rules. The first rule shows that the students who have scores higher than 0.25 in SummAll in MetaTutorES PASS the course. The second rule shows that if students have a score lower than 0.25 in SummAll in MetaTutorES and a surprise emotion lower than 0.06, then they FAIL the course. The third rule shows that if students have a surprise emotion higher than 0.06 and a value of AOI2FixCount lower than 0.04 in the pedagogical agent zone, then they PASS the course. Finally, the remaining students are classified as FAIL.

In experiment 2, the prediction model that produced the highest prediction values used the Randomtree algorithm with numerical data (see Table 8).

Table 8 Randomtree pruned tree produced when selecting the best attributes

This prediction model (see Table 8) consists of 7 IF–THEN rules. In all these rules, the two most frequent attributes are the summary strategies (SummAll) and the frequency of use of the user coordination of information sources strategy (COIStotalFreq). It is also important to note that in this model the predictions of students passing or failing was not influenced by any emotions or interaction zones.

In experiment 3, the prediction model that produced the highest prediction values used the RepTree algorithm with numerical data (see Table 9).

Table 9 RepTree decision trees produced using ensembles with selecting the best attributes

This prediction model (see Table 9) is a combination of three different models showing that the behavior of students in relation to the frequency of the summary strategies, the proportion of fixations on AOI3 Images/graphics supporting content over the total session, and the surprise emotion are the most important attributes in predicting whether students PASS or FAIL. Students who interact with the ITS with a value higher than 0.03 in the SummAll variable, students who have a proportion of fixations on AOI3 over the total session higher than 0.29, and students who have an emotion of surprise higher than 0.05, are predicted to PASS the course, in other cases they are predicted to FAIL the course.

These results are not surprising considering that Summarizing and Content Coordination of Information Sources are classical strategies that contribute to students taking a strategic approach (Cerezo et al., 2020a,b), and positive emotions such as surprise, enjoyment and happiness are thought to promote motivation, facilitating use of flexible learning strategies, and supporting self-regulation of learning (Pekrun et al., 2011), all of which presumably promote better performance.

Conclusions

This paper proposes the use of ensembles and attribute selection for improving the prediction of students’ performance from multimodal data in an ITS. We collected and preprocessed data from 40 first-year university students from three different sources: learning strategies from MetaTutorES logs, emotions from face recording videos, and interaction zones from gaze data, along with marks from performance test about the learning content. We carried out 3 experiments in order to answer two research questions:

  • Can attribute selection and classification ensemble algorithms improve the prediction of student final performance from our ITS data? Yes, the use of ensembles and selecting the best attributes approach from numerical data produced the best results in terms of Accuracy and AUC values. The REPTree classification algorithm produced the best results.

  • How useful are the models produced and what are the best variables to help teachers understand how to predict students’ final performance in the ITS? The white-box models we produced give teachers understandable explanations (IF–THEN rules) of how they arrived at their classifications of student performance. They showed that the attributes that appeared most in these rules were logs denoting use of Summarizing strategies and Coordination of Information Sources (SummAll and COIStotalFreq) from the ITS logs, paying attention to avatars and to images/graphics supporting text content (AOI2 and AOI3) from gaze data, and surprise from emotions.

The implications of the current study point to Web ITS and Web-based Adaptive Educational Systems. If data is captured from different data sources, the classifier ensemble methodology proposed in this study could make better, earlier performance predictions than the single data source models that are commonly used at present.

As the next step, we intend to investigate and perform new experiments with the aim of improving our results and in order to overcome some limitations:

  • Adding additional different variables/attributes from the multimodal student interaction with the ITS such as think aloud data, self-report data, and/or physiological measures. In the context of multimodal data, classical self-report methodology remains valuable. Aspects such as achievement emotions experienced by students, students’ learning goals and approaches, self-esteem, and epistemological beliefs may help to improve the prediction results. For instance, previous studies have shown that visual metrics (e.g., fixation rate, longest fixations) are significantly influenced by students’ goals, so this could be applied to ITS design so that it adapts better to students’ learning goals (Lallé et al., 2017). As well as this, using EEG (Electroencephalography), ECG (Electrocardiogram), EMG (Electromyography), EDA (Electrodermal Activity), sitting posture, etc. in order to produce more accurate values for predicting students’ performance.

  • Taking into account that there is recent evidence that emotions co-occur during learning in MetaTutor (Lallé et al., 2021), it should be considered for future research; the emotions in ITS are often studied as single affective state, like in the present work.

  • We would also like to use additional classifier algorithms, particularly deep learning, which could perform significantly better than classic methods.

  • Using raw data and other specific data fusion techniques. We used a basic fusion method that uses summary data. However, there are other data fusion theories and methods such as Probability-based methods (PBM) and Evidence reasoning methods (EBM) that we can use with raw data. We could also use semantic (abstract) level features in order to produce intelligent data aggregation.

  • We are also aware of the limited generalizability of the results. The next step would be applying the current proposal in other learning systems such as Learning Management Systems (LMSs) or Personal Learning Environments (PLEs). This would allow us to compare results in different learning contexts and with a greater diversity of subjects.