Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach

Nakamoto, Ryosuke; Flanagan, Brendan; Yamauchi, Taisei; Dai, Yiling; Takami, Kyosuke; Ogata, Hiroaki

doi:10.3390/computers12110217

Open AccessEditor’s ChoiceArticle

Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach

by

Ryosuke Nakamoto

^1,*

,

Brendan Flanagan

^2,*

,

Taisei Yamauchi

¹,

Yiling Dai

³

,

Kyosuke Takami

^3,4 and

Hiroaki Ogata

³

¹

Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan

²

Center for Innovative Research and Education in Data Science, Institute for Liberal Arts and Sciences, Kyoto University, Kyoto 606-8501, Japan

³

Academic Center for Computing and Media Studies, Kyoto University, Kyoto 606-8501, Japan

⁴

Education Data Science Center, National Institute for Educational Policy Research, Tokyo 100-8951, Japan

^*

Authors to whom correspondence should be addressed.

Computers 2023, 12(11), 217; https://doi.org/10.3390/computers12110217

Submission received: 21 August 2023 / Revised: 25 September 2023 / Accepted: 10 October 2023 / Published: 24 October 2023

(This article belongs to the Special Issue Recent Advances in Computer-Assisted Learning)

Download

Browse Figures

Versions Notes

Abstract

:

In the realm of mathematics education, self-explanation stands as a crucial learning mechanism, allowing learners to articulate their comprehension of intricate mathematical concepts and strategies. As digital learning platforms grow in prominence, there are mounting opportunities to collect and utilize mathematical self-explanations. However, these opportunities are met with challenges in automated evaluation. Automatic scoring of mathematical self-explanations is crucial for preprocessing tasks, including the categorization of learner responses, identification of common misconceptions, and the creation of tailored feedback and model solutions. Nevertheless, this task is hindered by the dearth of ample sample sets. Our research introduces a semi-supervised technique using the large language model (LLM), specifically its Japanese variant, to enrich datasets for the automated scoring of mathematical self-explanations. We rigorously evaluated the quality of self-explanations across five datasets, ranging from human-evaluated originals to ones devoid of original content. Our results show that combining LLM-based explanations with mathematical material significantly improves the model’s accuracy. Interestingly, there is an optimal limit to how many synthetic self-explanation data can benefit the system. Exceeding this limit does not further improve outcomes. This study thus highlights the need for careful consideration when integrating synthetic data into solutions, especially within the mathematics discipline.

Keywords:

self-explanation; automated scoring; semi-supervised learning; language learning model (LLM); data augmentation

1. Introduction

The emergence of digital learning platforms has opened a plethora of opportunities for researchers to investigate and comprehend learning behaviors through abundant system interaction data [1]. A notable area of interest among various learning facets is self-explanation, identified as a robust active learning technique. This strategy has been particularly effective in bolstering comprehension in subjects like mathematics [1,2,3]. Self-explanation can be described as a mechanism where learners articulate explanations, elucidate concepts, expand on methods, and immerse in problem-solving to enhance their grasp and absorb fresh insights [4,5].

With the proliferation of computer-driven learning platforms, self-explanation has gained renewed attention and application. Contemporary learning innovations place a premium on self-explanation by crafting intuitive interfaces, creating assessment models rooted in self-explanation behaviors, and formulating tactics to extract profound self-explanations [6,7]. Tools like the one formulated by Crippen and Earl [8] highlight the centrality of self-explanation in methodical problem-solving. Ongoing studies persistently explore the versatile applications of self-explanation in education [9,10], like the adoption of template-driven self-explanations. Such templates equip learners with pre-set frameworks, serving as built-in guides to bolster their explanation processes [1,11,12].

Furthermore, the domain of self-explanation practices reaches beyond traditional boundaries. These methods involve understanding a concept and facilitating multiple educational tools such as feedback systems, crafting practice quiz responses, and generating valuable datasets for automated evaluations [10]. Within this framework, automated assessments play a pivotal role. By analyzing and interpreting self-explanations, both educators and automated systems can delve deeper into the intricacies of a learner’s thought patterns. This knowledge equips them with the capability to tailor educational strategies to better cater to individual needs. Such insights are crucial for tasks like classifying learner responses, which provide a clear view of their comprehension levels. Additionally, they allow for the easy detection of recurring mistakes or topics that consistently stump students [9,10].

However, devising a system capable of automatically grading diverse styles of self-explanation is challenging. A major concern is that self-explanation, due to its time-intensive nature [13], poses a feasibility issue for mass data collection. Additionally, crafting a quality self-explanation requires proficiency in both the specific subject matter and general writing [14,15]. Given these challenges, amassing a vast and diverse collection of self-explanation samples is demanding. This predicament further complicates the development of systems designed to aid learning through extensive sets of self-explanation examples.

To address these challenges and enhance automated scoring of self-explanations, we propose a semi-supervised approach that leverages the LLM. While popular LLMs such as OpenAI’s GPT-3 [16] are commonly employed in English language settings, due to the nature of the target problem we specifically focus on the Japanese variant of the model developed by CyberAgent [17], based on GPT-NeoX [18]. This approach aims to explore the model’s potential for generating self-explanation sentences, which will serve as the foundation for our regression models designed to predict self-explanation scores. By incorporating the semi-supervised methodology and leveraging advanced language models, we aim to improve the accuracy and effectiveness of auto-scoring in the self-explanation learning domain. Our research is anchored by two pivotal questions:

RQ1: To what extent can the integration of self-explanations generated by the LLM Japanese model and mathematical material be used to enhance the accuracy of the predictive regression model for self-explanation scores?
RQ2: What is the optimal quantity of artificially generated pseudo-self-explanation data required to effectively improve the predictive performance of the model?

These research questions provide insights into maximizing the utility of the LLM Japanese model and refining data augmentation techniques. The core findings from our research are twofold. First, we propose a strategy for advancing automated scoring in math education by synergizing LLM-generated content and mathematical material. Second, we highlight the ideal quantity of artificial self-explanation data for peak predictive accuracy.

2. Related Work

2.1. Automated Scoring of Self-Explanations: The Imperative for Rich Data

Self-explanation, widely recognized for amplifying learning outcomes in various fields, notably mathematics, has found its stride in the digital learning environment [1,4,5]. Emblematic tools like the iSTART tutoring system have been devised to foster and elevate learners’ grasp and performance [19]. Such platforms urge students to think critically, mirroring the analytical strategies of experts. Notably, the iSTART system utilizes natural language processing (NLP) in its pioneering approach to gauge and rate self-explanations, bolstering understanding across a gamut of texts.

The endeavor to automate the scoring of self-explanation quality has seen the integration of NLP tools and cutting-edge neural network architectures [20]. Techniques like latent semantic analysis (LSA) and recurrent neural network (RNN) interfaced with machine learning underscore the capabilities of automated systems, often outshining traditional manual evaluation in both effectiveness and efficiency [14,20,21,22,23,24]. Furthermore, semi-supervised learning techniques, which capitalize on abundant unlabeled data, have exhibited the potential to refine scoring accuracy [25]. Yet, the quest for more representative samples of self-explanations, especially in languages other than English, remains a prevailing challenge.

2.2. Augmenting Mathematical Self -Explanations Using Large Language Models

Consistent data shortages and imbalances have long impeded automated classification. Techniques like the synthetic minority oversampling technique (SMOTE), adaptive synthetic (ADASYN), and its derivatives aim to counteract these with synthetic data generation [26,27,28,29], but their effectiveness wanes for scant datasets. Although a variety of synthetic data tools and research exist [30,31,32,33,34,35], their applicability is often curtailed when facing complex domain-specific data, particularly in predicting student outcomes [35,36,37]. Navigating this landscape, large language models (LLMs) have made a compelling case as tools for text data augmentation. For instance, Dai et al.’s AugLLM [38] capitalizes on ChatLLM [16] to generate supplemental text, emphasizing the rising importance of LLMs in the realm of mathematical content [39].

However, in the context of mathematical education, the accurate auto-scoring of self-explanations presents stark challenges. Crafting these self-explanations is not just labor-intensive but demands expertise in both the subject and linguistic expression. Coupled with the lack of diverse and representative self-explanation samples, particularly in languages other than English, the task becomes even more daunting.

This paper aims to navigate these challenges by adopting a semi-supervised learning technique anchored by the Japanese variant of LLM. Our objective is twofold: to generate enriched self-explanation content and to amplify the accuracy of automated grading tailored for Japanese mathematical education. The emphasis here is a meticulous alignment with the intricacies of the Japanese language, paired with the nuances of mathematical challenges. In essence, our work hopes to bridge the existing data gap, presenting a more robust and linguistically attuned auto-scoring system.

3. Problem Setting: The Learning Task

In this chapter, we primarily introduce the original human-labeled data, which serves as a foundation for the subsequent pseudo-labeling of the unlabeled samples, thereby bolstering the training process. Prior to delving into methodological details, it is pivotal to define the distinct learning task under examination, which underpins our methodological foundation.

3.1. Collecting Self-Explanations

Self-explanations from learners are gathered via online platforms, as represented in Figure 1. The scope of this approach includes diverse mathematical challenges or quizzes that require written elaboration. We utilized the LEAF platform [40], composed of BookRoll (a digital reading application), and LAViEW (a tool for learning analytics), enabling students and teachers to monitor and reflect on their educational progress. This platform, having been successfully implemented in a Japanese secondary school for several years, captures handwritten responses in vector form, portraying the precise coordinates and velocity of each pen stroke.

The learners interacted with the quiz and recorded their answers using a tablet computer, employing a stylus for handwriting. As shown in Figure 1, the handwritten answer playback and self-explanation input process require students to input an explanation sentence after completing a step of their answer during playback.

3.2. Assessment of Self-Explanation Quality

Self-explanations in our study were assessed based on three main criteria: coherence, clarity, and relevance. Specifically, ‘coherence’ gauges the logical flow of the explanation, ‘clarity’ measures its understandability, and ‘relevance’ ensures the inclusion of all pertinent knowledge concepts and procedural elements. For consistent evaluation, we adapted the rubric and scoring definitions from Nakamoto et al. [10], as depicted in Table 1 and Table 2, which are well-suited for tasks with varied solutions or strategies [41]. Instead of a detailed sentence-by-sentence breakdown, our approach evaluates explanations on a holistic, quiz-by-quiz basis, offering a comprehensive insight into the learner’s understanding of the topic.

For the evaluation process, two independent evaluators employed these rubrics to rate the collected 2205 self-explanations, scoring them on a scale ranging from 1 to 5. A quadratic weighted Cohen’s kappa coefficient [42] of 0.749 between the evaluators indicated a significant level of agreement. The subsequent analysis used the mean score derived from both evaluators, which categorized the self-explanation scores for a roughly uniform distribution. Descriptive statistics of the collected self-explanations are presented in Table 3.

In anticipation of the machine learning methodologies outlined in the subsequent chapters, our dataset was segmented into three distinct categories. The train dataset, which incorporates 1420 self-explanations, forms the fundamental basis for both training our models and for LLM data augmentation. Meanwhile, the valid dataset, comprising 355 self-explanations, is earmarked for the crucial tasks of fine-tuning our models’ parameters. It also plays a significant role in the evaluation of model accuracy and in ensuring model robustness. Lastly, the test dataset, which consists of 431 self-explanations, is designated to provide a measure of the performance of our finalized models.

3.3. The Text Regression Model Description

Inspired by the work of Wang et al. [43], we employ BERT [44] and a pre-trained BERT Japanese model [45] as the backbone for our regression models, which are intended to predict the quality scores of self-explanations. Wang et al.’s methodology of injecting rubrics into the system influenced the architecture of our model, making it specifically attuned to the grading of short responses. BERT’s deep learning model, grounded on a transformer architecture, has been recognized for surpassing most preceding models in diverse natural language processing tasks [46]. Given its robust performance and compatibility with the Japanese language, BERT is an ideal choice for our study. Our model takes as the input the preprocessed self-explanation text and the corresponding quiz title (Figure 2) and yields as the output the predicted quality score for each self-explanation.

4. The Proposed Method

4.1. Overview or Pseudo-Labeling

In this section, we delve into our proposed method, building on the foundation laid out in Section 3. Our approach, illustrated in Figure 3, seamlessly blends human-labeled mathematical text data described in Section 3 and LLM-generated data to enhance our machine-learning model. Drawing from Cascante–Bonilla’s semi-supervised framework [25], we utilize pseudo-labeling as our primary technique. For human-labeled data, we lean on mathematical self-explanations, while the LLM and mathematical content texts help in producing pseudo-labeled samples to complement them. We gathered over 1420 self-explanation samples for the training model from undergraduate math students, which were further utilized in the Japanese LLM (Step 1). Figure 3 provides a comprehensive visual overview of this pseudo-training mechanism and its integrated phases.

The pseudo-labeling technique commences with the training of an initial model using the labeled dataset (Step 2). This model then assigns labels to the unlabeled data, producing what we term ‘pseudo’ labels (Step 3). These newly formed pseudo-labels are then amalgamated with the original labeled dataset, initiating a cycle of continuous model enhancement (Steps 4–5). As the model’s predictive prowess escalates, the caliber of the pseudo-labels also elevates.

4.2. Pseudo-Labeling Training Algorithm: Dataset Categorization, Function Definitions, and Model Learning

In Figure 3, we present our methodology which combines both human and pseudo-labeled samples to form a mathematically relevant dataset. Using the pseudo-labeling training algorithm (Algorithm 1), we integrate human-labeled datasets, LLM, and math texts and functions.

The human-labeled datasets act as the primary foundation for our machine learning training, and the pseudo-labeled datasets further refine and expand it. A comprehensive breakdown and flow are described in detail in the following sections as definitions and pseudo-code. Figure 4 offers an illustrative representation of the training process, harmonizing with the methodology laid out in Figure 3.

(1): Dataset categorization:

Hereafter

D_{t y p e}

represents an unlabeled dataset pertaining to a particular

t y p e

, and

D_{t y p e}^{*}

does a labeled dataset of the same category as

D_{t y p e}

.

D_{p r o v i d e d}

is the composite dataset given to the model for training and evaluation, which includes the labeled training set

D_{t r a i n}^{*}

, test dataset

D_{t e s t}

, and the generated unlabeled sample dataset

D_{s a m p l e}

.

D_{p r o v i d e d} = D_{t r a i n}^{*} + D_{t e s t} + D_{s a m p l e}

(1)

(2): Function definitions:

M o d e l (θ, D^{*}) = M

(2a)

T e s t (M, D_{t e s t}) = D_{t e s t}^{*}

(2b)

S e l e c t (D, k) = D_{k} (where S e l e c t (D, n (D)) = D)

(2c)

$M o d e l (θ, D^{*})$ : A function that takes a set of parameters, denoted by $θ$ , and a labeled dataset $D^{*}$ to yield a learned model $M$ .
$T e s t (M, D_{t e s t})$ : A function that accepts a model $M$ and a non-labeled test dataset $D_{t e s t}$ , subsequently outputting a labeled test dataset $D_{t e s t}^{*}$ .
$S e l e c t (D, k)$ : A function that takes in a dataset $D$ and a numerical value $k$ where $0 \leq k \leq n (D)$ ( $n (D)$ refers to the total number of data points in dataset $D$ ), outputting a selected subset $D_{k} \subseteq D$ .

(3): Model learning and final test:

M_{1} = M o d e l (θ, D_{train}^{*})

(3a)

M_{t + 1} = M o d e l (θ, D_{train}^{*} + S e l e c t (T e s t (M_{t}, D_{s a m p l e}), k_{t}))

(3b)

D_{test}^{*} = T e s t (M_{T}, D_{test}) (where T is sufficient number of t)

(3c)

The model learning procedure follows an iterative process. Initially, the model

M_{1}

is trained using

θ

and the labeled training dataset

D_{train}^{*}

. In each subsequent timestep, a new model

M_{t + 1}

is developed with an updated training set, comprising the original labeled dataset

D_{train}^{*}

and a selected subset of the pseudo-labeled

D_{sample}

. After the model learning process has been iterated

T

times, the final model

M_{T}

is evaluated on the original test dataset

D_{test}

to output the pseudo-labeled test dataset

D_{test}^{*}

. This dataset, enriched with pseudo-labels, serves as a vital resource for subsequent analyses and performance evaluations.

(4): Parameter setting in our study:

Figure 4 provides a comprehensive outline of our experimental approach. Our dataset consists of both human-annotated and unlabeled samples. For the training process, we amassed 2205 self-explanation samples from student contributors. In our setting,

θ

stands for a model built using logistic regression with text representation acquired from BERT, a state-of-the-art transformer-based model renowned for its superior performance on numerous NLP tasks. The iterative process continues for 3 timesteps; in other words,

T

is set to 3. The selection size at each timestep

t

, denoted by

k_{t}

, varies as follows:

k_{1}

is equal to the total number of data points in

D_{s a m p l e}

, such as

n (D_{s a m p l e}

), whereas for the second time step,

k_{2}

could be any one of the following: 128, 256, 512, 1024, 2048, or 4096. To distinguish them, we defined the

(t + 1)

th model, which is the same as

M_{t + 1}

in Formula (3b), which was learned with selected

k_{t}

training data as

M_{t}^{k_{t}}

. The following formula represents the concrete model learning method in the study:

M_{1} = M o d e l (θ, D_{train}^{*})

(4a)

M_{2} = M o d e l (θ, D_{train}^{*} + S e l e c t (T e s t (M_{1}, D_{sample}), n (D_{sample})))

(4b)

M_{3}^{2^{i + 7}} = M o d e l (θ, D_{train}^{*} + S e l e c t (T e s t (M_{2}, D_{sample}), 2^{i + 7})) (where 0 \leq i \leq 5)

(4c)

D_{test}^{*} = T e s t (M_{3}^{2^{i + 7}}, D_{test}) (where 0 \leq i \leq 5)

(4d)

Algorithm 1: Semi-supervised learning with pseudo-labeling

Input:

Labeled dataset: $D_{train}^{*}$
Unlabeled dataset: $D_{sample}$
Test dataset: $D_{test}$
Maximum number of iterations: $T$
Confidence threshold for pseudo-labeling: $c o n f i d e n c e$
Convergence threshold for change in consecutive scores: $ε$

Output:

Trained model: $N L P m o d e l$
Final score of the model on $D_{test}$ : $s$

Procedure:
Initialize model:

N L P m o d e l \leftarrow t r a i n e d N L P m o d e l w i t h D_{train}^{*}

Initialize values:

p r e v i o u s s c o r e \leftarrow 0

i t e r a t i o n \leftarrow 0

c o n v e r g e n c e a c h i e v e d \leftarrow False

While

i t e r a t i o n

<

T

and not

c o n v e r g e n c e a c h i e v e d

do
a. Predict pseudo-labels:

p r e d i c t i o n s \leftarrow

predicted labels of

D_{sample}

using

N L P m o d e l

c o n f i d e n c e \leftarrow

threshold when labeling of

D_{sample}

using

N L P m o d e l

b. Filter high confidence predictions:

D_{sample}^{*} \leftarrow \{s a m p l e| s a m p l e

in

D_{sample}

and

c o n f i d e n c e o f t h e s a m p l e > c o n f i d e n c e}

c. Merge labeled and pseudo-labeled data:

D_{train}^{*} \leftarrow D_{train}^{*} + D_{sample}^{*}

d. Retrain the model:

N L P m o d e l \leftarrow

trained

N L P m o d e l

with

D_{train}^{*}

e. Evaluate current model performance:

c u r r e n t s c o r e

\leftarrow

Evaluation metrics when labeling

D_{test}

using

N L P m o d e l

f. Check for convergence:
If

| c u r r e n t s c o r e - p r e v i o u s s c o r e | < ε :

c o n v e r g e n c e a c h i e v e d \leftarrow True

p r e v i o u s s c o r e \leftarrow c u r r e n t s c o r e

g. Update iteration count:

i t e r a t i o n \leftarrow i t e r a t i o n + 1

End Procedure

4.3. Pseudo-Data Preparation: LLM Usage and Mathematical Material

We employed a pseudo-labeling technique to enrich our dataset, sourcing additional self-explanation data via the Japanese LLM and mathematical material.

Given alternatives such as OpenAI’s GPT-3 [16], our preference leaned towards Cyberagent’s LLM [17] due to its open-source availability and its adeptness in the Japanese language, perfectly complementing our dataset. To gather data, the LLM tackled mathematical contexts and formulated pertinent explanations. Our methodology was as follows:

Random data selection: We began our process by randomly selecting 30% from our human-labeled training dataset to capitalize on the rich diversity of student-generated self-explanations.
Keyword extraction: Ten keywords were extracted from each self-explanation, encapsulating its essence, guiding LLM to produce contextually relevant data.
LLM generation: Armed with the extracted keywords, we then proceeded to prompt the LLM [47]. Specifically, each set of 10 keywords was used as seed input, directing the LLM to generate contextually coherent pseudo-self-explanation data. The model was given a directive to ‘elaborate based on the provided keywords’, ensuring the generated content-maintained relevance to the original self-explanation context.

Approximately 19,000 entries were generated, with a random subset of 4096 used for experiments. This combination of pseudo- and human-labeled data broadened our training set, enhancing the automated scoring system’s performance without compromising quality.

We also leveraged the math quiz texts dataset, populated with standard mathematical solutions. Its rich mathematical material and contextual problem-solving methods made it invaluable for generating mathematical self-explanations.

4.4. Comparative Analysis of Original and LLM-Generated Dataset

In Table 4 and Figure 5, a detailed comparative analysis between the original and the synthetically generated datasets is elucidated. Upon examination, it becomes evident that the augmented datasets exhibit a modest augmentation in their average quality metrics relative to the foundational dataset.

Table 5 showcases a range of these samples with their respective translations in English, alongside the scores predicted by our automated system. It is evident from the samples that as the complexity of the self-explanation increases, so does the predicted score. For instance, the first sample, which delves into the relationship between solutions and coefficients, is accorded a high score of 5.00 due to its comprehensive self-explanation. Conversely, the final example, which is a brief reflection on an error and an attempt to rectify it, receives a lower score of 1.23, reflecting its succinct nature. Table 6 offers various math texts that were considered during our study. These texts, extracted from educational resources, shed light on various mathematical principles and theorems. The predicted self-explanation score gives an indication of the depth and comprehensiveness of the self-explanation provided in each math text. For instance, the math text delving into the ‘angle bisector and ratio, using Ceva’s theorem’ provides a nuanced exploration of the topic, earning it a top score of 5.00. In contrast, the examination of the sizes of the three angles of a triangle is more straightforward, reflected in its score of 2.66. The varied scores underline the diverse complexity levels present in math texts.

The inherent variability in the complexity of mathematical texts and self-explanations, as illustrated by Table 5 and Table 6, underscores the importance of our automated scoring system. It demonstrates the system’s capability to adaptively discern and quantify the nuances in self-explanation quality.

5. Experiments and Evaluations

5.1. Exploring the Influence of Self-Explanation Augmentation on Model Efficiency

We embarked on an exploration to discern the influence of self-explanation augmentation on the efficiency of an automated self-explanation scoring model across diverse datasets. We used mean absolute error (MAE) metrics [48,49] to evaluate model performance, giving insights into the extent of error deviation and the efficacy for individual items. Table 7 and Table 8 lay out the results of our experiments, contrasting performances across different dataset permutations. When we introduced augmented datasets into the mix, distinct variations in performance emerged.

Remarkably, our model, when nurtured with a blend of the ‘math’ and ‘original dataset’, consistently delivered the most desirable MAE results. This underlines its superior predictive precision in assessing self-explanation quality. Such results lend credence to the efficacy of the model when trained with this specific data amalgamation. On another note, the ‘mixed’ model—which weaves together human-graded samples, LLM-crafted pseudo-sentences, and mathematical content—also demonstrated notable improvements. This outcome underscores the model’s robustness and flexibility when fed with diverse data sources. Yet, the model named ‘only_LLM_math’, which exclusively depended on LLM-created sentences, trailed behind the foundational model in terms of performance. This observation underscores the criticality of harmonizing human-judged and machine-produced data to achieve optimal results.

5.2. Evaluating Optimal Quantity of Pseudo-Self-Explanation Data

The data from Table 9 and Table 10, as well as the visual cues from Figure 6 and Figure 7, underline the importance of calibrating the amount of added pseudo-self-explanation data. While adding more datasets often improves performance, there is a saturation point beyond which the returns diminish. This trend underscores the need for a balanced approach, optimizing the amount of data incorporated to ensure robust and generalizable results. It also emphasizes the intricate interplay between different datasets, with ‘LLM’ and ‘mixed’ data offering particularly promising outcomes.

The ‘baseline’ row signifies the MAE when the model is trained only on the original dataset, devoid of any pseudo-self-explanation data. Each subsequent row shows the MAE when the model is trained with an increasing volume of pseudo-self-explanation data, ranging from 128 to 4096 datasets.

Upon examining the ‘LLM’ model, we note an enhancement in performance when the added datasets increase from 128 to 256. Beyond this, the further addition of generated data does not significantly reduce the MAE, suggesting an optimal balance between data augmentation and model efficacy with an addition of 256 datasets. The ‘math’ model displays a similar trend, with the lowest MAE achieved when 1024 datasets are added. Beyond this point, no substantial performance enhancement is observed with extra data.

For the ‘mixed’ model, we see a consistent improvement in performance with increased data, but this plateaus beyond 2048 datasets, where the MAE slightly increases. Conversely, the ‘only_LLM_math’ model shows erratic trends. Its performance varies noticeably with the quantity of added data and consistently exceeds the baseline model’s MAE, regardless of the added data volume. This reveals potential difficulties when exclusively relying on generated pseudo-self-explanation data.

6. Discussion

6.1. Detailed Analysis of Results (RQ1)

Regarding Research Question 1, an in-depth analysis of the results displayed in Table 8 reveals several noteworthy observations regarding the influence of self-explanation augmentation on the model’s performance. In the test category, we observe an improvement in the model’s performance when transitioning from the baseline to the LLM and math models. Notably, the math model achieves the lowest MAE at 0.646, which aligns with Dai et al.’s [36] proposition that data augmentation at the semantic level improves robustness and consistency. However, the performance slightly deteriorates in the mixed model and substantially plummets in the ‘only_LLM_math’ model. This suggests that an excessive concentration of LLM-generated self-explanations could impair the model’s predictive proficiency.

A similar pattern emerges when examining individual topics within the ‘test’ category. For instance, the model delivers optimal performance for ‘quadratic equations’ with the LLM-generated model, but the performance markedly deteriorates when solely relying on LLM-generated self-explanations. The validation category follows a similar trajectory, with the LLM, math, and mixed models outshining the baseline model. Once again, the mixed model achieves the smallest error. However, the ‘only_LLM_math’ model experiences a decline in performance, further highlighting the advantages of using a diverse dataset that encompasses both human-evaluated and machine-generated explanations.

6.2. Findings and Observations (RQ2)

Regarding Research Question 2, the results presented in Table 9 and Table 10 provide valuable insights into determining the optimal quantity of generated pseudo-self-explanation data that can enhance the model’s performance. For the ‘LLM’ model, an initial improvement in model performance is observed as the number of added datasets increases from 128 to 256. Beyond this point, further augmentation of the generated data does not lead to a significant reduction in MAE, suggesting that adding 256 datasets strikes an optimal balance between data augmentation and model performance. The ‘math’ model exhibits a similar pattern, with the lowest MAE observed when 1024 datasets are added, and no significant performance improvements resulting from further data augmentation. The ‘mixed’ model, on the other hand, shows a general trend of performance enhancement with increased data augmentation, up to a threshold of 2048 datasets, beyond which the MAE slightly increases. In contrast, the ‘only_LLM_math’ model does not present a consistent trend. Its performance fluctuates significantly as the volume of added data increases, and its MAE consistently surpasses that of the baseline model, regardless of the amount of added data. This underscores the challenges of solely leveraging generated pseudo-self-explanation data for augmentation, particularly when the model might lack domain-specific expertise, echoing concerns raised by Dai et al. [36].

Our study underscores that while weaving in generated pseudo-self-explanation data offers potential advantages, there is a point of diminishing returns. Beyond a certain threshold, piling on more data does not invariably translate to superior outcomes. This stunted impact might have its roots in the constraints imposed by our evaluation dataset’s size and imbalance. The compact nature of our training and test datasets might curtail the model’s expansive learning capabilities. Intriguingly, despite the mathematical underpinnings in the LLM and math texts, the ‘only_LLM_math’ model does not match up to either the baseline or its proposed counterparts. The divergences might be attributable to the caliber of the generated self-explanations coupled with the characteristics of the datasets utilized in the training and testing phases.

In conclusion, while integrating generated pseudo-self-explanation data shows potential in enhancing model performance, our results suggest there is a limit. Oversampling methods, such as SMOTE and ADASYN, can expand small datasets and provide a broader learning spectrum for models [29]. Although our focus has been on LLM pseudo-labeling, merging this with oversampling could be a promising direction for future research, aiming to further refine our models. Our study underscores the importance of careful and context-aware adjustments when adopting data augmentation strategies in developing self-explanation auto-scoring models.

6.3. Limitations and Future Research

In our research, several limitations of our study should be highlighted.

Subject scope: Our dataset is restricted to mathematics, potentially constraining the generalizability of our findings to other subjects.
Dependency on LLM: Our methodology hinges on the LLM’s ability to generate pseudo-self-explanation data. This dependence may introduce noise and errors into our system.
Data quality and representativeness: The performance of our approach is contingent on the quality and representativeness of labeled data. Poor or biased data could compromise model efficacy.
Model performance variability: We identified noticeable disparities in our model’s performance across various mathematical categories. For instance, it predicted the ‘property of a circle’ (0.242) more accurately than ‘quadratic functions’ (0.419) within the validation datasets. These results indicate that self-explanation augmentation’s effectiveness may be influenced by the inherent complexity of a topic and the linguistic nuances present within the self-explanations.
Evaluation dataset categories and size: The evaluation dataset for some categories is comparatively small, which poses challenges in drawing definitive conclusions. It is essential to consider the ease of inference as it pertains to various mathematical concepts, including linear functions, shapes, equations, and square roots. Certain subjects may be inherently more challenging for machine training due to their linguistic or conceptual intricacies.

Author Contributions

R.N., B.F., T.Y., Y.D., K.T. and H.O. contributed to the research conceptualization and methodology. Data collection was performed by R.N. R.N. analyzed the data and wrote the manuscript. B.F., Y.D. and H.O. provided comments to improve the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly funded by the JSPS Grant-in-Aid for Scientific Research (B) JP20H01722 and JP23H01001, (Exploratory) JP21K19824, (Early Career) JP23K17012, (A) JP23H00505, and NEDO JPNP20006.

Data Availability Statement

The data of this study are not open to the public due to participant privacy.

Acknowledgments

We used LLM for the English proofreading of this paper and this fact has been explicitly mentioned for transparency. As this study involves the use of student data, we acknowledge the importance of obtaining approval from the Institutional Review Board (IRB). We have taken the necessary steps to ensure compliance with ethical guidelines, and the study has been submitted to and approved by the IRB. Consent for using the students’ data in our research is obtained from their guardians at the beginning of each academic year. We provide detailed information about the purpose of data collection, how it will be used, and the measures taken to ensure confidentiality and privacy. The guardians have the right to decline consent or withdraw their consent at any time without any negative consequences for the students.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study, in the collection, analyses.

References

Rittle-Johnson, B.; Loehr, A.M.; Durkin, K. Promoting self-explanation to improve mathematics learning: A meta-analysis and instructional design principles. ZDM 2017, 49, 599–611. [Google Scholar] [CrossRef]
Rittle-Johnson, B. Developing Mathematics Knowledge. Child Dev. Perspect. 2017, 11, 184–190. [Google Scholar] [CrossRef]
Renkl, A. Learning from worked-examples in mathematics: Students relate procedures to principles. ZDM 2017, 49, 571–584. [Google Scholar] [CrossRef]
Chi, M.T.; Leeuw, N.D.; Chiu, M.; LaVancher, C. Eliciting Self-Explanations Improves Understanding. Cogn. Sci. 1994, 18, 439–477. [Google Scholar]
Rittle-Johnson, B. Promoting transfer: Effects of self-explanation and direct instruction. Child Dev. 2006, 77, 1–15. [Google Scholar] [CrossRef] [PubMed]
Conati, C.; VanLehn, K. Toward Computer-Based Support of Meta-Cognitive Skills: A Computational Framework to Coach Self-Explanation. Int. J. Artif. Intell. Educ. 2000, 11, 389–415. [Google Scholar]
Bisra, K.; Liu, Q.; Nesbit, J.C.; Salimi, F.; Winne, P.H. Inducing Self-Explanation: A Meta-Analysis. Educ. Psychol. Rev. 2018, 30, 703–725. [Google Scholar] [CrossRef]
Crippen, K.J.; Earl, B.L. The impact of web-based worked examples and self-explanation on performance, problem solving, and self-efficacy. Comput. Educ. 2007, 49, 809–821. [Google Scholar] [CrossRef]
Nakamoto, R.; Flanagan, B.; Takami, K.; Dai, Y.; Ogata, H. Identifying Students’ Stuck Points Using Self-Explanations and Pen Stroke Data in a Mathematics Quiz. In Proceedings of the 29th International Conference on Computers in Education, Online, 22–26 November 2021; Volume 2021, pp. 22–26. [Google Scholar]
Nakamoto, R.; Flanagan, B.; Dai, Y.; Takami, K.; Ogata, H. Unsupervised techniques for generating a standard sample self-explanation answer with knowledge components in a math quiz. Res. Pract. Technol. Enhanc. Learn. 2024, 19, 016. [Google Scholar] [CrossRef]
Berthold, K.; Eysink, T.H.; Renkl, A. Assisting self-explanation prompts are more effective than open prompts when learning with multiple representations. Instr. Sci. 2009, 37, 345–363. [Google Scholar] [CrossRef]
Berthold, K.; Renkl, A. Instructional Aids to Support a Conceptual Understanding of Multiple Representations. J. Educ. Psychol. 2009, 101, 70–87. [Google Scholar] [CrossRef]
McEldoon, K.L.; Durkin, K.L.; Rittle-Johnson, B. Is self-explanation worth the time? A comparison to additional practice. Br. J. Educ. Psychol. 2013, 83, 615–632. [Google Scholar] [CrossRef] [PubMed]
Panaite, M.; Dascalu, M.; Johnson, A.M.; Balyan, R.; Dai, J.; McNamara, D.S.; Trausan-Matu, S. Bring It on! Challenges Encountered While Building a Comprehensive Tutoring System Using ReaderBench. In Proceedings of the International Conference on Artificial Intelligence in Education, London, UK, 27–30 June 2018. [Google Scholar]
Hodds, M.; Alcock, L.; Inglis, M. Self-explanation training improves proof comprehension. J. Res. Math. Educ. 2014, 45, 62–101. [Google Scholar] [CrossRef]
CyberAgent. Open-Calm-7B [Software]. Hugging Face. 2023. Available online: https://huggingface.co/cyberagent/open-calm-7b (accessed on 1 June 2023).
Andonian, A.; Anthony, Q.; Biderman, S.; Black, S.; Gali, P.; Gao, L.; Hallahan, E.; Levy-Kramer, J.; Leahy, C.; Nestler, L.; et al. GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch (Version 0.0.1) [Computer Software]. 2021. Available online: https://zenodo.org/record/7714278 (accessed on 1 June 2023).
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
McNamara, D.S.; Levinstein, I.B.; Boonthum, C. iSTART: Interactive strategy training for active reading and thinking. Behavior Research Methods. Instrum. Comput. 2004, 36, 222–233. [Google Scholar] [CrossRef]
Funayama, H.; Asazuma, Y.; Matsubayashi, Y.; Mizumoto, T.; Inui, K. Reducing the Cost: Cross-Prompt Pre-finetuning for Short Answer Scoring. In Proceedings of the International Conference on Artificial Intelligence in Education, Tokyo, Japan, 3–7 July 2023. [Google Scholar]
Crossley, S.A.; Kim, M.; Allen, L.K.; McNamara, D.S. Automated Summarization Evaluation (ASE) Using Natural Language Processing Tools. In Proceedings of the International Conference on Artificial Intelligence in Education, Chicago, IL, USA, 25–29 June 2019. [Google Scholar]
Özsoy, M.G.; Alpaslan, F.N.; Çiçekli, I. Text summarization using Latent Semantic Analysis. J. Inf. Sci. 2011, 37, 405–417. [Google Scholar] [CrossRef]
León, J.A.; Olmos, R.; Escudero, I.; Cañas, J.J.; Salmerón, L. Assessing short summaries with human judgments procedure and latent semantic analysis in narrative and expository texts. Behav. Res. Methods 2006, 38, 616–627. [Google Scholar] [CrossRef]
Panaite, M.; Ruseti, S.; Dascalu, M.; Balyan, R.; McNamara, D.S.; Trausan-Matu, S. Automated Scoring of Self-explanations Using Recurrent Neural Networks. In Proceedings of the European Conference on Technology Enhanced Learning, Delft, The Netherlands, 16–19 September 2019. [Google Scholar]
Cascante-Bonilla, P.; Tan, F.; Qi, Y.; Ordonez, V. Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Chawla, N.; Bowyer, K.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. arXiv 2002, arXiv:1106.1813. [Google Scholar] [CrossRef]
Han, H.; Wang, W.; Mao, B. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005. [Google Scholar]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
Salazar, A.; Vergara, L.; Safont, G. Generative Adversarial Networks and Markov Random Fields for oversampling very small training sets. Expert Syst. Appl. 2021, 163, 113819. [Google Scholar] [CrossRef]
Rubin, D.B. Statistical disclosure limitation. J. Off. Stat. 1993, 9, 461–468. [Google Scholar]
Antulov-Fantulin, N.; Bošnjak, M.; Zlatić, V.; Grčar, M.; Šmuc, T. Synthetic Sequence Generator for Recommender Systems–Memory Biased Random Walk on a Sequence Multilayer Network. In Discovery Science. DS 2014. Lecture Notes in Computer Science; Džeroski, S., Panov, P., Kocev, D., Todorovski, L., Eds.; Springer: Cham, Switerland, 2014; Volume 8777. [Google Scholar] [CrossRef]
El Emam, K. Seven Ways to Evaluate the Utility of Synthetic Data. IEEE Secur. Priv. 2020, 18, 56–59. [Google Scholar] [CrossRef]
Ping, H.; Stoyanovich, J.; Howe, B. DataSynthesizer: Privacy-Preserving Synthetic Datasets. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, 27–29 June 2017. [Google Scholar]
Dahmen, J.; Cook, D.J. SynSys: A Synthetic Data Generation System for Healthcare Applcations. Sensors 2019, 19, 1181. [Google Scholar] [CrossRef] [PubMed]
Berg, A.; Mol, S.T.; Kismihók, G.; Sclater, N. The Role of a Reference Synthetic Data Generator within the Field of Learning Analytics. J. Learn. Anal. 2016, 3, 107–128. [Google Scholar] [CrossRef]
Peña-Ayala, A. Learning analytics: A glance of evolution, status, and trends according to a proposed taxonomy. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1243. [Google Scholar] [CrossRef]
Flanagan, B.; Majumdar, R.; Ogata, H. Fine Grain Synthetic Educational Data: Challenges and Limitations of Collaborative Learning Analytics. IEEE Access 2022, 10, 26230–26241. [Google Scholar] [CrossRef]
Dai, H.; Liu, Z.; Liao, W.; Huang, X.; Cao, Y.; Wu, Z.; Zhao, L.; Xu, S.; Liu, W.; Liu, N.; et al. AugGPT: Leveraging ChatGPT for Text Data Augmentation. arXiv 2023, arXiv:2302.13007. [Google Scholar]
Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; Cobbe, K. Let’s Verify Step by Step. arXiv 2023, arXiv:2305.20050. [Google Scholar]
Flanagan, B.; Ogata, H. Learning analytics platform in higher education in Japan. Knowl. Manag. E-Learn. Int. J. 2018, 10, 469–484. [Google Scholar]
Thompson, D.R.; Senk, S.L. Using rubrics in high school mathematics courses. Math. Teach. Learn. Teach. PK–12 1998, 91, 786–793. [Google Scholar] [CrossRef]
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Wang, T.; Inoue, N.; Ouchi, H.; Mizumoto, T.; Inui, K. Inject Rubrics into Short Answer Grading System. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019. [Google Scholar]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; NIPS’17, 6000-6010. Curran Associates Inc.: Red Hook, NY, USA, 2017. ISBN 9781510860964. [Google Scholar]
Suzuki, M. Pretrained Japanese BERT Models, GitHub Repository. 2019. Available online: https://github.com/cl-tohoku/bert-japanese (accessed on 1 April 2021).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2021, 55, 1–35. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
Hodson, T.O. Root-mean-square error (RMSE) or mean absolute error (MAE): When to use them or not. Geoscientific Model Development. Geosci. Model Dev. 2022, 15, 5481–5487. [Google Scholar] [CrossRef]

Figure 1. Handwritten answer review playback and self-explanation input user interface. The self-explanation of the answer section includes the following: If triangle ABO’s area is 1, then triangle AOC’s area is 4. Given that the total area is five and straight-line OP bisects the area of triangle ABC, the joint area of quadrilateral ABPO and triangle POC is 2/5. Hence, the area ratio of triangle APO to triangle POC is 3:5, leading to a length ratio of straight-line AP to straight-line PC of 3:5.

Figure 2. Workflow for the BERT regression models.

Figure 3. The overview of pseudo-training process.

Figure 4. The detail of pseudo-training process.

Figure 5. Boxplots of self-explanation scores.

Figure 6. Test MAE plot.

Figure 7. Validation MAE plot.

Table 1. Rubrics and a sample answer of self-explanation in a quiz.

Numbers	Rubric	Sample Answer of Self-Explanations
Step 1	Be able to find the equation of a linear function from two points.	Substituting the y-coordinate of p into the equation of the line AC.
Step 2	Be able to find the equation of the line that bisects the area of a triangle.	Find the area of triangle ABC, then find the area of triangle OPC.
Step 3	Be able to represent a point on a straight-line using letters (P-coordinates).	With the line OC as the base, find the y-coordinate of p, which is the height. P’s coordinate is (t, −1/2t + 4).
Step 4	Be able to represent a point on a straight-line using letters (Q-coordinate).	Since the coordinates of P are (3,5/2), the line OP is y = ⅚x, and the coordinates of Q are (t,5/6).

Table 2. Score grading definitions.

Graded Score	Description
1 (Unacceptable)	The number of steps for which self-explanation is filled in for the steps required for the solution is minimal, and there were problematic expressions in the students’ self-explanation (e.g., mistaken patterns, boredom).
2 (Poor)	Self-explanation is mainly provided for the steps required for the solution. Still, they are more like bullet points than explanations.
3 (Fair)	Self-explanation is mainly provided for the steps required for the answer—the average self-explanation level among all respondents.
4 (Very Good)	Self-explanation is provided for most of the steps required for the answer, but there is room for improvement as an explanation (logic, expressions).
5 (Excellent)	Self-explanation is mainly provided for the steps required for the answer, and the explanation is logical and well-written.

Table 3. Descriptive statistics of graded self-explanations.

Data Type	Num of Quiz	Variations in Math Units	Total Answers	Sentence Length (Character Count)		Quality Score
Data Type	Num of Quiz	Variations in Math Units	Total Answers	Mean	SD	Mean	SD
Train	40	8	1420	67.8	56.8	2.94	1.34
Valid	37	8	355	67.3	59.3	2.92	1.31
Test	8	3	431	63.7	53.2	2.81	1.25

Table 4. Comparative metrics for original and generated datasets.

Data Type	Counts	Mean Score	Std
Original	2205	2.91	1.33
LLM	4096	3.39	1.44
Math	4096	3.87	1.77

Table 5. LLM-generated samples.

LLM-Generated Samples (Original)	LLM-Generated Samples (English Translation)	Predicted Self-Explanation Score
その後、α + β = −a−2とαβ = 2aの関係から解と係数の関係が分かる。次に、問題で言及されたαとβを用いて式を展開し整理し、右辺を0にする。さらに式を工夫して代入が可能な形にする。そして、関係式α + β = −a−2とαβ = 2aを式に代入して簡略化し、a^2 + a−6 = 0となる。これを因数分解してaの解を求めると、a = −3とa = 2が得られる。その後、a = 2を代入すると解が重解になり、解が一つしかなくなるため問題に合わない。したがって、正しい答えはa = −3である。その後、重積算や累乗などいろいろな計算が解に出てきて混乱するだろうから丁寧に説明していく。	After that, α + β = −a−2 and αβ = 2a can be understood from the relationship between the solutions and coefficients. Next, expand and organize the expressions using the α and β mentioned in the problem, and set the right side to 0. Further manipulate the equation to transform it into a form that allows for substitution. Then, substitute the relationship α + β = −a−2 and αβ = 2a into the equation and simplify it, resulting in a2 + a−6 = 0. Factorize this to find the solutions for a, which are a = −3 and a = 2. Afterward, when a = 2 is substituted, the answer becomes a repeated root, which means there is only one solution, making it inconsistent with the problem. Therefore, the correct answer is a = −3. Afterward, various calculations such as product of roots and exponentiation may appear in the solution, and it may be confusing, so I will explain it carefully.	5.00
代入してみたら、平方根の中が計算できた。αとβを求めた。αに−2を代入した。解を吟味して、答えは−3だった。その後、a = −4.7x^5 + 7として計算したが誤差があった。そこで右側の5を左側の4に変形すると正しい値が求められた（ただし小数点以下を切り捨て）。この問題は高校数学の範囲外であるが、中学・高校で習った知識で解くことができたので紹介したい。	Substituted it into the formula for solutions. Calculated the value inside the square root. Found α and β. Substituted α = −2. Examined the solution, and the answer was −3. Afterward, calculated it as a = −4.7x^5 + 7, but there was an error. So I rearranged the 5 on the right side to 4, and the correct value was obtained (with the decimal part truncated). This problem is beyond the scope of high school mathematics, but it could be solved using knowledge learned in middle and high school, so I want to introduce it.	4.08
グラフを書いた傾きがー4のy = x^2の接線を求めた。その後、接点を原点とし両辺に点a,bを代入すると「sin-cos」と出たので、「COS(1)」で求めた。これで式の意味が分かったと思う。−5は角度が変化しないからゼロでないといけないのだ！	Used the slope obtained from graphing to find the tangent of y = x^2 with slope −4. Took the point of tangency as the origin and substituted points a and b on both sides. ‘Sin-cos’ was obtained, so I calculated it using ‘COS(1)’. I think I understood the meaning of the equation now. −5 does not change the angle, so it must be nonzero!	3.02
x = 3を代入する。方程式を解き進める。解の吟味を書く。間違いに気づき、修正する。最終的な答えを書く。その後、再帰的に解く。	Substituted x = 3. Proceeded to solve the equation. Wrote the examination of the solutions. Noticed the mistake and corrected it. Wrote the final answer. Afterward, solve it recursively.	2.18
前のは間違えたため、全部消した。その後、通分してみた。	Since the previous one was incorrect, I deleted everything and then performed the common denominator. Afterwards, something like this.	1.23

Table 6. Math texts samples.

Math Texts Samples	Predicted Self-Explanation Score
Angle bisector and ratio, using Ceva’s theorem: Revised version succeed math A problem 349 Δ (Triangle), let Δ have the angle bisector of ∠ and the point where it intersects the side, and the point that divides the side in the ratio. When the line intersects at point, find the length of the side.	5.00
Using Menelaus’s theorem: Segment ratio and area ratio, revised version succeed math A problem 350 Δ, let be the point where it divides the side in the ratio, and the point where the segment is divided in the ratio, and the point where the extension of the segment intersects the side. Find the following segment ratios and area ratios Δ : Δ	4.93
Using the relationship between sides and angles: Range of values for side length in a triangle, revised version succeed math A problem 355, determine the range of values for so that a triangle with the following side lengths exists.	3.84
Using the relationship between the sizes of three sides: Proving inequalities related to segment lengths, revised version succeed math A important example 66, take point inside Δ, and join and prove that . Abbreviated.	3.13
Examining the sizes of the three angles of a triangle, revised version succeed math A important example 64, examine the sizes of the three interior angles of Δ.	2.66

Table 7. Datasets overview.

Dataset	Base_Line	LLM	Math	Mixed	Only_LLM_Math
Original (N = 1420)	●	●	●	●
LLM- generated (N = 4096)		●		●	●
Math texts (N = 4096)			●	●	●
Total Number of Data	1420	5516	5516	9612	8192

Table 8. Model performance for various datasets (MAE).

Data Type	Base_Line	LLM	Math	Mixed	Only_LLM_Math
Test	0.749	0.699	0.646	0.692	1.135
Val	0.602	0.341	0.358	0.336	1.033

Table 9. Test MAE with varying amounts of added pseudo-self-explanation sata.

Dataset	Number of Datasets Added
Dataset	128	256	512	1024	2048	4096
base_line	0.75
LLM	0.67	0.63	0.72	0.72	0.71	0.7
math	0.64	0.66	0.67	0.64	0.65	0.65
mixed	0.68	0.66	0.71	0.68	0.73	0.69
only_LLM_math	1.19	0.96	1.02	0.89	1.15	1.14

Table 10. Validation MAE with varying amounts of added pseudo-self-explanatory data.

Dataset	Number of Datasets Added
Dataset	128	256	512	1024	2048	4096
base_line	0.60
LLM	0.57	0.35	0.51	0.49	0.40	0.34
math	0.40	0.50	0.43	0.35	0.40	0.36
mixed	0.59	0.32	0.52	0.44	0.40	0.34
only_LLM_math	1.19	0.90	0.96	0.81	1.02	1.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nakamoto, R.; Flanagan, B.; Yamauchi, T.; Dai, Y.; Takami, K.; Ogata, H. Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach. Computers 2023, 12, 217. https://doi.org/10.3390/computers12110217

AMA Style

Nakamoto R, Flanagan B, Yamauchi T, Dai Y, Takami K, Ogata H. Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach. Computers. 2023; 12(11):217. https://doi.org/10.3390/computers12110217

Chicago/Turabian Style

Nakamoto, Ryosuke, Brendan Flanagan, Taisei Yamauchi, Yiling Dai, Kyosuke Takami, and Hiroaki Ogata. 2023. "Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach" Computers 12, no. 11: 217. https://doi.org/10.3390/computers12110217

APA Style

Nakamoto, R., Flanagan, B., Yamauchi, T., Dai, Y., Takami, K., & Ogata, H. (2023). Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach. Computers, 12(11), 217. https://doi.org/10.3390/computers12110217

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach

Abstract

1. Introduction

2. Related Work

2.1. Automated Scoring of Self-Explanations: The Imperative for Rich Data

2.2. Augmenting Mathematical Self -Explanations Using Large Language Models

3. Problem Setting: The Learning Task

3.1. Collecting Self-Explanations

3.2. Assessment of Self-Explanation Quality

3.3. The Text Regression Model Description

4. The Proposed Method

4.1. Overview or Pseudo-Labeling

4.2. Pseudo-Labeling Training Algorithm: Dataset Categorization, Function Definitions, and Model Learning

4.3. Pseudo-Data Preparation: LLM Usage and Mathematical Material

4.4. Comparative Analysis of Original and LLM-Generated Dataset

5. Experiments and Evaluations

5.1. Exploring the Influence of Self-Explanation Augmentation on Model Efficiency

5.2. Evaluating Optimal Quantity of Pseudo-Self-Explanation Data

6. Discussion

6.1. Detailed Analysis of Results (RQ1)

6.2. Findings and Observations (RQ2)

6.3. Limitations and Future Research

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI