Automated Assessment of Inferences Using Pre-Trained Language Models

: Inference plays a key role in reading comprehension. However, assessing inference in reading is a complex process that relies on the judgment of trained experts. In this study, we explore objective and automated methods for assessing inference in readers’ responses using natural language processing. Specifically, classifiers were trained to detect inference from a pair of input texts and reader responses by fine-tuning three widely used pre-trained language models. The effects of the model size and pre-training strategy on the accuracy of inference classification were investigated. The highest F1 score of 0.92 was achieved via fine-tuning the robustly optimized 12-layer BERT model (RoBERTa-base). Fine-tuning the larger 24-layer model (RoBERTa-large) did not improve the classification accuracy. Error analysis provides insight into the relative difficulty of classifying inference subtypes. The proposed method demonstrates the feasibility of the automated quantification of inference during reading, and offers potential to facilitate individualized reading instructions.


Introduction 1.Automating the Assessment of Reading Comprehension Skills
Natural language processing (NLP) can improve productivity in many areas.This is especially true for organizations that deal with large volumes of textual data [1].For example, NLP helps users to identify and extract key entities, facts, and relationships from text to build knowledge bases [2].These knowledge bases become valuable resources for quick information access and decision making.In addition, in customer service, sentence classification can be used to automatically categorize customer inquiries, complaints, or feedback [3].This enables quicker routing to the appropriate department or personnel, thereby reducing response times and improving customer satisfaction.Furthermore, sentiment analysis, a special case of sentence classification, can automatically identify positive, negative, or neutral sentiments in customer feedback or social media posts [4].This helps organizations to quickly gauge public opinion and adjust their strategies accordingly.
Recent research has increasingly focused on using NLP to augment or even replace human expert judgments.In the legal field, NLP-powered tools are accelerating case preparation by efficiently sifting through large volumes of legal documents [5].Similarly, in the financial sector, NLP-based systems are enhancing the ability of analysts to predict market trends by analyzing large and complex datasets [6], a task that is beyond human capabilities.Similarly, in education, automated essay grading systems provide instant feedback [7], thus allowing educators to spend more time on individualized instructions.
This study focuses on automating the assessment of reading comprehension skills through the development and application of advanced NLP in education.Reading is a fundamental skill for the acquisition of knowledge [8].Assessing reading skills is challenging because reading involves multiple cognitive processes, such as letter-sound correspondence, phonological memory, word recognition, sentence processing, and comprehension [9].These individual skills are assessed using standardized tests [10].The objectivity of standardized tests makes it possible to compare the reading skills of students from different backgrounds using the same criteria.However, standardized tests often focus on specific types of reading skills, potentially neglecting broader aspects of literacy, such as critical thinking and engagement with diverse background knowledge [11].To gain a holistic understanding of a student's reading skills, the interpretation of test results and personalized feedback from human experts are essential.
Machine learning models are actively used to improve the accuracy of reading level assessments.In [12], Petersen and Ostendorf used support vector machines to combine lexical and syntactic features of a given text to assess its reading level.The corpus used for this study was Weekly Readers, which consisted of 2400 articles from an educational magazine designed for children of ages 7-10 in the United States.Their model outperformed the traditional Flesch-Kincaid score [13], which is based on sentence length and the average syllable count.In [14], Boonthum-Denecke et al. used latent semantic analysis (LSA) [15] and word matching to first identify students' reading strategies and then estimate their reading comprehension scores.The correlation between predicted and actual reading scores was low (<0.5),indicating that estimating reading comprehension skills is a difficult task.In [16], Allen et al. used a linguistic measure of coherence, called the Coh-Matrix [17], and LSA to assess the reading comprehension of high school students.Their predictions were compared with standardized scores, using the Gates-MacGinitie reading skill test, level 10/12 [18], and the correlation between them was low (<0.5).
However, existing studies using machine learning models combine known linguistic features to improve the overall accuracy of reading assessments.In contrast, this study investigates a way to use pre-trained language models to assess a specific cognitive task (inference) for reading.This would support human experts in assessing reading skills by providing tools and platforms that streamline assessment processes using NLP.

Contributions of the Study
Specifically, this study was motivated by the fact that inference has been shown to play a key role in reading comprehension, serving as a critical component in the construction of meaning from text [19][20][21].Inference allows readers to fill in gaps in explicit textual information, facilitating the deeper understanding and integration of knowledge.Previous research emphasizes its importance not only for comprehending literal content, but also for engaging with the text at a deeper level [20], allowing for the application of prior knowledge and the anticipation of subsequent narrative developments [21].Enhancing inferential skills could significantly improve reading comprehension outcomes, thereby advancing literacy education practices.
However, the evaluation of inferences during reading is a complex process and is subject to the judgment of human experts.Currently, readers' cognitive processes during reading are monitored by the think-aloud protocol [22][23][24].First, readers read a given text and verbally report their thoughts (Figure 1A).Then, the readers' responses are transcribed and evaluated by multiple evaluators (Figure 1B).This traditional method relies heavily on qualitative analysis and the interpretive insights of the raters, which leads to inter-rater variability and limits the scalability of the evaluation.This variability and the high resource requirements underscore the need for the development of more objective and automated methods for assessing inference in readers' responses [25].
Thus, this study attempts to address these challenges by using NLP to automatically evaluate reader responses.Specifically, we formulate the problem as sentence classification, where a classifier is trained to classify a pair of input texts and reader responses as inferential or non-inferential (Figure 1C).The classifier is fine-tuned from pre-trained language models that have been trained with large text corpora.These pre-trained models efficiently produce the contextualized representations of a given text and are a good starting point for developing a task-specific language model [26].Further training on smaller, more focused datasets which are relevant to the task at hand (inference classification) would allow for the automated quantification of readers' responses to a given text.
The remainder of this paper is organized as follows.Section 2 describes the data collection, human expert evaluation, and the procedure of fine-tuning pre-trained language models and their evaluation.In Section 3, we assess the accuracy of the inference classification using the proposed method, and we analyze the error patterns.Then, in Section 4, we discuss the results and their implications.In Section 5, we draw general conclusions with future research directions.Using the think-aloud protocol, the subject's response to each sentence is collected (A).Human evaluators then assess the sentence-response pairs to determine whether an inference was drawn (B).The proposed method is used to classify a sentence-response pair as inferential or not, without having to involve human experts (C).

Data Collection and Evaluation by Human Experts
A dataset of 720 sentence-response pairs was collected as follows.The stimulus text in Korean consisted of 10 sentences taken from an elementary school reading textbook.The average number of words per sentence was 9.2.The think-aloud protocol was administered to 72 third-and fourth-grade elementary school students in public elementary schools in South Korea.The participant's verbal response to each sentence was recorded and transcribed.
Among the total 720 sentence-response pairs, 58 pairs were removed because the reader failed to produce any meaningful response.The remaining 662 sentence-response pairs were used for further analysis.Three evaluators individually assessed each sentence-response pair according to the nine inference types defined in a previous study [21] (summarized in Table 1).Each evaluator first identified the type of inference and then determined whether an inference was made (labeled as 1) or not (labeled as 0).The evaluators' decisions agreed on 642 sentence-response pairs (97%).For the mismatched 20 pairs (3%), evaluators discussed until they reached the same conclusion.Among the 662 sentence-response pairs, 438 pairs were labeled as inference (1), and 224 were labeled as no inference (0).
Table 2 shows typical examples of sentence-response pairs in which inferences were made.The evaluators agreed that, in the first example in Table 2, the inference was made because the reader tried to make a prediction about what would come next based on what had come before.In the second example in Table 2, the reader tried to connect the background knowledge (The plants seem to be inactive in general.)when explaining the given sentence.In the third example in Table 2, the reader tried to associate the information in the given sentence with their background knowledge (The Venus firetrap is a famous example of a plant that eats insects.).In the fourth example in Table 2, the reader's response was a simple paraphrase of the given sentence, and the sentence and response were almost identical in meaning.In the last example in Table 2, there was no meaningful response from the reader.Human evaluators then assess the sentence-response pairs to determine whether an inference was drawn (B).The proposed method is used to classify a sentence-response pair as inferential or not, without having to involve human experts (C).
The remainder of this paper is organized as follows.Section 2 describes the data collection, human expert evaluation, and the procedure of fine-tuning pre-trained language models and their evaluation.In Section 3, we assess the accuracy of the inference classification using the proposed method, and we analyze the error patterns.Then, in Section 4, we discuss the results and their implications.In Section 5, we draw general conclusions with future research directions.

Data Collection and Evaluation by Human Experts
A dataset of 720 sentence-response pairs was collected as follows.The stimulus text in Korean consisted of 10 sentences taken from an elementary school reading textbook.The average number of words per sentence was 9.2.The think-aloud protocol was administered to 72 third-and fourth-grade elementary school students in public elementary schools in South Korea.The participant's verbal response to each sentence was recorded and transcribed.
Among the total 720 sentence-response pairs, 58 pairs were removed because the reader failed to produce any meaningful response.The remaining 662 sentence-response pairs were used for further analysis.Three evaluators individually assessed each sentenceresponse pair according to the nine inference types defined in a previous study [21] (summarized in Table 1).Each evaluator first identified the type of inference and then determined whether an inference was made (labeled as 1) or not (labeled as 0).The evaluators' decisions agreed on 642 sentence-response pairs (97%).For the mismatched 20 pairs (3%), evaluators discussed until they reached the same conclusion.Among the 662 sentence-response pairs, 438 pairs were labeled as inference (1), and 224 were labeled as no inference (0).
Table 2 shows typical examples of sentence-response pairs in which inferences were made.The evaluators agreed that, in the first example in Table 2, the inference was made because the reader tried to make a prediction about what would come next based on what had come before.In the second example in Table 2, the reader tried to connect the background knowledge (The plants seem to be inactive in general.)when explaining the given sentence.In the third example in Table 2, the reader tried to associate the information in the given sentence with their background knowledge (The Venus firetrap is a famous example of a plant that eats insects.).In the fourth example in Table 2, the reader's response was a simple paraphrase of the given sentence, and the sentence and response were almost identical in meaning.In the last example in Table 2, there was no meaningful response from the reader.

Pre-Trained Language Models
In this study, three widely used Transformer-based pre-trained language models [27] were selected to investigate their effectiveness for classifying inference.Due to the relatively small sample size (662) when compared to the huge parameter space of Transformer-based models, we decided to use BERT [24] and its variants in order to avoid overfitting.These models, namely the BERT-base [28], RoBERTa-base [29], and RoBERTa-large [29], have become fundamental in the field of natural language understanding due to their success in capturing complex patterns and representations, achieved through extensive pre-training on large corpora.The selection of these models allows for a comprehensive analysis of their performance and adaptability in fine-tuning scenarios for inference classification.Each model brings its own set of characteristics, such as model size, masking, and training objectives, which are summarized in Table 3. Korean versions of the three base models were pre-trained using the KLUE (Korean Language Understanding Evaluation) benchmark dataset [30].The benchmark contains eight Korean natural language understanding tasks, including topic classification, semantic textual similarity, natural language inference, named entity recognition, relation extraction, dependency parsing, machine reading comprehension, and dialogue state tracking.The corpora used for this benchmark include news headlines, Wikipedia, Wikinews, policy news, The Korea Economics Daily News, and Acrofan News for formal texts, and ParaKQC, Airbnb reviews, and the NAVER Sentiment Movie Corpus for colloquial texts.The tokenizer for the dataset is a morpheme-based sub-word tokenizer [30], which first divides an input text into morphemes using a morphological analyzer, and then tokenizes them using the byte pair encoding (BPE) technique [31].The pre-trained models and the tokenizer are publicly available [32][33][34].

Transfer Learning and Evaluation Using k-Fold Cross-Validation
The inference classifiers were trained by fine-tuning the three pre-trained language models as follows.Each pair of input sentences and corresponding responses was concatenated with the separator symbol ([SEP]) between them.The combined text was then provided as the input.For each base model, the last layer (classification head) was replaced with a dense layer with the output size set to one, and its weights were initialized with random values.As a result, the model will produce a single number for each input, and this output is used as the logit of the target class.The loss is defined by the binary cross entropy between the output of the model and true class label as follows: where y is the true class label (0 or 1), z is the model output, and σ is the sigmoid function that transforms the input logit into a probability.This loss is minimized using the Adam optimizer [35].Hyperparameters for the training process were optimized as follows: As suggested by the authors of the pre-training models [26], the batch size, learning rate, warm-up ratio, and weight decay were varied independently, and the highest F1 score for each model was reported.First, the batch size was set to either 4, 8, 16, and 32.Larger batch sizes provide a more accurate estimate of the gradient, resulting in more stable training.However, they require more memory and processing power.Second, the learning rate was set to either 10 −5 , 2 × 10 −5 , 3 × 10 −5 , and 5 × 10 −5 .Too low a learning rate can lead to a long convergence time or to being fixed at a local minimum, while too high a learning rate can cause the training to be unstable and diverge.Third, the warm-up ratio was set to either 0, 0.1, 0.2, and 0.6.During the warm-up iterations, the learning rate gradually increased from zero to the target learning rate.This technique helps to stabilize the fine-tuning in the early iterations.Fourth, the weight decay was set to either 0, 0.01, 0.02, 0.04, and 0.08.The weight decay regularizes the model and prevents overfitting by penalizing large weights.The four hyperparameters varied independently, resulting in a total of 320 configurations.
For each base model, five-fold cross-validation was performed for each training configuration in order to rigorously assess the classification accuracy and generalization ability of the model.Specifically, the dataset was randomly divided into five distinct subsets, so that the proportions of the inference subtypes were equal.Four of these subsets were used for training and the remaining one was used for validation in order to calculate the F1 score with corresponding precision and recall scores.This cycle was repeated five times, with each subset serving as a validation set once.As a result, five F1 scores were collected for each training configuration.The configuration corresponding to the highest average F1 score was found for each base model.The five F1 scores of the best model were compared for models that were fine-tuned from different base models, using the paired t-test.

Accuracy of Inference Classification
Table 4 shows the hyperparameters that resulted in the best models.Different batch sizes yielded the highest F1 scores for different base models.In contrast, the same learning rate (10 −5 ) and warm-up ratio (0) corresponded to the highest F1 scores.Here, the best warm-up ratio of zero means that the warm-up was not necessary.A non-zero weight decay was useful only for the largest model (RoBERTa-large).This suggests that the regularization via weight decay was effective only for the largest model.Table 5 shows the classification accuracies of the best models fine-tuned from the three pre-trained models.Both the precision and recall scores of the RoBERTa-base model were higher than those of the BERT-base model.As a result, the F1 score of the RoBERTa-base model was higher than that of the BERT-base model.The precision score of the RoBERTalarge model was lower than that of the RoBERTa-base model, but the recall scores of the two models were the same.As a result, the F1 score of the RoBERTa-large model was lower than that of the RoBERTa-base model.Figure 2 shows the F1 scores of the inference classification by fine-tuning the three pre-trained language models.First, training the inference classifier from the BERT-base model resulted in an average F1 score of 0.87, with a standard error of the mean (SEM) of 0.01.Second, training the inference classifier from the RoBERTa-base model resulted in a higher average F1 score of 0.92, with an SEM of 0.01.The F1 score of the fine-tuned RoBERTa-base model was significantly higher than that of the fine-tuned BERT-base model (paired t-test, p < 0.05).Third, training the classifier from the RoBERTa-large model resulted in a lower average F1 score of 0.90, with a higher SEM of 0.02.However, the difference in the F1 scores of the fine-tuned RoBERTa-base and RoBERTa-large models were not statistically significantly (paired t-test, p > 0.05).A comparison of the proportions of the inference subtypes in the errors with the proportion in the entire dataset shows the relative difficulty of classifying the inference subtypes.The proportion of elaboration and bridging subtypes in the errors decreased as larger language models were used.This suggests that larger language models classify these types of inferences with a higher accuracy.In contrast, the proportions of evaluative comments and paraphrases increase as larger language models are used.This suggests that these inference subtypes are more susceptible to overfitting.Furthermore, the largest language model (RoBERTa-large) made more errors in classifying meaningless responses than the other two language models (BERT-base and RoBERTa-base).This is another indication of overfitting.The asterisk corresponds to a statistically significant difference (paired t-test, p < 0.05), and n.s.indicates that the difference is not statistically significant (paired t-test, p > 0.05).A comparison of the proportions of the inference subtypes in the errors with the proportion in the entire dataset shows the relative difficulty of classifying the inference subtypes.The proportion of elaboration and bridging subtypes in the errors decreased as larger language models were used.This suggests that larger language models classify these types of inferences with a higher accuracy.In contrast, the proportions of evaluative comments and paraphrases increase as larger language models are used.This suggests that these inference subtypes are more susceptible to overfitting.Furthermore, the largest language model (RoBERTa-large) made more errors in classifying meaningless responses than the other two language models (BERT-base and RoBERTa-base).This is another indication of overfitting.

Effects of Pre-Trained Language Models on the Classification Accuracy
The accuracy of inference classification is significantly influenced by the choice of the base pre-trained language model.The comparison of the F1 scores based on the BERT-base and RoBERTa-base models shows the effect of the training strategies used during the pre-training phase of a language model on the classification accuracies of the fine-tuned models.The more advanced training strategies used for the RoBERTa-base model resulted in significantly higher F1 scores for the inference classification than that of the BERT-base model of the same size.This is consistent with previous findings which indicate that models trained with RoBERTa-base models outperform models trained with BERT-base models for various downstream tasks [22].
Furthermore, the comparison of the F1 scores based on the RoBERTa-base and RoBERTalarge models suggests that the larger model is not necessarily better for classifying inference in the current dataset.The RoBERTa-base and RoBERTa-large models share the same Transformer architecture and training objective (masked language model with dynamic masking), but the main difference between the two models is the number of layers (12 vs. 24) and the model sizes (110 million vs. 355 million parameters).Despite the greater representational capacity, the RoBERTa-large model did not perform significantly better than the smaller model with the same architecture.The average F1 score became even lower, and the SEM of the F1 scores became larger.The error analysis using the inference subtypes shows that the largest language model (RoBERTa-large) made more errors in classifying evaluative comments, paraphrases, and meaningless responses.
This lower accuracy of the larger model is probably due to overfitting.In Table 4, the optimal weight decay value was zero for RoBERTa-base and non-zero for RoBERTa-large.This suggests that regularization with weight decay worked for RoBERTa-large, yielding a higher F1 score than RoBERTa-large without weight decay.Without weight decay, the F1 score of RoBERTa-large would be even lower than that of RoBERTa-base.More data are needed to train the larger model, and a well-trained small model (RoBERTa-base) would be the preferred choice for inference classification with a relatively small dataset.

Insights from the Error Analysis for Automating Inference Classification
Error analysis using the inference subtypes provides further insights into automated inference classification as follows: Elaboration was the most common subtype of inference in the dataset, and the error rates of elaboration decreased for more complex (RoBERTa-base) and larger (RoBERTalarge) models.Given enough elaboration samples for training, scaling up the model could further improve the classification accuracy of elaboration.
Paraphrase, the second most common inference subtype in the dataset, shows a different pattern.The error rates when analyzing paraphrases using all the three models were higher than the proportion in the dataset, and there was no clear order among the models.This suggests that fine-tuning the pretrained models is not effective for paraphrase classification.
The accuracies for classifying elaboration and inference seem to be in a trade-off relationship.Because paraphrase is classified as non-inferring, sentence pairs which are too close are classified as negative samples.This is the opposite of the typical applications that measure the similarity between sentences and look for meaningfully related sentences.To correctly classify paraphrases as non-inferring, the classifier should reject similar sentence pairs.However, this would reduce the accuracy of classifying elaborations.Solving this problem would require new model architectures or training strategies, which are topics for future research.
For the classification of evaluative comments and meaningless responses, the error rates increased for more complex (RoBERTa-base) and larger (RoBERTa-large) models.The low accuracy in classifying evaluative comments would be due to a lack of samples.In contrast, there were more samples of meaningless responses, but the error rates increased much more for the largest model (RoBERTa-large).These different patterns in error rates between the different subtypes suggest that the degree of overfitting varies for different inference subtypes.

Conclusions
In this study, we investigated the feasibility of using language models to classify the inferences of sentence-response pairs.The proposed method achieved high F1 scores by fine-tuning a Transformer-based pre-trained language models.Specifically, the highest F1 score, 0.92, was achieved by fine-tuning RoBERTa-base, which was higher than that of a model with the same size fine-tuned from BERT-base.A larger language model (RoBERTalarge) did not increase the classification accuracy.This suggests that choosing pre-trained language models of high quality and appropriate size is important.
The proposed method would allow the automated quantification of reader responses to a given text and improve the effectiveness of the think-aloud protocol.In the think-aloud protocol, reader responses are open ended and provide rich information which requires a trained expert to evaluate.This study opens the possibility of simulating trained experts in evaluating inferences, which is one of the key qualities of an effective reader.
Further work could be conducted in three ways.First, expanding the genre of the stimulus text would be a natural next step.Different genres have unique structures, styles, and conventions that would affect reading behaviors.Therefore, confirming the feasibility of our method for different genres would test the generality of our method.Second, classifying inference subtypes would be an interesting future work.This requires a larger dataset, so that enough samples are collected for each inference subtype.Third, another future research direction is to adapt the proposed method to educational settings.For example, the automated inference classification of user responses could enrich interactive learning by tailoring content to meet individual learning needs and preferences.We are eager to further explore applications and improve learning outcomes.

Informed Consent Statement:
A comprehensive written consent form was provided to all participants and their parents.These forms outlined the objectives of the research, the extent of participation, potential benefits and risks, assurances of confidentiality, and the right of the participants to withdraw from the study at their discretion.

Figure 1 .
Figure 1.Using the think-aloud protocol, the subject's response to each sentence is collected (A).Human evaluators then assess the sentence-response pairs to determine whether an inference was drawn (B).The proposed method is used to classify a sentence-response pair as inferential or not, without having to involve human experts (C).

Figure 1 .
Figure 1.Using the think-aloud protocol, the subject's response to each sentence is collected (A).Human evaluators then assess the sentence-response pairs to determine whether an inference was drawn (B).The proposed method is used to classify a sentence-response pair as inferential or not, without having to involve human experts (C).

Figure 2 .
Figure 2. F1 scores of inference classification using different base models.Each error bar represents the standard error of the mean (SEM) measured from five-fold cross-validation.The asterisk corresponds to a statistically significant difference (paired t-test, p < 0.05), and n.s.indicates that the difference is not statistically significant (paired t-test, p > 0.05).

Figure 3
Figure 3 shows the proportions of the inference subtypes in the errors.The proportions of the inference subtypes in the entire dataset are shown as gray bars for reference.The proportions of the inference subtypes in inaccurate predictions are shown in red, green, and blue for the BERT-base, RoBERTa-base, and RoBERTA-large models, respectively.A comparison of the proportions of the inference subtypes in the errors with the proportion in the entire dataset shows the relative difficulty of classifying the inference subtypes.The proportion of elaboration and bridging subtypes in the errors decreased as larger language models were used.This suggests that larger language models classify these types of inferences with a higher accuracy.In contrast, the proportions of evaluative comments and paraphrases increase as larger language models are used.This suggests that these inference subtypes are more susceptible to overfitting.Furthermore, the largest language model (RoBERTa-large) made more errors in classifying meaningless responses than the other two language models (BERT-base and RoBERTa-base).This is another indication of overfitting.

Figure 2 .
Figure 2. F1 scores of inference classification using different base models.Each error bar represents the standard error of the mean (SEM) measured from five-fold cross-validation.The asterisk corresponds to a statistically significant difference (paired t-test, p < 0.05), and n.s.indicates that the difference is not statistically significant (paired t-test, p > 0.05).

Figure 3 Figure 3 .
Figure 3 shows the proportions of the inference subtypes in the errors.The proportions of the inference subtypes in the entire dataset are shown as gray bars for reference.The proportions of the inference subtypes in inaccurate predictions are shown in red, green, and blue for the BERT-base, RoBERTa-base, and RoBERTA-large models, respectively.Appl.Sci.2024, 14, x FOR PEER REVIEW

1 .Figure 3 .
Figure 3.The proportions of the inference subtypes in the entire dataset (gray) and in the errors made by fine-tuned models using different base models (red, green, and blue).

Funding:
This research received no external funding.Institutional Review Board Statement:This study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Chonnam National University (1040198-170920-HR-074-02) on 13 October 2017.

Table 1 .
Inference types and definitions.

Table 2 .
Examples of sentence-response pairs in which inferences were made.

Table 3 .
Comparison of pre-trained models' architecture and training objectives.

Table 4 .
Hyperparameters for the best models.

Table 5 .
Accuracies of the best models.