An Interpretable and Reproducibility-Focused Evaluation Pipeline for Automatic Short-Answer Grading in Low-Resource Mathematics and Science Educational Datasets

González Maestre, Miguel Ángel; Cubero Juánez, Javier; de la Hoz Serrano, Alejandro; Melo, Lina

doi:10.3390/computers15050320

Open AccessArticle

An Interpretable and Reproducibility-Focused Evaluation Pipeline for Automatic Short-Answer Grading in Low-Resource Mathematics and Science Educational Datasets

by

Miguel Ángel González Maestre

,

Javier Cubero Juánez

^*

,

Alejandro de la Hoz Serrano

^*

and

Lina Melo

Department of Experimental Science and Mathematics Teaching Area, University of Extremadura, 06006 Badajoz, Spain

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(5), 320; https://doi.org/10.3390/computers15050320 (registering DOI)

Submission received: 16 March 2026 / Revised: 9 May 2026 / Accepted: 11 May 2026 / Published: 18 May 2026

(This article belongs to the Special Issue Transformative Approaches in Education: Harnessing AI, Augmented Reality, and Virtual Reality for Innovative Teaching and Learning (2nd Edition))

Download

Browse Figures

Versions Notes

Featured Application

The proposed pipeline provides a practical and fully specified methodological framework for evaluating short open-ended student responses in mathematics and science education, particularly in low-resource settings. Its transparent and clearly documented design allows instructors and educational researchers to reliably compare classification models, assess evaluation stability through stratified cross-validation, and document methodological decisions in a consistent and replicable manner. The approach is especially suited to formative assessment scenarios and institutional benchmarking of automated grading systems, where interpretability, reliability, and methodological traceability are essential. By prioritizing evaluation stability and token-level interpretability over algorithmic complexity, the pipeline enables evidence-based assessment practices while remaining accessible to interdisciplinary educational stakeholders.

Abstract

Automated short-answer grading (ASAG) in educational contexts faces a fundamental trade-off between predictive performance, interpretability, and methodological transparency, particularly under data-constrained educational settings. While recent approaches rely on deep learning architectures, these models require large annotated datasets and offer limited transparency, restricting their applicability in authentic classroom environments. This study proposes a fully specified and interpretable machine learning pipeline for ASAG across multiple educational concepts. The approach is based on a shared TF–IDF representation and evaluates three linear classifiers—Logistic Regression, Multinomial Naïve Bayes, and Linear Support Vector Machines—under a stratified cross-validation framework adapted to small datasets. Model performance is assessed using accuracy, precision, recall, and F1-score. Statistical comparisons using the Wilcoxon signed-rank test indicate exploratory evidence of statistically significant differences between classifiers, although the observed differences remain small in practical magnitude. Additionally, the methodology incorporates token-level analysis to identify discriminative lexical patterns and examine consensus across classifiers. To enhance interpretability, tokens are presented using a bilingual Spanish/English representation while preserving the original feature space. The results across ten concept-specific datasets show consistent performance across models (accuracy ≈ 0.82–0.88) and reveal stable lexical patterns consistently associated with model predictions of correctness. The findings highlight that lightweight, interpretable models can provide consistent and reliable performance under resource-constrained educational conditions. The proposed framework contributes a stability-oriented and interpretable evaluation paradigm for ASAG, offering a practical alternative to data-intensive approaches in educational assessment. It is intended as a methodological reference protocol rather than a performance benchmark. The findings should be interpreted as evidence of within-context consistency instead of broad external generalization.

Keywords:

automatic short-answer grading; educational data mining; educational NLP; transparent machine learning; interpretable models; text classification; low-resource datasets; cross-validation; mathematics and science education; assessment analytics

1. Introduction

Automated assessment of short open-ended responses has become increasingly relevant in educational data mining and learning analytics [1]. Unlike multiple-choice questions, short answers provide richer insight into student understanding but introduce challenges for reliable and scalable evaluation.

Recent advances in Natural Language Processing have promoted deep learning models, particularly transformer-based architectures, for automated grading. While these approaches perform well in large-scale benchmarks, their applicability in classroom-scale instructional scenarios remains limited due to data requirements, computational cost, and reduced interpretability [2].

In contrast, real-world educational datasets are typically small, heterogeneous, and concept-specific [3,4,5,6,7]. Under these conditions, simpler and more transparent models are often more appropriate. However, prior work has largely prioritized predictive accuracy, with less attention to stability, reproducibility, and interpretability.

This study addresses these gaps by proposing a methodological framework centered on interpretability, stability, and reproducibility. Rather than optimizing a single model, multiple classifiers are evaluated under a shared feature representation across concept-specific datasets.

The proposed pipeline uses Term Frequency–Inverse Document Frequency (TF–IDF) features together with Logistic Regression, Multinomial Naïve Bayes, and Linear Support Vector Machines. TF–IDF provides a robust and interpretable lexical representation well suited to short educational texts.

Evaluation is conducted using stratified cross-validation with adaptive fold selection. Performance is assessed through multiple metrics (accuracy, precision, recall, and F1-score), complemented by Wilcoxon signed-rank tests for statistical comparison.

In addition to performance, the framework incorporates a lexical analysis component to enhance interpretability. Discriminative tokens are identified and compared across models to detect shared and model-specific patterns. Function words are retained, as they may encode structural signals in short definitional responses, and tokens are presented in a bilingual Spanish/English format to support interpretation.

This study analyzes ten concept-specific datasets from mathematics, science, and computing domains. Each dataset is processed independently to preserve conceptual integrity and avoid cross-concept leakage. Consequently, the results primarily reflect within-concept evaluation stability and should not be interpreted as evidence of broad cross-domain generalization without further external validation.

Unlike work focused on model innovation, this study formalizes a stability-oriented evaluation protocol for low-resource ASAG. The contribution is a fully documented methodological framework integrating (i) adaptive cross-validation, (ii) variability-aware evaluation, and (iii) cross-model lexical consensus analysis.

The main contributions are:

A transparent evaluation protocol for low-resource ASAG.
A multi-metric framework with non-parametric statistical testing.
A lexical consensus analysis for interpretable comparison.
A methodological design prioritizing transparency and stability.

By emphasizing methodological clarity, this work provides a practical alternative to data-intensive approaches and supports reliable automated assessment in authentic classroom conditions.

Therefore, the results should be interpreted as evidence of within-concept evaluation stability as opposed to generalization across educational domains, institutions, or languages. In this sense, the contribution of this work lies not in competing with state-of-the-art performance, but in providing a stable and fully documented evaluation framework that can be applied to both traditional and modern modeling approaches.

2. Related Works

2.1. Automatic Short-Answer Grading

Early Automatic Short-Answer Grading (ASAG) systems relied primarily on rule-based techniques, pattern matching, and semantic similarity measures, often constructed around handcrafted lexical resources or manually designed scoring heuristics [8]. These approaches were motivated by the need to approximate human grading criteria using interpretable linguistic signals, particularly in controlled learning scenarios where answer variability was relatively limited.

With the rise of machine learning and statistical natural language processing, classifiers such as Naïve Bayes and Support Vector Machines became standard tools for short-answer classification, particularly in settings with limited training data [3,9]. These approaches enabled data-driven modeling of lexical and structural patterns while maintaining computational efficiency and interpretability. Feature representations based on bag-of-words or TF–IDF weighting became dominant, as they provided robust performance across multiple educational datasets and domains [7].

Subsequent work introduced semantic similarity methods combining lexical overlap with syntactic or dependency-based representations. Notably, approaches leveraging semantic alignment and graph-based similarity demonstrated improvements in capturing conceptual equivalence across linguistically diverse student responses [10]. These methods highlighted a key challenge in ASAG: distinguishing linguistic variability from conceptual correctness, particularly in short responses.

More recent research has explored deep learning architectures, including recurrent neural networks, convolutional models, and attention-based architectures [2]. These models aim to learn distributed semantic representations directly from data, reducing the need for manual feature engineering. Transformer-based architectures and contextual embeddings further advanced performance on benchmark datasets, particularly when large-scale annotated corpora are available.

However, several studies suggest that the performance gains of deep neural architectures may be sensitive to dataset size, domain characteristics, and training conditions, particularly in small or heterogeneous learning corpora [3,4,5,6,7]. Educational datasets collected in authentic classroom environments are frequently small, heterogeneous, and domain-specific, limiting the effectiveness of data-intensive approaches. Consequently, classical semantic and machine learning approaches remain strong baselines in applied assessment scenarios, especially when interpretability and reproducibility are required.

Recent developments in educational NLP also emphasize the growing role of large language models (LLMs) in automated grading tasks. Emerging studies investigate prompt-based grading strategies and hybrid architectures combining similarity scoring with generative models [11,12,13,14]. While these approaches show promising zero-shot and few-shot capabilities, concerns remain regarding evaluation consistency, bias, and robustness across learning domains. Recent analyses indicate that LLM-based grading performance can vary depending on prompt design, evaluation configuration, and dataset characteristics, reinforcing the importance of stability-aware and reproducible evaluation protocols [12,13,14].

Hybrid architectures combining contextual embeddings with traditional classifiers have also been proposed as a compromise between performance and interpretability [15]. Additionally, domain-specific studies—including applications in medical education and multilingual grading contexts—highlight the increasing diversity of ASAG applications and methodological frameworks [16,17].

Overall, the literature suggests that model complexity alone does not guarantee reliable performance in learning environments. Instead, dataset characteristics, evaluation design, and interpretability considerations play a decisive role in determining the practical effectiveness of automated short-answer grading systems. While transformer-based approaches (e.g., BERT-based grading systems) often report higher performance in large-scale benchmarks, their dependency on large annotated datasets and limited interpretability restrict their applicability in resource-constrained classroom settings. In contrast, the present work prioritizes stability and transparency, offering a complementary methodological perspective rather than a direct performance-oriented alternative.

2.2. Interpretability and Transparency

Interpretability is a central concern in educational natural language processing, where model decisions must be explainable to researchers, instructors, and teachers [1,18]. Unlike many industrial machine learning applications, automated grading systems directly influence pedagogical decisions, making transparency not only a technical requirement but also an ethical and instructional necessity.

Linear classifiers provide direct access to feature weights or class-conditional probabilities, enabling token-level inspection without the need for post hoc explanation techniques. This property is particularly valuable in short-answer grading scenarios, where individual lexical elements often correspond to conceptual components of student understanding.

Recent work in interpretable machine learning highlights the importance of distinguishing between intrinsic interpretability and post hoc explanation methods [1]. While complex neural architectures may require additional explanation layers such as feature attribution or attention visualization, inherently interpretable models allow for direct inspection of the learned decision structure. In pedagogical settings, this distinction becomes especially relevant, as interpretability must remain accessible to interdisciplinary audiences, including educators without specialized machine learning expertise.

Transparency is also closely linked to fairness and bias detection in classroom scenarios. Studies addressing unintended consequences of machine learning emphasize that models trained on small or unbalanced datasets may capture spurious correlations or demographic artifacts [19]. In automated grading, such biases may lead to systematic misclassification of valid linguistic variations, particularly in multilingual or culturally diverse classroom environments.

Recent educational NLP research further highlights the interpretability challenges introduced by LLM-based grading systems [11,12,13,14]. Although these models provide strong semantic representations, their internal decision mechanisms are less directly interpretable than linear baselines, which may complicate the justification of grading outcomes in pedagogical terms [11,12,13,14]. This tension reinforces the continued relevance of interpretable baselines for methodological validation and comparative analysis.

Interpretability therefore supports not only technical validation but also instructional accountability and methodological transparency, aligning automated grading research with broader educational assessment principles [20].

2.3. Evaluation Practices and Reproducibility

Robust evaluation is essential in limited-data settings, where small dataset size and class imbalance may significantly affect performance estimates. Prior work highlights the limitations of single train–test splits and advocates for cross-validation as a standard practice for reliable performance estimation and model comparison [21,22].

Despite these recommendations, many ASAG studies continue to rely on limited evaluation protocols, often reporting single accuracy values without variability metrics. Such practices hinder reproducibility and make it difficult to compare results across datasets and methodological frameworks.

Educational data mining research increasingly emphasizes reproducibility as a core methodological requirement [21,22]. Transparent reporting of preprocessing steps, feature representations, and validation strategies is necessary to ensure that results can be replicated and meaningfully interpreted. This requirement is particularly important in pedagogical environments, where datasets are frequently private or institution-specific.

Recent work in educational NLP also stresses the importance of stability-aware evaluation. Variability across folds and concepts may provide critical insight into model robustness, yet remains underreported in many studies. Variability metrics such as standard deviation and coefficient of variation offer complementary information to aggregate accuracy measures and are especially relevant when datasets are small or heterogeneous.

The growing use of deep learning and LLM-based grading approaches further amplifies reproducibility concerns. Large models frequently involve complex training pipelines or external dependencies that may limit straightforward replication across institutional contexts. Additionally, prompt-based evaluation frameworks introduce new sources of variability related to prompt design and inference settings [12,13,14].

In response to these challenges, recent methodological literature advocates for fully scripted pipelines, transparent preprocessing workflows, and standardized evaluation frameworks to improve comparability across studies [11]. These recommendations align closely with the present work, which adopts stratified cross-validation as a core methodological principle and explicitly reports both performance and variability metrics.

By embedding reproducibility and evaluation stability directly into the experimental design, the proposed framework addresses a persistent methodological gap in applied ASAG research and contributes toward more transparent and reliable automated assessment practices.

Despite extensive research on automated grading models, comparatively less attention has been devoted to evaluation pipelines explicitly designed for small-scale educational datasets.

3. Data and Task Description

The datasets used in this study consist of short textual responses collected in an authentic learning context. The data were obtained from students enrolled in courses at the Faculty of Education of the University of Extremadura (Spain). All participants were undergraduate students in education degrees, responding as part of regular instructional activities.

The dataset comprises 2940 student responses, divided into ten concept-specific subsets of 294 entries each. Each subset is linked to a short, open-ended question targeting a specific conceptual definition. This relatively small sample size is representative of real-world learning datasets and reinforces the need for evaluation strategies specifically designed for low-resource conditions. Responses were collected in Spanish and reflect real classroom conditions, including typical variability in expression, brevity, and conceptual precision.

The full dataset is organized into ten concept-specific subsets, each corresponding to a distinct foundational concept related to mathematics and science education. The concepts include: algorithm, computer programming, artificial intelligence, natural number, prime number, density, pH, living being, health, and global warming. Each of them is treated as an independent dataset in order to preserve conceptual specificity and to avoid cross-concept information leakage during model training and evaluation.

For each concept, student responses are annotated using a binary correctness scheme, indicating whether the response adequately captures the target definition. This annotation strategy reflects common assessment practices in education, particularly in formative and diagnostic evaluation scenarios.

All concept labels were manually translated into English on a one-by-one basis for reporting and visualization purposes, ensuring semantic fidelity between the original Spanish prompts and their English descriptors. The original Spanish responses were retained for all computational processing, preserving linguistic authenticity.

The datasets exhibit typical data-constrained characteristics, including limited sample size per concept, moderate class imbalance, and restricted lexical diversity. No external corpora, pretrained embeddings, or domain-specific lexicons are employed. This design choice ensures that the proposed pipeline remains applicable in realistic, resource-constrained classroom settings and that all reported results derive exclusively from the student data itself.

Taken together, these observations motivate the need for interpretable, stability-oriented ASAG pipelines specifically designed for limited-data classroom datasets, which constitutes the focus of the present study. Given the small sample size per concept, individual fold estimates may exhibit high variance. Therefore, all reported results are interpreted as aggregate indicators across folds and concepts, rather than as reliable estimates for any single split. A detailed summary of dataset size and class distribution for each concept is provided in Table 4 (Section 5.5).

4. Methodology

This section presents the methodological framework underlying the proposed documented and interpretable pipeline for short-answer grading in low-resource educational datasets. The design emphasizes transparency, reliability, and replicability, while remaining compatible with small to medium-sized datasets.

Instead of optimizing a single predictive model, the framework operationalizes the interpretability–stability–reproducibility triad introduced earlier. All methodological choices are explicitly reported to facilitate replication and critical assessment in applied educational research.

Figure 1 summarizes the pipeline, from raw student responses to cross-validated evaluation, stability analysis, and lexical consensus interpretation.

Figure 1 provides a schematic overview of the complete pipeline, illustrating the flow from raw student responses to global consensus analysis and stability-oriented evaluation.

4.1. Data Structure and Educational Context

The datasets analyzed in this study consist of short open-ended student responses to conceptual questions drawn from mathematics, science, and computing-related domains. Each dataset corresponds to a single concept and contains textual student answers paired with binary labels indicating correctness. This structure closely reflects typical formative and summative assessment scenarios in educational practice, where student responses are concise, concept-focused, and linguistically diverse.

The datasets are characterized by limited size and restricted lexical variability, a common situation in real-world learning environments and classroom-scale assessments. These data-constrained conditions motivate the use of robust validation strategies and interpretable models rather than data-intensive deep learning approaches, which often require substantially larger corpora to generalize reliably [3,21].

All datasets are processed independently to preserve conceptual specificity and avoid cross-concept information leakage. This design ensures that results reflect within-concept learning behavior, which is central to educational interpretation but limits direct generalization across domains.

4.2. Text Preprocessing and Feature Representation

Student responses are normalized through a minimal preprocessing pipeline that includes lowercasing, whitespace normalization, and the removal of non-alphanumeric characters, while preserving Spanish diacritics. This conservative preprocessing strategy is intentionally adopted to maintain linguistic fidelity and to avoid the unintended loss of semantically relevant information. In short responses, even minor lexical variations may encode important conceptual distinctions.

Textual features are represented using Term Frequency–Inverse Document Frequency (TF–IDF) weighting, a well-established representation in educational text analysis and information retrieval [7]. TF–IDF balances the local importance of a term within a response against its global specificity across the dataset.

Formally, for a given term t in document d, the TF–IDF weight is defined as follows:

TF-IDF (t, d) = tf (t, d) \cdot \log (\frac{N}{df (t)}),

where tf(t, d) is the term frequency of token t in document d, log denotes the natural logarithm, df(t) is the number of documents containing t, and N is the total number of documents in the dataset. The TF-IDF vectorizer is configured using a maximum vocabulary size of 5000 terms and an n-gram range of [1, 1] (unigrams), ensuring comparability across datasets and preventing overfitting in limited-data conditions. To prevent data leakage, the TF–IDF vectorizer is fitted exclusively on the training portion of each cross-validation fold and subsequently applied to the corresponding validation split. This ensures that no information from the test data influences feature construction.

4.3. Cross-Validation Strategy

To ensure reliable performance estimation under low-resource conditions, all models are evaluated using stratified k-fold cross-validation. Stratification ensures that each fold preserves the original class distribution as closely as possible, a critical consideration when datasets are small or imbalanced.

The number of folds k is dynamically adjusted based on the available data, following classical recommendations for accuracy estimation in supervised learning [21,22]. Formally, the number of folds is defined as follows:

k = \min (5, \min n_{c})

where

n_{c}

denotes the number of samples in the minority class. This adaptive choice prevents degenerate splits in which a fold would contain no instances of one class, a common risk in small educational datasets. Although k may vary across concepts, all metrics are computed within-concept and only aggregated at a higher level, mitigating direct comparability issues.

Cross-validation is applied independently to each concept dataset and each classifier (see Figure 2). Performance is reported as mean accuracy together with standard deviation across folds, thereby explicitly quantifying both central tendency and variability. As further discussed in Appendix A, this design emphasizes evaluation consistency rather than isolated peak performance, aligning with pedagogical assessment requirements where consistent behavior across cohorts and concepts is critical. As a consequence, the number of folds may vary across concept datasets. This introduces a controlled source of methodological variability, which is explicitly acknowledged when comparing variability-aware metrics across concepts.

As a result, the number of folds may vary across concept datasets. This reflects a trade-off between strict comparability and valid stratification under class constraints, which is prioritized to ensure reliable evaluation in limited-data settings.

4.4. Classification Models

Within the shared methodological framework described above, three widely used and conceptually distinct linear classifiers are evaluated. These models are selected due to their interpretability, computational efficiency, and proven effectiveness in short-answer grading contexts [3,4,9]. The exclusive focus on linear models is intentional, as they provide intrinsic interpretability and allow for controlled comparison under low-resource conditions, avoiding confounding effects introduced by model complexity.

Logistic Regression (LR): Logistic Regression is a discriminative linear model that estimates the posterior probability of a response being correct. Given a feature vector x, the probability of correctness is defined as follows:

P (y = 1 | x) = \frac{1}{1 + e^{- (w^{T} x + b)}}

where w represents the learned feature weights and b is a bias term. Class weighting is applied to mitigate imbalance and ensure fair learning across labels. Importantly, the linear structure enables direct inspection of feature contributions, supporting interpretability.

Multinomial Naïve Bayes (NB): Multinomial Naïve Bayes is a generative probabilistic classifier based on conditional independence assumptions between features. Despite its simplicity, Naïve Bayes has demonstrated strong performance in text classification tasks and remains a widely used baseline in automated short-answer grading research [3,7]. Its probabilistic formulation allows for intuitive interpretation of token relevance through class-conditional likelihoods.
Linear Support Vector Machines (SVM): Linear Support Vector Machines are included as a maximum-margin classifier that emphasizes robustness in high-dimensional and sparse feature spaces. The linear formulation enables direct inspection of feature weights, while benefiting from strong theoretical guarantees in text categorization tasks [9].

Default hyperparameters from scikit-learn were used unless otherwise specified (e.g., C = 1.0 for SVM and Logistic Regression, Laplace smoothing for Naïve Bayes). Together, these models provide complementary perspectives on the same feature space while maintaining a shared level of transparency, facilitating fair comparison and consensus analysis.

4.5. Interpretability and Token-Level Analysis

Beyond predictive performance, the methodology explicitly incorporates token-level analysis to support interpretability. For each model, discriminative tokens are identified using model-specific scoring mechanisms, such as coefficient magnitudes in linear models or log-probability differences in Naïve Bayes.

These tokens are analyzed both individually and comparatively across models, enabling the identification of shared and model-specific lexical indicators of correctness. As detailed in Section 5 and Appendix A, shared tokens across classifiers are interpreted as recurrent lexical patterns associated with model agreement. However, these patterns may reflect structural regularities of short responses as opposed to direct evidence of conceptual understanding, while exclusive tokens highlight model-specific sensitivities.

This interpretability-focused design aligns with recent calls for transparent and accountable automated assessment systems in education [18,19]. Instead of presenting model outputs as opaque scores, the pipeline enables researchers to inspect lexical patterns and assess their semantic and pedagogical coherence across classifiers and datasets.

Although linear models provide direct access to feature weights, the lexical analysis performed in this study should be understood as a post-training interpretative layer applied to model parameters. Its purpose is not to claim causal interpretability, but to identify stable and convergent lexical patterns across models.

These patterns should therefore be interpreted as exploratory model-consistent signals rather than direct indicators of conceptual understanding.

4.6. Evaluation Metrics and Statistical Analysis

Although accuracy is reported for comparability with prior work, F1-score is treated as the primary metric for model comparison, given its robustness under class imbalance. For clarity, F1-score is used as the primary comparison metric in all summary tables, while accuracy is reported as a secondary reference measure. Effect sizes were not explicitly computed; however, the small absolute differences in mean F1-score suggest limited practical impact.

In particular, precision, recall, and F1-score are computed for each model across cross-validation folds. These metrics offer complementary perspectives on classification performance, especially in the presence of class imbalance, where accuracy alone may provide an incomplete picture. Given the limited number of concept datasets, statistical results are interpreted conservatively, emphasizing effect consistency rather than strict significance.

All metrics are aggregated across folds and concepts, ensuring robustness and comparability across models.

To further support the comparison between classifiers, non-parametric statistical testing is conducted using the Wilcoxon signed-rank test. This test is appropriate for paired comparisons of model performance across multiple datasets and does not assume normality of the underlying distributions.

Pairwise comparisons are performed using F1-score (primary) and accuracy (secondary). In cases where statistical comparison is not meaningful due to lack of variability or identical ranks (e.g., identical values or insufficient variability), results are conservatively reported without overinterpretation.

This multi-metric and statistically grounded evaluation framework strengthens the methodological rigor of the analysis and aligns with best practices in educational data mining and machine learning evaluation. These metrics are reported in Section 5 as aggregated global indicators of model performance. The Wilcoxon signed-rank test is applied to paired F1-scores computed at the concept level, resulting in 10 paired observations per model comparison. Given the limited number of concept-level observations (n = 10), the absence of multiple-comparison correction, and the lack of explicit effect size estimation, the statistical results should be interpreted as exploratory indicators of consistent directional differences as opposed to strong evidence of practically meaningful superiority between classifiers.

4.7. Lexical Processing and Bilingual Token Representation

The lexical analysis component of the methodology is designed to preserve as much linguistic information as possible while ensuring interpretability across languages. However, this choice introduces an interpretive limitation, as highly frequent grammatical tokens may reflect response templates rather than conceptual content. This limitation is explicitly considered in the analysis.

Unlike standard Natural Language Processing pipelines, function words (commonly referred to as stopwords) are not removed. This decision is motivated by the nature of short formative responses, where even high-frequency grammatical tokens may encode relevant stylistic or cognitive patterns associated with conceptual understanding. Furthermore, this approach is supported by prior work showing that function words can carry stylistic and structural information relevant for text classification tasks, particularly in short texts [23]. This design choice intentionally departs from standard NLP preprocessing, as removing function words was empirically observed to reduce classification stability in preliminary pilot observations.

Retaining these tokens allows the models to capture subtle differences in response formulation, which may be indicative of varying levels of conceptual mastery.

To facilitate interpretability for an international audience, tokens originally expressed in Spanish are automatically mapped to their English equivalents using a controlled bilingual dictionary. Each token is reported in the format “Spanish/English”, preserving the original lexical form while ensuring accessibility.

This bilingual representation supports transparent interpretation of model outputs without altering the underlying feature space used for training, thereby maintaining both linguistic fidelity and analytical clarity.

4.8. Reproducibility and Implementation Considerations

All experiments are implemented using open-source Python libraries (see Appendix B) and follow a fully scripted workflow.

To enhance reproducibility and methodological transparency, the overall process can be summarized in Figure 3 as follows:

This hierarchical structure highlights the separation between data handling, model evaluation, and consensus interpretation, which is critical for methodological clarity.

While the full codebase is not publicly released at this stage due to ongoing development, all experimental steps are fully specified and documented from the description provided, although the complete workflow is not yet fully executable by external researchers because the full codebase has not been publicly released. A public release is planned for future work. To ensure experimental consistency of the experimental results, fixed random seeds were used across all stochastic components of the pipeline, including data splitting and model initialization.

All experiments rely exclusively on standard, widely available libraries (e.g., scikit-learn), ensuring methodological accessibility and facilitating future reproducibility.

5. Results

Throughout the analysis, F1-score is treated as the primary comparison metric, as it provides a balanced assessment under class imbalance. Accuracy is reported as a secondary indicator. Unless otherwise stated, differences between models are considered meaningful only when they are consistent across metrics and across concept-specific datasets.

The experimental results are reported using mean classification accuracy and variability metrics computed across cross-validation folds and educational concepts. This dual reporting strategy is essential in low-resource formative scenarios, where small sample sizes and heterogeneous student responses can produce unstable estimates if variability is not explicitly considered. Throughout this section, results are therefore interpreted jointly in terms of performance and stability, with additional guidance provided in Appendix A for readers less familiar with these evaluation metrics.

5.1. Global Classification Performance

Table 1 summarizes the global classification performance across models, including accuracy, precision, recall, and F1-score, aggregated over cross-validation folds and concepts. F1-score is treated as the primary metric for model comparison; accuracy is reported as a secondary reference. From a practical perspective, differences below approximately 0.02–0.03 in F1-score should be interpreted as negligible under the observed variability. F1-score is explicitly used as the primary metric for model comparison, and should be considered the main reference when interpreting differences across classifiers.

Across all evaluated concepts, the three classifiers—Logistic Regression, Multinomial Naïve Bayes, and Linear Support Vector Machines—exhibit broadly comparable performance. Although the Support Vector Machine shows slightly higher mean values, Wilcoxon signed-rank tests indicate statistically significant differences between classifiers. However, these differences are relatively small in magnitude and should be interpreted cautiously under the given sample size and variability conditions. No correction for multiple comparisons was applied, as the analysis is exploratory and focused on effect consistency rather than strict hypothesis testing. All statistical comparisons are primarily based on F1-score, as the main evaluation metric.

Importantly, none of the classifiers consistently dominates across all concepts. This finding reinforces the methodological importance of comparative evaluation frameworks over isolated peak performance metrics, particularly in educational applications where robustness and generalizability are critical.

To provide a global visual comparison, Figure 4 presents the mean accuracy together with the corresponding standard deviation for each classifier.

As discussed in Appendix A.1, overlapping error bars suggest that apparent differences in mean accuracy should be interpreted cautiously. From an formative assessment perspective, these results suggest that all three linear models offer similarly reliable baseline performance under realistic classroom-scale conditions.

Table 2 reports the results of the Wilcoxon signed-rank test for pairwise comparisons between classifiers. All pairwise differences are statistically significant (p < 0.05). However, consistent with the observed performance distributions, these differences remain small in magnitude and do not reflect a consistent dominance of any single classifier across concepts. Overall, the statistical results indicate that differences between classifiers are consistently statistically significant across concept datasets, although their practical magnitude remains small.

These statistical results should be interpreted as exploratory, given the limited number of concept-level observations (n = 10), the absence of multiple-comparison correction, and the lack of explicit effect size estimation. Accordingly, statistically significant differences should be understood as small in practical magnitude.

In summary, all pairwise comparisons are statistically significant (p < 0.05). However, these differences are not consistent in magnitude across concepts and remain small in practical terms.

5.2. Cross-Concept Performance Distributions

While aggregated accuracy provides a useful first-order comparison, it does not capture how models behave across heterogeneous educational concepts. To address this limitation, results are disaggregated at the concept level.

Figure 5 displays the distribution of accuracy scores across individual concepts for each classifier.

This figure shows moderate dispersion across concepts, confirming that concept-specific linguistic and semantic characteristics influence model behavior. Certain concepts exhibit systematically lower accuracy, likely reflecting greater lexical variability or ambiguity in acceptable student responses.

Crucially, dispersion patterns remain broadly consistent across classifiers. This suggests that observed variability is primarily driven by dataset characteristics rather than intrinsic model instability. In other words, differences between concepts appear more influential than differences between algorithms, a result that aligns with prior findings in educational NLP.

5.3. Stability and Variability Analysis

Given the limited scale of the datasets, stability across concepts constitutes a central evaluation criterion. To explicitly quantify stability, the coefficient of variation (CV) is computed for each model as the ratio between standard deviation and mean accuracy across concepts (see Appendix A.1 for interpretive guidance).

Stability is analyzed using the coefficient of variation (CV), as illustrated in Figure 6 and Figure 7.

Recent studies on prompt-based grading using large language models report competitive performance in zero-shot settings, but also highlight sensitivity to prompt formulation and evaluation setup. For example, Yoon et al. [12] show that accuracy variations of several percentage points may arise from minor prompt modifications, even on the same dataset [12,24,25]. This variability reinforces the importance of stability-oriented evaluation frameworks such as the one proposed in this study.

It is important to note that, due to the small size of individual folds (in some cases fewer than 10 samples), variability estimates at the fold level may be unstable. For this reason, stability is analyzed at the cross-concept level, where aggregated patterns provide a more reliable signal.

This analysis shows that models with similar mean accuracy can differ substantially in stability. In particular, classifiers exhibiting slightly lower accuracy may nonetheless demonstrate more consistent behavior across concepts. Such consistency is a desirable property in formative assessment contexts, where predictability across diverse student populations is often prioritized over marginal accuracy gains.

To visually explore the relationship between performance and stability, Figure 6 plots mean accuracy against CV for each classifier.

The scatter plot highlights that higher accuracy does not necessarily correspond to lower variability. This finding supports the methodological argument that stability metrics should complement accuracy when evaluating automated short-answer grading systems.

Finally, Figure 7 provides a direct comparison of CV values across classifiers.

Together, these analyses confirm that cross-validation-based stability metrics provide essential insights beyond aggregate accuracy, particularly under realistic low-resource conditions.

5.4. Lexical Consensus and Token-Level Results

Beyond performance metrics, the proposed pipeline enables token-level consensus analysis across classifiers, offering insights into shared and model-specific lexical signals. Discriminative tokens correspond to the highest-weighted lexical features identified by each classifier according to their respective model coefficients or likelihood ratios. For the purposes of this analysis, shared tokens are defined as discriminative tokens identified by two or more classifiers, whereas exclusive tokens correspond to tokens identified by only one classifier. The consensus analysis was conducted on the set of discriminative tokens extracted from the TF-IDF feature space generated during model training.

Figure 8 illustrates the degree of lexical overlap among discriminative tokens identified by different models, including both shared tokens (identified by multiple classifiers) and model-specific tokens. Function words appear among discriminative tokens due to their structural role in definitional responses and are therefore intentionally retained rather than removed through conventional stopword filtering.

The presence of a substantial shared token set suggests convergent lexical patterns across classifiers, despite their distinct learning mechanisms. This convergence indicates the existence of a stable lexical core consistently identified across classifiers in correct-class prediction.

Complementarily, Figure 9 reports the number of model-specific exclusive tokens, reporting only those discriminative tokens identified by a single classifier.

While exclusive tokens are present, their relative proportion remains limited, reinforcing the overall pattern of cross-model agreement.

To focus on robust lexical signals, Figure 10 presents the most frequent shared tokens across classifiers, averaged over correct-class responses.

Finally, these results are consolidated in Table 3, which reports global consensus discriminative tokens along with their supporting models and mean frequencies.

This table supports descriptive interpretability by identifying lexical elements consistently identified by models in correct-class predictions, independent of classifier choice. Instead of providing direct pedagogical prescriptions, it offers an empirically grounded lexical reference that can inform subsequent analytical and interpretive work. These tokens should be interpreted as model-consistent signals rather than direct indicators of conceptual knowledge. Therefore, these results should be interpreted cautiously and not as direct evidence of conceptual understanding.

Overall, the consensus analysis suggests that a substantial portion of the discriminative lexical signals identified by the models are shared across classifiers, while model-specific tokens remain limited. This pattern supports the consistency of the extracted lexical cues and indicates that the pipeline captures recurrent linguistic regularities, these signals include both structural (function words) and domain-specific tokens instead of reflecting classifier-specific artefacts. All reported results should be interpreted as baseline stability indicators rather than task-specific performance ceilings.

The analysis should therefore be interpreted as exploratory and descriptive, as opposed to as a causal explanation of model behavior.

5.5. Dataset Characteristics

Table 4 (see Section 3) summarizes the size and class distribution of each concept-specific dataset. As expected in real-world classroom settings, the datasets exhibit moderate variability in sample size and class balance. These characteristics reinforce the importance of using stratified cross-validation and multi-metric evaluation, as discussed in Section 4. The average response length ranges from 7.6 to 13.5 tokens, confirming the short-answer nature of the task. Avg_length and Std_length correspond to the mean and standard deviation of response length measured in tokens. Each concept is treated as an independent dataset with 294 responses, ensuring controlled and comparable evaluation across concepts. These statistics provide an additional indicator of linguistic variability across concepts.

Representative examples of correct and incorrect student responses are provided in Examples 1 and 2 to illustrate the linguistic variability in and conceptual complexity of the data.

Example 1: Artificial Intelligence.

Correct “La inteligencia artificial es la rama de la informática que busca crear sistemas capaces de realizar tareas que normalmente requieren inteligencia humana, como aprender, razonar, reconocer o tomar decisiones.” “Artificial intelligence is the branch of computer science that seeks to create systems capable of performing tasks that normally require human intelligence, such as learning, reasoning, recognizing, or making decisions.” Incorrect “Como chatGPT, es un robot que da respuestas rápidas y ayuda al proceso de aprendizaje y actividades cotidianas” “Like ChatGPT, it is a robot that gives fast answers and helps with the learning process and daily activities.”

Example 2: Prime Number.

Correct “Número que solo es divisible por 1 y por sí mismo.” “Number that is only divisible by 1 and by itself.” Incorrect “Es aquel que no tiene múltiplos.” “It is that which has no multiples.”

Statistical comparisons using the Wilcoxon signed-rank test indicate statistically significant differences between classifiers across concept datasets. However, no model consistently dominates across all concepts, and differences should be interpreted as modest and context-dependent.

6. Discussion

The results yield several methodological and educational insights relevant to automated short-answer grading in data-constrained settings. Rather than identifying a single optimal classifier, the analysis highlights how evaluation design, stability considerations, and interpretability jointly shape the reliability of automated assessment systems. The objective of this work is not to outperform deep learning models, but to establish a stable and interpretable evaluation baseline under realistic low-resource constraints. While transformer-based architectures such as BERT have demonstrated strong performance in large-scale NLP tasks, their reliance on large annotated datasets, computational cost, and limited interpretability make them less suitable for the limited-data, pedagogically grounded settings considered in this study. The focus on authentic classroom data prioritizes ecological validity over benchmark comparability.

First, the observed consistency across cross-validation folds and educational concepts suggests that carefully designed evaluation protocols may help mitigate the uncertainty inherent to small and heterogeneous datasets [12,22]. As shown in Section 5 and further clarified in Appendix A, the explicit incorporation of variability metrics—particularly the coefficient of variation (CV)—reveals aspects of model behavior that remain invisible when evaluation is limited to point estimates of accuracy. In this sense, consistency emerges not as a secondary diagnostic, but as a central criterion for responsible model assessment in authentic instructional environments. This finding aligns with broader recommendations in educational data mining and learning analytics, which emphasize reliability and reproducibility over maximal performance.

Second, the comparative analysis across Logistic Regression, Multinomial Naïve Bayes, and Linear Support Vector Machines demonstrates that differences between linear classifiers are relatively modest when evaluated under a unified and transparent pipeline. While small variations in mean accuracy are observed, variability-aware metrics provide a more informative basis for distinguishing practical suitability. Models with similar average performance can differ meaningfully in their sensitivity to concept-level variation, as illustrated by the Accuracy–CV trade-off analyses in Section 5.3. These results suggest that, in small-scale learning environments, preprocessing decisions, feature representations, and validation strategies may exert greater influence on outcome reliability than classifier selection itself. Such observations are consistent with prior work in automated short-answer grading, which favors simplicity, transparency, and methodological rigor over architectural complexity [3,4]. Although statistical tests detect significant pairwise differences, the practical impact of these differences remains limited, as reflected by overlapping performance distributions and similar stability patterns across classifiers.

Third, the results highlight the relationship between predictive evaluation and model transparency. By identifying discriminative tokens that are consistently selected across classifiers, the proposed pipeline provides a principled mechanism for inspecting and characterizing model behavior. Unlike post hoc explanation techniques, which may introduce additional layers of abstraction or uncertainty, consensus-based token analysis operates directly on model-internal representations. The resulting shared lexical core offers a stable, model-independent reference point that enhances trust in automated grading outcomes while remaining analytically grounded.

From an educational perspective, these findings support the responsible deployment of automated assessment systems. Transparent models combined with stability-aware evaluation reduce the risk of overfitting to idiosyncratic cohorts, concepts, or linguistic patterns. In light of growing concerns regarding AI-generated student responses and automated detection systems [24], the ability to justify grading decisions through consistent performance patterns and recurrent lexical signals becomes increasingly important for maintaining trust, accountability, and pedagogical legitimacy.

Overall, the discussion underscores that reproducibility, consistency, and interpretability should be treated not as auxiliary considerations, but as core design principles for automated short-answer grading in realistic classroom settings. By foregrounding these dimensions, the present study contributes a methodological perspective that complements performance-oriented research while remaining closely aligned with educational practice. A limitation of the present study is the use of binary correctness labels, whereas real formative assessment often involves partial credit or multi-level grading schemes. Future work could extend the proposed framework to ordinal or continuous scoring settings, enabling finer-grained evaluation of student understanding. Differences in k across datasets constitute a trade-off between statistical validity and comparability. The chosen strategy prioritizes valid stratification over uniform fold counts.

The proposed framework does not eliminate the limitations imposed by small datasets. Instead, it makes these limitations explicit through variability-aware evaluation. The goal is therefore not statistical generalization in the classical sense, but methodological robustness under constrained data conditions. Consequently, the findings should not be interpreted as evidence of external validity, but rather as a demonstration of methodological robustness under controlled limited-data conditions. Although statistical differences were detected, the observed effect magnitudes remained modest and did not translate into consistent practical superiority across concepts.

While function words may capture structural or stylistic regularities in short responses, their prominence makes it difficult to disentangle conceptual understanding from recurring response templates. The prominence of function words among discriminative tokens introduces an interpretive limitation. While such tokens may capture structural or stylistic regularities, they make it difficult to disentangle conceptual understanding from recurring response templates. Future work should explore methods to separate structurally informative features from concept-bearing lexical content.

We therefore distinguish between structurally informative tokens and concept-bearing lexical content, and consider this a limitation of purely lexical interpretability.

The proposed evaluation protocol is validated within concept-specific datasets derived from a single instructional context. As such, the results primarily demonstrate within-concept stability rather than generalization across domains, institutions, or languages.

While this design reflects realistic classroom conditions, the findings should not be interpreted as evidence of broad external validity beyond the studied setting. Future work will extend the protocol to cross-domain and cross-linguistic scenarios in order to assess its generalizability.

This study does not include transformer-based baselines such as DistilBERT due to the limited size of the datasets. Prior work suggests that such models may underperform or exhibit high variance under limited-data conditions. Preliminary pilot experiments with transformer-based models (not reported here) showed high variance and unstable performance under the same cross-validation conditions. Future work will include controlled comparisons under the same evaluation protocol.

7. Conclusions and Future Work

The present study introduced an interpretable and stability-oriented evaluation pipeline for automatic short-answer grading in resource-constrained instructional settings. Instead of optimizing for maximal predictive performance, the proposed methodology focuses on comparative evaluation, cross-concept robustness, and transparent model behavior, aligning more closely with the practical and ethical requirements of learning assessment.

The results demonstrate that, under realistic low-resource conditions, linear classifiers combined with simple lexical representations can achieve stable and comparable performance when evaluated using appropriate validation protocols. Importantly, the integration of variability metrics and lexical consensus analysis highlights that evaluation stability and interpretability provide essential complementary insights beyond aggregate accuracy values. These dimensions enable a more nuanced understanding of model behavior, particularly in heterogeneous educational contexts where consistency across concepts is often more valuable than marginal performance gains.

Several promising directions for future work naturally emerge from this framework. First, the token-level consensus analysis opens the door to more fine-grained linguistic investigations, including the study of semantic equivalence, conceptual paraphrasing, and lexical diversity in student responses. Such analyses may help clarify how different linguistic strategies support correct conceptual understanding across domains, without conflating linguistic variation with conceptual error.

Second, while the current pipeline focuses on linear models for reasons of transparency and reliability, the methodological framework itself remains model-neutral. Future studies may explore how more expressive representations or contextualized embeddings behave under the same stability-oriented evaluation criteria. This would enable principled comparisons between transparent models and more complex approaches, without sacrificing methodological clarity or evaluation fairness.

Finally, the modular and fully documented design of the pipeline facilitates its extension to broader classroom-scale scenarios, including multilingual datasets, longitudinal student cohorts, and formative assessment contexts. By maintaining a clear separation between performance evaluation, stability analysis, and interpretability mechanisms, future work can systematically investigate each dimension while preserving transparency and replicability.

Taken together, these findings position the present contribution as a foundational methodological reference. It is intended to support subsequent analytical and interpretive studies while promoting responsible, transparent, and stable deployment of automated grading systems in formative research and practice. The proposed pipeline is intentionally designed as a methodological baseline rather than a performance benchmark, enabling transparent comparison across future educational NLP studies. This work aims to contribute toward improved methodological transparency and comparability in small-scale educational NLP evaluation. The scripted design of the pipeline facilitates future replication, extension, and methodological comparison. All code and evaluation scripts were designed to facilitate future replication and extension of the proposed framework in future educational NLP studies. The inclusion of multiple evaluation metrics and statistical testing further reinforces the proposed framework.

Author Contributions

Conceptualization, M.Á.G.M.; methodology, M.Á.G.M., A.d.l.H.S. and L.M.; software, M.Á.G.M.; validation, M.Á.G.M., J.C.J., A.d.l.H.S. and L.M.; formal analysis, M.Á.G.M.; investigation, M.Á.G.M., J.C.J., A.d.l.H.S. and L.M.; resources, M.Á.G.M., J.C.J., A.d.l.H.S. and L.M.; data curation, M.Á.G.M. and L.M.; writing—original draft preparation, M.Á.G.M.; writing—review and editing, J.C.J., A.d.l.H.S. and L.M.; visualization, M.Á.G.M., J.C.J., A.d.l.H.S. and L.M.; supervision, J.C.J., A.d.l.H.S. and L.M.; project administration, L.M.; funding acquisition, L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This project has been 85% co-financed by the European Union, the European Regional Development Fund, the Regional Government of Extremadura, the Managing Authority, and the Ministry of Finance, through project GR24052.

Institutional Review Board Statement

This study was conducted in accordance with institutional ethical guidelines for educational research. According to institutional guidelines, formal ethical review and approval were not required for this study because this study relied exclusively on anonymized student responses collected as part of regular instructional activities, without experimental intervention or collection of personal or sensitive information.

Informed Consent Statement

Participants were informed that anonymized formative responses could be used for research and dissemination purposes, in accordance with institutional educational and research guidelines.

Data Availability Statement

The datasets analyzed in this study are not publicly available due to ethical and institutional restrictions, as they contain student-generated learning data. Anonymized data may be available from the corresponding author upon reasonable request, subject to institutional approval and applicable data protection regulations. AI-Assisted Writing Disclosure: The author used AI tools exclusively for language editing, translation support, stylistic refinement, and citation consistency checks. All scientific content—including research design, data collection, annotation, methodology, and interpretation—was developed entirely by the author. The author reviewed all outputs and assumes full responsibility for this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TF-IDF	Term Frequency–Inverse Document Frequency
NLP	Natural Language Processing
LR	Logistic Regression
NB	Multinomial Naïve Bayes
SVM	Support Vector Machine
ASAG	Automatic Short Answer Grading

Appendix A. How to Read and Interpret Results

This appendix provides guidance for readers on how to interpret the figures and tables reported in the Section 5. Its purpose is to facilitate understanding of the evaluation framework and the reported metrics, particularly for readers who may be less familiar with cross-validation procedures, stability indicators, or machine learning-based evaluation in formative contexts. Rather than introducing additional results, this appendix serves as a reading companion to Section 5, clarifying how the reported figures should be interpreted and compared.

Appendix A.1. Accuracy and Variability Metrics

Tables reporting mean performance summarize the predictive behavior of each classifier across cross-validation folds and educational concepts. In the proposed evaluation framework, F1-score is treated as the primary metric for model comparison, particularly due to the presence of class imbalance, while accuracy is retained as a complementary indicator. For clarity, F1-score should be considered the main reference metric when interpreting model comparisons throughout this manuscript.

Accuracy reflects the proportion of correctly classified student responses and provides an intuitive measure of overall performance. However, it should not be interpreted in isolation. In small or heterogeneous datasets, performance estimates may vary substantially depending on how the data are partitioned.

To address this, all performance metrics are accompanied by variability measures. Standard deviation (SD) is used as the primary indicator of variability, quantifying the absolute dispersion of metric values across folds or concepts. A low SD indicates consistent performance across different data splits, whereas a high SD suggests sensitivity to sample composition.

The coefficient of variation (CV), defined as the ratio between standard deviation and mean performance, is included as a secondary, relative measure of variability. While CV facilitates comparison across models with similar mean values, it should be interpreted cautiously and not used as a standalone criterion for model assessment.

Figures displaying mean values with error bars should therefore be interpreted with care. Overlapping error bars indicate comparable performance, and small differences in mean values should not be overinterpreted when variability is substantial. From an educational perspective, consistency across cohorts and concepts is often more relevant than marginal gains in average performance.

Appendix A.2. Cross-Concept Performance Distributions

Scatter plots and concept-wise accuracy distributions illustrate how models behave across heterogeneous concepts. Each point in these figures corresponds to a single concept, allowing readers to observe whether a classifier performs uniformly across topics or exhibits concept-dependent variability.

Concepts associated with linguistic ambiguity, diverse acceptable formulations, or limited lexical cues may naturally yield lower or more variable accuracy. Consequently, isolated low-performing points should not be interpreted as model failure but rather as indicators of concept-specific difficulty.

A compact distribution of points indicates stable performance across concepts, suggesting that the model behaves consistently despite differences in subject matter. In contrast, a widely dispersed distribution reveals sensitivity to dataset characteristics and highlights concepts where automated grading may require additional pedagogical or linguistic modeling.

These visualizations are particularly useful for identifying patterns of robustness and vulnerability that may not be evident from aggregated accuracy values alone.

Appendix A.3. Stability-Oriented Rankings and Accuracy–Variability Trade-Offs

To complement aggregate performance metrics, this study considers performance and variability jointly. Rather than defining strict rankings, these analyses provide a descriptive interpretation of how models balance predictive performance and consistency.

Scatter plots relating performance (e.g., F1-score or accuracy) and variability (SD or CV) should be interpreted as follows: Models combining high performance with low variability represent more reliable configurations. Conversely, models with similar mean performance but higher variability may be less predictable across datasets.

Given the small sample sizes typical of educational datasets, these indicators should be interpreted descriptively as opposed to definitive criteria for model selection. Their primary purpose is to support cautious and transparent comparison across models.

This joint interpretation is particularly important in learning environments, where assessment tools are expected to behave consistently across diverse student populations and instructional contexts.

Appendix A.4. Lexical Overlap and Consensus Tokens

Figures and tables related to lexical overlap examine the degree of agreement among classifiers in identifying discriminative tokens. Tokens shared across models indicate lexical features recurrently associated with correct-class predictions, regardless of the underlying algorithm. These shared tokens can be interpreted as robust lexical signals associated with model predictions of correctness, which may reflect both conceptual content and structural regularities of short responses.

In contrast, exclusive tokens—those identified by only one classifier—reflect model-specific sensitivities and may arise from differences in optimization criteria or feature weighting. While such tokens may still provide useful insights, they should be interpreted with greater caution.

The table of global consensus discriminative tokens represents a high-confidence lexical core derived from agreement across multiple classifiers and concepts. Importantly, this table does not imply semantic causality or pedagogical sufficiency; rather, it highlights recurrent linguistic patterns that consistently appear in correct responses.

As such, it provides an empirical reference point for further interpretive or pedagogical analysis, without replacing domain-specific expertise or instructional judgment. In particular, function words should be interpreted primarily as indicators of response structure rather than direct evidence of conceptual understanding.

Appendix A.5. Statistical Comparison (Wilcoxon Signed-Rank Test)

To support model comparison, this study employs the Wilcoxon signed-rank test, a non-parametric statistical test designed for paired comparisons. This test evaluates whether the median difference between paired observations (e.g., model performance across concept datasets) differs significantly from zero.

The test is particularly appropriate in this context because it does not assume normality of the data and is robust under small sample sizes, which are characteristic of educational datasets. In this study, the Wilcoxon test is applied to paired F1-scores computed at the concept level, resulting in ten paired observations per model comparison.

A statistically significant result (p < 0.05) indicates that the observed differences between classifiers are unlikely to reflect random variation alone under the observed evaluation setting. However, statistical significance does not necessarily imply practical relevance, particularly under small sample conditions. Therefore, these results should always be interpreted jointly with effect magnitude and variability measures.

In the context of this study, significant differences between classifiers are observed, but their magnitude remains small. Therefore, results are interpreted conservatively, emphasizing consistency across concepts rather than isolated statistical outcomes.

Appendix B. Implementation Notes and Reproducibility Guidelines

This appendix outlines key implementation aspects of the proposed pipeline, with the aim of facilitating replication, adaptation, and transparent evaluation. The focus is not on providing exhaustive code documentation, but on clarifying the main design choices that underpin the reported results.

Appendix B.1. Software Environment and Libraries

All experiments are implemented in Python using widely adopted open-source libraries. Core dependencies include:

Pandas and NumPy for data handling and numerical computation;
Scikit-learn for feature extraction [26], model training, and cross-validation;
Matplotlib for figure generation;
python-docx for automated generation of publication-ready tables.

The reliance on standard libraries minimizes dependency-related issues and allows researchers from diverse backgrounds to reproduce the pipeline without specialized infrastructure.

Appendix B.2. Cross-Validation and Randomness Control

To ensure experimental consistency, all sources of randomness are explicitly controlled through fixed random seeds. Stratified k-fold cross-validation is employed to preserve class proportions within each fold, which is particularly important in classroom datasets with limited size or class imbalance.

The number of folds is dynamically adjusted based on class distributions to avoid degenerate splits. This design choice prioritizes valid evaluation over rigid adherence to a fixed fold count and reflects practical constraints commonly encountered in formative data collection.

Researchers extending the pipeline are encouraged to maintain this validation strategy, as it provides a balanced compromise between robustness and feasibility in small-scale educational datasets. The use of fixed random seeds ensures that all reported results can be reproduced under the same experimental conditions.

Appendix B.3. Common Pitfalls and Practical Recommendations

Several practical considerations should be taken into account when applying or extending the proposed framework:

Extremely skewed class distributions should be inspected carefully, as they may distort both accuracy and variability estimates.
Performance metrics should always be interpreted jointly with variability measures to avoid overestimating model reliability.

Finally, the scripted nature of the pipeline allows all figures and tables to be regenerated automatically from raw data. This reduces the risk of manual errors, ensures consistency between reported results and underlying computations, and supports transparent and reproducible research practices. Researchers applying the framework to new educational contexts are encouraged to report both performance and variability metrics in order to preserve methodological comparability across studies.

References

Molnar, C. Interpretable Machine Learning; Leanpub: Victoria, BC, Canada, 2020. [Google Scholar]
Riordan, B.; Horbach, A.; Cahill, A.; Zesch, T.; Lee, C.M. Investigating neural architectures for short answer scoring. In Proceedings of the BEA Workshop 2017; Association for Computational Linguistics: Copenhagen, Denmark, 2017; pp. 159–168. [Google Scholar]
Burrows, S.; Gurevych, I.; Stein, B. The eras and trends of automatic short answer grading. Int. J. Artif. Intell. Educ. 2015, 25, 60–117. [Google Scholar] [CrossRef]
Shourya, R.; Bhatt, H.S.; Narahari, Y. An Iterative Transfer Learning Based Ensemble Technique for Automatic Short Answer Grading. arXiv 2016, arXiv:1609.04909. [Google Scholar] [CrossRef]
Liu, T.; Ding, W.; Wang, Z.; Tang, J.; Huang, G.; Liu, Z. Automatic Short Answer Grading via Multiway Attention Networks. arXiv 2019, arXiv:1909.10166. [Google Scholar] [CrossRef]
Zhang, M.; Baral, S.; Heffernan, N.; Lan, A. Automatic Short Math Answer Grading via In-Context Meta-Learning. arXiv 2022, arXiv:2205.15219. [Google Scholar] [CrossRef]
Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Dikli, S. An overview of automated scoring of essays. J. Technol. Learn. Assess. 2006, 5, 1–35. [Google Scholar]
Joachims, T. Text categorization with support vector machines. In Proceedings of ECML; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
Mohler, M.; Bunescu, R.; Mihalcea, R. Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In Proceedings of the ACL 2011; Association for Computational Linguistics: Portland, OR, USA, 2011; pp. 752–762. [Google Scholar]
Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Yoon, S.-Y. Short Answer Grading Using One-Shot Prompting and Text Similarity Scoring Model. arXiv 2023, arXiv:2305.18638. [Google Scholar] [CrossRef]
Kostiuk, Y.; Vitman, O.; Gagała, Ł.; Kiulian, A. Evaluating LLM Judgment on Latvian and Lithuanian Short Answer Matching Tasks. arXiv 2025, arXiv:2501.09164. [Google Scholar]
Li, X.; Zhou, Z.; Liu, Z.; Wu, Y.; Luo, W. GradingAttack: Attacking Large Language Models Towards Short Answer Grading Ability. arXiv 2026, arXiv:2602.00979. [Google Scholar] [CrossRef]
Zhu, X.; Wu, H.; Zhang, L. Automatic Short-Answer Grading via BERT-Based Deep Neural Networks. IEEE Trans. Learn. Technol. 2022, 15, 364–375. [Google Scholar] [CrossRef]
Grévisse, C. LLM-Based Automatic Short Answer Grading in Undergraduate Medical Education. BMC Med. Educ. 2024, 24, 1060. [Google Scholar] [CrossRef] [PubMed]
Zheng, H.; Sun, Q.; Li, Q.; Liu, Y.; Ouyang, Y.; Cao, Q. FusionASAG: An LLM-Enhanced Automatic Short Answer Grading Model for Subjective Questions in Online Education. In Computer Science and Educational Informatization; Communications in Computer and Information Science; Zhang, K., Song, X., Obaidat, M.S., Eds.; Springer: Singapore, 2025; pp. 39–52. [Google Scholar]
Loukina, A.; Madnani, N.; Cahill, A. Speech- and Text-Driven Features for Automated Scoring of English Speaking Tasks. In Proceedings of the First Workshop on Speech-Centric Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2017; pp. 67–77. [Google Scholar]
Suresh, H.; Guttag, J. A framework for understanding unintended consequences of machine learning. Commun. ACM 2021, 64, 56–65. [Google Scholar]
Black, P.; Wiliam, D. Assessment and classroom learning. Assess. Educ. 1998, 5, 7–74. [Google Scholar] [CrossRef]
Baker, R.S.; Inventado, P.S. Educational data mining and learning analytics. In Learning Analytics; Springer: New York, NY, USA, 2014. [Google Scholar]
Romero, C.; Ventura, S. Educational data mining: A review of the state of the art. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2010, 40, 601–618. [Google Scholar] [CrossRef]
Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
Stefanovič, P.; Pliuskuvienė, B.; Radvilaitė, U.; Ramanauskaitė, S. Machine learning model for ChatGPT usage detection in students’ answers to open-ended questions: Case of Lithuanian language. Educ. Inf. Technol. 2023; in press. [CrossRef]
Zhao, W.; Wallace, E.; Feng, S.; Klein, D.; Singh, S. Calibrate Before Use: Improving Few-Shot Performance of Language Models. In Proceedings of the 38th International Conference on Machine Learning (ICML); PMLR (Proceedings of Machine Learning Research): New York, NY, USA, 2021; pp. 12697–12706. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]

Figure 1. Pipeline Overview Scheme. From raw student responses to cross-validated evaluation, stability analysis, and lexical consensus interpretation.

Figure 2. Cross-validation scheme across five folds (k₁–k₅). Blue boxes represent training data, while yellow boxes indicate validation sets.

Figure 3. Pseudocode Structure.

Figure 4. Global comparative accuracy across classifiers (mean ± SD). Error bars represent variability across cross-validation folds, emphasizing stability alongside accuracy.

Figure 5. Accuracy distribution across educational concepts. Each point corresponds to the mean cross-validated accuracy across folds for a given concept and classifier. Overlapping points are annotated (×2) when necessary.

Figure 6. Scatter plot of mean accuracy versus coefficient of variation (CV). This figure illustrates the trade-off between predictive performance and stability across concepts.

Figure 7. Coefficient of variation in accuracy across educational concepts. Lower bars indicate more stable performance across heterogeneous datasets.

Figure 8. Lexical overlap between classifiers. Shared versus exclusive discriminative tokens across models.

Figure 9. Model-specific exclusive discriminative tokens. Exclusive tokens highlight model-specific lexical sensitivities.

Figure 10. Most frequent shared tokens across classifiers (correct class). Tokens are ranked by mean frequency across models, highlighting consensus discriminative vocabulary.

Table 1. Global classification performance across models (mean values across cross-validation folds and concepts).

Model	Accuracy	Precision	Recall	F1	SD
Logistic Regression	0.86	0.86	0.86	0.86	0.045
Naive Bayes	0.82	0.83	0.82	0.79	0.052
Support Vector Machine	0.88	0.87	0.87	0.87	0.035

Table 2. Wilcoxon signed-rank test results for pairwise model comparisons (based on F1-score across concept datasets).

Model 1	Model 2	Statistic	p-Value	Significant (p < 0.05)
Logistic Regression	Naive Bayes	0.0	0.0019	Yes
Logistic Regression	Support Vector Machine	3.0	0.0352	Yes
Naive Bayes	Support Vector Machine	0.0	0.0019	Yes

Table 3. Global consensus discriminative tokens across classifiers.

Token (ES/EN)	Supporting Models	Frequency in Correct Responses	Number of Models
de/of	Logistic Regression, Naïve Bayes, Support Vector Machine	847	3
que/that	Logistic Regression, Naïve Bayes, Support Vector Machine	733	3
la/the	Logistic Regression, Naïve Bayes, Support Vector Machine	581	3
un/a-one	Logistic Regression, Naïve Bayes, Support Vector Machine	472	3
es/is	Logistic Regression, Naïve Bayes, Support Vector Machine	399	3
se/(reflexive particle)	Logistic Regression, Naïve Bayes, Support Vector Machine	204	3
para/so that	Logistic Regression, Naïve Bayes, Support Vector Machine	196	3
masa/mass	Logistic Regression, Naïve Bayes, Support Vector Machine	173	3
volumen/volume	Logistic Regression, Naïve Bayes, Support Vector Machine	170	3
por/by	Logistic Regression, Naïve Bayes, Support Vector Machine	151	3
en/in	Logistic Regression, Naïve Bayes, Support Vector Machine	149	3
entre/over	Logistic Regression, Naïve Bayes, Support Vector Machine	139	3
relación/relation	Logistic Regression, Naïve Bayes, Support Vector Machine	104	3
acidez/acidity	Logistic Regression, Naïve Bayes, Support Vector Machine	101	3
solo/only	Logistic Regression, Naïve Bayes, Support Vector Machine	95	3

Table 4. Dataset composition across educational concepts, including total number of responses and class distribution.

Concept	N	Class0	Class1	Avg_Length	Std_Length
Algorithm	294	163	131	9.0	6.0
Programming	294	211	83	10.1	5.8
Artificial Intelligence	294	226	68	13.0	7.5
Natural Number	294	220	74	8.1	4.9
Prime Number	294	179	115	11.1	4.7
Density	294	118	176	9.5	6.0
pH	294	154	140	7.6	6.5
Living Being	294	202	92	9.4	5.0
Health	294	181	113	8.9	5.4
Global Warming	294	210	84	13.5	9.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

González Maestre, M.Á.; Cubero Juánez, J.; de la Hoz Serrano, A.; Melo, L. An Interpretable and Reproducibility-Focused Evaluation Pipeline for Automatic Short-Answer Grading in Low-Resource Mathematics and Science Educational Datasets. Computers 2026, 15, 320. https://doi.org/10.3390/computers15050320

AMA Style

González Maestre MÁ, Cubero Juánez J, de la Hoz Serrano A, Melo L. An Interpretable and Reproducibility-Focused Evaluation Pipeline for Automatic Short-Answer Grading in Low-Resource Mathematics and Science Educational Datasets. Computers. 2026; 15(5):320. https://doi.org/10.3390/computers15050320

Chicago/Turabian Style

González Maestre, Miguel Ángel, Javier Cubero Juánez, Alejandro de la Hoz Serrano, and Lina Melo. 2026. "An Interpretable and Reproducibility-Focused Evaluation Pipeline for Automatic Short-Answer Grading in Low-Resource Mathematics and Science Educational Datasets" Computers 15, no. 5: 320. https://doi.org/10.3390/computers15050320

APA Style

González Maestre, M. Á., Cubero Juánez, J., de la Hoz Serrano, A., & Melo, L. (2026). An Interpretable and Reproducibility-Focused Evaluation Pipeline for Automatic Short-Answer Grading in Low-Resource Mathematics and Science Educational Datasets. Computers, 15(5), 320. https://doi.org/10.3390/computers15050320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Interpretable and Reproducibility-Focused Evaluation Pipeline for Automatic Short-Answer Grading in Low-Resource Mathematics and Science Educational Datasets

Featured Application

Abstract

1. Introduction

2. Related Works

2.1. Automatic Short-Answer Grading

2.2. Interpretability and Transparency

2.3. Evaluation Practices and Reproducibility

3. Data and Task Description

4. Methodology

4.1. Data Structure and Educational Context

4.2. Text Preprocessing and Feature Representation

4.3. Cross-Validation Strategy

4.4. Classification Models

4.5. Interpretability and Token-Level Analysis

4.6. Evaluation Metrics and Statistical Analysis

4.7. Lexical Processing and Bilingual Token Representation

4.8. Reproducibility and Implementation Considerations

5. Results

5.1. Global Classification Performance

5.2. Cross-Concept Performance Distributions

5.3. Stability and Variability Analysis

5.4. Lexical Consensus and Token-Level Results

5.5. Dataset Characteristics

6. Discussion

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. How to Read and Interpret Results

Appendix A.1. Accuracy and Variability Metrics

Appendix A.2. Cross-Concept Performance Distributions

Appendix A.3. Stability-Oriented Rankings and Accuracy–Variability Trade-Offs

Appendix A.4. Lexical Overlap and Consensus Tokens

Appendix A.5. Statistical Comparison (Wilcoxon Signed-Rank Test)

Appendix B. Implementation Notes and Reproducibility Guidelines

Appendix B.1. Software Environment and Libraries

Appendix B.2. Cross-Validation and Randomness Control

Appendix B.3. Common Pitfalls and Practical Recommendations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI