Automatic Scoring of Laboratory Reports Using Multi-Dimensional Feature Engineering and Ensemble Learning with Dynamic Threshold Control

Wang, Chang; Shi, Jingzhuo

doi:10.3390/app16083649

Open AccessArticle

Automatic Scoring of Laboratory Reports Using Multi-Dimensional Feature Engineering and Ensemble Learning with Dynamic Threshold Control

by

Chang Wang

and

Jingzhuo Shi

^*

School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(8), 3649; https://doi.org/10.3390/app16083649

Submission received: 7 March 2026 / Revised: 2 April 2026 / Accepted: 3 April 2026 / Published: 8 April 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In the field of engineering, the advancement of automated scoring systems for laboratory reports has been significantly hampered by three persistent challenges: scarcity of high-quality annotated data, high domain-specific complexity, and insufficient model interpretability. To address these limitations, this study proposes an AdaBoost regression model based on multi-level feature engineering and threshold control, denoted as MFTC-ABR. This method constructs a multi-dimensional feature set using a lightweight neural network, which evaluates laboratory reports across four core dimensions: comprehension of experimental principles, completion of experimental procedures, depth of result analysis, and plagiarism detection. At the scoring algorithm level, a dynamic threshold adjustment mechanism is integrated into the AdaBoostReg ensemble learning framework. By redesigning the sample weight update rule, the prediction errors of samples are divided into three intervals: the acceptable region, the stable learning range, and the focus range. Accordingly, a differentiated weight update strategy is implemented, and a history-aware mechanism is introduced to further regulate the attention allocated to individual samples. Finally, experimental results on the power electronics laboratory report dataset show that MFTC-ABR model achieves a mean absolute error (MAE) of 3.09 and a scoring consistency rate of 82% within a five-point error tolerance. These findings validate the effectiveness and practicability of the proposed method for automatic assessment in specialized domains with limited data availability.

Keywords:

automated scoring of laboratory reports; differential weight updating; multi-level evaluation; feature engineering; dynamic threshold regulation; history-aware mechanism

1. Introduction

Engineering education is a core link for talent cultivation in the high-end manufacturing industry and the new industrial system, and a key dimension for evaluating the quality of higher education under the Emerging Engineering Education Initiative. As a core part of engineering education, experimental teaching bridges professional theoretical knowledge and engineering practical ability. Laboratory report assessment, as an indispensable tool for evaluating students’ practical ability, theoretical mastery, logical reasoning and academic norm literacy, is a key research topic of automated scoring for open-ended questions in STEM education [1].

Different from general text writing and standardized test scoring, the core of engineering laboratory report grading lies in the multi-dimensional and systematic comprehensive evaluation of students’ engineering application capability of professional knowledge, integrity of experimental design and implementation, rigor of result analysis, and scientificity of error tracing. This imposes extremely high professional requirements on the accuracy, standard consistency, interpretability and fairness of the scoring method. However, the traditional manual grading mode of laboratory reports has long been a core pain point in the practical teaching of engineering education.

On the one hand, manual grading is inherently characterized by high time and labor costs. Grading laboratory reports for a single professional basic course often requires teachers to devote dozens of hours of continuous work, which greatly occupies the energy that should be allocated to core teaching tasks, such as teaching design and personalized tutoring [2]. On the other hand, long-term, high-intensity repetitive review work inevitably leads to reviewer fatigue, which in turn introduces subjective bias, fluctuations in leniency and strictness, and inconsistency in scoring standards into the grading results. There even emerges the invalid review phenomenon of “valuing format over content”, posing a severe challenge to the fairness and reliability of practical ability evaluation in engineering education. More importantly, the manual grading mode makes it difficult to achieve timely feedback for students. Existing studies have confirmed that timely scoring feedback is a core element for optimizing the learning process and improving the effect of autonomous learning [3], and the time lag of the traditional mode severely restricts the implementation of the formative evaluation system in engineering experimental teaching.

In this context, Automated Grading Systems (AGSs) driven by machine learning and natural language processing technologies have become a key technical solution to address the above pain points. This technology enables efficient, objective and standardized automatic evaluation of laboratory reports. It can not only greatly reduce the burden of repetitive work for teachers, but also ensure the consistency and fairness of grading results, while providing students with immediate scoring feedback. It has become a cutting-edge research direction in the interdisciplinary field of educational technology and artificial intelligence, and has demonstrated great application potential in STEM education [1,2].

Existing automated text-scoring studies have achieved remarkable progress in general essay evaluation and standardized short-answer assessment, but still face critical adaptation challenges when directly applied to professional engineering laboratory report scoring. The unique task characteristics of engineering laboratory reports put forward strict requirements for the accuracy, interpretability and low-cost implementation of the scoring model in small-sample professional scenarios. Therefore, developing an automated scoring model that can meet the above requirements has important practical necessity and research value. Aiming at the core pain points of the automated scoring task in the professional engineering field, this study constructs an automated scoring technical framework based on multi-level interpretable feature engineering and an improved AdaBoostReg algorithm. This framework enriches the theoretical research system of the automatic text-scoring field in subdivided professional scenarios, and provides new insights for research on automated assessment in the field of STEM education [1].

2. Related Work

2.1. Development Paradigms of Automated Text-Scoring Technology

Automated Text Scoring (ATS), a core research direction in educational AI, enables automated evaluation of students’ text responses. In their authoritative review of STEM automated grading systems, Tan et al. [1] divided the core technologies in this field into four categories, covering the full technological evolution of ATS. Combined with the task characteristics of engineering laboratory report scoring, this paper sorts ATS into three progressive paradigms, each with distinct technical features and inherent limitations in the target scenario.

2.1.1. Rule-Based and Semantic Similarity Matching Methods

As the earliest ATS paradigm, this method achieves scoring via manually constructed expert rules, keyword libraries, and semantic similarity matching with preset standard answers.

Chen et al. [4] realized automated scoring of objective questions in laboratory reports through keyword matching, a typical early application of this paradigm. To improve semantic capture ability, Wang et al. [5] constructed an electrical engineering professional ontology and evaluated semantic similarity via ontology network node distance; Magooda et al. [6] proposed sentence-level vectorization based on Word2Vec/GloVe embedding and TF-IDF weighted aggregation for semantic matching; Landauer et al. [7] developed an LSA-based automated essay scoring system, confirming that LSA has a stronger semantic capture capability than LDA. This paradigm relies on alignment with a single standard answer, which cannot adapt to diverse but reasonable responses in engineering report scoring, leading to over-simplified evaluation dimensions. Meanwhile, it only outputs an overall similarity score, with no traceable scoring basis or targeted feedback, and completely lacks the interpretability required for formative evaluation.

2.1.2. Traditional Machine Learning-Based Automated Scoring Methods

To break the rigid limitations of rule-based methods, research shifted to data-driven traditional machine learning models, which have long been the mainstream solution for Automatic Short-Answer Grading (ASAG) [8].

Early studies focused on surface linguistic feature extraction: Ramalingam et al. [9] confirmed that syntactic and semantic features combined with polynomial regression can build an effective nonlinear mapping for essay scoring; Weegar et al. [10] verified that fusing Sentence-BERT embeddings with traditional word frequency features improves short-answer scoring accuracy. Subsequent studies proposed refined hierarchical feature engineering: Wei et al. [11] built a multi-level language analysis framework to capture text complexity from multiple linguistic levels; Zhang et al. [12] adopted a two-stage word vector training strategy to improve domain semantic capture ability for semi-open-ended questions; relevant studies also achieved excellent performance in semi-open-ended science question scoring via N-Gram, word embedding and lightweight machine learning models [13].

At the model level, ensemble learning became the mainstream scheme for its strong nonlinear fitting ability: Bani et al. [14] combined text similarity algorithm with ANN to improve scoring accuracy; relevant studies built an integrated scoring system with a scoring engine and adaptive fusion engine, reaching a Pearson correlation coefficient of 91.30% [15]; stacking ensemble learning and Bayesian networks were also effectively applied in small-sample scoring and dynamic assessment scenarios [16,17,18]. Notably, the latest empirical study by Ferreira Mello et al. [8] confirmed that traditional machine learning models outperform GPT-4 in overall ASAG performance even with optimized prompt engineering, verifying their irreplaceable advantages in highly professional short-answer assessment tasks.

Model performance is highly dependent on feature engineering quality, and existing studies have not built a standardized feature system aligned with engineering laboratory report scoring standards. Most feature systems focus on general linguistic features, lacking deep alignment with professional scoring dimensions, and cannot solve the oversimplification of evaluation dimensions. Meanwhile, data-driven models rely heavily on annotated data scale, and traditional ensemble algorithms are prone to overfitting and insufficient robustness in small-sample professional scenarios.

2.1.3. Deep Learning and Pre-Trained Language Model-Based Methods

The end-to-end deep learning paradigm greatly reduces models’ dependence on manual feature engineering, bringing revolutionary changes to ATS.

Early attempts adopted LSTM, GRU and CNN to capture text sequential dependencies and local features, but required large-scale annotated data. The Transformer architecture and BERT pre-trained models have become the mainstream base models in ATS: Ikiss et al. [19] built an efficient end-to-end regression scoring model by fine-tuning BERT and combining it with LSTM; Nie et al. [20] proposed the SBERT-BiLSTM model, combining pre-trained sentence embeddings with bidirectional LSTM for excellent essay scoring performance; Ma et al. [21] proposed the GFTB method, fusing text syntactic and semantic information via graph convolutional network and fine-tuned BERT; Uto et al. [22] developed a hybrid neural architecture for integrated scoring regression and grade classification, enhancing model generalization via multi-task learning. Relevant studies also confirmed that fusing shallow artificial features with deep semantic features from deep learning can significantly improve scoring performance compared with BiLSTM and RNN baselines [23]. The end-to-end paradigm leads to inherent black-box characteristics: the feature extraction process is opaque, and scoring results cannot be mapped to explicit expert evaluation dimensions, nor provide a traceable scoring basis and targeted feedback, which is the core user trust issue of current AI automated scoring systems [1]. Meanwhile, fine-tuning pre-trained models still requires large-scale professional annotated data, with severely restricted generalization ability in small-sample professional engineering scenarios.

2.2. Research on the Application of Large Language Models in Educational Assessment

The rapid development of large language models (LLMs) has profoundly impacted education, and domain adaptation via downstream fine-tuning has become the core technical path for LLM implementation [24]. Chen et al. [24] confirmed that fine-tuned LLMs can adapt to education scenarios, while facing challenges in data quality, computing cost and application boundaries. In educational assessment and automated scoring, LLM-related research has formed three mainstream technical directions.

The first category is zero-shot/few-shot direct scoring based on general LLMs, which realizes automated scoring via natural language prompts without large-scale annotated data or model fine-tuning, showing excellent adaptability in general text assessment scenarios [25]. For Spanish open-ended short-answer scoring, Capdehourat et al. [3] systematically tested the performance of different LLMs and prompting techniques, confirming that advanced LLMs can achieve high scoring accuracy, but their results are highly sensitive to prompt styles with inherent content bias.

The second category is domain-specific fine-tuned LLM assessment solutions, the current mainstream optimization direction for professional educational assessment, which improves models’ professional alignment and scoring stability via domain-annotated data fine-tuning [26]. Lee and Jia [26] built a complete LLM fine-tuning tutorial for educational assessment, confirming that fine-tuned LLMs can better align with assessment constructs and scoring standards; Sarzaeim et al. [27] further confirmed that domain-specific fine-tuning can inject professional knowledge into LLMs, alleviate hallucinations, and improve output reliability for professional fields including education and engineering.

The third category is assessment systems based on LLM optimization frameworks and multi-model integration, which optimize scoring accuracy and interpretability via Tree-of-Thought (ToT), multi-agent collaboration and multi-model fusion [28]. Ito and Ma [28] proposed the Ensemble ToT framework, coordinating multiple LLMs for scoring result integration via simulated debates; the Agent Laboratory framework realized full automation of the scientific research process, confirming the broad application prospects of LLM-Agent in education and scientific research [29]; relevant studies also confirmed that generative LLMs have great potential in assessment content generation, with the quality of generated medical examination questions exceeding that of human experts [30]. For engineering domain adaptation, Lu et al. [31] systematically explored the impact of various fine-tuning strategies on model performance, revealing the correlation between model scale, fine-tuning strategies and domain adaptation ability.

However, for the highly professional engineering laboratory report scoring task with strict fairness requirements and teaching feedback attributes, both general and domain-optimized LLMs still have insurmountable core bottlenecks:

High dependence on large-scale expert annotated data: Refined domain fine-tuning requires large-scale, high-quality, expert-annotated data with extremely high acquisition costs, which cannot solve the core pain point of data scarcity in professional domains [26,27,31].

Insufficient scoring consistency and rule controllability: Even after optimization, LLMs’ generative output has inherent randomness and high sensitivity to prompts [3], making it impossible to strictly follow preset scoring rules and stably replicate manual multi-dimensional evaluation logic.

Insufficient interpretability for formative evaluation: The end-to-end generative logic cannot solve the black-box problem of the scoring process, with no accurate mapping between scoring results and subdivided evaluation dimensions, nor a traceable scoring basis and targeted feedback, which is the core user trust issue [1].

No absolute performance advantage with high implementation costs: Even optimized GPT-4 has not surpassed traditional machine learning models in comprehensive ASAG performance [8]. Meanwhile, commercial LLMs have high calling costs and data compliance risks, while local deployment of open-source LLMs requires extremely high computing resources, making large-scale implementation in grassroots college teaching difficult [31].

2.3. Limitations of Existing Research and Research Entry Point of This Paper

2.3.1. Core Limitations of Existing Research

Although existing ATS research has formed a complete technical system and achieved large-scale application in general scenarios, Tan et al. [1] pointed out that STEM automated grading systems still face core challenges of insufficient interpretability, lack of user trust, and poor domain generalization. When directly applied to engineering laboratory report scoring, existing methods still have three unsolved core bottlenecks:

Scarcity of high-quality professional annotated data: Expert-annotated engineering laboratory report data has extremely high acquisition costs and limited scale, which seriously restricts the generalization ability of all data-driven models [26,31].

Severe lack of model interpretability: Most existing models, especially end-to-end deep learning frameworks and generative LLMs, have inherent black-box characteristics, which not only trigger user trust issues [1] but also cannot meet the requirements of formative evaluation in engineering education.

Over-simplification of evaluation dimensions: Most existing models lack a systematic deconstruction mechanism for the multi-dimensional scoring standards of engineering laboratory reports, and cannot stably replicate the multi-dimensional comprehensive evaluation logic of manual scoring.

2.3.2. Research Entry Point of This Paper

Based on the above limitations, combined with the empirical conclusion of Ferreira Mello et al. [8] on the performance advantages of traditional machine learning models in ASAG tasks, this study does not blindly follow the general LLM technical route. Aiming at the exclusive requirements of engineering laboratory report scoring, with the core goal of high accuracy, high interpretability, and low-cost implementation in small-sample scenarios, this study proposes an AdaBoostReg algorithm based on multi-level features and threshold control (MFTC-ABR), with core breakthroughs in three aspects:

A multi-dimensional feature engineering framework based on lightweight neural network scoring point extraction is proposed, which systematically deconstructs the multi-dimensional scoring standards of engineering laboratory reports and completes feature extraction without large-scale domain-annotated data.

A multi-level interpretable feature set highly aligned with domain-specific evaluation standards is constructed, covering four core dimensions: comprehension of experimental principles, completion of experimental procedures, depth of result analysis, and plagiarism detection, ensuring traceable and interpretable scoring results.

The traditional AdaBoostReg algorithm is improved, and a dynamic threshold control-based scoring algorithm (TC-ABR) is proposed, which optimizes the sample weight update strategy, avoids model overfitting to hard samples, and significantly improves model robustness and accuracy in small-sample professional scenarios.

3. Materials and Methods

This chapter elaborates on the MFTC-ABR automated scoring model proposed in this study. First, it introduces the design principles and implementation details of the multi-dimensional feature engineering framework. Second, it expounds on the core mechanisms of the TC-ABR algorithm, including the innovative designs such as multi-error fusion, dynamic threshold division, differentiated weight updating, and history-aware adjustment.

3.1. Overall Evaluation Framework for Laboratory Reports

This section presents an integrated feature engineering framework for the scoring of engineering laboratory reports. This framework is constructed based on three core principles: scoring-oriented deconstruction, semantic explicitness and interpretability, and the synergy between domain prior knowledge and data-driven learning.

Scoring-Oriented Deconstruction Principle

Through cognitive deconstruction of the expert manual scoring process, it is found that teachers’ scoring decisions are not based on holistic impressions, but on several quantifiable core evaluation dimensions. This framework aims to simulate this cognitive process by decomposing the total score into multiple semantically explicit and mutually orthogonal sub-competency dimensions.

2.: Semantic Explicitness and Interpretability Principle

Each feature dimension in the framework (e.g., principle comprehension, depth of result analysis) has clear semantic connotations and can be directly mapped to teaching evaluation objectives. This ensures that the model inputs are no longer an unexplainable “black-box” vector, but traceable and feedback-enabled teaching indicators, laying a foundation for subsequent targeted guidance for students.

3.: Synergy Principle of Domain Priori and Data-Driven Learning

Under small-sample conditions, purely data-driven models are highly prone to overfitting. By injecting domain knowledge (e.g., the mandatory/optional structure of laboratory reports, common forms of plagiarism), this framework constructs features with strong inductive bias to guide the model to learn correct evaluation patterns from limited data, thereby improving the generalization ability of the model.

4.: Construction and Validation of Evaluation Dimensions

First, core evaluation points involved in teachers’ scoring decisions were extracted through a systematic analysis of the syllabus and scoring rubrics for power electronics laboratory reports. Second, semi-structured interviews were conducted with three senior teachers with over ten years of teaching experience, and they stated that these four dimensions cover more than 90% of the scoring concerns. Finally, ablation experiments were performed to verify the contribution of each dimension to model performance: the removal of the “depth of result analysis” dimension caused the MAE to rise from 3.09 to 16.45, proving the indispensability of this dimension; while the “plagiarism detection” dimension showed low variance in the current dataset, its inclusion was designed based on practical teaching requirements, and its discriminative power can be activated through subsequent data augmentation. The above validation procedures collectively ensure the representativeness of the four-dimensional feature system for the core construct of “laboratory report quality”, as shown in Figure 1.

3.2. Construction of Multi-Dimensional Features

3.2.1. Comprehension of Experimental Principles and Completion of Experimental Procedures

Comprehension of experimental principles is a fundamental dimension for evaluating students’ mastery of theoretical knowledge. This section introduces two quantitative features of this dimension: core concept coverage (based on TextRank keyword extraction) and the ability to express personal insights (based on an MLP classifier). Meanwhile, this section elaborates on the quantification method for the completion degree of experimental procedures, including the recursive keyword matching and independent checking mechanism based on the experimental manual.

Core Concept Coverage

Keyword extraction is performed using the TextRank algorithm based on a pre-constructed domain-specific vocabulary and stop word list. The principle description section is quantified by measuring the keyword similarity between students’ reports and the experimental instruction manual.

2.: Personal Insight Expression

Original thinking and critical analysis reflect a higher level of comprehension of experimental principles. However, distinguishing personal insights from basic descriptive text is a nuanced and subjective task. For this reason, a Multi-Layer Perceptron (MLP) classifier is designed in this paper to learn the underlying basis of manual differentiation in this small-sample scenario. This MLP is selected over large pre-trained models such as BERT because BERT is prone to overfitting and incurs high computational costs in small-sample scenarios.

Specifically, the input of the classification model is a sentence-level text vector represented by a word-document matrix, as shown in Figure 2, where rows represent sentences and columns represent words.

The value of the matrix element w is the weight calculated via the Term Frequency-Inverse Document Frequency (TF-IDF) method. The value w of each element in the matrix is calculated using Equations (1) and (2).

T F (w o r d, D o x) = \frac{c o u n t (w o r d, D o x)}{c o u n t (D o x)}

(1)

w = T F (w o r d, D o x) \times \log (\frac{N}{1 + D F (w o r d)})

(2)

Here,

N

is the total number of documents, and

D F (w o r d)

is the number of documents containing the word.

As shown in Figure 3, the architecture of the MLP classifier consists of an input layer that accepts TF-IDF vectors, two hidden layers with 512 and 128 units, respectively (both using the ReLU activation function), and a softmax output layer for classification. To mitigate overfitting, Dropout (with a rate of 0.3) is applied after each hidden layer. The model is trained using the Adam optimizer (learning rate = 0.001) with binary cross-entropy as the loss function, for a maximum of 100 epochs with an early stopping strategy implemented.

As shown in Table 1, the proposed MLP classifier significantly outperforms the BERT model in both computational efficiency and classification accuracy for this task. It achieves superior performance with only 0.26% of the parameter quantity of BERT, which verifies its applicability in small-sample scenarios. Accordingly, the proportion of sentences identified as containing personal insights is adopted as the second feature for evaluating the comprehension of experimental principles.

3.: Completion of Experimental Procedures

Construction of Step Keyword Lexicon: First, an in-depth analysis of the experimental manual is conducted to define a set of keywords or key phrases that can characterize each experimental step (including mandatory and optional steps), such as “soldering”, “output voltage”, “DC motor”, and “inductor current”.

Experimental procedures have inherent logical sequential dependencies. For example, a power-on test can only be conducted after circuit board soldering is completed, and waveform observation must follow oscilloscope probe connection. To model these sequential dependencies, this study proposes a recursive matching mechanism. The core logic of this mechanism is: if the keywords of a specific experimental step are detected in the report, the model automatically infers that all preceding steps of this step have been completed, unless an explicit omission of the preceding steps is identified. This mechanism avoids misjudgments caused by elliptical expressions in students’ writing (e.g., omitting explicit descriptions of simple completed steps), thus improving the evaluation accuracy.

Independent matching mechanism: For optional steps, since they are usually parallel or alternative with no strict sequential dependencies, an independent matching mechanism is adopted. The completion status is determined solely by the presence of the corresponding keywords of the step itself.

Feature calculation: Three features are ultimately generated to quantify the completion degree of the experiment:

Completion degree of mandatory steps: the ratio of the number of matched keywords for mandatory steps to the total number of keywords for mandatory steps;

Completion degree of optional steps: the ratio of the number of matched keywords for optional steps to the total number of keywords for optional steps;

Number of charts: the number of charts (circuit diagrams, waveform graphs, data tables) in the report; this number reflects the completeness of experimental records.

3.2.2. Analysis of Experimental Results and Plagiarism Detection

Analysis of experimental results is a critical dimension that reflects students’ data interpretation ability and critical thinking competency. This section presents the three-level classification system (descriptive, basic analysis, and advanced analysis) and its cascade classifier architecture, and quantifies students’ performance in this dimension through three complementary aspects: coverage, depth, and exhaustiveness. Meanwhile, this section elaborates on the design of the plagiarism detection module, including the similarity calculation with the experimental manual and the cross-checking mechanism among student reports.

Hierarchical Classification Of Analysis Levels

A three-level classification system is established to evaluate the quality of analysis:

Level 1 (descriptive): Only restates experimental phenomena without any interpretation.

Level 2 (basic analysis): Provides simple data interpretation and preliminary reasoning.

Level 3 (advanced analysis): Conducts in-depth analysis of the correlation between experimental results and theoretical principles, including critical discussion.

2.: Cascade Classifier Architecture

To accurately classify the depth of experimental result analysis while avoiding overfitting in the small-sample scenario, this study constructs a cascade of MLP classifiers (Figure 4). Different from a single multi-class classifier that directly predicts analysis levels, this hierarchical architecture decomposes the complex classification task into three progressive binary sub-tasks: it first distinguishes analytical content from descriptive text, then divides the analytical content into mandatory and optional parts, and finally assigns the corresponding depth level to each analysis segment. This design reduces the complexity of each sub-task, effectively mitigates error propagation between classification levels, and improves the overall classification accuracy in the small-sample setting.

The classification results of each stage of the cascade are shown in Table 2.

Three types of features are quantified based on the classifier output:

Coverage features: The number of covered analysis points, denoted as

n_{B}

for mandatory analysis and

n_{O}

for optional analysis.

Depth features: The weighted sum of the identified analysis levels (Equations (3) and (4)), with higher weights

ω = {1, 2, 3}

assigned to higher-level analysis.

D_{B} = \sum_{i = 1}^{3} ω_{i} n_{B i}

(3)

D_{O} = \sum_{j = 1}^{3} ω_{j} n_{O j}

(4)

Exhaustiveness features: The number of sentences and words for each analysis level, which indicates the level of detail of the elaboration.

3.: Plagiarism detection algorithm.

Academic integrity is a critical component of laboratory report evaluation. In this study, detection features are constructed for two types of plagiarism behaviors: plagiarism from the experimental manual and inter-student cross-plagiarism.

Plagiarism detection from the experimental manual: This method detects verbatim plagiarism based on the sentence-level character overlap ratio.

Sentence segmentation: The laboratory report

D

is segmented into a sentence set

S_{1} = {s_{1}^{1}, s_{1}^{2}, \dots, s_{1}^{n}}

, and the relevant content of the experimental manual is segmented into the set

S_{2} = {s_{2}^{1}, s_{2}^{2}, \dots, s_{2}^{m}}

.

Overlap ratio calculation: For each sentence pair

(s_{1}^{i}, s_{2}^{j})

, the proportion of the overlapping character segment in each respective sentence is calculated as follows:

C = {c | c \in s_{1}^{i} a n d c \in s_{2}^{j}}

h_{1}^{i} = \sum_{c \in C} c o u n t (c, s_{1}^{i}), h_{2}^{j} = \sum_{c \in C} c o u n t (c, s_{2}^{j})

(5)

r_{1}^{i} = \frac{h_{1}^{i}}{l e n (s_{1}^{i})}, r_{2}^{j} = \frac{h_{2}^{j}}{l e n (s_{2}^{j})}

(6)

Plagiarism judgment: Thresholds

θ_{1} a n d θ_{2}

are set (determined through empirical analysis of the training data). A sentence pair is judged as plagiarized if and only if

r_{1}^{i} > θ_{1}

and

r_{2}^{j} > θ_{2}

hold simultaneously.

Feature generation: The feature value is defined as the ratio of the number of sentences identified as plagiarized in the report to the total number of sentences in the report.

Similarity detection among student reports.

This method is designed to detect the similarity between student reports and identify potential inter-student cross-plagiarism.

Pairwise similarity calculation: For a report set

D = {d_{1}, d_{2}, \dots, d_{n}}

containing

n

reports, the aforementioned method is used to calculate the proportion of similar sentences between any two reports

d_{i}

and

d_{j}

.

Similarity matrix construction: A symmetric matrix

S \in R^{n \times n}

is constructed, where the element

s_{i j} =

(

\frac{d_{i j}}{l e n (d_{i})}

,

\frac{d_{i j}}{l e n (d_{j})}

) represents the proportion of identical sentences in each respective report.

Feature quantification: The average value of the elements in each row of the matrix is calculated.

Plagiarism detection from external online sources has not been implemented in the current version, but the corresponding framework has been designed: in future work, it can be connected to academic search engine APIs or cross-platform text matching services to identify plagiarism through hash comparison or semantic similarity calculation between report paragraphs and online resources. The plagiarism detection features in this study are currently mainly used to identify explicit text reuse behaviors as one of the scoring dimensions, and will be extended to external source detection in subsequent research.

3.2.3. Scalability Design of the Feature Set

The multi-dimensional feature set is designed as an extensible module. The feature extractors corresponding to each dimension (e.g., MLP classifier, keyword matcher) are independent of one another. When adding a new dimension, only the extractor for the corresponding dimension needs to be trained, with no need to retrain the existing modules. For example, if the “academic writing standardization” dimension needs to be added, an independent MLP classifier can be trained based on a small amount of annotated data, and the output features can be concatenated into the existing feature vector. The original TC-ABR model can directly use the extended feature set for training or fine-tuning. This modular design ensures the flexibility and reusability of the framework, enabling it to adapt to the differentiated requirements of laboratory report evaluation standards across different disciplines and courses.

3.3. Construction of the Scoring Model

This section elaborates on the framework of the proposed multi-level feature and threshold control AdaBoostReg scoring model. Based on the multi-level features constructed in Section 3.2, this part introduces the feature selection method and the AdaBoostReg algorithm with dynamic threshold control.

3.3.1. Overall Architecture and Design Philosophy

The core of the MFTC-ABR model is an automated scoring system framework that achieves a balance between interpretability and prediction accuracy. The model operation consists of three stages:

Multi-level feature extraction (see Section 3.2 for details).

Feature selection based on recursive feature elimination (RFE).

AdaBoostReg algorithm with dynamic threshold control.

3.3.2. TC-AdaBoostReg Scoring Algorithm

This subsection elaborates on the core mechanisms of the TC-ABR algorithm. It first presents the motivation and theoretical basis for the algorithmic innovations, followed by the overall framework of the algorithm, and concludes with a detailed elaboration of the specific technical details of the algorithm.

First, the standard AdaBoostReg algorithm is adapted to regression tasks by reconstructing the loss function and weight update strategy, but it has inherent limitations in the automated scoring scenario. When a decision tree is used as the weak learner, the algorithm predicts scores in a manner similar to classification by minimizing MSE, but it has two key shortcomings. First, it applies a unified weight update rule to all sample errors, failing to distinguish between “acceptable scoring errors” and “prediction deviations requiring critical attention”. Second, it exhibits equal sensitivity to all deviations, which easily leads to overfitting in small-sample scenarios; in this case, the model may learn rater-specific biases rather than universal evaluation principles.

To facilitate a more intuitive understanding of the algorithm, a straightforward interpretation is presented here. The core rationale of the TC-ABR algorithm lies in the differentiated handling of diverse prediction errors. The traditional AdaBoostReg algorithm adopts a unified weight update rule for all samples, which may cause the model to pay excessive attention to inherently hard-to-learn samples or ignore errors within the acceptable region. To address this limitation, TC-ABR introduces three key innovations:

Multi-Perspective Error Evaluation

Instead of relying solely on absolute error, the algorithm evaluates each sample from three complementary perspectives: absolute deviation, trend consistency (the variation pattern of predicted values with the number of iterations), and distribution position (the relative ranking of the sample in the overall score distribution).

2.: Dynamic Threshold Division

Through kernel density estimation, the algorithm automatically identifies the natural demarcation points in the error distribution and divides the samples into three regions: the Acceptable Region (errors are within the acceptable region), the Stable Learning Region (errors require moderate attention), and the Focused Attention Region (errors require priority and critical handling).

3.: Differentiated Weight Updating

Samples in different regions are processed with targeted weight adjustment strategies. Meanwhile, a history-aware mechanism is introduced to prevent the model from overfitting to samples with persistent prediction errors across iterations.

This design enables the model to distinguish between “acceptable scoring errors” and “deviations requiring critical attention”, thus achieving more robust learning in small-sample scenarios.

The overall framework of the proposed algorithm is illustrated in Figure 5.

The specific implementation details of the proposed algorithm are presented as follows.

Statistical Adaptive Fusion of Error Metrics

Traditional AdaBoostReg relies solely on absolute error to update sample weights, which cannot fully capture the systematic deviation, prediction stability, and distribution mismatch of samples in the scoring task. To address this limitation, this study designs three complementary error components to evaluate prediction errors from multiple perspectives, as detailed below.

(1): Standardized absolute error: This component is designed to quantify the deviation between the predicted value and the true manual score, while improving the model’s robustness to outlier samples in the unevenly distributed laboratory report score dataset. It is standardized using the Median Absolute Deviation (MAD), a robust statistical metric that is insensitive to extreme outliers.

e_{k i}^{(a b s)} = \frac{|y_{i} - G_{k} (x_{i})|}{{MAD}_{k}}

(7)

where

M A D_{k}

is the median absolute deviation, calculated as follows:

{MAD}_{k} = median (||y_{i} - G_{k} (x_{i})| - median (|y - G_{k} (x)|)|)

(8)

The median and MAD are used instead of the mean and standard deviation to improve robustness to outliers, and the standardization process ensures the comparability of errors across different iterations.

(2): Trend consistency error: The trend consistency error is designed to measure the stability of the model’s prediction for a single sample across iterations, avoiding over-attention to samples with volatile prediction results. It is calculated as the difference between the current prediction and the predictions of the past $L$ iterations, divided by the standard deviation of the predictions in each iteration to account for prediction volatility.

e_{k i}^{(t r e n d)} = \frac{1}{L} \sum_{t = k - L}^{k - 1} \frac{|G_{t} (x_{i}) - G_{k} (x_{i})|}{σ_{t}^{(p r e d)}}

(9)

where

L

is the window size, and

σ_{t}^{(p r e d)}

is the standard deviation of the predicted values in the t-th iteration. By comparing the difference between the current prediction and historical predictions, the stability of the model’s prediction for this sample is measured, and standard deviation normalization eliminates the influence of the scale of predicted values.

(3): Distribution position error: This component is designed to capture systematic deviations of predicted values in the overall score distribution, ensuring the rank consistency and fairness of the final scoring results, which is a core requirement for educational assessment scenarios. The Empirical Distribution Function (EDF) is used to map the true values and predicted values to the interval [0, 1], and then the distance between them in this interval is calculated.

e_{k i}^{(q u a n t)} = |F_{k} (y_{i}) - F_{k} (G_{k} (x_{i}))|

(10)

where

F_{k} (\cdot)

is the empirical distribution function:

F_{k} (z) = \frac{1}{n} \sum_{j = 1}^{n} ⨅ (z_{j} \leq z)

. By converting absolute errors into relative position differences in the distribution, systematic deviations of the predicted values in the overall distribution are identified.

The above three error components describe the model’s prediction performance from three orthogonal perspectives: numerical deviation, iterative stability, and distribution consistency, which can fully characterize the error characteristics of samples in the laboratory report scoring task. To integrate these components reasonably, a data-driven adaptive weight allocation mechanism is designed to assign dynamic weights to each error component according to its discriminatory power and stability in the current training stage.

2.: Adaptive Weight Allocation Mechanism

Different error components have distinct discriminatory power in different training stages. For this reason, a data-driven importance evaluation mechanism is adopted to assign weights to each error component:

(1): Calculation of discriminatory power index: Cohen’s d effect size is used as the discriminatory power indicator, which is a standardized measure of the mean difference. For each error component $j$ , its ability to distinguish between the high-score group and the low-score group is calculated as the mean difference divided by the pooled standard deviation.

D_{j} = \frac{μ_{j}^{(h i g h)} - μ_{j}^{(l o w)}}{σ_{j}^{(p o o l e d)}}

(11)

where

μ_{j}^{(h i g h)} a n d μ_{j}^{(l o w)}

are the mean values of the j-th error component for the samples in the top 30% and bottom 30% of the manual scores, respectively, and

σ_{j}^{(p o o l e d)}

is the pooled standard deviation:

σ_{j}^{(p o o l e d)} = \sqrt{\frac{(n_{h} - 1) σ_{h}^{2} + (n_{l} - 1) σ_{l}^{2}}{n_{h} + n_{l} - 2}}

(12)

where

σ_{h}, σ_{l}

represent the standard deviations of the high-score samples and low-score samples, respectively. A larger

D_{j}

value indicates a stronger ability of the error component to distinguish between the high-score and low-score groups. According to Cohen’s empirical rule,

n_{h} = n_{l} = ⌊0.3 N⌋

is adopted.

In addition to discriminatory power, the stability of error components is also considered to avoid selecting components with excessive variance or unstable distribution.

(2): Calculation of stability index: The stability indicator of the j-th error component $e_{j}^{(k)}$ is defined as follows:

s_{j}^{(k)} = 1 - \frac{Q_{3} (e_{j}^{(k)}) - Q_{1} (e_{j}^{(k)})}{\max (e_{j}^{(k)}) - \min (e_{j}^{(k)})}

(13)

where

Q_{1}

and

Q_{3}

denote the first quartile and the third quartile, respectively. The stability indicator

s_{j}^{(k)} \in [0, 1]

, and a larger value indicates smaller variation and more stable distribution of the error component. This indicator is used to evaluate the concentration of the values of the error component, and components with high stability are more reliable.

(3): Adaptive weight allocation: Based on the discriminatory power $d_{j}^{(k)}$ and stability $s_{j}^{(k)}$ , the adaptive weight of each error component is calculated via the softmax function. The adaptive weight $ω_{j}^{(k)}$ of the j-th error component is defined as follows:

ω_{j}^{(k)} = \frac{\exp (α \cdot d_{j}^{(k)} + β \cdot s_{j}^{(k)})}{\sum_{l = 1}^{3} \exp (α \cdot d_{l}^{(k)} + β \cdot s_{l}^{(k)})}

(14)

where α = 0.7 and β = 0.3, which control the relative importance of discriminatory power and stability, respectively. This parameter setting emphasizes the dominant role of discriminatory power while taking stability into appropriate consideration.

(4): Calculation of comprehensive error: The three error components are weighted and fused via adaptive weights to obtain the comprehensive error of each sample. The comprehensive error ${\tilde{e}}_{i}^{(k)}$ of sample i in the k-th iteration is defined as follows:

{\tilde{e}}_{i}^{(k)} = \sum_{j = 1}^{3} ω_{j}^{(k)} \cdot ψ (e_{j, i}^{(k)})

(15)

where ψ(⋅) denotes the 5% Winsorization of the original error values. Winsorization can limit the influence of extreme values on the comprehensive error and enhance the robustness of the method:

ψ (z) = \{\begin{matrix} Q_{0.05} (z), & i f z < Q_{0.05} (z) \\ z, & i f Q_{0.05} (z) \leq z \leq Q_{0.95} (z) \\ Q_{0.95} (z), & i f z > Q_{0.95} (z) \end{matrix}

(16)

3.: Determination of an Adaptive Threshold Based on Distribution Analysis

It is generally assumed that the comprehensive error follows a mixture distribution, that is, most samples produce “normal errors” and a small number produce “abnormal errors”. To identify the natural boundary between these two types of errors, a kernel density function is first constructed:

\hat{f_{k}} (e) = \frac{1}{n h} \sum_{i = 1}^{n} K (\frac{e - {\tilde{e}}_{i}^{(k)}}{h})

(17)

where K(⋅) is the standard Gaussian kernel function

K (u) = {(2 π)}^{- \frac{1}{2}} e x p (\frac{- u^{2}}{2})

, and the bandwidth h is determined using Silverman’s rule:

h = 1.06 \cdot \hat{σ} \cdot n^{- \frac{1}{5}}

(18)

Subsequently, density valley detection is performed. The first-order numerical derivative

\hat{f_{k}^{'}} (e)

of the density function is calculated to find the points (local minima) that satisfy

\hat{f_{k}^{'}} (e) = \hat{f_{k}^{″}} (e) > 0

:

V^{(k)} = \{e \in R : \frac{d {\hat{f}}^{(k)}}{d e} (e^{-}) < 0 and \frac{d {\hat{f}}^{(k)}}{d e} (e^{+}) \geq 0\}

(19)

If multiple local minima exist, the minimum value is selected as the core threshold

θ_{k}^{(c o r e)}

. By constructing the density function to find the valley, the high-density region (normal errors) and low-density region (abnormal errors) are naturally distinguished.

4.: Partitioned Adaptive Sample Weight Update Mechanism

The traditional AdaBoostReg algorithm updates sample weights through the exponential decay factor

β_{t}

, which reduces the weights of samples with smaller errors and relatively increases the weights of samples with larger errors. However, this update strategy has two potential problems. First, the error of experimental report scoring can be divided into acceptable error and unacceptable error, and the simple exponential update is difficult to apply differentiated attention to samples in different error intervals. Second, samples with extreme errors (which may be noise) will obtain excessively high weights, leading to model overfitting. Aiming at these problems, a three-region adaptive weight update mechanism based on threshold error is proposed, which is described in detail as follows.

An adjustable acceptable threshold δ is introduced to distinguish samples with acceptable error and unacceptable error, and

θ_{k}^{(c o r e)}

is used to distinguish samples with normal error and samples requiring focused attention, which satisfy

δ \leq θ_{k}^{(c o r e)}

. Then, in each iteration k, the comprehensive error

{\tilde{e}}_{i}^{(k)}

calculated above is sorted in ascending order, and the samples are divided into three regions based on

θ_{k}^{(c o r e)}

and δ:

Acceptable region

R_{A}^{(κ)}

: Contains samples with error smaller than the acceptable error, i.e.,

{\tilde{e}}_{i}^{(k)} \leq δ

. Samples in this region have been well mastered by the current model, and their prediction errors are within the acceptable region.

Stable learning region

R_{B}^{(κ)}

: Contains samples with error satisfying

δ < {\tilde{e}}_{i}^{(k)} \leq θ_{k}^{(c o r e)}

. Samples in this region still have room for further learning.

Focused attention region

R_{C}^{(κ)}

: Contains samples with error

{\tilde{e}}_{i}^{(k)} > θ_{k}^{(c o r e)}

. Samples in this region have large prediction errors, which may include hard samples or noisy data, and require focused attention.

Weight Update MechanismFor each sample i, the weight adjustment factor is calculated according to the region it is located in:

Region

A

(acceptable region):

Δ w_{i}^{(k)} = \exp [- γ_{A} \cdot (1 - \frac{{\tilde{e}}_{i}^{(k)}}{δ})], i \in R_{A}^{(κ)}

(20)

Region

B

(stable learning region):

Δ w_{i}^{(k)} = \exp [γ_{B} \cdot \frac{{\tilde{e}}_{i}^{(k)} - δ}{θ^{(t)} - δ}], i \in R_{B}^{(κ)}

(21)

Region

C

(focused attention region):

Δ w_{i}^{(k)} = \exp [γ_{C} \cdot \frac{{\tilde{e}}_{i}^{(k)} - θ^{(t)}}{\max (\tilde{e^{(t)}}) - θ^{(t)}}], i \in R_{C}^{(κ)}

(22)

where

γ_{A}, γ_{B}, γ_{C}

are region adjustment coefficients, satisfying

γ_{C} > γ_{B} > 1 > γ_{A} > 0

.

Considering the long-term performance of samples, a cumulative attention mechanism is introduced to realize history-aware adjustment of samples:

Define the cumulative attention

A_{i}^{(k)}

of sample i, with the update rule:

A_{i}^{(k)} = λ \cdot A_{i}^{(k - 1)} + (1 - λ) \cdot I ({\tilde{e}}_{i}^{(k)} > δ)

(23)

where λ ∈ [0, 1] is the forgetting factor, and I(⋅) is the indicator function.

A_{i}^{(k)}

records the historical attention level of a sample (i.e., the number of times its error has exceeded δ in history, attenuated by the forgetting factor λ). This design can avoid overfitting caused by continuous attention to certain samples (which may be noise or outliers).

5.: History-Aware Adjustment

For samples whose weights need to be increased (

Δ w_{i}^{(k)} > 1

), this adjustment can prevent excessive attention to certain fixed samples (which may be noise or outliers):

{\tilde{Δ ω}}_{i}^{k} = 1 + \frac{Δ w_{i}^{(k)} - 1}{1 + η \cdot A_{i}^{(k)}}

(24)

For samples whose weights need to be decreased (

Δ w_{i}^{(k)} < 1

and

A_{i}^{(k)} > 0

), this adjustment can reduce the weight reduction for previously hard samples, preventing the model from forgetting the learned hard patterns:

{\tilde{Δ ω}}_{i}^{k} = 1 - (1 - Δ w_{i}^{(k)}) \cdot [1 - \exp (- η \cdot A_{i}^{(k)})]

(25)

where η > 0 is the attenuation coefficient.

The final sample weight update is:

w_{i}^{(k + 1)} = \frac{w_{i}^{(k)} \cdot {\tilde{Δ ω}}_{i}^{k}}{\sum_{j = 1}^{n} w_{j}^{(k)} \cdot {\tilde{Δ ω}}_{j}^{k}}

(26)

Weak Learner Weight Calculation

Define the weighted average error for each region as follows:

ϵ^{(k)} = \frac{\sum_{i \in R^{(κ)}} w_{i}^{(k)} {\tilde{e}}_{i}^{(k)}}{\sum_{i \in R^{(κ)}} w_{i}^{(k)}}, (A, B, C)

(27)

The comprehensive error of the weak learner

h_{k}

is formulated as follows:

ϵ^{(k)} = ω_{A} ϵ_{A}^{(k)} + ω_{B} ϵ_{B}^{(k)} + ω_{C} ϵ_{C}^{(k)}

(28)

where

ω_{A}

,

ω_{B}

, and

ω_{C}

are region weight coefficients satisfying

ω_{A}

,

ω_{B}

,

ω_{C}

> 0,

ω_{A}

+

ω_{B}

+

ω_{C}

= 1, and the inequality constraint

ω_{C}

>

ω_{B}

>

ω_{A}

.

The weight of the weak learner is calculated as follows:

α_{k} = \frac{1}{2} \ln (\frac{1 - ϵ^{(k)}}{ϵ^{(k)}})

(29)

For numerical stability,

ϵ^{(k)}

is clipped with the following rule:

ϵ^{(k)} \leftarrow \max (\min (ϵ^{(k)}, 1 - 10^{- 8}), 10^{- 8})

The ensemble prediction output of the proposed TC-AdaBoostReg is given by:

H (x_{i}) = \sum_{k = 1}^{T} \tilde{α_{k}} h_{k} (x_{i})

(30)

4. Results

This chapter validates the effectiveness of the proposed MFTC-ABR model via a series of systematic experiments. First, the performance of the proposed model is compared with that of classical machine learning, ensemble learning, and deep learning methods on the power electronics laboratory report dataset. Subsequently, the generalization ability of the TC-ABR algorithm as a general-purpose regressor is verified on two public benchmark datasets, namely the Boston Housing dataset and the Diabetes dataset. Finally, the contributions of the feature engineering framework and the core algorithm module are verified, respectively, through ablation experiments and parameter sensitivity analysis.

4.1. Data Sources

This section presents the experimental data used in this study, including the source, sample size, scoring rubrics, and basic structure of the laboratory reports from the Power Electronics Technology course.

The experimental data are derived from 155 undergraduate laboratory reports of the Buck Chopper Circuit experiment in the Power Electronics Technology course, covering three cohorts (Grade 2017 to Grade 2019). Incomplete submissions and those with critical defects were excluded during screening. Each report was independently scored by domain experts in accordance with explicit scoring rubrics (full score: 100 points). A report typically consists of three sections: Experimental Principle and Circuit Design, Experimental Procedures and Result Analysis, and Personal Insights and Reflections. Table 3 presents examples of report fragments and their corresponding feature categories. For example, the content in the second row of Table 3 is extracted from the report numbered 23 of the 2018 cohort, with the corresponding text of the experimental result analysis: “The trailing edge phenomenon observed at U2-2 is attributed to the turn-off behavior of the internal power diode during potential drop. This action causes reverse current to flow through resistors R11 and R10.” Based on our proposed feature extraction framework, this segment of experimental result analysis is classified as Level 2 (basic analysis).

To address the small-sample challenge while preserving data representativeness, a stratified sampling strategy based on score intervals was adopted. Table 4 and Figure 6 summarize the statistical information of the dataset. Within each score interval, samples were randomly partitioned into the training set and test set at a ratio of 4:1, which preserved the proportion of each performance level and avoided distribution bias that might be introduced by simple random sampling or k-fold cross-validation.

4.2. Experimental Setup

This section elaborates on the experimental configuration and parameter settings in detail, including the hyperparameter configurations of the three categories of baseline models (classical machine learning, ensemble learning, and deep learning), the parameter settings of the proposed TC-ABR algorithm, and the validation scheme on public datasets.

Validation on the Power Electronics Technology Experimental Report Dataset

To verify the effectiveness of the proposed MFTC-ABR automated scoring model on power electronics experimental reports, we conducted an evaluation based on the multi-dimensional features extracted in Section 3, combined with Recursive Feature Elimination (RFE) for scoring, and compared the results with multiple baseline models. The configuration of each model is as follows:

(1): Classical Machine Learning Models

Support vector machine (SVM): Bayesian optimization was adopted for hyperparameter optimization, with a Radial Basis Function (RBF) kernel. The search space included C ∈ [0.1, 500], ε ∈ [0.01, 1], and γ = ‘auto’. The parameter set with optimal performance was selected for evaluation.

Bayesian ridge regression: The model used prior parameters alpha1_1 = 1 and alpha_2 = 1, with the maximum number of training iterations set to 100 and a convergence tolerance of 1 × 10⁻⁴.

Decision tree regressor: The maximum tree depth was set to 3, and mean squared error minimization was used as the splitting criterion.

Stacking ensemble: A two-layer stacking ensemble model was implemented, with SVM and Gradient Boosting Regression Tree (GBRT) as base learners and linear regression as the meta-learner.

Other baseline models, including Linear Regression and Ridge Regression, were also included for comparison.

(2): Ensemble Learning

To ensure the fairness and competitiveness of the evaluation, the MFTC-ABR model was compared with several widely recognized gradient boosting frameworks with excellent performance on structured data. The configuration of each baseline method is as follows:

XGBoost: max_depth = 4, η = 0.1, n_estimators = 100

LightGBM: Adopted GBDT algorithm, max_depth = 4, η = 0.05

CatBoost: iterations = 100, depth = 6, η = 0.1

AdaBoostReg: n_estimators = 3, learning_rate = 0.2

(3): Deep Learning

Multiple deep learning models, including Long Short-Term Memory (LSTM) network, Convolutional Neural Network (CNN), and Transformer, were adopted as baseline scoring algorithms to realize automatic feature extraction, and their performance was compared with the feature engineering method proposed in this study. To save computational resources and improve training efficiency, Bayesian optimization was adopted for hyperparameter tuning of all models. The model was trained for a maximum of 150 epochs, and an early stopping mechanism was triggered if the Mean Absolute Error (MAE) did not improve for five consecutive epochs. The specific configuration of the deep learning models is as follows:

LSTM: optimal training epochs: 105; embedding dimension: 115; hidden layer dimension: 179; learning rate: 0.0007.

CNN: optimal training epochs: 12; embedding dimension: 113; learning rate: 0.001; kernel size: 186.

Transformer: optimal training epochs: 94; embedding dimension: 128; learning rate: 0.001; number of attention heads: 8; number of encoder layers: 4.

(4): Early Stopping Strategy

To prevent overfitting and ensure a fair comparison of the performance across different models, a validation set-based early stopping mechanism was adopted for all models, with distinct stopping conditions set according to the characteristics of each model.

Deep Learning Models: The patience value is set to 10, and the minimum improvement threshold (min_delta) is set to 0.001. Specifically, training is terminated early when the reduction in validation set MAE is less than 0.001 for 10 consecutive epochs, and the model parameters corresponding to the lowest validation MAE are restored. This strategy is designed to terminate training after model performance saturates, thus avoiding invalid computation.

MFTC-ABR Model: Since AdaBoost-type algorithms are prone to gradual overfitting as the number of weak learners increases (the validation MAE decreases first and then increases), this study adopts a rise-detection early stopping strategy. When the validation set MAE shows a continuous increase for five consecutive added weak learners (i.e., no decrease compared to the previous iteration), the ensemble process is terminated, and the ensemble size corresponding to the lowest validation MAE is selected for the final model. This strategy ensures that the model stops in a timely manner before overfitting occurs.

2.: Generalizability Validation of the Core Mechanism of the Scoring Algorithm

To rigorously evaluate the effectiveness of the proposed TC-ABR algorithm (i.e., the improved AdaBoost framework with dynamic threshold control) and avoid the situation where its performance advantage is solely derived from domain-specific feature engineering, this study conducted supplementary validation on two classical benchmark regression datasets in the machine learning field: the Boston Housing dataset [32] and the Diabetes dataset [33]. These two datasets have completely different feature dimensionality, sample size, and noise distribution from the power electronics experimental report dataset, thus serving as a neutral test bed to isolate the influence of domain features and independently verify the superiority of the core mechanism of TC-ABR in general regression tasks. In this validation link, TC-ABR was treated as a general-purpose regression algorithm for horizontal comparison with multiple baseline models, including the traditional AdaBoostReg. In addition, the TC-ABR algorithm is compared with the current mainstream baseline models, namely TabPFN-2.5 [34], FT-Transformer [35], XGBoost [36], LightGBM [37], and LightGBM optimized via Bayesian optimization [38].

3.: Ablation Experiments and Parameter Sensitivity Analysis

For the power electronics experiment report dataset, this study conducted three sets of controlled experiments: stepwise ablation of hierarchical features; individual validation of the contribution of the multi-error fusion mechanism, threshold partitioning mechanism, and historical awareness mechanism in the TC-ABR algorithm to the final results; and quantitative analysis of the impact of the introduced parameters δ,

γ_{A}, γ_{B}, γ_{C}

, λ, and η on the Mean Absolute Error (MAE).

4.3. Experimental Results and Analysis

This section presents the results and analysis of three sets of comparative experiments: the performance comparison on the power electronics laboratory report dataset (horizontal comparison with classical machine learning, ensemble learning, and deep learning methods), the verification of generalization ability on public datasets, and the ablation experiments and parameter sensitivity analysis of the feature and algorithm modules.

Performance on Power Electronics Technology Experimental Reports

The experimental results of classical machine learning models on power electronics technology experimental reports are shown in Table 5. Mean Absolute Error (MAE), correlation coefficient, and scoring consistency rate (i.e., the proportion of samples within a specific error range) were adopted as evaluation metrics.

The proposed MFTC-ABR model exhibits superior performance on all evaluation metrics. In terms of Mean Absolute Error (MAE), it achieves a 61.3% reduction compared with the worst-performing baseline model (Bayesian ridge regression), indicating its high scoring accuracy. In terms of scoring consistency, the model reaches an accuracy of 0.82 within a deviation of 5 points, a 49.1% improvement compared with the traditional stacking ensemble method. It also achieves the highest accuracy of 0.91 within a deviation of 10 points, demonstrating its robust performance under different scoring tolerance thresholds. The high Pearson correlation coefficient (over 0.9) achieved by the model indicates that the multi-dimensional features extracted in the feature engineering stage of Section 3.2 effectively capture the teacher’s scoring logic, ensuring that the model can reliably distinguish between high-quality and low-quality reports.

The comparison results with ensemble learning models are shown in Table 6.

It can be seen from the comparison results in Table 6 that ensemble learning models, including XGBoost, CatBoost and Random Forest, all exhibit better performance than classical machine learning methods in the power electronics laboratory report scoring task, which verifies the effectiveness of tree-based ensemble models in small-sample scenarios. Among them, XGBoost and CatBoost both achieve over 0.91 in scoring consistency rate (error < 10 points) and over 0.97 in correlation coefficient, which are comparable to those of the proposed MFTC-ABR model. This phenomenon indicates that, given high-quality feature engineering, various powerful ensemble learners can effectively fit the scoring pattern, which verifies the universal effectiveness of the multi-dimensional feature system constructed in this study.

However, the proposed MFTC-ABR model still maintains significant advantages in key metrics. In terms of Mean Absolute Error (MAE), MFTC-ABR reaches 3.09, corresponding to a 24.4% reduction compared with CatBoost (MAE = 4.09) and a 27.6% reduction compared with XGBoost (MAE = 4.27). In terms of scoring consistency rate (error < 5 points), MFTC-ABR achieves 0.82, a 13.9% improvement compared with CatBoost and XGBoost (both 0.72). This advantage is mainly attributed to the precise modeling of fine-grained errors via the multi-error fusion and threshold partitioning mechanism in the TC-ABR algorithm, and this advantage is even more prominent in scenarios requiring high-precision scoring (e.g., error within five points). In addition, as a representative of Bagging-based models, Random Forest achieves an MAE of 4.35 and a five-point error consistency rate of 0.69, outperforming the single decision tree regressor (MAE = 6.63) but slightly underperforming XGBoost and CatBoost. This is consistent with the general rule that gradient boosting algorithms usually deliver better performance on structured data.

The above results demonstrate that, for small-sample scoring tasks in professional domains, the quality of feature engineering serves as the critical foundation determining model performance, while targeted optimization at the algorithm level (e.g., the error partitioned management of TC-ABR) can achieve further performance improvement on the basis of well-designed features. The advantages of MFTC-ABR in accuracy-sensitive metrics verify its practical value in small-sample, high-precision scoring scenarios.

The MAE of the MFTC-ABR model and deep learning models is shown in Figure 7.

As can be seen from Figure 7, the validation Mean Absolute Error (MAE) of the Convolutional Neural Network (CNN) decreases rapidly in the early stage, reaches the minimum value at the 7th epoch, and then stabilizes, while its MAE on the test set is as high as above 14.0. The validation MAE of the Long Short-Term Memory (LSTM) network and Transformer, respectively, reaches the minimum at around the 80th epoch, and then stabilizes, yet remains higher than that of CNN. The validation MAE of SciBERT levels off after the 350th epoch, and BERT-BiLSTM reaches an inflection point at approximately the 385th epoch to trigger early stopping, with a negligible subsequent decline. However, both the validation and test MAE values of these two models are higher than 20, indicating that pre-trained language models have no obvious advantages in small-sample engineering domains.

In contrast, the validation MAE of MFTC-ABR decreases rapidly within the first 10 weak learners, and is significantly lower than the minimum validation MAE of all deep learning models. It reaches the lowest value at the 15th weak learner, then starts to rise, triggering early stopping to avoid overfitting. Accordingly, the model with 15 weak learners is selected as the final model in this paper, which achieves a test MAE of 3.09, exhibiting a significant advantage over the best-performing deep learning baseline. MFTC-ABR not only achieves a lower MAE on the validation set, but also presents a more stable convergence process and lower overfitting risk, which demonstrates the significant superiority of the proposed multi-dimensional feature engineering and dynamic threshold-controlled AdaBoostReg algorithm framework in small-sample professional domains.

2.: Performance on Public Datasets

MAE and R² were adopted as evaluation metrics, and only the top 10 models in terms of performance are presented. The experimental results are shown in Figure 8 and Figure 9.

The experimental results demonstrate that the effectiveness of the proposed TC-ABR method is highly dependent on the intrinsic characteristics of the data, and it delivers substantial performance improvements over the classical AdaBoostReg algorithm on both datasets. On the Boston Housing dataset, TC-ABR achieves the optimal performance (MAE = 2.0501), which verifies that the improved strategies adopted in TC-ABR can more effectively capture complex nonlinear relationships. However, on the diabetes disease progression dataset, which has an extremely low signal-to-noise ratio (SNR), the absolute performance of all models is constrained. Even in this scenario, TC-ABR still maintains a stable performance advantage over AdaBoostReg.

To further verify the generalization ability of the TC-ABR algorithm as a general-purpose regressor, this study conducted supplementary comparative experiments with mainstream state-of-the-art (SOTA) models in the field of structured data on two classic benchmark regression datasets in machine learning: the Boston Housing dataset and the Diabetes dataset. The evaluated models include the widely used industrial gradient boosting tree models XGBoost, LightGBM, and LightGBM optimized via Bayesian optimization, Tabpfn-2.5, a pre-trained large model dedicated to tabular data, and the FT-Transformer architecture optimized for tabular data. The detailed performance comparison results of the two datasets are shown in Table 7 and Table 8, respectively.

On the Boston Housing dataset, which features a high signal-to-noise ratio (SNR) and clear intrinsic correlation between features and labels, TC-ABR achieved a Mean Absolute Error (MAE) of 2.0501. This metric underperforms the 2.0027 of LightGBM optimized via Bayesian optimization and the 2.0285 of vanilla LightGBM, ranking third among all evaluated models and not the optimal result for this indicator. This result objectively indicates that, in general regression scenarios with large samples and high SNR, there is still a slight gap between TC-ABR and mature gradient boosting models that have undergone long-term engineering optimization in the industry in terms of the extreme optimization of average prediction deviation, which remains within an acceptable region. At the same time, TC-ABR achieved the optimal performance among all evaluated models on two core metrics: Mean Squared Error (MSE) and coefficient of determination (R²). Its MSE reached 8.1793, a relative reduction of 18.6% compared with the second-best performing LightGBM, and its R² reached 0.8908, a relative improvement of 3.2 percentage points compared with the second-best LightGBM. From the perspective of statistical essence, MAE assigns equal weights to all prediction errors and quantifies the overall average deviation level of the model, while MSE imposes a higher penalty weight on extremely large errors, and R² reflects the model’s explanatory power for the overall distribution of the data. This discrepancy in metrics is directly related to the core design intention of TC-ABR. The core target scenario of this algorithm is the automated scoring of laboratory reports in small-sample professional domains. The core requirement of this scenario is not the extreme optimization of average prediction deviation, but the avoidance of extreme scoring deviations and the guarantee of the overall consistency and fairness of scoring. Therefore, the algorithm design prioritizes the enhancement of the control ability for extreme errors rather than the extreme optimization of MAE, which is the core source of the above indicator discrepancy.

On the Diabetes dataset, which is characterized by high noise and weak feature correlation as its core attributes, the inherent low SNR of the data significantly constrains the absolute fitting performance of all evaluated models. Even in this harsh test scenario, TC-ABR still achieved the optimal values among all evaluated models on the three-core metrics of MAE, MSE, and R². Specifically, its MAE reached 42.8737, a relative reduction of 5.3% compared with the second-best performing LightGBM, and its R² reached 0.4276, a relative improvement of 2.0 percentage points compared with the second-best LightGBM. In contrast, pre-trained large models such as Tabpfn-2.5 showed significant performance degradation in this scenario. This result fully validates the anti-interference ability and robustness of TC-ABR in low SNR and high-noise data, and these characteristics are highly aligned with the requirements of the core scenario of this study, where small-sample data in professional domains often have problems such as annotation noise and uneven sample distribution.

Based on the experimental results of the two benchmark datasets, it can be objectively concluded that the TC-ABR algorithm proposed in this study has the competitiveness to rival the current top SOTA models in general structured data regression tasks. It achieves leading performance across all indicators in harsh scenarios with low SNR and high noise and realizes significant optimization of extreme error control and overall fitting ability in high SNR scenarios, fully verifying the universality of the proposed algorithmic improvements. Meanwhile, the slight gap in MAE on the high SNR large-sample dataset not only objectively reflects that there is still room for improvement of the algorithm in the extreme optimization of average deviation, but also more comprehensively clarifies the advantages and applicable boundaries of the algorithm. More importantly, this gap does not affect the core innovative conclusion of this study: TC-ABR can be well adapted to the core scenario of automated scoring of engineering laboratory reports, which is characterized by small samples, high professionalism, and sensitivity to extreme errors.

3.: Ablation Experiments and Parameter Sensitivity Analysis

The ablation experimental results on the Power Electronics Experimental Report Dataset are shown in Table 9.

The ablation experimental results show that there are significant differences in the contribution of each feature layer to the scoring model. The removal of the Experimental Result Analysis Layer leads to a cliff-like drop in model performance (MAE soars from 3.09 to 16.45, and the correlation coefficient drops from 0.98 to 0.52), which fully confirms that this dimension is the most critical discriminant basis in the expert scoring system and validates the necessity of the three-level classification and weighted in-depth quantification strategy adopted for this dimension in Section 3.2 of this paper. The absence of the Experimental Completion Degree Layer and Principle Elaboration Layer also leads to performance loss to varying degrees, indicating that they are also effective components of the final scoring. However, the removal of the Plagiarism Detection Layer does not cause a significant change in model performance. To explore the underlying reason, a visual analysis of the original data distribution of this feature was performed, as shown in Figure 10.

It can be seen from Figure 10 that the sample data of this feature have extremely low volatility in the current dataset, with its variance and standard deviation being only 0.0017 and 0.0410, respectively, and the data points are closely distributed around the mean value. In other words, since most of the report samples in this experimental dataset are completed independently, plagiarism is not common, resulting in limited discriminative power in this feature dimension. Therefore, the model performance is insensitive to changes in this feature. However, this finding does not mean that the “plagiarism detection” dimension itself is valueless; on the contrary, it reflects an advantage of the feature engineering framework adopted in this study: the interpretability enables us to diagnose the contribution of features and feed back to data collection and teaching practice.

The low contribution of this feature in the specific sample set of this study is due to the limitation of data distribution (extreme imbalance between positive and negative samples). This indicates that in future work, to build a more robust scoring system, it is necessary to intentionally collect samples containing more plagiarism cases or introduce a more sophisticated cross-report plagiarism detection algorithm to enhance the discriminative power of this dimension, so that it can play its due role in the model. This precisely proves that the dimensions deconstructed by scoring orientation are necessary, but the current data fail to fully activate their potential.

The ablation experimental results of TC-ABR are shown in Table 10.

As can be seen from Table 10, the model performance degrades drastically when the threshold partitioning mechanism is removed, with the MAE rising to 5.38, corresponding to a relative increase of 74.1%. This result indicates that in the scoring scenario of power electronics experiment reports, which requires fine-grained differentiation between “acceptable errors” and “deviations requiring critical attention”, the three-zone differentiated weight update mechanism based on threshold partitioning is the core pillar of the algorithm’s accuracy. Once degraded to the uniform weight update scheme of traditional AdaBoost, the model cannot accurately identify hard samples with complex logic and ambiguous expressions, resulting in a significant increase in scoring deviation for a large number of reports and a sharp drop in scoring discrimination.

When the multi-error fusion mechanism is removed, the MAE rises to 4.72. This indicates that relying solely on absolute error is insufficient to fully capture the characteristics of prediction deviations. Especially when processing samples with small numerical errors but systematic deviations in scoring logic (e.g., reports with a correct understanding of principles but non-rigorous expressions), the multi-perspective strategy integrating absolute error, trend error, and distribution position error plays an irreplaceable role.

When the historical awareness mechanism is removed, the MAE increases to 4.08. The historical awareness mechanism models the long-term performance of samples through cumulative attention, which effectively alleviates the model’s overfitting to certain stubborn samples that are repeatedly mispredicted, yet may stem from the particularity of the test questions.

Compared with the feature ablation experiments, the range of performance variation in the algorithm module ablation is smaller. This verifies that in few-shot application scenarios, the impact of feature engineering on the final results is more significant than that of algorithmic improvements to the model. For scoring tasks in professional fields with scarce data, constructing a multi-dimensional interpretable feature system that is closely aligned with evaluation criteria is the primary factor determining model performance, while algorithm-level optimization only provides incremental improvements on this basis. This precisely addresses the widespread phenomenon of “model prioritization over feature engineering” in existing research on automated scoring in professional fields: without the systematic deconstruction and feature-based representation of scoring dimensions, even state-of-the-art algorithms can hardly break through the performance bottleneck caused by data scarcity.

The results of the parameter sensitivity analysis of TC-ABR are shown in Figure 11.

As can be seen from Figure 11, there are significant differences in the sensitivity of each parameter. δ and

γ_{C}

have the most significant impact on model performance: the MAE rises sharply when they deviate from the optimal values, presenting a steep U-shaped curve.

γ_{B}

and λ have moderate sensitivity: the performance remains stable near the optimal interval, and the MAE increases gradually when they exceed the reasonable range.

γ_{A}

and η have low sensitivity, with the MAE fluctuating slightly over a wide range of values, indicating that the model has strong robustness to these parameters. Compared with the contribution of feature engineering, the fluctuation range of MAE caused by parameter changes is smaller than the performance degradation caused by feature removal in the feature ablation experiments. This corroborates that in few-shot scoring tasks in professional fields, the decisive role of feature engineering on model performance is greater than that of hyperparameter tuning. The TC-ABR algorithm can stably exert its advantages within a reasonable parameter range, which demonstrates its reliability and practicability as an advanced regression tool.

5. Discussion

This chapter conducts an in-depth discussion on the entirety of this research, as well as potential research directions not covered in this work. First, the performance of the MFTC-ABR model is analyzed from three dimensions: the decisive role of feature engineering, the effectiveness of the algorithmic mechanism, and parameter sensitivity and robustness. Second, the limitations and research boundaries of this study are systematically reflected based on four validity dimensions. Finally, the application value and future expansion directions of the MFTC-ABR model in engineering education practice are elaborated in detail.

To address the key challenges of data scarcity, complex evaluation requirements, and insufficient interpretability in the automated scoring of engineering experimental reports, this paper proposes an MFTC-ABR model that integrates multi-level interpretable features and ensemble learning with dynamic threshold control. Experimental results demonstrate that the proposed model achieves excellent performance on the power electronics experimental report dataset, with a mean absolute error (MAE) of 3.09 and a scoring consistency rate (within an absolute deviation of 5 points) of 82%. This result validates the effectiveness of combining scoring-oriented feature deconstruction with task-adaptive algorithm improvement in small-sample scenarios within professional domains.

5.1. The Decisive Role of Feature Engineering

Feature ablation experiments reveal significant differences in the contribution of each feature dimension to the model’s scoring performance. After removing the depth of result analysis feature, the model’s MAE surged from 3.09 to 16.45, and its correlation coefficient dropped from 0.98 to 0.52, showing a catastrophic performance decline. Removing the experimental procedure completion feature increased the MAE to 9.36, and removing the experimental principle comprehension feature raised the MAE to 5.36. In contrast, in the algorithm ablation experiments, even after removing the core threshold partitioning mechanism, the MAE only rose to 5.38, which is far lower than the performance degradation caused by the removal of core feature dimensions.

This stark contrast fully verifies that in small-sample professional scoring tasks, feature engineering has a far more decisive influence on the final results than algorithm-level optimization. Specifically, constructing a multi-dimensional interpretable feature system that is highly aligned with professional scoring standards is the cornerstone of model performance, while algorithm optimization only provides incremental improvement on this basis. This finding breaks the widespread model-centric over feature-centric tendency in existing research, and confirms that the multi-dimensional feature engineering framework proposed in this study effectively solves the core problems of over-simplified evaluation dimensions, insufficient model interpretability and poor generalization ability under small-sample conditions.

5.2. Effectiveness of the Innovative Algorithm Mechanisms

Ablation experiments on the TC-ABR algorithm show that all three innovative mechanisms make a significant positive contribution to scoring accuracy. Removing the threshold partitioning mechanism led to the most dramatic performance degradation (MAE increased by 74.1% to 5.38). This indicates that in scoring scenarios such as power electronics experimental reports, which require fine-grained differentiation between acceptable errors and high-priority deviations, the adaptive threshold division based on kernel density estimation and the three-zone differentiated weight update mechanism are the core pillars of the algorithm’s accuracy. Once degraded to the uniform weight update of the AdaBoostReg algorithm, the model cannot accurately identify hard samples with complex logic and ambiguous expressions, resulting in a significant increase in scoring deviation. Removing the multi-error fusion mechanism increased the MAE by 52.8% to 4.72, indicating that relying solely on absolute error is difficult to fully capture the nature of prediction deviations. Especially when processing samples with small numerical errors but systematic deviation in scoring logic (e.g., reports with correct understanding of principles but imprecise expression), the multi-perspective strategy integrating absolute error, trend error, and distribution position error plays an irreplaceable role. Removing the history-aware mechanism increased the MAE by 32.0% to 4.08, confirming that the cumulative attention mechanism effectively alleviates the model’s overfitting to persistently misclassified samples and enhances generalization stability.

5.3. Parameter Sensitivity and Robustness

Parameter sensitivity analysis indicates that the degree of influence of each parameter on model performance varies significantly. δ (threshold of the acceptable region) and

γ_{C}

(growth coefficient of the focused attention region) are the most sensitive: when deviating from the optimal values, the MAE rises sharply, presenting a steep U-shaped curve, so these parameters should be prioritized for fine-tuning.

γ_{B}

(growth coefficient of the stable learning region) and λ (forgetting factor) have moderate sensitivity, with stable performance near the optimal interval.

γ_{A}

(decay coefficient of the Acceptable Region) and η (attenuation coefficient) have low sensitivity, with minimal MAE fluctuation over a wide range of values, which reflects the robustness of the algorithm design. Notably, the fluctuation range of MAE caused by parameter changes (3.09~5.38) is far smaller than the severe performance collapse caused by feature absence (3.09~16.45), which reconfirms the decisive role of feature engineering. The TC-ABR algorithm can stably exert its advantages within a reasonable parameter range, demonstrating its reliability and practicability as an advanced regression tool.

5.4. Validity Threats and Limitation Analysis

Although the MFTC-ABR model proposed in this paper has achieved favorable performance in the automated scoring task of engineering laboratory reports, potential validity threats exist in most empirical studies. This paper conducts a systematic reflection on the limitations of this research from four dimensions: construct validity, internal validity, external validity, and reliability.

Construct Validity Threats:The core issue of construct validity is whether the measurement indicators used in this study can truly reflect the core construct of “engineering laboratory report quality”. This construct is quantified through four dimensions: comprehension of experimental principles, completion of experimental procedures, depth of result analysis, and plagiarism detection, yet this operationalization process carries two potential threats:First, although the four dimensions are selected based on the teaching syllabus and expert scoring rubrics, they may not fully cover other critical aspects… such as creative thinking, critical thinking, and the standardization of academic writing. Second, the three-level classification system (Descriptive, Basic Analysis, Advanced Analysis) for “depth of result analysis” relies on an MLP classifier for text semantic discrimination. The training labels of the classifier come from expert judgments, which inherently carry a certain degree of subjectivity. This operationalization may simplify the continuity and nuanced differences in analytical thinking in engineering practice into discrete levels, resulting in a validity risk of insufficient representativeness.

Internal Validity Threats: The labels used for model training in this study are derived from manual scoring by domain experts. However, in engineering education practice, even with explicit scoring rubrics, there may be certain variability in scoring among different experts, as well as by the same expert at different time points. Such variability is an inherent characteristic of human scoring activities and one of the core problems that automated scoring systems seek to address. The high correlation between the model and expert scoring (with a correlation coefficient of 0.98) in this study indicates that the model has learned the dominant scoring patterns, but it may also implicitly capture the individual biases of raters. Future work will introduce a multi-rater label fusion mechanism to mitigate this impact. In addition, the plagiarism detection feature shows limited variance (with a variance of only 0.0017) in the current dataset, reflecting the scarcity of plagiarism in this sample. This suggests that the model performance may change on datasets with different distributions of this feature.

External Validity Threats: The experimental data of this study were derived from 155 laboratory reports of a single course (Power Electronics Technology), which limits the generalizability of the conclusions. First, laboratory reports from different engineering disciplines (e.g., mechanical engineering, civil engineering, computer science) have significant differences in text structure, evaluation criteria, and terminology usage. The “completion of experimental procedures” evaluation module in the proposed model, which relies on keyword extraction from the experimental manual, may be difficult to directly migrate to disciplines with low-quality or structurally inconsistent manuals. Second, there are differences in the writing style of experimental manuals and scoring standards among different universities, so the model’s performance on the current dataset may not be directly reproducible in other teaching environments. To partially alleviate the above concerns, this study validated the TC-ABR algorithm on two public benchmark datasets, the Boston Housing dataset and the Diabetes dataset, proving its generalization ability as a regression algorithm. However, the complete MFTC-ABR scoring system (including the feature engineering framework) still needs to be validated in more diverse disciplinary and institutional contexts. Future work will construct a multi-disciplinary, cross-institutional dataset of engineering laboratory reports to systematically evaluate the transferability of the system in different scenarios.

Reliability Threats: The reliability threats of this study mainly come from the following three aspects. First, there are concerns about inter-rater and intra-rater reliability. As stated in the internal validity threats, even with explicit scoring rubrics, there may be certain variability in scoring among different experts and by the same expert at different time points. Second, model stability. The feature engineering process relies on manual construction, including the keyword lexicon extracted from the experimental manual and the three-level classification of “depth of result analysis”. Minor changes to these resources (such as updates to the experimental manual) may alter the feature distribution and thus affect the model output. Third, variability in data partitioning. Most empirical studies based on limited samples face the impact of data partitioning methods on results. This study adopts a stratified sampling strategy (stratified by score intervals) to split the training set and test set, which has advantages in ensuring the distribution similarity between the training and test sets. However, different partitioning methods may still lead to performance fluctuations within a certain range. Such variability is a common statistical challenge in small-sample empirical studies, reflecting the inherent uncertainty in evaluating model performance under limited data conditions.

5.5. Discussion on the Feasibility of Data Augmentation

For small-sample scenarios, data augmentation is a potential means to improve the generalization ability of the model, but the augmentation strategy must be carefully selected in combination with the characteristics of professional engineering laboratory reports. During the model development process, this study systematically evaluated the feasibility of text-level data augmentation, including commonly used methods such as back-translation, synonym replacement, and sentence reordering, but found that these methods have inherent limitations in the professional engineering field. On the one hand, engineering laboratory reports contain a large number of professional terms (e.g., “Buck Chopper Circuit”, “Continuous Conduction Mode of inductor current”, “PWM duty cycle”), which lack replaceable synonyms in general corpora, and the back-translation process is very likely to cause mistranslation of professional concepts or semantic distortion. On the other hand, the main focus of expert scoring is whether students accurately describe the experimental operations, complete the specified experimental steps, and conduct in-depth analysis of experimental results, rather than the rhetorical flourish or sentence diversity of the text. Therefore, text-level augmentation not only makes it difficult to generate semantically authentic training samples, but may also introduce noise that deviates from the core essence of scoring, leading the model to learn spurious patterns.

In addition to text augmentation, feature space interpolation is another potential direction for data augmentation. Most features in the constructed multi-dimensional feature system are structured numerical features (e.g., keyword coverage, classification probability, analysis depth score), which are theoretically suitable for augmentation in the feature space. However, feature space interpolation faces two fundamental challenges. First, there are currently no mature methods or advanced studies available for reference regarding the selection of specific threshold intervals and step sizes involved in interpolation. Uncontrolled settings without theoretical basis are very likely to destroy the inherent distribution structure of features. Second, the “depth of result analysis” in the feature framework of this study is itself a fuzzy quantitative concept (transformed from the three-level classification system). If fuzzy interpolation is further introduced on the basis of the already fuzzy quantification, the augmentation process will be completely separated from the real semantics of the original data, and the generated samples will essentially lose logical interpretability. In addition, excessive augmentation may introduce spurious correlations, causing the model to learn false patterns that do not exist in real data.

Based on the above analysis, this study holds that data augmentation in professional fields must be coordinated with a strict theoretical and verification mechanism to ensure that the augmented feature space has real semantic logic rather than arbitrary values without basis. In the absence of reliable augmentation methods, the current study prioritizes the authenticity of data and the interpretability of features, which is the core consideration for not adopting data augmentation technology in this paper. Future work will explore more domain-adaptive augmentation strategies, such as combining active learning to prioritize the annotation of high-information hard samples, or constructing semantic-preserving constrained augmentation methods based on expert knowledge.

5.6. Analysis of Domain Generalizability and Architecture Flexibility

Although the MFTC-ABR framework proposed in this study is designed and validated with the power electronics laboratory report as the application scenario, its core design philosophy and modular architecture have good generalization potential, which can adapt to the laboratory report scoring requirements of different engineering disciplines and be flexibly integrated with other regression architectures.

The design of the feature engineering framework follows three general principles: scoring-oriented deconstruction, semantic explicitness, and the synergy between domain prior knowledge and data-driven learning. These principles themselves do not depend on a specific discipline, but are a common abstraction of any engineering laboratory report scoring task. Therefore, when migrating the framework to other engineering disciplines, it is only necessary to redefine the scoring dimensions according to the evaluation criteria of the discipline and construct the corresponding feature extractors. Taking mechanical engineering laboratory reports as an example, the scoring focus may include “drawing standardization”, “rationality of structural design”, “rigor of error analysis”, etc. For these dimensions, the same methodological path as this study can be adopted: extract core evaluation points through the analysis of the teaching syllabus and scoring rubrics, confirm the representativeness of the dimensions through expert interviews, and then construct the corresponding feature extractors based on domain knowledge (e.g., design an image feature extraction module for drawing standardization, and design keyword matching and logical reasoning features for design rationality). Since the feature extractors for each dimension are independent of each other, adding or replacing dimensions only requires training the extractor for the new dimension and concatenating its features into the existing feature vector. This modular design effectively reduces the cost of cross-domain migration. Meanwhile, the TC-ABR algorithm maintains a decoupled relationship with the feature engineering module. The input received by the algorithm is the feature vector, regardless of the specific source and meaning of the features. Therefore, it can be directly applied in other scenarios by inputting the corresponding feature vectors. This design ensures the flexibility and scalability of the scoring system, enabling users to apply it according to the characteristics of specific tasks.

5.7. Theoretical Rationale for AdaBoostReg as the Base Framework

The core innovations of the TC-ABR algorithm, including multi-error fusion, dynamic threshold division, differentiated weight updating, and history-aware adjustment, are deeply embedded in the AdaBoost framework in terms of design, but their algorithmic ideas are universal and can be borrowed by other ensemble learning or meta-learning algorithms.

First, the reason for choosing AdaBoost as the base framework is that AdaBoost realizes the learning strategy of “focusing on hard samples” by iteratively adjusting sample weights, which is naturally consistent with the demand of “distinguishing different error levels” in this study. Specifically, traditional AdaBoost assigns higher weights to samples with larger errors, which is consistent with the idea of this study that “samples in the focused attention region require priority handling”. However, its unified update rule cannot distinguish between “acceptable errors” and “deviations requiring critical attention”. Therefore, this study chooses AdaBoost as the foundation, and introduces dynamic threshold and differentiated updating into its weight iteration framework, realizing a refined improvement from “unified attention” to “partitioned management”. This design enables the inherent hard sample focusing mechanism of AdaBoost to form a synergy with the newly added error partitioning strategy, rather than a simple superposition. Second, regarding the portability of the innovative mechanisms, although the above mechanisms are closely integrated with AdaBoost in TC-ABR, their core calculation modules are universal. For example, kernel density estimation and density valley detection can be used in any regression or classification task that needs to identify the natural demarcation points of error distribution; the three-region differentiated weight update strategy can be embedded into other ensemble learning algorithms or meta-learning frameworks to optimize the learning process of hard samples; the history-aware adjustment mechanism, as a general sample weight weighting strategy, is applicable to scenarios that need to suppress overfitting to noisy samples.

5.8. Practical Significance and Teaching Feedback

In addition to the above technical contributions, the proposed MFTC-ABR model has multiple application values in engineering education practice. First, the system can be designed as a lightweight backend service to seamlessly integrate with the existing Learning Management Systems (LMS) in universities (such as Chaoxing Xuexitong, Zhihuishu, Moodle, or self-built platforms of universities). After teachers upload students’ laboratory reports through the LMS interface, the system automatically completes feature extraction and scoring calculation, and returns the total score and evaluation results of each dimension. This realizes the deep integration of automated scoring with the existing teaching process, without changing the usage habits of teachers and students, and reducing the promotion cost of new technology implementation. Second, the four-dimensional feature system constructed in this study (comprehension of experimental principles, completion of experimental procedures, depth of result analysis, and plagiarism detection) inherently has the ability to generate diagnostic feedback. The system can not only output the total score, but also locate students’ specific performance in each dimension. For example, when a student gets a low score in the “depth of result analysis” dimension, the system can automatically prompt the student to deepen the analysis from phenomenon description to theoretical correlation analysis; when there is a deficiency in the “completion of experimental procedures”, it can prompt the student to supplement specific operation links. This dimension-based fine-grained feedback can help students accurately identify their own weak links, realize the transformation from “knowing the score” to “knowing how to improve”, and effectively support the realization of the formative evaluation goal of engineering education. Third, the interpretability of the feature system enables teachers to understand the basis of the model’s scoring, quickly identify the common weak links of the class and adjust the teaching content accordingly. At the same time, it significantly reduces the repetitive review burden on teachers, and the saved time can be used for more valuable educational activities such as personalized guidance, on-site experimental Q&A, and teaching design optimization. In summary, the proposed MFTC-ABR model shows good comprehensive advantages in technical performance, teaching practicability, implementation cost and other aspects, providing an implementable technical path for the intelligent upgrading of engineering experimental teaching.

5.9. Challenges and Coping Strategies for AI-Generated Content Detection

With the wide application of large language models, the phenomenon of students using AI tools to generate laboratory reports has become increasingly common, which poses new challenges to the fairness and credibility of automated scoring systems. The plagiarism detection mechanism of the MFTC-ABR model proposed in this paper is mainly aimed at high similarity with the experimental manual or text reuse among students, and has not been explicitly designed to identify AI-generated content. However, it is worth noting that the scoring basis of this model does not directly rely on the fluency or surface features of the text, but strictly constructs interpretable features around the core dimensions concerned by domain experts in the scoring process (such as principle comprehension, procedure integrity, depth of result analysis). The scoring points of these dimensions usually require students to accurately describe specific experimental operations, cite real data, and conduct analysis in line with theoretical logic. In contrast, the text generated by general large models is often semantically fluent but lacks substantive content deeply bound to specific experimental tasks, so it usually gets a low score under the current feature system, which objectively alleviates the direct impact of AI-generated content on scoring results.

Nevertheless, from the perspective of safeguarding academic integrity, it is still necessary to incorporate AI-generated content detection into the plagiarism detection module. This paper plans to introduce a multi-dimensional AI-generated content detection mechanism in future work, which specifically includes: (1) Linguistic pattern analysis, which constructs a classification model to distinguish human writing from AI generation by counting features such as lexical diversity, sentence complexity, and logical coherence; (2) Perplexity detection, which uses special detection tools (such as GPTZero, DetectGPT) to calculate the perplexity and burstiness of the text to identify traces of AI generation; (3) Knowledge consistency verification, which combines the knowledge graph of experimental principles to detect whether there are principle errors in the report or general statements irrelevant to the experimental task. Through the fusion of the above multi-dimensional features, the automatic identification and early warning of AI-generated reports can be realized, to further improve the credibility of the scoring system and the fairness of educational evaluation.

5.10. Future Work

Based on the above analysis, future work will be carried out from the following directions, aiming to further improve the performance, generalizability, and practical educational value of the proposed MFTC-ABR model, while addressing the validity threats identified in Section 5.

First, focus on the optimization of the multi-dimensional feature engineering system and multimodal information fusion. Considering that power electronics laboratory reports contain a large number of non-text elements such as circuit diagrams, mathematical formulas, waveform graphs, and data tables, a multimodal feature extraction module will be integrated in the future. Specifically, it includes: (1) Image feature extraction: based on CNN or ViT architecture, to realize topological structure identification of circuit diagrams and automatic extraction of key parameters (amplitude, frequency, duty cycle) of waveform graphs; (2) Table feature parsing: using Table Transformer or rule matching methods to identify table structure, extract numerical experimental data, and calculate the deviation from theoretical values; (3) Formula semantic understanding: using a formula recognition engine (such as Mathpix) to convert formula images into LaTeX representation, and judge the accuracy and rationality of the formula through semantic matching. The above multimodal features will be fused with the existing text features to form a more comprehensive quality evaluation system for laboratory reports.

Second, carry out cross-domain and cross-scenario verification of generalizability and adaptive optimization of the model. To address the external validity limitation caused by the current research being limited to a single experiment of the Power Electronics Technology course, a multi-disciplinary, cross-institutional dataset of engineering laboratory reports will be constructed to systematically verify the generalizability of the complete MFTC-ABR scoring system in different disciplines and teaching scenarios. A multi-rater label fusion mechanism will be introduced to optimize the training process of the model, thereby reducing the impact of inter-rater and intra-rater variability in manual scoring on model performance, and improving the reliability and robustness of the model in actual teaching scenarios.

Third, build an intelligent teaching feedback and diagnosis system based on the proposed model, which outputs quantitative evaluation results, specific analysis of strengths and weaknesses, and actionable improvement suggestions, so as to realize the upgrade from “automated scoring” to “intelligent teaching diagnosis”.

Finally, optimize the core TC-ABR algorithm for extreme small-sample and low SNR scenarios. Aiming at the general scarcity of high-quality annotated data in the field of professional engineering education, the TC-ABR algorithm can be combined with advanced methods such as few-shot learning, semi-supervised learning, and active learning. Specifically, an active learning strategy will be designed based on the three-region sample partitioning mechanism to prioritize the annotation of hard samples with the highest information content, thereby further reducing the model’s dependence on large-scale annotated data and improving its performance in extreme small-sample scenarios.

6. Conclusions

The automated scoring of engineering laboratory reports serves as a core support for the intelligent upgrading of engineering experimental teaching, while its large-scale implementation is fundamentally restricted by three persistent bottlenecks in the field: scarcity of high-quality expert-annotated data in professional domains, insufficient interpretability of existing models, and over-simplification of evaluation dimensions. Targeting the above bottlenecks, this study proposes an AdaBoost regression model based on multi-level feature engineering and dynamic threshold control, namely MFTC-ABR, for the automated scoring task of engineering laboratory reports.

The proposed model forms a systematic solution to the core bottlenecks of existing research through two dimensionalities of targeted design. On the one hand, an interpretable four-dimensional feature system deeply aligned with professional scoring standards is constructed, which realizes the complete deconstruction of the multi-dimensional evaluation logic of laboratory reports, and effectively alleviates the problems of over-simplified evaluation dimensions, insufficient model interpretability and limited generalization caused by data scarcity in professional scenarios. On the other hand, targeted improvement is carried out based on the AdaBoostReg base framework, and the TC-ABR algorithm integrating dynamic threshold division, differentiated weight updating and history-aware mechanism is proposed, which significantly enhances the robustness and anti-overfitting ability of the model in small-sample professional domain scenarios.

Systematic experimental validation demonstrates the effectiveness and generalization ability of the proposed method. On the self-built power electronics laboratory report dataset, the MFTC-ABR model achieves a mean absolute error (MAE) of 3.09 and an 82% scoring consistency rate within a 5-point error tolerance, outperforming mainstream baseline models including XGBoost and CatBoost. On two public benchmark datasets, the core TC-ABR algorithm exhibits significant performance advantages over the traditional AdaBoostReg algorithm, especially excellent robustness in low signal-to-noise ratio scenarios. The experimental results also confirm the decisive role of high-quality feature engineering in small-sample professional scoring tasks.

The current research is mainly limited to the text content analysis of laboratory reports, and the cross-disciplinary generalization ability of the model remains to be further verified. Subsequent research will be carried out around the above limitations: multi-modal information fusion modules for non-text elements such as charts and formulas will be integrated to build a more comprehensive evaluation system; a multi-disciplinary cross-institutional laboratory report dataset will be constructed to systematically verify and optimize the domain generalization ability of the model; an active learning strategy based on the three-region sample division mechanism of TC-ABR will be designed to further reduce the model’s dependence on large-scale annotated data, so as to develop a lightweight and practical intelligent scoring system with fine-grained formative teaching feedback capabilities.

Author Contributions

Conceptualization, C.W. and J.S.; methodology, C.W.; software, C.W.; validation, C.W.; formal analysis, J.S.; investigation, J.S.; resources, J.S.; data curation, J.S.; writing—original draft preparation, C.W.; writing—review and editing, C.W.; visualization, C.W.; supervision, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. U1304501). The APC was funded by Henan University of Science and Technology.

Data Availability Statement

The data presented in this study are available on request from the authors. The data are not publicly available due to privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MFTC-ABR	Multi-level Features and Dynamic Threshold Control AdaBoost Regression
AI	Artificial Intelligence
ML	Machine Learning
TC-ABR	Dynamic Threshold Control AdaBoost Regression

References

Tan, L.Y.; Hu, S.; Yeo, D.J.; Cheong, K.H. A Comprehensive Review on Automated Grading Systems in STEM Using AI Techniques. Mathematics 2025, 13, 2828. [Google Scholar] [CrossRef]
Vetrivel, S.C.; Arun, V.P.; Ambikapathi, R.; Saravanan, T.P. Automated grading systems: Enhancing efficiency and consistency in student assessments. In Adopting Artificial Intelligence Tools in Higher Education; CRC Press: Boca Raton, FL, USA, 2025; pp. 41–61. [Google Scholar]
Capdehourat, G.; Amigo, I.; Lorenzo, B.; Trigo, J. On the effectiveness of LLMs for automatic grading of open-ended questions in Spanish. arXiv 2025, arXiv:2503.18072. [Google Scholar] [CrossRef]
Chen, Y.; Liu, X.; Huo, P.; Li, L.; Li, F. The design and implementation for automatic evaluation system of virtual experiment report. In Proceedings of the 2017 12th International Conference on Computer Science and Education (ICCSE), Houston, TX, USA, 22–25 August 2017; pp. 717–721. [Google Scholar]
Wang, J.; Guo, W.; Chen, J.; Tang, Z. An Automatic Scoring Method for Subjective Questions in the Electrical Field Based on Multi-feature Fusion. J. Guizhou Univ. (Nat. Sci. Ed.) 2022, 39, 77–82. [Google Scholar]
Magooda, A.E.; Zahran, M.A.; Rashwan, M.A.; Raafat, H.M.; Fayek, M.B. Vector Based Techniques for Short Answer Grading. In Florida Artificial Intelligence Research Society Conference; AAAI Press: Menlo Park, CA, USA, 2016; pp. 238–243. [Google Scholar]
Hoblos, J. Experimenting with latent semantic analysis and latent dirichlet allocation on automated essay grading. In Proceedings of the 2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS), Paris, France, 14–16 December 2020; pp. 1–7. [Google Scholar]
Ferreira Mello, R.; Pereira Junior, C.; Rodrigues, L.; Pereira, F.D.; Cabral, L.; Costa, N.; Ramalho, G.; Gasevic, D. Automatic short answer grading in the LLM era: Does GPT-4 with prompt engineering beat traditional models? In Proceedings of the 15th International Learning Analytics and Knowledge Conference; Association for Computing Machinery: New York, NY, USA, 2025; pp. 93–103. [Google Scholar]
Ramalingam, V.V.; Pandian, A.; Chetry, P.; Nigam, H. Automated essay grading using machine learning algorithm. J. Phys. Conf. Ser. 2018, 1000, 012030. [Google Scholar] [CrossRef]
Weegar, R.; Idestam-Almquist, P. Reducing workload in short answer grading using machine learning. Int. J. Artif. Intell. Educ. 2024, 34, 247–273. [Google Scholar] [CrossRef]
Wei, S.; Gong, J.; Wang, S.; Song, Z. Improving Chinese automated essay scoring via deep language analysis. J. Chin. Inf. Process. 2022, 36, 111–123. [Google Scholar]
Zhang, L.; Huang, Y.; Yang, X.; Yu, S.; Zhuang, F. An automatic short-answer grading model for semi-open-ended questions. Interact. Learn. Environ. 2022, 30, 177–190. [Google Scholar] [CrossRef]
Lertchaturaporn, T.; Songmuang, P. Automated Scoring Model for Semi-Open-Ended Question in Scientific Explanation. In Proceedings of the 2024 IEEE International Conference on Cybernetics and Innovations (ICCI), Chonburi, Thailand, 29–31 March 2024; pp. 1–5. [Google Scholar]
Bani Saad, M.; Jackowska-Strumillo, L.; Bieniecki, W. Hybrid ANN-Based and Text Similarity Method for Automatic Short-Answer Grading in Polish. Appl. Sci. 2025, 15, 1605. [Google Scholar] [CrossRef]
Lotfy, N.; Shehab, A.; Elhoseny, M.; Abu-Elfetouh, A. An Enhanced Automatic Arabic Essay Scoring System Based on Machine Learning Algorithms. CMC-Comput. Mater. Contin. 2023, 77, 1227–1249. [Google Scholar] [CrossRef]
Zhu, L.; Liu, G.; Lv, S.; Chen, D.; Chen, Z.; Li, X. An intelligent boosting and decision-tree-regression-based score prediction (BDTR-SP) method in the reform of tertiary education teaching. Information 2023, 14, 317. [Google Scholar] [CrossRef]
Jiao, H.; Xu, S.; Liao, M. Exploration of the Stacking Ensemble Learning Algorithm for Automated Scoring of Constructed-Response Items in Reading Assessment. In The Routledge International Handbook of Automated Essay Evaluation; Routledge: London, UK, 2024; pp. 40–54. [Google Scholar]
Xing, W.; Li, C.; Chen, G.; Huang, X.; Chao, J.; Massicotte, J.; Xie, C. Automatic assessment of students’ engineering design performance using a Bayesian network model. J. Educ. Comput. Res. 2021, 59, 230–256. [Google Scholar] [CrossRef]
Ikiss, S.; Daoudi, N.; Abourezq, M.; Bellafkih, M. Improving Automatic Short Answer Scoring Task Through a Hybrid Deep Learning Framework. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 1066–1073. [Google Scholar] [CrossRef]
Nie, Y. Automated essay scoring with SBERT embed-dings and LSTM-Attention networks. PeerJ Comput. Sci. 2025, 11, e2634. [Google Scholar] [CrossRef]
Ma, Y.; Yang, Y.; Ren, G.; Palidan, T. Automated Essay Scoring Method Based on GCN and Fine Tuned BERT. Comput. Mod. 2024, 9, 33. [Google Scholar]
Uto, M.; Takahashi, Y. Neural automated essay scoring for improved confidence estimation and score prediction through integrated classification and regression. In International Conference on Artificial Intelligence in Education; Springer Nature: Cham, Switzerland, 2024; pp. 444–451. [Google Scholar]
Li, H. Research on automatic scoring algorithm for English composition based on machine learning. In International Conference on Algorithm, Imaging Processing, and Machine Vision (AIPMV 2023); SPIE: Bellingham, WA, USA, 2024; Volume 12969, pp. 179–184. [Google Scholar]
Chen, Y.; Chen, H.; Su, S. Fine-tuning large language models in education. In Proceedings of the 2023 13th International Conference on Information Technology in Medicine and Education (ITME), Wuyishan, China, 24–26 November 2023; pp. 718–723. [Google Scholar]
Zhou, X.; Hu, W.; Lei, Z.; Liu, G.P. Implementation and Evaluation of an Automatic Scoring System for Experimental Reports Based on ChatGPT. In International Conference on Interactive Collaborative Learning; Springer Nature: Cham, Switzerland, 2023; pp. 432–441. [Google Scholar]
Lee, P.; Jia, Z. Fine-tuning LLMs for psychological and educational assessments: Tutorials for open-source and closed-source LLMs. Int. J. Test. 2026, 26, 1–37. [Google Scholar] [CrossRef]
Sarzaeim, P.; Azim, A.; Bauer, G.; Makrehchi, M. A Survey of Domain-specific Fine-tuned Large Language Models. IEEE Access 2026, 14, 48407–48433. [Google Scholar] [CrossRef]
Ito, Y.; Ma, Q. Ensemble ToT of LLMs and Its Application to Automatic Grading System for Supporting Self-Learning. arXiv 2025, arXiv:2502.16399. [Google Scholar] [CrossRef]
Schmidgall, S.; Su, Y.; Wang, Z.; Sun, X.; Wu, J.; Yu, X.; Liu, J.; Moor, M.; Liu, Z.; Barsoum, E. Agent laboratory: Using llm agents as research assistants. In Findings of the Association for Computational Linguistics: EMNLP 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 5977–6043. [Google Scholar]
Morjaria, L.; Burns, L.; Gandhi, B.; Bracken, K.; Farooq, M.S.; Levinson, A.J.; Ngo, Q.; Sibbald, M. Can Large Language Models Generate High-Quality Short-Answer Assessments? A Comparative Study in Undergraduate Medical Education. Appl. Sci. 2026, 16, 2535. [Google Scholar] [CrossRef]
Lu, W.; Luu, R.K.; Buehler, M.J. Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities. Npj Comput. Mater. 2025, 11, 84. [Google Scholar] [CrossRef]
Harrison, D., Jr.; Rubinfeld, D.L. Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag. 1978, 5, 81–102. [Google Scholar] [CrossRef]
Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Statist. 2004, 32, 407–499. [Google Scholar] [CrossRef]
Grinsztajn, L.; Flöge, K.; Key, O.; Birkel, F.; Jund, P.; Roof, B.; Jäger, B.; Safaric, D.; Alessi, S.; Hayler, A.; et al. Tabpfn-2.5: Advancing the state of the art in tabular foundation models. arXiv 2025, arXiv:2511.08667. [Google Scholar] [CrossRef]
Ye, H.J.; Liu, S.Y.; Chao, W.L. A closer look at tabpfn v2: Strength, limitation, and extension. arXiv 2025, arXiv:2502.17361. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems; NeurIPS Foundation: New Orleans, LA, USA, 2017; Volume 30. [Google Scholar]
Kumar, M.S.; Srivastava, D.M.; Prakash, D.V. Advanced Hybrid Prediction Model: Optimizing LightGBM, XGBoost, Lasso Regression and Random Forest with Bayesian Optimization. J. Theor. Appl. Inf. Technol. 2024, 102, 4103–4115. [Google Scholar]

Figure 1. Overall Evaluation Framework for Laboratory Reports.

Figure 2. The vocabulary-text matrix expressed in TF-IDF.

Figure 3. Architecture of the fully connected neural network.

Figure 4. Cascaded hierarchical binary classification framework for experimental result analysis depth evaluation.

Figure 5. Framework of the sample weight update strategy.

Figure 6. Distribution proportion of sample scores.

Figure 7. Validation of MAE convergence curves and test performance comparison between the proposed MFTC-ABR model and all baseline models.

Figure 8. MAE and R² results on the Boston Housing dataset.

Figure 9. MAE and R² results on the Diabetes dataset.

Figure 10. Fluctuation plot of plagiarism detection features.

Figure 11. TC-ABR parameter sensitivity test.

Table 1. Comparison of classification results for the personal insight section of experimental principles.

Model	P	R	F1	Training Time	Number of Parameters
Bert	0.7333	0.98	0.8390	165.38 s	102 M
MLP	0.8846	0.92	0.9020	0.87 s	263,458

Table 2. Classification results of each section of the experimental result analysis.

	Mandatory			Optional
Level	One	Two	Three	One	Two	Three
P	0.7123	0.6190	0.75	0.7730	0.7226	0.7
R	0.7619	0.5909	0.75	0.7933	0.7517	0.75
F1	0.7363	0.6047	0.75	0.7830	0.7368	0.7241

Table 3. Examples of experimental report fragments and corresponding feature categories.

Sample ID	Report Content	Feature Category
17-4	“Electrical isolation is critical in this experiment for two purposes: to establish independent reference grounds with different potentials, and to reduce interference between ground loops.”	Personal Comprehension of Experimental Principles
18-23	“The trailing edge phenomenon observed at U2-2 is attributed to the turn-off behavior of the internal power diode during potential drop. This action causes reverse current to flow through resistors R11 and R10.”	Experimental Result Analysis—Mandatory Procedure (Level 2)
18-23	“The aforementioned figure presents the curves of output voltage (UO) and D1 cathode potential versus inductor current at 12.5% PWM duty cycle. The data confirm that the inductor current operates in discontinuous conduction mode under this setting.”	Experimental Result Analysis—Optional Procedure (Level 1)
19-12	“Despite the noisy waveform generated by high-frequency signals, the current interruption phenomenon can be clearly observed before the duty cycle reaches 20.7%.”	Experimental Result Analysis—Optional Procedure (Level 2)

Table 4. Sample size and text length by cohort.

Cohort	Quantity	Score Range	Average Length
17	48	50~95	3007
18	58	55~100	4526
19	49	30~100	4858

Table 5. Comparison results with classical machine learning models.

Model	MAE	Correlation Coefficient	Scoring Consistency Rate (≤5)	Scoring Consistency Rate (<10)
Linear Regression	6.72	0.91	0.63	0.81
Ridge Regression	6.90	0.91	0.63	0.81
SVM	6.72	0.94	0.45	0.72
Bayesian Ridge	8	0.94	0.36	0.72
Decision Tree Regressor	6.63	0.93	0.54	0.81
Stacking	5.18	0.95	0.55	0.82
MFTC-ABR	3.09	0.98	0.82	0.91

Table 6. Comparison results with ensemble learning models.

Model	MAE	Correlation Coefficient	Scoring Consistency Rate (≤5)	Scoring Consistency Rate (<10)
XGBoost	4.27	0.97	0.72	0.91
LightGBM	18.55	0.12	0.18	0.27
CatBoost	4.09	0.98	0.72	0.91
Random Forest	4.35	0.96	0.69	0.88
AdaboostReg	4.73	0.92	0.70	0.89
MFTC-ABR	3.09	0.98	0.82	0.91

Table 7. Comparison results on the Boston Housing dataset.

Model	MAE	MSE	R²
XGBoost	2.1643	10.2447	0.8603
LightGBM	2.0285	10.0428	0.8631
LightGBM_ Bayesian	2.0027	10.2682	0.8600
Tabpfn-2.5	2.0853	10.0803	0.8625
FT-Transformer	2.5131	15.7071	0.7858
TC-ABR	2.0501	8.1793	0.8908

Table 8. Comparison results on the Diabetes dataset.

Model	MAE	MSE	R²
XGBoost	46.6470	3319.84	0.3437
LightGBM	45.2690	3076.57	0.4193
LightGBM_ Bayesian	46.4495	3250.0	0.3866
Tabpfn-2.5	50.5850	4127.75	0.2209
FT-Transformer	44.3733	3222.10	0.3918
TC-ABR	42.8737	3110.54	0.4276

Table 9. Results of feature ablation experiments.

Feature Dimension	MAE	Correlation Coefficient	Scoring Consistency Rate (≤5)	Scoring Consistency Rate (<10)
All Features	3.09	0.98	0.82	0.91
-Principle Elaboration Layer	5.36	0.98	0.64	0.73
-Completion Degree Layer	9.36	0.9	0.36	0.45
-Result Analysis Layer	16.45	0.52	0.1	0.45
-Plagiarism Detection Layer	3.09	0.98	0.82	0.91

Table 10. Ablation experiment results of the TC-ABR algorithm.

Configuration	MAE	Correlation Coefficient	Scoring Consistency Rate (≤5)	Scoring Consistency Rate (<10)
Full Algorithm	3.09	0.98	0.82	0.91
-Multi-Error Fusion Mechanism	4.72	0.92	0.54	0.77
-Threshold Partitioning Mechanism	5.38	0.86	0.42	0.69
-Historical Awareness Mechanism	4.08	0.95	0.66	0.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, C.; Shi, J. Automatic Scoring of Laboratory Reports Using Multi-Dimensional Feature Engineering and Ensemble Learning with Dynamic Threshold Control. Appl. Sci. 2026, 16, 3649. https://doi.org/10.3390/app16083649

AMA Style

Wang C, Shi J. Automatic Scoring of Laboratory Reports Using Multi-Dimensional Feature Engineering and Ensemble Learning with Dynamic Threshold Control. Applied Sciences. 2026; 16(8):3649. https://doi.org/10.3390/app16083649

Chicago/Turabian Style

Wang, Chang, and Jingzhuo Shi. 2026. "Automatic Scoring of Laboratory Reports Using Multi-Dimensional Feature Engineering and Ensemble Learning with Dynamic Threshold Control" Applied Sciences 16, no. 8: 3649. https://doi.org/10.3390/app16083649

APA Style

Wang, C., & Shi, J. (2026). Automatic Scoring of Laboratory Reports Using Multi-Dimensional Feature Engineering and Ensemble Learning with Dynamic Threshold Control. Applied Sciences, 16(8), 3649. https://doi.org/10.3390/app16083649

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Scoring of Laboratory Reports Using Multi-Dimensional Feature Engineering and Ensemble Learning with Dynamic Threshold Control

Abstract

1. Introduction

2. Related Work

2.1. Development Paradigms of Automated Text-Scoring Technology

2.1.1. Rule-Based and Semantic Similarity Matching Methods

2.1.2. Traditional Machine Learning-Based Automated Scoring Methods

2.1.3. Deep Learning and Pre-Trained Language Model-Based Methods

2.2. Research on the Application of Large Language Models in Educational Assessment

2.3. Limitations of Existing Research and Research Entry Point of This Paper

2.3.1. Core Limitations of Existing Research

2.3.2. Research Entry Point of This Paper

3. Materials and Methods

3.1. Overall Evaluation Framework for Laboratory Reports

3.2. Construction of Multi-Dimensional Features

3.2.1. Comprehension of Experimental Principles and Completion of Experimental Procedures

3.2.2. Analysis of Experimental Results and Plagiarism Detection

3.2.3. Scalability Design of the Feature Set

3.3. Construction of the Scoring Model

3.3.1. Overall Architecture and Design Philosophy

3.3.2. TC-AdaBoostReg Scoring Algorithm

4. Results

4.1. Data Sources

4.2. Experimental Setup

4.3. Experimental Results and Analysis

5. Discussion

5.1. The Decisive Role of Feature Engineering

5.2. Effectiveness of the Innovative Algorithm Mechanisms

5.3. Parameter Sensitivity and Robustness

5.4. Validity Threats and Limitation Analysis

5.5. Discussion on the Feasibility of Data Augmentation

5.6. Analysis of Domain Generalizability and Architecture Flexibility

5.7. Theoretical Rationale for AdaBoostReg as the Base Framework

5.8. Practical Significance and Teaching Feedback

5.9. Challenges and Coping Strategies for AI-Generated Content Detection

5.10. Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI