This chapter validates the effectiveness of the proposed MFTC-ABR model via a series of systematic experiments. First, the performance of the proposed model is compared with that of classical machine learning, ensemble learning, and deep learning methods on the power electronics laboratory report dataset. Subsequently, the generalization ability of the TC-ABR algorithm as a general-purpose regressor is verified on two public benchmark datasets, namely the Boston Housing dataset and the Diabetes dataset. Finally, the contributions of the feature engineering framework and the core algorithm module are verified, respectively, through ablation experiments and parameter sensitivity analysis.
4.2. Experimental Setup
This section elaborates on the experimental configuration and parameter settings in detail, including the hyperparameter configurations of the three categories of baseline models (classical machine learning, ensemble learning, and deep learning), the parameter settings of the proposed TC-ABR algorithm, and the validation scheme on public datasets.
To verify the effectiveness of the proposed MFTC-ABR automated scoring model on power electronics experimental reports, we conducted an evaluation based on the multi-dimensional features extracted in
Section 3, combined with Recursive Feature Elimination (RFE) for scoring, and compared the results with multiple baseline models. The configuration of each model is as follows:
- (1)
Classical Machine Learning Models
Support vector machine (SVM): Bayesian optimization was adopted for hyperparameter optimization, with a Radial Basis Function (RBF) kernel. The search space included C ∈ [0.1, 500], ε ∈ [0.01, 1], and γ = ‘auto’. The parameter set with optimal performance was selected for evaluation.
Bayesian ridge regression: The model used prior parameters alpha1_1 = 1 and alpha_2 = 1, with the maximum number of training iterations set to 100 and a convergence tolerance of 1 × 10−4.
Decision tree regressor: The maximum tree depth was set to 3, and mean squared error minimization was used as the splitting criterion.
Stacking ensemble: A two-layer stacking ensemble model was implemented, with SVM and Gradient Boosting Regression Tree (GBRT) as base learners and linear regression as the meta-learner.
Other baseline models, including Linear Regression and Ridge Regression, were also included for comparison.
To ensure the fairness and competitiveness of the evaluation, the MFTC-ABR model was compared with several widely recognized gradient boosting frameworks with excellent performance on structured data. The configuration of each baseline method is as follows:
XGBoost: max_depth = 4, η = 0.1, n_estimators = 100
LightGBM: Adopted GBDT algorithm, max_depth = 4, η = 0.05
CatBoost: iterations = 100, depth = 6, η = 0.1
AdaBoostReg: n_estimators = 3, learning_rate = 0.2
Multiple deep learning models, including Long Short-Term Memory (LSTM) network, Convolutional Neural Network (CNN), and Transformer, were adopted as baseline scoring algorithms to realize automatic feature extraction, and their performance was compared with the feature engineering method proposed in this study. To save computational resources and improve training efficiency, Bayesian optimization was adopted for hyperparameter tuning of all models. The model was trained for a maximum of 150 epochs, and an early stopping mechanism was triggered if the Mean Absolute Error (MAE) did not improve for five consecutive epochs. The specific configuration of the deep learning models is as follows:
LSTM: optimal training epochs: 105; embedding dimension: 115; hidden layer dimension: 179; learning rate: 0.0007.
CNN: optimal training epochs: 12; embedding dimension: 113; learning rate: 0.001; kernel size: 186.
Transformer: optimal training epochs: 94; embedding dimension: 128; learning rate: 0.001; number of attention heads: 8; number of encoder layers: 4.
- (4)
Early Stopping Strategy
To prevent overfitting and ensure a fair comparison of the performance across different models, a validation set-based early stopping mechanism was adopted for all models, with distinct stopping conditions set according to the characteristics of each model.
Deep Learning Models: The patience value is set to 10, and the minimum improvement threshold (min_delta) is set to 0.001. Specifically, training is terminated early when the reduction in validation set MAE is less than 0.001 for 10 consecutive epochs, and the model parameters corresponding to the lowest validation MAE are restored. This strategy is designed to terminate training after model performance saturates, thus avoiding invalid computation.
MFTC-ABR Model: Since AdaBoost-type algorithms are prone to gradual overfitting as the number of weak learners increases (the validation MAE decreases first and then increases), this study adopts a rise-detection early stopping strategy. When the validation set MAE shows a continuous increase for five consecutive added weak learners (i.e., no decrease compared to the previous iteration), the ensemble process is terminated, and the ensemble size corresponding to the lowest validation MAE is selected for the final model. This strategy ensures that the model stops in a timely manner before overfitting occurs.
- 2.
Generalizability Validation of the Core Mechanism of the Scoring Algorithm
To rigorously evaluate the effectiveness of the proposed TC-ABR algorithm (i.e., the improved AdaBoost framework with dynamic threshold control) and avoid the situation where its performance advantage is solely derived from domain-specific feature engineering, this study conducted supplementary validation on two classical benchmark regression datasets in the machine learning field: the Boston Housing dataset [
32] and the Diabetes dataset [
33]. These two datasets have completely different feature dimensionality, sample size, and noise distribution from the power electronics experimental report dataset, thus serving as a neutral test bed to isolate the influence of domain features and independently verify the superiority of the core mechanism of TC-ABR in general regression tasks. In this validation link, TC-ABR was treated as a general-purpose regression algorithm for horizontal comparison with multiple baseline models, including the traditional AdaBoostReg. In addition, the TC-ABR algorithm is compared with the current mainstream baseline models, namely TabPFN-2.5 [
34], FT-Transformer [
35], XGBoost [
36], LightGBM [
37], and LightGBM optimized via Bayesian optimization [
38].
- 3.
Ablation Experiments and Parameter Sensitivity Analysis
For the power electronics experiment report dataset, this study conducted three sets of controlled experiments: stepwise ablation of hierarchical features; individual validation of the contribution of the multi-error fusion mechanism, threshold partitioning mechanism, and historical awareness mechanism in the TC-ABR algorithm to the final results; and quantitative analysis of the impact of the introduced parameters δ, , λ, and η on the Mean Absolute Error (MAE).
4.3. Experimental Results and Analysis
This section presents the results and analysis of three sets of comparative experiments: the performance comparison on the power electronics laboratory report dataset (horizontal comparison with classical machine learning, ensemble learning, and deep learning methods), the verification of generalization ability on public datasets, and the ablation experiments and parameter sensitivity analysis of the feature and algorithm modules.
The experimental results of classical machine learning models on power electronics technology experimental reports are shown in
Table 5. Mean Absolute Error (MAE), correlation coefficient, and scoring consistency rate (i.e., the proportion of samples within a specific error range) were adopted as evaluation metrics.
The proposed MFTC-ABR model exhibits superior performance on all evaluation metrics. In terms of Mean Absolute Error (MAE), it achieves a 61.3% reduction compared with the worst-performing baseline model (Bayesian ridge regression), indicating its high scoring accuracy. In terms of scoring consistency, the model reaches an accuracy of 0.82 within a deviation of 5 points, a 49.1% improvement compared with the traditional stacking ensemble method. It also achieves the highest accuracy of 0.91 within a deviation of 10 points, demonstrating its robust performance under different scoring tolerance thresholds. The high Pearson correlation coefficient (over 0.9) achieved by the model indicates that the multi-dimensional features extracted in the feature engineering stage of
Section 3.2 effectively capture the teacher’s scoring logic, ensuring that the model can reliably distinguish between high-quality and low-quality reports.
The comparison results with ensemble learning models are shown in
Table 6.
It can be seen from the comparison results in
Table 6 that ensemble learning models, including XGBoost, CatBoost and Random Forest, all exhibit better performance than classical machine learning methods in the power electronics laboratory report scoring task, which verifies the effectiveness of tree-based ensemble models in small-sample scenarios. Among them, XGBoost and CatBoost both achieve over 0.91 in scoring consistency rate (error < 10 points) and over 0.97 in correlation coefficient, which are comparable to those of the proposed MFTC-ABR model. This phenomenon indicates that, given high-quality feature engineering, various powerful ensemble learners can effectively fit the scoring pattern, which verifies the universal effectiveness of the multi-dimensional feature system constructed in this study.
However, the proposed MFTC-ABR model still maintains significant advantages in key metrics. In terms of Mean Absolute Error (MAE), MFTC-ABR reaches 3.09, corresponding to a 24.4% reduction compared with CatBoost (MAE = 4.09) and a 27.6% reduction compared with XGBoost (MAE = 4.27). In terms of scoring consistency rate (error < 5 points), MFTC-ABR achieves 0.82, a 13.9% improvement compared with CatBoost and XGBoost (both 0.72). This advantage is mainly attributed to the precise modeling of fine-grained errors via the multi-error fusion and threshold partitioning mechanism in the TC-ABR algorithm, and this advantage is even more prominent in scenarios requiring high-precision scoring (e.g., error within five points). In addition, as a representative of Bagging-based models, Random Forest achieves an MAE of 4.35 and a five-point error consistency rate of 0.69, outperforming the single decision tree regressor (MAE = 6.63) but slightly underperforming XGBoost and CatBoost. This is consistent with the general rule that gradient boosting algorithms usually deliver better performance on structured data.
The above results demonstrate that, for small-sample scoring tasks in professional domains, the quality of feature engineering serves as the critical foundation determining model performance, while targeted optimization at the algorithm level (e.g., the error partitioned management of TC-ABR) can achieve further performance improvement on the basis of well-designed features. The advantages of MFTC-ABR in accuracy-sensitive metrics verify its practical value in small-sample, high-precision scoring scenarios.
The MAE of the MFTC-ABR model and deep learning models is shown in
Figure 7.
As can be seen from
Figure 7, the validation Mean Absolute Error (MAE) of the Convolutional Neural Network (CNN) decreases rapidly in the early stage, reaches the minimum value at the 7th epoch, and then stabilizes, while its MAE on the test set is as high as above 14.0. The validation MAE of the Long Short-Term Memory (LSTM) network and Transformer, respectively, reaches the minimum at around the 80th epoch, and then stabilizes, yet remains higher than that of CNN. The validation MAE of SciBERT levels off after the 350th epoch, and BERT-BiLSTM reaches an inflection point at approximately the 385th epoch to trigger early stopping, with a negligible subsequent decline. However, both the validation and test MAE values of these two models are higher than 20, indicating that pre-trained language models have no obvious advantages in small-sample engineering domains.
In contrast, the validation MAE of MFTC-ABR decreases rapidly within the first 10 weak learners, and is significantly lower than the minimum validation MAE of all deep learning models. It reaches the lowest value at the 15th weak learner, then starts to rise, triggering early stopping to avoid overfitting. Accordingly, the model with 15 weak learners is selected as the final model in this paper, which achieves a test MAE of 3.09, exhibiting a significant advantage over the best-performing deep learning baseline. MFTC-ABR not only achieves a lower MAE on the validation set, but also presents a more stable convergence process and lower overfitting risk, which demonstrates the significant superiority of the proposed multi-dimensional feature engineering and dynamic threshold-controlled AdaBoostReg algorithm framework in small-sample professional domains.
- 2.
Performance on Public Datasets
MAE and R
2 were adopted as evaluation metrics, and only the top 10 models in terms of performance are presented. The experimental results are shown in
Figure 8 and
Figure 9.
The experimental results demonstrate that the effectiveness of the proposed TC-ABR method is highly dependent on the intrinsic characteristics of the data, and it delivers substantial performance improvements over the classical AdaBoostReg algorithm on both datasets. On the Boston Housing dataset, TC-ABR achieves the optimal performance (MAE = 2.0501), which verifies that the improved strategies adopted in TC-ABR can more effectively capture complex nonlinear relationships. However, on the diabetes disease progression dataset, which has an extremely low signal-to-noise ratio (SNR), the absolute performance of all models is constrained. Even in this scenario, TC-ABR still maintains a stable performance advantage over AdaBoostReg.
To further verify the generalization ability of the TC-ABR algorithm as a general-purpose regressor, this study conducted supplementary comparative experiments with mainstream state-of-the-art (SOTA) models in the field of structured data on two classic benchmark regression datasets in machine learning: the Boston Housing dataset and the Diabetes dataset. The evaluated models include the widely used industrial gradient boosting tree models XGBoost, LightGBM, and LightGBM optimized via Bayesian optimization, Tabpfn-2.5, a pre-trained large model dedicated to tabular data, and the FT-Transformer architecture optimized for tabular data. The detailed performance comparison results of the two datasets are shown in
Table 7 and
Table 8, respectively.
On the Boston Housing dataset, which features a high signal-to-noise ratio (SNR) and clear intrinsic correlation between features and labels, TC-ABR achieved a Mean Absolute Error (MAE) of 2.0501. This metric underperforms the 2.0027 of LightGBM optimized via Bayesian optimization and the 2.0285 of vanilla LightGBM, ranking third among all evaluated models and not the optimal result for this indicator. This result objectively indicates that, in general regression scenarios with large samples and high SNR, there is still a slight gap between TC-ABR and mature gradient boosting models that have undergone long-term engineering optimization in the industry in terms of the extreme optimization of average prediction deviation, which remains within an acceptable region. At the same time, TC-ABR achieved the optimal performance among all evaluated models on two core metrics: Mean Squared Error (MSE) and coefficient of determination (R2). Its MSE reached 8.1793, a relative reduction of 18.6% compared with the second-best performing LightGBM, and its R2 reached 0.8908, a relative improvement of 3.2 percentage points compared with the second-best LightGBM. From the perspective of statistical essence, MAE assigns equal weights to all prediction errors and quantifies the overall average deviation level of the model, while MSE imposes a higher penalty weight on extremely large errors, and R2 reflects the model’s explanatory power for the overall distribution of the data. This discrepancy in metrics is directly related to the core design intention of TC-ABR. The core target scenario of this algorithm is the automated scoring of laboratory reports in small-sample professional domains. The core requirement of this scenario is not the extreme optimization of average prediction deviation, but the avoidance of extreme scoring deviations and the guarantee of the overall consistency and fairness of scoring. Therefore, the algorithm design prioritizes the enhancement of the control ability for extreme errors rather than the extreme optimization of MAE, which is the core source of the above indicator discrepancy.
On the Diabetes dataset, which is characterized by high noise and weak feature correlation as its core attributes, the inherent low SNR of the data significantly constrains the absolute fitting performance of all evaluated models. Even in this harsh test scenario, TC-ABR still achieved the optimal values among all evaluated models on the three-core metrics of MAE, MSE, and R2. Specifically, its MAE reached 42.8737, a relative reduction of 5.3% compared with the second-best performing LightGBM, and its R2 reached 0.4276, a relative improvement of 2.0 percentage points compared with the second-best LightGBM. In contrast, pre-trained large models such as Tabpfn-2.5 showed significant performance degradation in this scenario. This result fully validates the anti-interference ability and robustness of TC-ABR in low SNR and high-noise data, and these characteristics are highly aligned with the requirements of the core scenario of this study, where small-sample data in professional domains often have problems such as annotation noise and uneven sample distribution.
Based on the experimental results of the two benchmark datasets, it can be objectively concluded that the TC-ABR algorithm proposed in this study has the competitiveness to rival the current top SOTA models in general structured data regression tasks. It achieves leading performance across all indicators in harsh scenarios with low SNR and high noise and realizes significant optimization of extreme error control and overall fitting ability in high SNR scenarios, fully verifying the universality of the proposed algorithmic improvements. Meanwhile, the slight gap in MAE on the high SNR large-sample dataset not only objectively reflects that there is still room for improvement of the algorithm in the extreme optimization of average deviation, but also more comprehensively clarifies the advantages and applicable boundaries of the algorithm. More importantly, this gap does not affect the core innovative conclusion of this study: TC-ABR can be well adapted to the core scenario of automated scoring of engineering laboratory reports, which is characterized by small samples, high professionalism, and sensitivity to extreme errors.
- 3.
Ablation Experiments and Parameter Sensitivity Analysis
The ablation experimental results on the Power Electronics Experimental Report Dataset are shown in
Table 9.
The ablation experimental results show that there are significant differences in the contribution of each feature layer to the scoring model. The removal of the Experimental Result Analysis Layer leads to a cliff-like drop in model performance (MAE soars from 3.09 to 16.45, and the correlation coefficient drops from 0.98 to 0.52), which fully confirms that this dimension is the most critical discriminant basis in the expert scoring system and validates the necessity of the three-level classification and weighted in-depth quantification strategy adopted for this dimension in
Section 3.2 of this paper. The absence of the Experimental Completion Degree Layer and Principle Elaboration Layer also leads to performance loss to varying degrees, indicating that they are also effective components of the final scoring. However, the removal of the Plagiarism Detection Layer does not cause a significant change in model performance. To explore the underlying reason, a visual analysis of the original data distribution of this feature was performed, as shown in
Figure 10.
It can be seen from
Figure 10 that the sample data of this feature have extremely low volatility in the current dataset, with its variance and standard deviation being only 0.0017 and 0.0410, respectively, and the data points are closely distributed around the mean value. In other words, since most of the report samples in this experimental dataset are completed independently, plagiarism is not common, resulting in limited discriminative power in this feature dimension. Therefore, the model performance is insensitive to changes in this feature. However, this finding does not mean that the “plagiarism detection” dimension itself is valueless; on the contrary, it reflects an advantage of the feature engineering framework adopted in this study: the interpretability enables us to diagnose the contribution of features and feed back to data collection and teaching practice.
The low contribution of this feature in the specific sample set of this study is due to the limitation of data distribution (extreme imbalance between positive and negative samples). This indicates that in future work, to build a more robust scoring system, it is necessary to intentionally collect samples containing more plagiarism cases or introduce a more sophisticated cross-report plagiarism detection algorithm to enhance the discriminative power of this dimension, so that it can play its due role in the model. This precisely proves that the dimensions deconstructed by scoring orientation are necessary, but the current data fail to fully activate their potential.
The ablation experimental results of TC-ABR are shown in
Table 10.
As can be seen from
Table 10, the model performance degrades drastically when the threshold partitioning mechanism is removed, with the MAE rising to 5.38, corresponding to a relative increase of 74.1%. This result indicates that in the scoring scenario of power electronics experiment reports, which requires fine-grained differentiation between “acceptable errors” and “deviations requiring critical attention”, the three-zone differentiated weight update mechanism based on threshold partitioning is the core pillar of the algorithm’s accuracy. Once degraded to the uniform weight update scheme of traditional AdaBoost, the model cannot accurately identify hard samples with complex logic and ambiguous expressions, resulting in a significant increase in scoring deviation for a large number of reports and a sharp drop in scoring discrimination.
When the multi-error fusion mechanism is removed, the MAE rises to 4.72. This indicates that relying solely on absolute error is insufficient to fully capture the characteristics of prediction deviations. Especially when processing samples with small numerical errors but systematic deviations in scoring logic (e.g., reports with a correct understanding of principles but non-rigorous expressions), the multi-perspective strategy integrating absolute error, trend error, and distribution position error plays an irreplaceable role.
When the historical awareness mechanism is removed, the MAE increases to 4.08. The historical awareness mechanism models the long-term performance of samples through cumulative attention, which effectively alleviates the model’s overfitting to certain stubborn samples that are repeatedly mispredicted, yet may stem from the particularity of the test questions.
Compared with the feature ablation experiments, the range of performance variation in the algorithm module ablation is smaller. This verifies that in few-shot application scenarios, the impact of feature engineering on the final results is more significant than that of algorithmic improvements to the model. For scoring tasks in professional fields with scarce data, constructing a multi-dimensional interpretable feature system that is closely aligned with evaluation criteria is the primary factor determining model performance, while algorithm-level optimization only provides incremental improvements on this basis. This precisely addresses the widespread phenomenon of “model prioritization over feature engineering” in existing research on automated scoring in professional fields: without the systematic deconstruction and feature-based representation of scoring dimensions, even state-of-the-art algorithms can hardly break through the performance bottleneck caused by data scarcity.
The results of the parameter sensitivity analysis of TC-ABR are shown in
Figure 11.
As can be seen from
Figure 11, there are significant differences in the sensitivity of each parameter. δ and
have the most significant impact on model performance: the MAE rises sharply when they deviate from the optimal values, presenting a steep U-shaped curve.
and λ have moderate sensitivity: the performance remains stable near the optimal interval, and the MAE increases gradually when they exceed the reasonable range.
and η have low sensitivity, with the MAE fluctuating slightly over a wide range of values, indicating that the model has strong robustness to these parameters. Compared with the contribution of feature engineering, the fluctuation range of MAE caused by parameter changes is smaller than the performance degradation caused by feature removal in the feature ablation experiments. This corroborates that in few-shot scoring tasks in professional fields, the decisive role of feature engineering on model performance is greater than that of hyperparameter tuning. The TC-ABR algorithm can stably exert its advantages within a reasonable parameter range, which demonstrates its reliability and practicability as an advanced regression tool.