Among 653 patients who underwent chest CT, the highest DLP value was 914.5, and the lowest DLP value was 68.81. The breast-specific average radiation dose was calculated using the CatBoostPSO model was 9.76 mGy (1.08–52.7), the average weight was 73.1 kg (46–133), the average height was 1.59 m (1.47–1.77), and the average age was 58.6 (33–88).
Among the five regression machine learning models used, CatBoostPSO demonstrated superior performance across all metrics. It achieved the lowest MSE (0.3795), MAE (0.3846), and MAPE (4.37%) values, while also achieving the highest R2 value of 0.9875. The CatBoost and Gradient Boost models stand out as algorithms that produce the closest predictions to real data. These two models showed sensitivity to high-frequency changes and were able to closely follow the actual curves even in extreme values.
3.1. Machine Learning Models
In this study, we deliberately focused on five regression algorithms—CatBoost, Gradient Boosting, Extra Trees, AdaBoost, and Random Forest—because tree-based ensemble methods are well-suited for low-dimensional, structured clinical data and can model non-linear relationships and feature interactions effectively with a moderate sample size. Given that our predictors are tabular (weight, height, BMI, breast thickness, age, and DLP) and the cohort size is 653 patients, we did not include deep learning architectures, which typically require larger datasets and image-based inputs to provide a clear performance advantage and are less interpretable in routine clinical practice.
In this case, the performance of five regression machine learning models—CatBoost, Gradient Boosting, Extra Trees, AdaBoost, and Random Forest—that are optimized using the Particle Swarm Optimization (PSO) algorithm is examined in detail. The evaluation metrics considered are Mean Squared Error (MSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Coefficient of Determination (R
2), which collectively provide an overall impression of the predictive capability and generalization performance of the models. The parameters of each model were optimized via PSO, enabling an unbiased and automated search for the most suitable combinations of parameters with which to reduce error on the test data. The results of the experiments are presented in
Table 1. Each model’s performance is discussed in detail as follows.
CatBoostPSO demonstrated superior performance across all metrics, with the lowest MSE (0.3795), MAE (0.3846), and MAPE (4.37%), while also achieving the highest R
2 value of 0.9875. The results show that not only is the magnitude of error small, but the explanatory power of the model is also high. This can be ascribed to CatBoost’s ability to handle complex feature interactions as well as its regularization technique that avoids overfitting. The best combination shows that the tree depth was relatively shallow(4) with a moderate value of the learning rate (0.1787), with a compromise between the complexity of the model and its accuracy. GradientBoostPSO also gave promising results with the mean squared error (MSE) value of 1.1458 and the value of R-squared (R
2) as 0.9622, outperforming conventional ensemble models like Random Forest and AdaBoost. The value of tree depth was biased towards being stable with low overfitting, along with a high number of estimators (200) that helped retain the accuracy. ExtraTreePSO followed closely, maintaining an R
2 value of 0.9504. However, both MSE and MAE values were moderately higher than those of CatBoost and Gradient Boosting, thus indicating a slightly less robust fit. The relatively deep trees (depth = 12) in the best configuration may have led to increased variance. AdaBoostPSO and RandomForestPSO showed comparably weaker performance, with MSE values above 2.4. Although all these models still achieved R
2 scores higher than 0.91, the higher MAE and MAPE values imply lower precision in pointwise predictions. Notably, the used base estimator in AdaBoost adopted a tuned tree depth of 7, indicating attempts to capture more non-linear relationships; however, the performance lagged behind boosting methods such as CatBoost. In general, the experiments confirm that gradient-based ensemble methods, in particular CatBoost, significantly outperform bagging-based methods such as Random Forest in terms of both predictive accuracy and generalization capability when optimized using swarm intelligence. By incorporating PSO, the search process for the best hyperparameters was carried out in a more efficient, unbiased manner, thus proving the value of metaheuristic optimization in machine learning algorithms. In the figures produced for every model, the actual value as well as the predictions were shown through the use of the sample number, as shown in
Figure 1. By closely examining the figures, the sensitivity of the models to changes over time was determined. Overall, it was found that every model was well able to predict low, medium, and high values, especially well in terms of predicting changes that happened within specific periods. But for those points that experienced sudden jumps in terms of high values, some models were not so good in predicting such extreme values, thus increasing their prediction errors. Specifically, the CatBoost model, along with the Gradient Boost model, can be seen as the algorithms that generated predictions closest to the real value. These two models demonstrated sensitivity to high-frequency changes and were able to closely follow the actual curves even at extreme values. In contrast, while the Extra Trees and AdaBoost models effectively tracked general trends, prediction deviations became more pronounced, especially at peak values. The Random Forest model, on the other hand, produced a relatively smooth prediction curve and diverged from the actual data in some time periods. This is due to the model’s overly smoothing effect. In conclusion, the findings reveal that CatBoost and Gradient Boost models are more successful in problems involving dynamic structures such as time series prediction.
Figure 2 contains scatter plots reflecting the relationship between actual values and predicted values to evaluate the regression performance of CatBoost, Gradient Boost, Extra Trees, AdaBoost, and Random Forest algorithms. In each sub-graph, the horizontal axis shows the actual values, while the vertical axis shows the values predicted by the model; a red dotted line has been added as the perfect prediction line (y = x). The proximity to this line is a visual indicator of the model’s prediction accuracy. When the graphs are examined as a whole, all models make their predictions very close to the actual values in most cases. Most of the points are clustered close to the red reference line, indicating that the models have learned the general trends correctly. However, in each model, deviations from the reference line are noticeable, especially at high-value outliers. These deviations indicate that the models perform relatively poorly in predicting extreme values and that systematic errors may occur in some cases. In particular, the CatBoost and Gradient Boost models are positioned closer to the red line across a wide range of values, indicating that they provide more stable and generalizable predictions. Extra Trees and Random Forest models also perform similarly well, but prediction deviations are observed in a few high-value examples. AdaBoost, on the other hand, provides successful predictions in the low value range but shows more deviation at medium and high levels. In conclusion, these graphs show that models are evaluated not only based on statistical metrics but also in terms of visual consistency. CatBoost and Gradient Boost algorithms stand out in regression accuracy, but all models require careful analysis and potential model improvements at high values.
Figure 3 shows the distribution of prediction errors for five different regression algorithms (CatBoost, Gradient Boost, Extra Trees, AdaBoost, and Random Forest), visualized using histograms and density curves. The horizontal axis represents prediction errors (actual value—predicted value), while the vertical axis represents the frequency of these errors. These graphs are very valuable for analyzing the deviation behavior of models and seeing to what extent predictions are prone to error. When the graphs are examined in general, it can be observed that prediction errors in all models are concentrated close to zero, i.e., within the correct prediction range. In particular, the error distribution in the CatBoost, Gradient Boost, and Extra Trees models is quite symmetrical, with a dense accumulation around the center (around 0). This implies that such models forecast without systematic bias and tend to perform with small errors. The CatBoost model was highly accurate with most errors between −2 and +2. Both the Gradient Boost and Extra Trees models also have a concentrated distribution of errors, but more infrequent but meaningful deviations are recorded at extreme points. The error distribution in the AdaBoost model is more spread out, and there is greater variation between +5 and −5. The Random Forest model primarily follows the central distribution but produces prediction errors greater than 10 for certain outliers. Overall, these distribution charts are crucial to study not only the average performance of the models, but also their reliability and their capacity to respond in unexpected situations. CatBoost and Gradient Boost models demonstrate a stable prediction performance with tight ranges and central density for the distribution of error, whereas for AdaBoost and Random Forest models’ performances, more cautious analyses and deeper optimization are needed due to the deviations of extreme values.
Figure 4 juxtaposes how four different regression success metrics (MSE, MAE, MAPE, and R
2) of the CatBoost, Gradient Boosting, Extra Trees, AdaBoost, and Random Forest models change with model iterations (number of trees or number of iterations in the learning process). These graphs were used to monitor the learning curve and convergence pattern of the models. Each graph provides comprehensive information about stability and performance overall by illustrating the way the model’s performance was enhanced during training. The CatBoost model showed a steady decrease in MSE, MAE, and MAPE values as iterations progressed, indicating that the model learned more with each step. The increase observed in the R
2 value illustrates that the explanatory power and precision of the model were enhanced throughout. Similarly, for the Gradient Boosting algorithm, improvements in metrics and a steady increase in the R
2 score were observed, indicating that the model has acquired good generalization power. For the AdaBoost algorithm, there was a sudden drop in error metrics, such as MSE and MAE, especially in the first few iterations, after which the metric values stabilized. This indicates that AdaBoost learns quickly in the first few iterations but then becomes slower. The R
2 plot also confirms this, plateauing after a while. The Random Forest and Extra Trees models, however, showed negligible variation in the iterations. The reason why the measurements were nearly fixed in these two models is that the algorithms were trained using a certain number of trees, and these structures were not recalculated very often. Therefore, it can be said that these models are efficient in strong initial performance with their ensemble model rather than an iterative improvement mechanism. Overall, CatBoost and Gradient Boosting models are not only good at high performance but also at stable and continuous improvement in learning processes. AdaBoost is efficient with its fast-learning capability, while Extra Trees and Random Forest models are efficient in consistent but solid performance.
In
Figure 2 and
Figure 3, all models exhibit low and centrally concentrated errors for the majority of low–to–medium breast dose values, whereas larger and more variable residuals are observed for a small number of highest-dose outliers, particularly in the AdaBoost and Random Forest models. This behavior likely reflects the limited number of observations in the extreme upper tail and the smoothing effect of bagging-based ensembles, which prioritize fit in the densely populated central region. CatBoost and Gradient Boost maintain tighter error distributions across almost the entire range, although a slight increase in error is also seen for the rare highest-dose cases. Future work will focus on collecting more high-dose examples and exploring tail-sensitive training strategies to further improve performance for extreme values.
Figure 5 shows the hyperparameter optimization processes performed by applying the Particle Swarm Optimization (PSO) algorithm to the CatBoost, Gradient Boost (GB), Extra Trees, AdaBoost, and Random Forest models. The plots show how the best score (p_best) of 10 particles and the global best score (g_best) change with the iterations. The
Y-axis represents Mean Squared Error (MSE), and the
X-axis represents the iterations. When the graphs are considered collectively, it can be observed that with an increase in iterations in each model, both the particle best scores and the global best scores individually decrease substantially. This condition reflects that the PSO algorithm can optimize the hyperparameter space effectively and obtain lower errors at every iteration. The CatBoost model is observed to have declining MSE values at a fast and consistent rate. A better rate was observed in the first 5 iterations, beyond which both particle differences lowered, and the global best score became constant. The same pattern was observed for the Gradient Boosting algorithm, although particle differences were greater at the initial steps. Scores converged towards each other within approximately 10 iterations. The Random Forest and Extra Trees models both exhibited the same trend of an accelerating convergence pattern, steep drops in MSE values from the first 3–5 iterations. This indicates that the two models had both learned through hyperparameter tuning in the early stages. However, the subsequent stabilization of errors shows that the models reached a limited learning capacity and that further iterations did not provide marginal gains. In the AdaBoost model, the initial MSE values are higher, and there is greater variance among the particles. However, as iterations progress, errors decrease significantly, indicating that AdaBoost can be significantly improved through optimization. Overall, these graphs demonstrate that the PSO algorithm is a powerful tool for minimizing model error by providing effective optimization in the hyperparameter settings of regression models. Additionally, when comparing models, CatBoost and Gradient Boosting algorithms stand out in the optimization process by achieving lower and more stable MSE values. This study visually highlights the critical impact of hyperparameter optimization on model performance.
In this study, the outputs of the mathematical model developed for the prediction of personalized breast radiation dose are evaluated comparatively with different machine learning algorithms. Five different regression algorithms (CatBoost, Gradient Boosting, Extra Trees, AdaBoost and Random Forest) were trained with PSO-based hyperparameter optimization and the generalizability and error levels of the model were analyzed in detail. The results show that the CatBoost algorithm is superior in all performance metrics (MSE, MAE, MAPE, R2) and successfully follows time series trends. The Gradient Boosting model performed close to CatBoost and proved to be a strong alternative. The error scatter plots confirm that these two models are free of systematic bias and operate with low variance. The PSO process significantly improved the accuracy of the models and clearly demonstrated the contribution of optimization to model performance. Models such as AdaBoost and Random Forest produced less consistent results, with higher error rates, especially at extreme values. Therefore, in areas that require precision, such as personal dose estimation, the combination of gradient-based methods and meta-heuristic optimization techniques provides more reliable results. This approach provides an important basis for the development of individualized decision support systems in the medical field.
3.2. Interpretation of Prediction Results Using XAI
In this section, the LIME (Local Interpretable Model-Agnostic Explanations) method was applied to evaluate the explainability of the regression-based machine learning model developed to calculate breast-specific radiation dose in thoracic CT examinations. The aim was to clearly identify which clinical or demographic characteristics influence the radiation dose prediction for individual patient-specific breast organ samples and to what extent they influence it.
When examining the LIME visualization in
Figure 6, the model prediction for the relevant sample is significantly influenced by the weight variable represented by the condition “Weight > 0.49.” The value of this variable in the sample is 85.00, contributing a high positive value of +3.300 units to the model prediction. This indicates that the patient’s body weight is an important factor affecting imaging parameters in radiological examinations. It is known that increasing the radiation dose may be necessary to maintain imaging quality in individuals with higher body mass, and the model’s learning in this direction is consistent with clinical reality.
On the other hand, the TotalDLP variable (example value: 206.00), evaluated in the range ‘−0.29 < TotalDLP ≤ 0.22’, contributes negatively by reducing the estimate by −1.352 units. This effect may reflect the impact of dose history from previously administered imaging sessions or automatic dosimetric optimization systems. Similarly, the height variable, which is 1.65 m under the condition ‘Height > 0.73’, reduces the estimated value by −0.787 units, indicating that the model evaluates this anthropometric variable in a balancing manner. In contrast, the breast tissue quality (Bt) and age variables, expressed as ‘Bt > 0.61’ and ‘−0.05 < Age ≤ 0.80’, contribute +1.058 and +0.643 units to the model prediction, respectively. This situation indicates that factors such as breast density and age may be effective in determining radiological protocols and supports the model’s successful learning of this clinical information.
The explanation in
Figure 7 shows that the model’s prediction in this example is heavily influenced by negative contributions. In particular, the TotalDLP value, which is measured at 117.60 units in the example and is included in the condition ‘TotalDLP ≤ −0.54’, has been the variable that most strongly reduces the model prediction by −6.357 units. This indicates that the total radiation level predicted by the model may be limited in examples with lower initial doses. Similarly, the patient’s age (Age ≤ −0.80, 49.00 years) and body weight (Weight ≤ −0.65, 56.00 kg) are also among the other important factors that reduce the model prediction by −2.386 and −2.317 units, respectively. These variables reflect the impact of patient anthropometry on protocol selection in radiological applications. The model has successfully learned that radiation should be minimized in low-weight and younger patients, especially in sensitive areas such as the breast. In contrast, the height of 1.55 m under the condition ‘Height ≤ −0.71’ increased the model’s prediction by +1.061 units, and the breast tissue quality (Bt) of 57.00 under the condition ‘Bt > 0.61’ increased it by +1.051 units, acting as variables that positively contributed to the prediction. This finding suggests that there may be a possibility of dose increase due to imaging difficulties in shorter individuals and that breast structure characteristics are related to dosimetric requirements.
The strongest negative contribution to the model estimate shown in
Figure 8 comes from the Total DLP (total dose length multiplier) variable, which is measured at 170.00 units and falls within the range “−0.54 < TotalDLP ≤ −0.29” (−3.501 units). This indicates that the total dose level previously administered or recorded in the system has a reducing effect on the additional dose to be administered in a new scan. Such model behaviors are shaped in a manner consistent with clinical protocols aimed at limiting patients’ cumulative radiation exposure. Similarly, the body weight (‘Weight ≤ −0.65’, 63.00 kg) and breast tissue density (Bt ≤ −0.57, 39.00) variables also contributed negative effects of −2.412 and −1.015 units, respectively. This suggests that lower radiation doses may be sufficient, particularly in individuals with lower body mass and breast tissue with lower density. The age variable, evaluated under the condition “−0.05 < Age ≤ 0.80” and measured as 61.00 years, increased the model estimate by +1.227 units, while the height variable, measured as 1.53 m under the condition “Height ≤ −0.71”, increased the estimate by +0.996 units, thereby positively influencing the estimate. This situation may reflect the possibility of modifying protocol parameters to improve image quality in older and shorter patients.