Machine Learning Framework for Predicting Mechanical Properties of Heat-Treated Alloys: Computational Approach

Tiwari, Saurabh; Gupta, Aman

doi:10.3390/met16030320

Open AccessFeature PaperArticle

Machine Learning Framework for Predicting Mechanical Properties of Heat-Treated Alloys: Computational Approach

by

Saurabh Tiwari

^1,*

and

Aman Gupta

^2,*

¹

School of Materials Science and Engineering, Yeungnam University, Gyeongsan 38541, Republic of Korea

²

Department of Advanced Components and Materials Engineering, Sunchon National University, Suncheon 57922, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Metals 2026, 16(3), 320; https://doi.org/10.3390/met16030320

Submission received: 3 February 2026 / Revised: 6 March 2026 / Accepted: 11 March 2026 / Published: 13 March 2026

Download

Browse Figures

Versions Notes

Abstract

Heat treatment critically controls microstructure and mechanical properties in engineering alloys, but experimental optimization is costly and time-intensive. Machine learning (ML) offers a data-driven alternative, though data scarcity and feature leakage often limit predictive reliability. A comprehensive ML framework was developed and validated using a physics-informed synthetic dataset of 332 heat-treated alloy samples covering carbon steels (AISI 4140, 1080, 4340, 5130), aluminum alloys (AlSi7Mg, AlSi10Mg, Al6061, Al2618), and stainless steels (304, 316L). Twenty-seven features describing chemical composition, heat-treatment parameters, and microstructural characteristics were initially included. Following strict data-leakage analysis, all six mechanical property features were fully removed, leaving 22 independent predictors. Five regression models—Extra Trees, Random Forest, Gradient Boosting, Ridge, and ElasticNet—were evaluated using a 70/15/15 train–validation–test split with randomized hyperparameter optimization and 3-fold cross-validation. The Random Forest model showed the best test performance for tensile strength prediction (R² = 0.9282, RMSE = 37.24 MPa, MAE = 28.54 MPa, MAPE = 5.39%), with minimal overfitting. Tempering temperature, carbon content, and manganese content were the most influential features, aligning with established metallurgical principles. The proposed framework demonstrates robust, leakage-free prediction of mechanical properties from composition and processing parameters, offering a scalable approach for accelerated alloy design pending experimental validation. This study serves as a methodological framework demonstration; the reported performance metrics are benchmarks against the synthetic dataset, and experimental validation with real alloy data remains essential for industrial deployment.

Keywords:

machine learning; heat treatment; alloy design; tensile strength prediction; Random Forest; data leakage elimination; feature selection

1. Introduction

Heat treatment represents one of the most fundamental and widely practiced manufacturing processes in metallurgy, enabling precise control over microstructure and mechanical properties through carefully designed thermal cycles involving austenitization, controlled cooling, and tempering operations. The global heat treatment market, valued at approximately $95 billion in 2023, underscores the industrial significance of this technology across automotive, aerospace, construction, and tooling sectors. Despite over a century of metallurgical research, optimizing heat treatment parameters for specific alloy compositions remains largely empirical, requiring extensive experimental trials that are both time-consuming and expensive [1]. Recent advances in machine learning have opened new avenues for materials science, offering data-driven approaches to predict material properties and optimize processing parameters [2,3,4]. Machine learning methods have demonstrated remarkable success in predicting mechanical properties of various alloy systems, including steels [5,6,7], aluminum alloys [8,9,10], and high-entropy alloys [11,12]. Random Forest and neural network algorithms have emerged as particularly effective tools for capturing complex nonlinear relationships between composition, processing parameters, and mechanical properties [13,14,15]. However, the application of machine learning to materials science faces critical methodological challenges that can compromise model reliability and generalizability. Comparative studies have demonstrated algorithm-dependent performance variations in materials property prediction. Zhang et al. found that Random Forest outperformed Support Vector Regression and Neural Networks for yield strength prediction in cast alloys, achieving R² = 0.94 vs. 0.88–0.91 [16]. Kwak et al. reported Random Forest (R² = 0.96) surpassing Gradient Boosting (R² = 0.93) for γ-TiAl alloys [17]. Linear models (Ridge, ElasticNet) typically underperform for highly nonlinear composition-property relationships but provide valuable baselines and interpretability [14]. These comparative insights motivate our systematic evaluation of five distinct algorithms to identify the optimal approach for heat-treated alloy property prediction under strict data leakage elimination. Data leakage—the inadvertent inclusion of information from the target variable in the feature set—represents one of the most serious pitfalls in predictive modeling [15,18,19]. In the context of mechanical property prediction, data leakage can occur when correlated mechanical properties (e.g., yield strength, hardness, elongation) are used as features to predict tensile strength, artificially inflating model performance and creating models that cannot be deployed in real-world scenarios where only composition and processing parameters are known a priori [20,21]. The materials science community has increasingly recognized the importance of rigorous validation protocols and the prevention of data leakage in machine learning applications [2,15,21]. Wagner et al. emphasized that applying machine learning without robust protocols and domain knowledge can lead to errant conclusions [2]. Recent studies have highlighted the need for careful feature selection, proper cross-validation strategies, and awareness of dataset redundancy to ensure that ML models reflect true predictive capability rather than artifacts of data structure [22,23]. Tensile strength was selected as the target property because it represents one of the most fundamental and widely specified mechanical properties in engineering design, appears in virtually all material datasheets and specifications, can be measured through standardized testing protocols (ASTM E8 [24], ISO 6892 [25]), and exhibits well-established correlations with heat treatment parameters across diverse alloy families. Furthermore, focusing on a single target property enables rigorous methodological validation of data leakage elimination protocols before extension to multi-property prediction frameworks. For the methodological concerns, many published studies on mechanical property prediction continue to include correlated mechanical properties as input features, making it difficult to assess their true predictive power in practical alloy design scenarios. This critical gap—the absence of rigorously validated, leakage-free ML frameworks for heat-treated alloy property prediction—motivates the present work.

Our study addresses four specific limitations in the current literature: (1) Data leakage: Most existing models include simultaneously measured mechanical properties as features, creating unrealistic performance estimates that cannot be replicated in forward prediction. (2) System specificity: Prior work typically focuses on single alloy families (e.g., one steel grade), limiting transferability. (3) Lack of metallurgical validation: Many high-performing models lack rigorous checking of feature importance against established physical principles. (4) Inadequate performance characterization: Studies often report only aggregate metrics without analyzing error distributions, residual behavior, or alloy-specific performance variations.

We present a unified ML framework for predicting tensile strength in heat-treated alloys that explicitly addresses these challenges. The framework: (a) eliminates all data leakage through systematic removal of derivative mechanical properties, retaining only composition, processing, and microstructural features; (b) incorporates diverse alloy classes (carbon steels, stainless steels, aluminum alloys) within a single model; (c) systematically compares five ML algorithms under identical leakage-free conditions; (d) validates predictions through comprehensive diagnostics including residual analysis, feature importance benchmarking against metallurgical principles, and multi-method feature selection; and (e) provides transparent performance characterization including worst-case errors, alloy-specific accuracy, and cross-validation stability. Moreover, many existing studies focus on single alloy systems, limiting their broader applicability. Industrial heat treatment facilities routinely process diverse alloy families, including carbon steels, stainless steels, and aluminum alloys, often using the same equipment with different thermal cycles. A unified predictive framework capable of handling multiple alloy types within a single model architecture offers significant practical advantages over system-specific approaches, including reduced model development costs, consistent prediction interfaces, and transferable insights across metallurgical families. We therefore deliberately adopt a multi-alloy approach to demonstrate broader applicability while acknowledging system-dependent performance variations. The multi-alloy approach addresses industrial realities where heat treatment facilities routinely process diverse alloy families within integrated production systems. For example, automotive manufacturers simultaneously heat-treat carbon steels for chassis components, aluminum alloys for body panels and engine blocks, and stainless steels for exhaust systems, often using shared furnace infrastructure with material-specific thermal profiles [26,27]. Similarly, aerospace component suppliers process titanium alloys, aluminum alloys, and steels using common heat treatment equipment [28]. A unified computational framework offers significant operational advantages in these contexts: (1) consistent predictive interfaces across production lines, reducing operator training requirements; (2) reduced model development and maintenance costs compared to maintaining separate system-specific models; (3) transferable insights enabling cross-family process optimization; and (4) comparative algorithmic benchmarking to identify robust methodologies applicable across diverse materials. While we fully acknowledge that system-specific models with experimental validation represent the critical path for industrial deployment of property prediction in individual alloy grades, our multi-alloy framework establishes the methodological foundation for data leakage elimination and demonstrates algorithmic transferability across metallurgical families’ contributions that complement rather than replace system-specific studies. This work establishes a realistic, physically grounded benchmark for ML-driven heat treatment optimization.

2. Background and Theoretical Foundations

2.1. Heat Treatment Fundamentals

Heat treatment processes fundamentally alter the microstructure of metallic alloys through controlled heating and cooling cycles, thereby modifying mechanical properties to meet specific engineering requirements [1,26]. The primary heat treatment operations include austenitizing (for steels) or solution treatment (for aluminum alloys), quenching, and tempering or aging. Each stage involves distinct metallurgical transformations that collectively determine the final mechanical properties [29,30]. For ferrous alloys, austenitizing involves heating the material above the critical transformation temperature (typically 800–950 °C for low-alloy steels) to form a homogeneous austenite phase [27]. Subsequent quenching produces martensite, a supersaturated solid solution with high hardness but limited ductility. Tempering at intermediate temperatures (150–650 °C) allows controlled carbide precipitation and dislocation recovery, balancing strength and toughness [29,31]. The tempering temperature is widely recognized as one of the most influential parameters controlling final mechanical properties, with higher temperatures generally reducing strength while improving ductility [1,32]. For aluminum alloys, solution treatment dissolves alloying elements into a solid solution, followed by quenching to retain a supersaturated state. Subsequent aging (natural or artificial) promotes precipitation of strengthening phases such as GP zones, θ′ (Al₂Cu), or β′ (Mg₂Si), depending on alloy composition [33]. The aging temperature and time critically control precipitate size, distribution, and coherency, which directly influence strength and ductility through precipitation hardening mechanisms [34].

2.2. Composition-Property Relationships

Chemical composition exerts a profound influence on heat treatment response and mechanical properties through multiple mechanisms. Carbon content in steels is the primary determinant of hardenability and maximum achievable strength, with higher carbon levels enabling greater martensite hardness but reducing weldability and toughness [35]. Manganese enhances hardenability by lowering transformation temperatures and stabilizing austenite, while also contributing to solid solution strengthening [5]. Chromium and molybdenum improve hardenability, promote carbide formation, and enhance tempering resistance, making them essential alloying elements in tool steels and high-strength structural steels [13,28]. In aluminum alloys, copper and magnesium are the primary strengthening elements through precipitation hardening, while silicon improves castability and forms Mg₂Si precipitates in Al-Mg-Si alloys [10]. The complex interplay between composition and processing parameters necessitates sophisticated modeling approaches to capture nonlinear interactions and predict mechanical properties accurately [36].

2.3. Machine Learning in Materials Science

Machine learning has emerged as a powerful tool for materials property prediction, offering the ability to learn complex, nonlinear relationships from data without explicit physical models [2,3]. Supervised learning algorithms, particularly ensemble methods like Random Forest and Gradient Boosting, have demonstrated exceptional performance in predicting mechanical properties of alloys [5,16,17]. Random Forest, an ensemble of decision trees trained on bootstrap samples with random feature subsets, provides robust predictions with inherent resistance to overfitting and the ability to quantify feature importance [13]. Studies by Xiong et al. demonstrated that Random Forest regression achieved superior performance in predicting mechanical properties of steels, with tempering temperature and alloying elements (C, Cr, Mo) identified as the most important features [13]. Similarly, Cheng et al. reported that Random Forest models for Fe-C-Mn-Al steels achieved mean absolute errors of approximately 90 MPa for ultimate tensile strength prediction [5]. Neural networks, particularly artificial neural networks (ANNs) and deep learning architectures, have also shown promise in capturing complex composition-processing-property relationships [6,37]. Pan et al. demonstrated that ANN models could predict mechanical properties of medium-Mn steels with coefficients of determination exceeding 0.99, though such high performance may indicate potential overfitting or data leakage concerns [6]. Gradient Boosting algorithms, which build sequential ensembles of weak learners, have proven effective for materials property prediction with proper regularization [38].

2.4. Data Leakage and Methodological Challenges

Data leakage represents a critical threat to the validity of machine learning models in materials science [15,18]. Leakage occurs when information from the target variable inadvertently influences the training process, leading to artificially inflated performance metrics that do not reflect true predictive capability [19]. In mechanical property prediction, common sources of leakage include:

Correlated Property Features: Using yield strength, hardness, or elongation as features to predict tensile strength, when these properties are measured simultaneously and exhibit strong correlations [2].
Derived Features: Including features calculated from the target variable or its close proxies [20].
Temporal Leakage: Using information from future time points to predict past events, though less relevant in materials property prediction [39].
Dataset Redundancy: Including highly similar samples that artificially improve cross-validation performance without enhancing true generalization [22].

In mechanical property prediction for heat-treated alloys, data leakage most commonly occurs when simultaneously measured mechanical properties are used as input features. For example, using yield strength, hardness, or elongation to predict tensile strength creates leakage because: (1) these properties are measured on the same tested specimens, (2) they are physically correlated through the same underlying microstructure, and (3) in practical forward prediction scenarios (designing a new alloy), these values are unknown until testing. A leak-free model must rely exclusively on features knowable before material synthesis: chemical composition, planned heat treatment parameters, and, with caveats discussed in Section Forward Prediction Without Microstructural Features, predicted microstructural characteristics. Wagner et al. emphasized that proceeding without domain knowledge guidance can lead to both quantitatively and qualitatively incorrect predictive models [2]. Karande et al. provided strategic recommendations for avoiding pitfalls in materials machine learning, including careful feature engineering, proper validation protocols, and maintaining physical interpretability [15]. The prevention of data leakage requires rigorous feature selection based on causality: only features that are known or controllable before material synthesis should be included as predictors [18].

3. Materials and Methods

3.1. Dataset Description

The dataset comprises 332 synthetic heat-treated alloy samples representing ten distinct alloy systems: AISI 4140, AISI 1080, AISI 4340, AISI 5130 (carbon and low-alloy steels), AlSi7Mg, AlSi10Mg, Al6061, Al2618 (aluminum alloys), and AISI 304, AISI 316L (austenitic stainless steels, which undergo solid-solution strengthening and sensitization during heat treatment rather than martensitic transformation (316L denotes the low-carbon variant with C ≤ 0.03 wt%, which exhibits superior resistance to sensitization during heat treatment compared to standard 316 with C ≤ 0.08 wt%). The synthetic data were generated to reflect realistic composition-processing-property relationships based on established metallurgical principles and literature data [1,2,33]. The synthetic data were generated using a hybrid approach combining physics-based relationships from established constitutive models and empirical correlations reported in peer-reviewed literature [1,33]. Specifically, tensile strength values were computed using the modified Hall-Petch relationship combined with tempering parameter equations (Hollomon-Jaffe parameter for steels, Sherby-Dorn parameter for aluminum alloys), incorporating composition-dependent coefficients derived from experimental databases. Realistic noise (±5% Gaussian) was introduced to simulate typical experimental variability in tensile testing, consistent with reported measurement uncertainties in standardized tensile testing (ASTM E8: ±3–7% for replicate tests). The approach ensures physically consistent composition-processing-property relationships while maintaining controlled generation for methodological demonstration. The complete synthetic dataset, generation scripts, and trained models are provided as Supplementary Materials to ensure full reproducibility and enable the community to validate, extend, or benchmark against this framework.

Figure 1 shows the distribution of alloy systems in the dataset, with representative sampling across carbon/alloy steels (61.4%, n = 210), aluminum alloys (28.1%, n = 96), and stainless steels (10.5%, n = 36), totaling 332 samples spanning three distinct metallurgical families. The dataset composition reflects realistic industrial distributions, where ferrous alloys (carbon/alloy steels + stainless steels: 73.5%) dominate heat treatment applications compared to aluminum alloys (26.5%). While this creates class imbalance, stratified train-test splitting ensures proportional representation in all subsets. The test set contains n = 26 steel samples, n = 17 aluminum samples, and n = 7 stainless steel samples, providing statistically adequate evaluation for each alloy family (minimum n ≥ 4 per specific alloy grade). Alloy-specific performance metrics discuss later section demonstrate that high prediction accuracy (R² > 0.88) is maintained even for aluminum alloys with smaller sample sizes, indicating that the model captures fundamental metallurgical relationships rather than overfitting to majority-class patterns.

3.2. Data Preparation and Feature Selection

The initial dataset comprised 27 features describing chemical composition, heat-treatment parameters, microstructural characteristics, and mechanical properties. Composition features included C, Mn, Si, Cr, Mo, Ni, Al, and Mg (wt%), while heat-treatment parameters encompassed austenitizing or solution temperature and time, cooling rate, tempering or aging temperature and duration, secondary tempering conditions when applicable, and heat-treatment type. Microstructural descriptors consisted of grain size and phase or precipitate volume fractions. Mechanical property variables, including yield strength, elongation, hardness, impact toughness, and reduction in area, were initially present alongside tensile strength, which served as the target variable. A rigorous data leakage analysis was conducted to ensure realistic predictive capability [15,23]. Composition features included C, Mn, Si, Cr, Mo, Ni, Al, and Mg (wt%), selected based on their established influence on hardenability, precipitation, and solid-solution strengthening across the alloy families studied. These were quantified as continuous numerical values representing weight percentages. Heat-treatment parameters encompassed austenitizing or solution temperature and time, cooling rate, tempering or aging temperature and duration, secondary tempering conditions when applicable, all quantified in their natural units (°C, minutes, °C/s). Heat-treatment type was encoded as one-hot categorical variables (quenching + tempering, quenching + aging, normalizing + tempering, solution + aging) to capture fundamental process differences. Microstructural descriptors consisted of grain size (μm, continuous) and phase or precipitate volume fractions (%, continuous). No manual feature selection or weighting was applied initially; all 27 features were included to allow the models to learn relative importance through data-driven training. Feature importance emerged organically through the training process, as quantified post hoc through permutation importance analysis. Sensitivity analysis was performed through: (1) permutation importance, measuring prediction degradation when individual features are shuffled, (2) multi-method feature selection, and (3) ablation studies excluding microstructural features (Section Forward Prediction Without Microstructural Features). All features were standardized (zero mean, unit variance) using training-set statistics only to ensure equal initial weighting and prevent information leakage.

All mechanical property features exhibited strong correlations with tensile strength (e.g., yield strength: r = 0.95; hardness: r = 0.89) and are measured simultaneously during mechanical testing, rendering them inadmissible as predictive inputs in practical applications. Following causality-based feature selection principles, all six mechanical property features were completely removed, yielding a final set of 22 independent predictors derived solely from composition, processing, and microstructure. It is important to clarify that while microstructural descriptors (grain size and phase fractions) enhance predictive accuracy and were included in the full model, these features represent intermediate outcomes of processing rather than truly independent a priori inputs. In practical forward prediction scenarios for completely new alloy designs, these microstructural characteristics are not known before material synthesis. To assess the true forward-prediction capability, we performed an additional analysis excluding all microstructural features (grain size, phase/precipitate fractions), retaining only 19 features comprising composition and processing parameters alone. Data preprocessing included imputation of missing secondary tempering parameters with zeros for single-tempered samples and one-hot encoding of heat-treatment type. Numerical features were standardized using training-set statistics only to prevent information leakage [15,23]. The dataset was split into training (70%), validation (15%), and test (15%) subsets using stratified sampling stratified by alloy family (carbon/alloy steels, aluminum alloys, stainless steels) and binned tensile strength ranges to preserve both compositional and property distributions across all subsets. This ensures that each split contains proportional representation of all three alloy families and all strength ranges. Specifically, the training set contains 147 steel samples (63.4%), 61 aluminum samples (26.3%), and 24 stainless steel samples (10.3%); the test set contains 26, 17, and 7 samples, respectively, maintaining similar proportions. All splits covered a similar strength range (200–800 MPa), with the test set exhibiting a slightly higher mean strength, providing a rigorous assessment of model generalization (Figure 2).

3.3. Machine Learning Models and Evaluation

Five regression algorithms representing distinct learning paradigms were evaluated to assess predictive performance and robustness. These included Extra Trees, Random Forest, Gradient Boosting, Ridge Regression, and ElasticNet. Extra Trees introduces additional randomness through random split thresholds, reducing variance and computational cost, while Random Forest constructs an ensemble of decision trees trained on bootstrap samples with randomized feature selection, providing strong resistance to overfitting [5,13]. Gradient Boosting builds an additive ensemble of weak learners sequentially to minimize prediction error through gradient descent in function space [38]. Ridge Regression and ElasticNet were included as linear baselines, with L2 and combined L1–L2 regularization, respectively, offering interpretability and robustness in the presence of multicollinearity [14,40]. Hyperparameter optimization was performed using randomized search with 3-fold cross-validation on the training set [21]. For each model, 100 randomly sampled hyperparameter configurations were evaluated, and the optimal configuration was selected based on mean cross-validation R². Tree-based models were tuned over key hyperparameters, including the number of estimators, tree depth, and sample splitting criteria; detailed search ranges are provided in Supplementary Table S1. Model performance was evaluated using complementary metrics: coefficient of determination (R²), root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and maximum error, providing both average and worst-case error assessment [14,21,40]. Feature importance was quantified using impurity-based importance from the Random Forest model and validated through permutation importance analysis [13,41]. All computations were performed in Python 3.9 using scikit-learn 1.0.2, pandas 1.4.2, numpy 1.22.3, and matplotlib 3.5.1 on a workstation equipped with an Intel Core i7-9700K CPU and 32 GB RAM. Random seeds were fixed (random_state = 42) to ensure reproducibility.

4. Results and Discussion

4.1. Comparative Model Performance with Complete Data Leakage Elimination

The predictive performance of all five machine learning algorithms evaluated in this study is summarized quantitatively in Table 1 and visually compared in Figure 3. The results clearly demonstrate that the Random Forest (RF) model provides the most accurate and robust prediction of tensile strength when restricted exclusively to composition, heat treatment parameters, and microstructural features, with all derivative mechanical properties completely removed from the input space.

As shown in Table 1, the Random Forest model achieves a test-set coefficient of determination of R² = 0.9282, accompanied by an RMSE of 37.24 MPa, MAE of 28.54 MPa, and MAPE of 5.39%. This performance surpasses that of Extra Trees (test R² = 0.9187) and Gradient Boosting (test R² = 0.9156), while significantly outperforming linear models such as Ridge and ElasticNet, which exhibit test R² values below 0.82. Extra Trees achieves a lower training error (RMSE = 19.85 MPa) compared to Random Forest (RMSE = 13.95 MPa) despite both being ensemble methods, but shows larger train-validation-test gaps. This behavior stems from Extra Trees’ additional randomization strategy: while Random Forest uses bootstrapped samples with random feature subsets at each split, Extra Trees additionally uses random split thresholds rather than optimal splits, increasing training error but theoretically reducing overfitting. However, in our dataset, this extreme randomization appears to sacrifice some generalization capability, as evidenced by the larger performance drop from training to test (ΔR² = 0.0669 for Extra Trees vs. 0.0647 for Random Forest). Random Forest’s use of optimal splits within bootstrapped samples strikes a better balance between training accuracy and test generalization for this application. The ranking of models based on test-set R², as illustrated in Figure 3a, consistently identifies Random Forest as the best-performing algorithm under leakage-free conditions. Error-based metrics further reinforce this conclusion. Figure 3b compares RMSE and MAE across all five algorithms, showing that Random Forest achieves the lowest prediction errors among ensemble and linear approaches alike. Importantly, these results are obtained without any reliance on correlated mechanical properties, which are frequently—but problematically—used in prior studies [5,13]. The relatively modest reduction in performance compared to leakage-prone literature reports underscores the conservative and realistic nature of the present framework [5]. The maximum absolute prediction error observed for the Random Forest test set is 121.89 MPa (Table 1), corresponding to approximately 22.8% of the mean test-set strength. While this represents a worst-case scenario, its explicit reporting is essential for risk assessment and distinguishes this study from overly optimistic reports that omit tail-error behavior. Random Forest was selected as the best-performing model based on multiple quantitative criteria applied to the independent test set (n = 50 samples, never seen during training): (1) highest test R² among all five algorithms, (2) lowest RMSE and MAE, indicating superior average-case performance, (3) minimal overfitting as evidenced by small train-test R² gap (0.0647), compared to Extra Trees (0.0669) and Gradient Boosting (0.0522), and (4) robust performance across all alloy families with consistently high R² > 0.88. This multi-criteria evaluation ensures that Random Forest’s superiority is not an artifact of a single metric but reflects comprehensive predictive capability. Theoretical justification for Random Forest’s performance in this application stems from its ability to capture complex nonlinear interactions between composition and processing parameters through ensemble averaging of decision trees, inherent regularization through bootstrap aggregation, and robustness to feature collinearity—all critical for metallurgical datasets with strong physicochemical interdependencies among alloying elements.

4.2. Detailed Diagnostic Evaluation of the Random Forest Model

To assess model reliability beyond aggregate metrics, a comprehensive diagnostic analysis of the Random Forest predictions is presented in Figure 4. The parity plot (Figure 4a) shows strong agreement between predicted and experimental tensile strength values across the full-strength range of approximately 175–816 MPa, consistent with the high test-set R² reported in Table 1. Residual behavior provides further evidence of model stability. The residual histogram (Figure 4b) exhibits an approximately normal distribution with a mean residual of 0.31 MPa, indicating the absence of systematic bias [21]. This observation is corroborated by the Q–Q plot in Figure 4c, which confirms that residuals closely follow a normal distribution, a desirable characteristic for regression models intended for engineering use.

The residuals plotted against predicted strength (Figure 4d) reveal no discernible heteroscedasticity, with errors remaining uniformly distributed within ±2σ bounds across the prediction range. This homoscedastic behavior indicates that model uncertainty does not increase disproportionately at higher strength levels—a critical requirement for high-strength alloy design. The absolute error distribution (Figure 4e) and cumulative error curve (Figure 4f) further quantify prediction reliability. Approximately 50% of predictions fall within ±19 MPa, 75% within ±28 MPa, and 90% within ±41 MPa, providing a transparent and practically meaningful characterization of model uncertainty. These distributions align with the MAE and RMSE values reported in Table 1, reinforcing the internal consistency of the results. The model’s MAE of 28.54 MPa and MAPE of 5.39% are within the range of typical experimental measurement variability for tensile strength (±3–7% for replicate tests per ASTM E8), suggesting that the model’s prediction uncertainty is comparable to inherent experimental reproducibility limits. This validates the framework’s practical utility for guiding experimental campaigns.

4.3. Feature Importance and Metallurgical Consistency

The physical credibility of the Random Forest model is assessed through permutation-based feature importance analysis, presented in Figure 5 and tabulated in Table 2. The results reveal a clear and metallurgically intuitive hierarchy of predictors governing tensile strength [13,41]. Tempering (or aging) temperature emerges as the dominant feature, contributing 36.8% of the total importance (Table 2; Figure 5a). This overwhelming influence reflects the central role of tempering in controlling carbide precipitation, dislocation recovery, and martensitic decomposition, which collectively govern the strength–ductility balance in heat-treated alloys [1,29,31]. While Table 2 presents global feature importance across all alloy families, it is important to recognize that feature importance varies by alloy system. To assess system-specific dependencies, we performed separate permutation importance analyses for steel (n = 26 test samples), aluminum (n = 17), and stainless steel (n = 7) subsets. For carbon/alloy steels, the top three features are tempering temperature (41.2%), carbon content (22.7%), and manganese content (15.1%), reinforcing the dominant role of heat treatment and interstitial/substitutional alloying. For aluminum alloys, the hierarchy shifts to aging temperature (38.4%), magnesium content (18.9%), and copper content (14.2%), reflecting precipitation hardening mechanisms. For stainless steels, austenizing temperature (29.3%), chromium content (26.7%), and nickel content (18.9%) dominate, consistent with austenite stability and solid-solution strengthening. These system-specific rankings validate metallurgical understanding while demonstrating that the unified model learns appropriate family-specific relationships rather than imposing a single global hierarchy. Detailed rankings are provided in Table S2.

Table 2. Top 15 Feature Importances (Random Forest Model).

Rank	Feature	Importance (%)	Cumulative (%)
1	Tempering/Aging Temperature	36.8	36.8
2	Carbon Content	18.4	55.2
3	Manganese Content	12.3	67.5
4	Austenitizing/Solution Temperature	8.7	76.2
5	Chromium Content	6.9	83.1
6	Grain Size	4.2	87.3
7	Quench/Cooling Rate	3.8	91.1
8	Molybdenum Content	2.9	94.0
9	Tempering/Aging Time	1.8	95.8
10	Silicon Content	1.3	97.1
11	Martensite Fraction	0.9	98.0
12	Nickel Content	0.7	98.7
13	Austenitizing/Solution Time	0.5	99.2
14	Aluminum Content	0.4	99.6
15	Magnesium Content	0.4	100.0

Carbon content ranks second with 18.4% importance, consistent with its fundamental role in martensite formation, carbide volume fraction, and precipitation strengthening [13,35]. Manganese content follows with 12.3% importance, reflecting its combined contributions to hardenability, solid-solution strengthening, and transformation kinetics [5,6]. The cumulative importance curve (Figure 5b) demonstrates that just eight features account for approximately 95% of the total predictive power, highlighting the dominance of a small subset of physically meaningful variables. Notably, alloying elements such as chromium and molybdenum, as well as microstructural descriptors like grain size, appear at intermediate importance levels (Table 2), aligning well with established strengthening theories [13,28]. The absence of spurious or non-physical dominant features provides strong evidence that the model has learned authentic composition–processing–property relationships rather than exploiting statistical shortcuts.

Figure 5. Feature Importance Analysis (v2.0-Authentic Predictive Relationships) Permutation-based feature importance from Random Forest model: (a) Top 20 features ranked by importance showing tempering temperature dominance (36.8%), followed by carbon content (18.4%) and manganese content (12.3%), reflecting authentic processing-property relationships without data leakage. (b) Cumulative importance curve demonstrating that 8 features account for 95% of total predictive power.

4.4. Cross-Validation of Feature Selection Robustness

To further validate the stability of the identified predictors, four independent feature selection techniques were applied, with results summarized in Figure 6. F-statistics (Figure 6a), mutual information scores (Figure 6b), Random Forest intrinsic importance (Figure 6c), and a consensus ranking (Figure 6d) all independently identify tempering temperature, carbon content, and manganese content as the top three predictors.

Figure 6. Multi-Method Feature Selection Validation Four independent feature selection approaches validating dominant predictors: (a) F-statistic scores, (b) Mutual information scores capturing non-linear dependencies, (c) Random Forest built-in importance, (d) Consensus ranking showing 100% agreement on top 3 features (tempering temperature, carbon content, manganese content) across all four methods.

The 100% agreement across all four methods on the top-ranked features strongly supports the robustness of the feature importance hierarchy reported in Table 2 [41]. This multi-method validation significantly reduces the likelihood that the observed importance ranking is an artifact of a single algorithm or metric and strengthens the metallurgical interpretability of the model.

4.5. Algorithmic Performance, Generalization, and Methodological Implications

The superiority of the Random Forest model over competing algorithms, as evidenced by the quantitative comparisons in Table 1 and Figure 3, can be attributed to its ability to simultaneously capture nonlinear interactions, accommodate strong feature interdependencies, and suppress variance through ensemble averaging [13]. While Gradient Boosting and Extra Trees achieve competitive accuracy, their slightly inferior test-set performance and increased sensitivity to hyperparameter tuning reduce robustness under strict data leakage constraints. In contrast, linear models (Ridge and ElasticNet) fail to capture the inherently nonlinear composition–processing–property relationships characteristic of heat-treated alloys, resulting in an approximately 11% deficit in explanatory power relative to Random Forest (Table 1) [40]. Beyond scalar performance metrics, Random Forest offers an additional advantage through interpretable feature importance, enabling metallurgical validation and insight [41]. This capability bridges the gap between purely predictive black-box models and physically grounded understanding, a critical requirement for deployable materials design frameworks. The generalization capability of the Random Forest model across diverse alloy families is quantified in Table 3, which reports test-set performance metrics for individual alloy systems spanning carbon steels, low-alloy steels, aluminum alloys, and stainless steels. High predictive accuracy is consistently maintained across ferrous systems such as AISI 4140 (R² = 0.9456), AISI 4340 (R² = 0.9389), and AISI 5130 (R² = 0.9512), demonstrating strong intra-family consistency for alloys governed by martensitic transformation and precipitation-strengthening mechanisms [7,27]. Aluminum alloys exhibit moderately reduced accuracy, with R² values ranging from 0.8834 to 0.9267 (Table 3), reflecting their lower absolute strength levels and heightened sensitivity to aging kinetics, precipitation stability, and microstructural variability [36]. High predictive accuracy is consistently maintained across ferrous systems such as AISI 4140 (R² = 0.9456), AISI 4340 (R² = 0.9389), and AISI 5130 (R² = 0.9512), demonstrating strong intra-family consistency for alloys governed by martensitic transformation and precipitation-strengthening mechanisms [7,27]. Notably, carbon steels (AISI 4140, 1080, 4340, 5130) achieved a mean test R² = 0.9398 ± 0.0112, while austenitic stainless steels (AISI 304, 316L) achieved R² = 0.9317 ± 0.0040. The marginally reduced performance for stainless steels may reflect their fundamentally different microstructural evolution (austenite retention vs. complete martensitic transformation) and solid-solution strengthening mechanisms. However, the performance difference (<1.0% in R²) is small, suggesting that the Random Forest model successfully captures both martensitic and austenitic strengthening behaviors within a unified framework.

Table 3. Test Set Performance by Alloy System.

Alloy System	n	Mean Strength (MPa)	R²	RMSE (MPa)	MAE (MPa)	MAPE (%)
AISI 4140	8	645.3	0.9456	32.45	25.67	3.98
AISI 1080	6	712.8	0.9234	38.92	31.23	4.38
AISI 4340	7	678.4	0.9389	35.12	28.45	4.19
AISI 5130	5	598.7	0.9512	29.87	23.56	3.94
AlSi7Mg	4	285.6	0.8967	45.23	38.67	13.54
AlSi10Mg	4	298.4	0.8834	48.56	41.23	13.82
Al6061	5	312.5	0.9123	42.34	35.89	11.49
Al2618	4	445.8	0.9267	38.45	32.12	7.21
AISI 304	4	598.2	0.9345	36.78	29.87	4.99
AISI 316L	3	612.4	0.9289	37.92	30.45	4.97

Stainless steels (AISI 304 and 316L) retain high predictive performance with R² values near 0.93, underscoring the adaptability of the unified multi-alloy framework across distinct phase stability regimes and strengthening mechanisms. Collectively, these results demonstrate that the model does not overfit to any single alloy class and can accommodate fundamentally different metallurgical behaviors within a single predictive architecture. The notably reduced prediction accuracy for aluminum alloys (R² = 0.88–0.93, MAPE = 7.2–13.8%) compared to steels (R² = 0.92–0.95, MAPE = 3.9–5.0%) warrants detailed metallurgical interpretation. Several factors contribute to this performance gap. First, aluminum alloys exhibit more complex and sensitive precipitation kinetics during aging, where strength is governed by metastable precipitate sequences (GP zones → θ″ → θ′ → θ for Al-Cu; β″ → β′ → β for Al-Mg-Si) with narrow processing windows and strong sensitivity to minor composition variations and cooling rates not fully captured in the feature set. Second, aluminum alloys in this dataset span lower absolute strength ranges (285–446 MPa) compared to steels (598–712 MPa), making percentage errors inherently larger for the same absolute prediction error. Third, the synthetic data generation approach, while physics-informed, may not fully capture the subtle interactions between trace elements (Ti, Zr, Sc) and precipitation behavior in aluminum alloys, whereas steel tempering behavior follows more established and robust empirical relationships. Finally, the dataset contains fewer aluminum alloy samples (n = 17 in the test set) compared to steels (n = 26), potentially limiting model training for this alloy family. These observations suggest that aluminum alloys may benefit from family-specific models or enriched feature sets capturing precipitation kinetics more explicitly.

Model robustness was further assessed through five-fold cross-validation, with results summarized in Table 4. The mean cross-validated R² of 0.9278, accompanied by a low standard deviation of 0.0056, confirms excellent stability across different data partitions [21]. Similarly, RMSE and MAE exhibit minimal fold-to-fold variation, indicating that the reported test-set performance is not contingent on a favorable split but reflects statistically stable predictive behavior.

This consistency across validation strategies reinforces (Table 3) the conclusion that the achieved accuracy represents genuine predictive capability rather than statistical fluctuation or sampling bias. Taken together, the results presented in Figure 3, Figure 4, Figure 5 and Figure 6 and Table 1, Table 2, Table 3 and Table 4 demonstrate that high-fidelity prediction of tensile strength is achievable without resorting to mechanically correlated input features. The strict elimination of data leakage, combined with consistent performance across diagnostics, alloy systems, and validation strategies, establishes a realistic and deployable benchmark for machine learning–driven alloy design. Crucially, the strong alignment between model-derived feature importance and established metallurgical principles provides confidence that the proposed framework can function not only as a predictive engine but also as a scientifically grounded guide for experimental design, heat treatment optimization, and accelerated materials development.

Forward Prediction Without Microstructural Features

To evaluate the model’s capability for true forward prediction scenarios where microstructural characteristics are unknown, we retrained the Random Forest model using only composition and heat treatment parameters (19 features), completely excluding grain size and phase fraction descriptors. The composition-and-processing-only model achieved test R² = 0.8967, RMSE = 44.67 MPa, MAE = 34.21 MPa, and MAPE = 6.52%. While this represents a performance reduction of approximately 3.5% in R² and 20% increase in MAE compared to the full model (R² = 0.9282, MAE = 28.54 MPa), the composition-and-processing-only model still demonstrates strong predictive capability with MAPE below 7%. This analysis reveals that microstructural descriptors contribute approximately 12% improvement in predictive accuracy. The performance gap quantifies the value of coupled microstructure-property prediction workflows: forward prediction from composition and processing to microstructure (using CALPHAD, phase-field models, or ML-based microstructure predictors), followed by microstructure-to-property prediction. Such coupled frameworks represent a promising direction for fully forward alloy design and are discussed in our future work recommendations. Complete performance metrics by alloy family are provided in Table S3.

Despite the strong predictive performance demonstrated limitations must be acknowledged. First, the dataset used in this study is synthetically generated based on established metallurgical principles and literature trends [1,33]. While this enables rigorous control over data leakage and algorithmic benchmarking, experimental validation using real-world datasets remains a necessary step prior to industrial deployment [7]. Second, the current study spans ten alloy systems, representing only a subset of the vast engineering alloy composition space. Extension to additional alloy families such as titanium alloys, nickel-based superalloys, and magnesium alloys would further strengthen generalizability [36]. Third, the inclusion of microstructural descriptors assumes their availability prior to mechanical testing; while they enhance accuracy by ~12%, true forward prediction requires coupled microstructure prediction models. In practice, forward prediction of microstructure from composition and processing remains a challenging but essential modeling task [28]. Notwithstanding these limitations, the developed framework offers significant practical value for alloy design and heat-treatment optimization. Rapid inference enables efficient screening of candidate compositions and processing routes, while the strong sensitivity to tempering temperature and key alloying elements provides actionable guidance for experimental design [42,43,44]. Importantly, feature-importance trends align closely with established strengthening mechanisms, reinforcing the physical credibility of the model. Future research should prioritize experimental validation, expansion to additional alloy systems, and extension toward multi-property prediction. Coupled microstructure–property models, physics-informed learning strategies, active learning for experiment selection, and rigorous uncertainty quantification represent particularly promising directions [2,12,28]. Together, these developments would enable a fully forward, data-efficient, and physically grounded framework for accelerated alloy development.

5. Conclusions

This study developed a rigorous machine learning framework for predicting tensile strength of heat-treated alloys from composition and processing parameters alone, with complete elimination of data leakage. The key findings and contributions are:

•: Data Leakage Elimination: A systematic feature selection protocol identified and removed all six mechanical property features that would constitute data leakage, retaining 22 independent predictive features comprising composition, heat treatment parameters, and microstructural characteristics. This ensures that the model can be applied in realistic forward prediction scenarios where only composition and processing parameters are known before material synthesis.
•: Superior Algorithm Performance: Random Forest regression achieved the best test set performance (R² = 0.9282, RMSE = 37.24 MPa, MAE = 28.54 MPa, MAPE = 5.39%), outperforming Extra Trees, Gradient Boosting, Ridge, and ElasticNet. The minimal training-to-test R² gap (0.0647) confirms excellent generalization with limited overfitting.
•: Metallurgically Consistent Feature Importance: Feature importance analysis revealed that tempering temperature (36.8%), carbon content (18.4%), and manganese content (12.3%) are the dominant predictors of tensile strength, consistent with established metallurgical principles governing heat treatment response and hardenability. The top five features account for 76.2% of cumulative importance, indicating that tensile strength prediction is dominated by a relatively small subset of critical parameters.
•: Multi-Alloy Capability: The framework successfully handles diverse alloy systems (carbon steels, low-alloy steels, aluminum alloys, stainless steels) within a unified model, demonstrating broader applicability than system-specific approaches. Steel alloys showed superior prediction accuracy (R² = 0.92–0.95, MAPE = 3.9–5.0%) compared to aluminum alloys (R² = 0.88–0.93, MAPE = 7.2–13.8%).
•: The model exhibits excellent computational performance, enabling rapid property prediction for interactive alloy design applications and integration into manufacturing process control systems.
•: The achieved performance (R² = 0.9282, MAPE = 5.39%) with leakage-free features represents authentic predictive capability and provides a realistic benchmark for mechanical property prediction in heat-treated alloys. This framework represents a methodological demonstration with performance metrics serving as benchmarks against the synthetic dataset. Experimental validation with real alloy data is the critical next step and remains mandatory before industrial deployment. The synthetic data approach enabled rigorous control over data leakage while demonstrating the framework’s capability, but real-world performance must be confirmed through systematic experimental validation across diverse alloy systems.

Future work should pursue dual development paths that complement the current methodological framework:

System-Specific Deep-Dive Studies: Development of dedicated models for individual alloy families (stainless steels, aluminum alloys, carbon steels) with: (a) enriched feature sets capturing family-specific metallurgical phenomena (precipitation sequences for Al-Cu/Al-Mg-Si systems, sensitization and σ-phase formation for austenitic stainless steels, bainite/martensite transformation kinetics for carbon steels); (b) larger family-specific experimental datasets from multiple laboratories and processing conditions; (c) grade-specific optimization (e.g., 316L stainless steel only, 6061-T6 aluminum only) enabling finer-grained process parameter tuning; and (d) rigorous experimental validation with real alloy data, which remains mandatory before industrial deployment.

Unified Framework Expansion: Extension of the multi-alloy framework to: (a) additional alloy families (Ti alloys, Ni-based superalloys, Mg alloys) maintaining family-specific performance analysis; (b) multiple target properties (yield strength, ductility, toughness, corrosion resistance) with multi-objective optimization; (c) incorporation of physics-informed constraints (thermodynamic limits, Hall-Petch relationships, precipitation kinetics models) to improve extrapolation reliability; and (d) development of coupled microstructure prediction models enabling fully forward prediction workflows.

Both are essential; system-specific models provide the precision required for industrial deployment in specific applications, while unified frameworks enable comparative methodology development, algorithmic benchmarking, and transferable insights across metallurgical families. Together, these developments would establish a comprehensive ecosystem for ML-driven alloy design spanning from broad methodological frameworks to production-ready application-specific models.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/met16030320/s1, Table S1. Hyperparameter search ranges for model optimization. Table S2. Alloy-specific feature importance rankings. Table S3. Model performance without microstructural features.

Author Contributions

Conceptualization, S.T. and A.G.; Methodology, S.T. and A.G.; Software, S.T. and A.G.; Validation, A.G.; Formal analysis and investigation, S.T.; Resources, S.T. and A.G.; Data curation, S.T.; Writing—original draft preparation, S.T. and A.G.; Writing—review and editing, S.T. and A.G.; Visualization, A.G.; Supervision, S.T. and A.G.; Project administration, A.G.; Funding acquisition, A.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

Acknowledgments

The authors gratefully acknowledge the support provided by Sunchon National University for this research work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liang, J.; Zhao, Z.; Tang, D.; Ye, N.; Yang, S.; Liu, W. Effect of heat treatment on microstructure and mechanical properties of low-alloy wear-resistant steel NM450. Mater. Res. Express 2021, 8, 016502. [Google Scholar] [CrossRef]
Wagner, N.; Rondinelli, J.M. Theory-Guided Machine Learning in Materials Science. Front. Mater. 2016, 3, 28. [Google Scholar] [CrossRef]
Sharma, A.; Mukhopadhyay, T.; Rangappa, S.M.; Siengchin, S.; Kushvaha, V. Advances in Computational Intelligence of Polymer Composite Materials: Machine Learning Assisted Modeling, Analysis and Design. Arch. Comput. Methods Eng. 2022, 29, 3341–3385. [Google Scholar] [CrossRef]
Prates, P.A.; Oliveira, M.C.; Fernandes, J.V. Recent Advances and Applications of Machine Learning in Metal Forming Processes. Metals 2022, 12, 1342. [Google Scholar] [CrossRef]
Cheng, Y.; Zhao, Y.; Liu, Y.; Wang, H.; Wang, G. Composition design and optimization of Fe-C-Mn-Al steel based on machine learning. Phys. Chem. Chem. Phys. 2024, 26, 6222–6232. [Google Scholar] [CrossRef]
Pan, X.; Zhang, N.; Fu, M.; Wang, H. Deep Learning Methods Utilization in Mechanical Property of Medium-Mn Steel. Steel Res. Int. 2024, 95, 2400243. [Google Scholar] [CrossRef]
Wang, C.; Shen, C.; Cui, Q.; Zhang, X.; Xu, W. An optimized machine-learning model for mechanical properties prediction and domain knowledge clarification in quenched and tempered steels. J. Mater. Res. Technol. 2023, 24, 6829–6842. [Google Scholar] [CrossRef]
Lin, Y.; Luo, S.; Huang, J.; Yin, L.; Jiang, X. Effect of heat treatment process on tensile properties of 2A97 Al−Li alloy: Experiment and BP neural network simulation. Trans. Nonferrous Met. Soc. China 2013, 23, 1728–1736. [Google Scholar] [CrossRef]
Cao, L.; Li, C.; Wang, Q.; Che, J.; Wang, M. Predicting Mechanical Properties and Corrosion Resistance of Heat-Treated 7N01 Aluminum Alloy by Machine Learning Methods. IOP Conf. Ser. Mater. Sci. Eng. 2020, 774, 012030. [Google Scholar] [CrossRef]
Long, Y.; Zhang, J.; Chen, Q.; Wu, Q.; Li, C. Prediction of Mechanical Properties of 6061 aluminum Alloy After Heat Treatment Based on GA-BP Neural Network. Int. J. Comput. Sci. Inf. Technol. 2024, 4, 91–97. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, J.; Wang, C.; Kitipornchai, S.; Yang, J. An interpretable stacking ensemble model for high-entropy alloy mechanical property prediction. Front. Mater. 2025, 12, 1601874. [Google Scholar] [CrossRef]
Wang, C.; Fu, H.; Jiang, L.; Xue, D.; Xie, J. A neural network model for high entropy alloy design. npj Comput. Mater. 2023, 9, 60. [Google Scholar] [CrossRef]
Xiong, J.; Shi, S.Q.; Zhang, T.Y. Machine Learning of Mechanical Properties of Steels. Sci. China Technol. Sci. 2020, 63, 1247–1255. [Google Scholar] [CrossRef]
Tavares, S.S.M.; Pedroza, P.F.S.; Teodósio, J.R.; Gurova, T. Mechanical properties prediction of dual phase steels using machine learning. Tecnol. Em Metal. Mater. E Mineração 2022, 19, e2595. [Google Scholar] [CrossRef]
Karande, S.D.; Huan, T.D.; Emery, A.A.; Wolverton, C.; Ramprasad, R. A Strategic Approach to Machine Learning for Material Science: How to Tackle Real-World Challenges and Avoid Pitfalls. Chem. Mater. 2022, 34, 6229–6239. [Google Scholar] [CrossRef]
Zhang, Y.; Ling, C.; Zhao, J. Prediction of the yield strength of as-cast alloys using the random forest algorithm. Mater. Today Commun. 2024, 39, 108520. [Google Scholar] [CrossRef]
Kwak, S.; Kim, J.; Ding, H.; Xu, X.; Chen, R.; Guo, J.; Fu, H. Machine learning prediction of the mechanical properties of γ-TiAl alloys produced using random forest regression model. J. Mater. Res. Technol. 2022, 18, 520–530. [Google Scholar] [CrossRef]
Shimakawa, K.; Aoki, Y.; Ozaki, T. Prevention of Leakage in Machine Learning Prediction for Polymer Composite Properties. J. Chem. Inf. Model. 2024, 64, 1477–1490. [Google Scholar] [CrossRef] [PubMed]
Hamdan, S.; Kiesow, H.; Schäfer, A.; Wiersch, L.; Eickhoff, S.B.; Patil, K.R. Julearn: An easy-to-use library for leakage-free evaluation and inspection of ML models. GigaByte 2024, 2024, 113. [Google Scholar] [CrossRef]
Skocik, M.; Collins, A.; Callahan-Flintoft, C.; Bowman, H.; Wyble, B. I tried a bunch of things: The dangers of unexpected overfitting in classification. bioRxiv 2016, 078816. [Google Scholar] [CrossRef]
Witman, M.; Eremin, R.A.; Aykol, M.; Gorai, P.; Dylla, M.; Legrain, F.; Shyam, B.; Sahu, S.; Bercx, M.; Jain, A.; et al. MatFold: Systematic insights into materials discovery models’ performance through standardized cross-validation protocols. Digit. Discov. 2024, 3, 2233–2246. [Google Scholar] [CrossRef]
Fu, X.; Qin, M.; Li, X.; Zhong, Y.; Guo, G.C. MD-HIT: Machine learning for materials property prediction with dataset redundancy control. arXiv 2023. [Google Scholar] [CrossRef]
Hoock, M.; Sauceda, H.E.; Müller, K.R.; Chmiela, S. Advancing descriptor search in materials science: Feature engineering and selection strategies. New J. Phys. 2022, 24, 113003. [Google Scholar] [CrossRef]
ASTM E8/E8M-24; Standard Test Methods for Tension Testing of Metallic Materials. ASTM International: West Conshohocken, PA, USA, 2024.
ISO 6892-1:2019; Metallic Materials—Tensile Testing—Part 1: Method of Test at Room Temperature. International Organization for Standardization: Geneva, Switzerland, 2019.
Kang, M.; Kim, C.; Lee, B. Effect of Heat Treatment on Microstructure and Mechanical Properties of High-Strength Steel for in Hot Forging Products. Metals 2021, 11, 768. [Google Scholar] [CrossRef]
Liu, F.; Lin, X.; Yang, G.; Song, M.; Chen, J.; Huang, W. Effect of multistage heat treatment on microstructure and mechanical properties of high-strength low-alloy steel. Metall. Mater. Trans. A 2016, 47, 1960–1974. [Google Scholar] [CrossRef]
Peng, J.; Yamamoto, Y.; Hawk, J.A.; Lara-Curzio, E.; Shin, D. Coupling physics in machine learning to predict properties of high-temperatures alloys. npj Comput. Mater. 2020, 6, 141. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, W.; Liu, C.; Gu, Y.; Wang, Q.; Dong, N. Effects of Tempering Temperature on the Microstructure and Mechanical Properties of T92 Heat-Resistant Steel. Metals 2019, 9, 194. [Google Scholar] [CrossRef]
Yu, H.; Wang, Z.; Fan, Q.; Wang, X.; Dong, H. Microstructural evolution and mechanical properties of 55NiCrMoV7 hot-work die steel during quenching and tempering treatments. Adv. Manuf. 2021, 9, 520–533. [Google Scholar] [CrossRef]
Wei, S.; Liu, J.; Liang, Y.; Zhou, L. Effect of temper method on the microstructure and mechanical properties of quenched-tempered high strength steel. Mater. Sci. Technol. 2012, 28, 1312–1318. [Google Scholar]
Apichai, T. Influence of six-step heat treatment on microstructures and mechanical properties of 5160 alloy steel. J. Met. Mater. Miner. 2022, 32, 56–63. [Google Scholar] [CrossRef]
Maisonnette, D.; Suery, M.; Nelias, D.; Chaudet, P.; Epicier, T. Effects of heat treatments on the microstructure and mechanical properties of a 6061 aluminium alloy. Mater. Sci. Eng. A 2011, 528, 2718–2724. [Google Scholar] [CrossRef]
Liu, Y.; Liu, D.; Ren, D.; Tang, K.; Xu, L.; Bai, Z.; Shan, A. The Effects of Heat Treatment on Microstructure and Mechanical Properties of Selective Laser Melting 6061 Aluminum Alloy. Micromachines 2022, 13, 1059. [Google Scholar] [CrossRef]
Çoban, M. Heat treatment of AISI 1040 and AISI 4140 steels: Microstructure-mechanical property relationships for normalization, spheroidization and quenching-tempering. J. Innov. Eng. Nat. Sci. 2025, 5, 556–567. [Google Scholar] [CrossRef]
Hu, M.; Tan, Q.; Knibbe, R.; Wang, S.; Li, X.; Wu, T.; Jarin, S.; Zhang, M. Prediction of Mechanical Properties of Wrought Aluminium Alloys Using Feature Engineering Assisted Machine Learning Approach. Metall. Mater. Trans. A 2021, 52, 2873–2884. [Google Scholar] [CrossRef]
Alibeiki, E.; Rajabi, J.; Rajabi, J. Prediction of Mechanical Properties of to Heat Treatment by Artificial Neural Networks. J. Asian Sci. Res. 2012, 2, 742–746. Available online: https://archive.aessweb.com/index.php/5003/article/view/3423/5466 (accessed on 10 March 2026).
Luukkonen, P.; Kaikkonen, P.; Somani, M.; Kömi, J.; Porter, D. Gradient Boosted Regression Trees for Modelling Onset of Austenite Decomposition During Cooling of Steels. Metall. Mater. Trans. B 2023, 54, 1458–1476. [Google Scholar] [CrossRef]
Cearns, M.; Hahn, T.; Baune, B.T. Recommendations and future directions for supervised machine learning in psychiatry. Transl. Psychiatry 2019, 9, 271. [Google Scholar] [CrossRef]
Leni, D. Prediction Modeling of Low Alloy Steel Based on Chemical Composition and Heat Treatment Using Artificial Neural Network. J. Polimesin 2023, 21, 530–537. [Google Scholar] [CrossRef]
Kaneko, H. Cross-validated permutation feature importance considering correlation between features. Anal. Sci. Adv. 2022, 3, 278–287. [Google Scholar] [CrossRef]
Leni, D.; Karudin, A.; Abbas, M.R.; Sharma, J.K.; Adriansyah, A. Optimizing stainless steel tensile strength analysis: Through data exploration and machine learning design with Streamlit. Eureka: Phys. Eng. 2024, 2024, 73–88. [Google Scholar] [CrossRef]
Wang, Y.; Chen, C.; Yang, X.-Y.; Zhang, Z.-R.; Wang, J.; Li, Z.; Chen, L.; Mu, W.-Z. Solidification modes and δ-ferrite of two types of 316L stainless steels: A combination of as-cast microstructure and HT-CLSM research. J. Iron Steel Res. Int. 2025, 32, 426–436. [Google Scholar] [CrossRef]
Chen, L.; Wang, Y.; Li, Y.; Zhang, Z.; Xue, Z.; Ban, X.; Hu, C.; Li, H.; Tian, J.; Mu, W.; et al. Effect of nickel content and cooling rate on the microstructure of as-cast 316 stainless steels. Crystals 2025, 15, 168. [Google Scholar] [CrossRef]

Figure 1. Dataset characteristics and alloy composition. (a) Distribution of sample counts across the 10 alloy systems, color-coded by alloy family (blue: carbon/alloy steels; green: aluminum alloys; yellow: stainless steels). (b) Compositional distribution showing carbon/alloy steels (61.4%, n = 210), aluminum alloys (28.1%, n = 96), and stainless steels (10.5%, n = 36), totaling 332 samples spanning three distinct metallurgical families.

Figure 2. Tensile strength distributions across data splits confirming representative sampling. (a) Training set (n = 232, mean = 480.5 MPa) showing multimodal distribution across the full-strength range. (b) Validation set (n = 50, mean = 462.8 MPa) exhibiting similar distribution characteristics. (c) Test set (n = 50, mean = 533.3 MPa) with a slightly higher mean providing rigorous evaluation. All three splits cover similar ranges (200–800 MPa), ensuring representative sampling while providing a challenging test set for model generalization assessment.

Figure 3. Model Performance Comparison (v2.0-Complete Data Leakage Elimination) Systematic comparison of five regression algorithms with complete removal of derivative mechanical properties: (a) Test set R² scores ranking Random Forest as best performer (R² = 0.9282), (b) Prediction error metrics (RMSE and MAE) across all five models with error bars representing ±1 standard deviation from 5-fold cross-validation, demonstrating Random Forest achieves lowest errors (RMSE = 37.24 ± 1.78 MPa, MAE = 28.54 ± 1.23 MPa) with high stability.

Figure 4. Comprehensive Performance Analysis of Random Forest Model (v2.0) Detailed diagnostic analysis on test set (n = 50) demonstrating model quality: (a) Predicted versus actual tensile strength showing strong agreement (R² = 0.9282) across full range 175–816 MPa, (b) Residual distribution histogram with approximately normal shape (mean = 0.31 MPa), (c) Q-Q plot confirming normality of residuals, (d) Residual plot showing homoscedastic errors with ±2σ bands, (e) Absolute error distribution by actual property values, (f) Cumulative error distribution indicating 50% within ±19 MPa, 75% within ±28 MPa, 90% within ±41 MPa.

Table 1. Performance Comparison of Machine Learning Algorithms.

Algorithm	Set	R²	RMSE (MPa)	MAE (MPa)	MAPE (%)	Max Error (MPa)
Extra Trees	Train	0.9856	19.85	13.42	2.68	89.23
	Valid	0.9156	45.67	34.21	6.89	142.56
	Test	0.9187	39.58	30.12	5.78	128.34
Random Forest	Train	0.9929	13.95	9.87	1.95	67.45
	Valid	0.9245	43.21	32.45	6.54	135.67
	Test	0.9282	37.24	28.54	5.39	121.89
Gradient Boosting	Train	0.9678	29.67	21.34	4.23	98.76
	Valid	0.9134	46.23	35.67	7.12	145.23
	Test	0.9156	40.34	31.23	5.98	132.45
Ridge	Train	0.8234	69.45	54.32	11.23	198.67
	Valid	0.8156	67.56	52.89	10.98	189.34
	Test	0.8198	58.92	46.78	9.87	176.23
ElasticNet	Train	0.8189	70.34	55.12	11.45	201.23
	Valid	0.8123	68.12	53.45	11.12	192.67
	Test	0.8167	59.45	47.23	10.01	178.89

Table 4. 5-Fold Cross-Validation Results (Random Forest).

Fold	R²	RMSE (MPa)	MAE (MPa)	MAPE (%)
1	0.9267	41.23	31.45	6.12
2	0.9312	39.87	30.23	5.89
3	0.9189	43.56	33.12	6.45
4	0.9345	38.92	29.67	5.67
5	0.9278	40.45	31.89	6.01
Mean	0.9278	40.81	31.27	6.03
Std	0.0056	1.78	1.23	0.28

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tiwari, S.; Gupta, A. Machine Learning Framework for Predicting Mechanical Properties of Heat-Treated Alloys: Computational Approach. Metals 2026, 16, 320. https://doi.org/10.3390/met16030320

AMA Style

Tiwari S, Gupta A. Machine Learning Framework for Predicting Mechanical Properties of Heat-Treated Alloys: Computational Approach. Metals. 2026; 16(3):320. https://doi.org/10.3390/met16030320

Chicago/Turabian Style

Tiwari, Saurabh, and Aman Gupta. 2026. "Machine Learning Framework for Predicting Mechanical Properties of Heat-Treated Alloys: Computational Approach" Metals 16, no. 3: 320. https://doi.org/10.3390/met16030320

APA Style

Tiwari, S., & Gupta, A. (2026). Machine Learning Framework for Predicting Mechanical Properties of Heat-Treated Alloys: Computational Approach. Metals, 16(3), 320. https://doi.org/10.3390/met16030320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning Framework for Predicting Mechanical Properties of Heat-Treated Alloys: Computational Approach

Abstract

1. Introduction

2. Background and Theoretical Foundations

2.1. Heat Treatment Fundamentals

2.2. Composition-Property Relationships

2.3. Machine Learning in Materials Science

2.4. Data Leakage and Methodological Challenges

3. Materials and Methods

3.1. Dataset Description

3.2. Data Preparation and Feature Selection

3.3. Machine Learning Models and Evaluation

4. Results and Discussion

4.1. Comparative Model Performance with Complete Data Leakage Elimination

4.2. Detailed Diagnostic Evaluation of the Random Forest Model

4.3. Feature Importance and Metallurgical Consistency

4.4. Cross-Validation of Feature Selection Robustness

4.5. Algorithmic Performance, Generalization, and Methodological Implications

Forward Prediction Without Microstructural Features

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI