Predicting the Cooling Rate in Steel-Part Heat Treatment via Random Forests

Nakatsukasa, Ikuto; Parque, Victor; Ito, Yasuaki; Nakano, Koji

doi:10.3390/app152111676

Open AccessArticle

Predicting the Cooling Rate in Steel-Part Heat Treatment via Random Forests

Graduate School of Advanced Science and Engineering, Hiroshima University, Higashi-Hiroshima 739-8527, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(21), 11676; https://doi.org/10.3390/app152111676

Submission received: 14 October 2025 / Revised: 28 October 2025 / Accepted: 30 October 2025 / Published: 31 October 2025

(This article belongs to the Special Issue Advances in Smart Manufacturing: Integrating AI, Digital Twins, and Edge Computing)

Download

Browse Figures

Versions Notes

Abstract

Heat treatment is a thermal-processing method involving controlled heating and cooling cycles designed to achieve the desired properties of materials. Among these steps, the cooling rate in heat treatment plays a crucial role, as it significantly influences the resulting material properties. In this paper, we investigated the feasibility of random forests in estimating the cooling-rate parameters for the steel-part heat treatment process. Random forests are particularly appealing in modeling an ensemble of expressive decision trees from which cooling can be modeled and estimated from the interaction of metal features. Our computational experiments using real-world data from industrial-scale operations demonstrated the advantageous properties of random forest regression models, particularly when combined with a random oversampling scheme. We also found that the chemical composition—specifically carbon and chromium content—as well as the weight of the steel parts, are key features that predict the cooling rate of steel parts. Furthermore, our validation using real-world cooling scenarios aligned closely with the practical insights of seasoned operators who routinely recommend cooling parameters for the metal-normalizing process. Our results highlight the effectiveness of the ensemble approach of random forest for practical applicability in industrial-scale heat treatment.

Keywords:

heat treatment; cooling rate; metals; random forest; regression; classification

1. Introduction

Iron-steel metals are widely used in construction and structural applications, as well as in consumer products involving machines, cars, gears, shafts, bolts, and axles. Heat treatment is advantageous for metals by inducing phase changes and by modifying mechanical properties such as hardness, toughness, impact resistance, ductility, and corrosion resistance. Heat treatment is a thermal processing technique that tunes the microstructure of metals by subjecting them to controlled heating and cooling cycles, often without changing the overall shape. Heat treatment parameters such as the heating temperature, the holding time, and the cooling rate determine the final mechanical properties of metals. As such, different combinations of heat-treatment parameters can lead to a diverse set of mechanical properties.

The Heat treatment of iron-steel metals is often conducted in carbon-emitting heating furnaces [1]. Such steel parts are placed on trays and heated to a desired temperature (often above 900 °C) and then cooled in the furnace (annealing), in air (normalizing), or in water/oil (hardening), depending on the desired target properties [2]. Annealing cools metals slowly and produces soft, ductile, and workable properties. In contrast, hardening involves faster cooling, resulting in metals that are hard and strong but brittle. Normalizing falls between these two processes, with moderate cooling that produces balanced mechanical properties [2].

The cooling rate during the heat treatment of metals allows for tuning the type, size, and distribution of precipitates, thereby determining a balance between strength—when precipitates are fine and densely or evenly distributed to impede dislocation motion—and ductility, when precipitates are spaced, allowing for plastic deformation [3]. Concomitant with the above, the cooling rate in heat treatment determines the width of layers in layered structures [4], the width of lamellar structures within grains [5], the refinement of coarse, fully lamellar microstructures [6], the extent of continuous layers in grain boundaries [5], the precipitate formation (e.g., boundary vs. intragranular) and changes in the fracture mechanism (e.g., intergranular cracking vs. transgranular cracking) [7], the spacing between dendrites—the tree-like crystal structures that form when a metal cools [8], the grain boundaries, and the generation of dislocation sites in accordance with the Hall–Petch effect [9,10,11], the growth and distribution of austenite and corrosion resistance [12], the magnetic permeability of amorphous magnetic alloys [13], the dislocation density and microstrain formation [14], and the kinetics of atomic nucleation–diffusion in phase transformations [15].

Estimating the quantitative relationship between the cooling rate after heat exposure and the resulting microstructure in metal alloys has attracted significant attention in the manufacturing community due to its practical benefits. It is known that the microstructural properties of metals are influenced by both cooling rates and chemical compositions [16,17,18,19], yet the relationships are known to be nonlinear, making them well suited for machine learning and data-based approaches. Li et al. [20] predicted intermetallic morphology in recycled Al-Si-Cu alloys using machine learning based on cooling rates and chemical compositions. Gao et al. [18] established an exponential relationship between the critical cooling rate in U75V rail steel and its chemical composition, specifically based on carbon (C), manganese (Mn), silicon (Si), and vanadium (V). Geng et al. [21] found that not only cooling time but also chemical composition—specifically carbon (C), molybdenum (Mo), manganese (Mn), and nickel (Ni)—determined the hardness of low-alloy steel in welding applications. Afflerbach et al. [22] used features of the constituent chemical elements of bulk metallic glasses to predict the critical cooling rate for glass formation. Schultz et al. [23] used calculated features derived from chemical elements and molecular properties to predict the critical cooling rate of metallic glasses. Mehmet Akif Koç et al. [24] used neural networks to forecast cooling rates based on pressure, distance to sections, and the duration throughout H-section profiles of steel beams S275/HEA120, S275/HEB120, and S275/HEB14. Liu et al. [25] proposed an equation to predict the critical cooling rate of U71Mn rail steel as a function of holding temperatures and austenitizing times, both of which are key parameters determining the microstructure and stability of austenite.

Rapid cooling rates refine metal microstructures, suppress atomic diffusion, reduce micropores and grain sizes, and lead to a homogeneous distribution of precipitates and higher corrosion resistance. However, too-rapid cooling may result in brittle microstructures, making metals prone to cracking and fracture; therefore, the strength of each metal alloy must be carefully controlled. Although the above-mentioned studies explored the relationships between cooling rates and chemical composition and other mechanical notions involved in heat treatment of metals, the applicability to different metal alloys’ being relevant to industrial-scale operations remains unclear. If a pragmatic approach were considered, it would be desirable to elucidate whether geometry, weight, and material play as significant a role as chemical composition in predicting the cooling rate of the heat treatment of industrial-scale iron-steel parts. This question has remained elusive in the related works. As such, we aim to forecast the cooling rate for industrial-scale steel parts involved in the operations of Tsukimi Factory of Nagato Co., Ltd. (Hiroshima, Japan), an industrial plant that operates furnaces and cooling devices for the heat treatment of steel parts. In this setting, heat treatment is performed using specialized equipment (heating furnaces), and steel parts are heated in trays and then cooled using air via a cooling fan, as shown in Figure 1, or by immersing them in oil. Heat-treatment parameters such as heating temperatures and cooling rates are entered by the operator, with the cooling rate typically set based on the results of past, similar steel parts, and the intuition of experienced operators. This practice often leads to issues with personalization, accuracy, and repeatability.

In this paper, to tackle the aforementioned, we investigate the feasibility of regression and classification models aided via the ensemble approach of random forests [26,27]. Our goal is to model decision trees that predict the cooling rate parameter based on features such as chemical composition, material type, weight, and the geometry of industrial-scale iron-steel parts—information that is solely obtained for each part from daily plant operations. Additionally, we aim to analyze feature importance when decision trees successfully predict the cooling rate. Our computational experiments demonstrate the merits of regression-based random forest models, particularly when combined with an oversampling scheme. Furthermore, our validation using real-world cooling scenarios shows that the cooling parameters recommended through the random forest model consistently produced steel parts whose hardenability was within the standard acceptable range. Our findings highlight the practical merits of random forests for predicting cooling rates in the normalization of industrial-scale steel parts.

Of the following sections, Section 2 explains the learning mechanism of the random forest algorithm [26,27], Section 3 presents our computational experiments and results, and Section 4 concludes the study by summarizing the key findings.

2. Random Forest

In this section, we describe the key algorithm involved in constructing the trees within random forests. Let the training dataset be denoted as

X = {x_{0}, x_{1}, \dots, x_{n - 1}}

, where n is the total number of samples in the dataset, and let

y = {y_{0}, y_{1}, \dots, y_{n - 1}}

be the associated target values. Our goal is to find a tree-based mapping function,

τ : X \to y

. In what follows, we review the major components of the random forest implementation from scikit-learn [26,27], which constructs the function

τ

from a collection of trees (also known as a forest).

2.1. Bootstrap Sample

Each tree,

T

, in a random forest in scikit-learn [26,27] is trained on a bootstrap sample,

X^{*}

, a subset of X, constructed as

X^{*} = {x_{u_{0}}, x_{u_{1}}, \dots, x_{u_{k - 1}}}

and

y^{*} = {y_{u_{0}}, y_{u_{1}}, \dots, y_{u_{k - 1}}}

, where

u_{i} \in [0, n - 1]

,

i \in [0, k - 1]

, denotes the index of the selected sample. The bootstrap sample

X^{*}

is constructed as follows: m sample indices

I_{0}, I_{1}, \dots, I_{m - 1}

are generated; each

I_{j} \in [0, n - 1]

is drawn independently and uniformly at random (sampling with replacement):

I_{j} \sim Uniform (0, n - 1), j = 0, \dots, m - 1 .

As such, the set

U = {u_{0}, u_{1}, \dots, u_{k - 1}} = {I_{j} : j = 0, \dots, m - 1}

corresponds to unique (non-repeating) indices associated with the bootstrap samples:

x_{u_{0}}, x_{u_{1}}, \dots, x_{u_{k - 1}}

. Then, for each

i = 0, \dots, k - 1

, the sample weight

w_{i}

is the count of the samples of the generated indices:

w_{i} = \sum_{j = 0}^{m - 1} 1_{{I_{j} = u_{i}}} where 1_{{I_{j} = u_{i}}} = \{\begin{matrix} 1 & if I_{j} = u_{i}, \\ 0 & otherwise . \end{matrix}

(1)

The above aims to randomly select (with replacement) k samples from the original dataset, in which samples picked multiple times get a higher weight and are to fit more closely than lower-weight samples. The above-mentioned mechanism extends Breiman’s original bootstrap sampling mechanism [28] and bagging (bootstrap aggregating) [29] to learn multiple versions of trees on bootstrap subsets of the training (learning) set.

2.2. Tree Construction

The tree construction algorithm in random forests follows a depth-first, stack-based routine (Algorithm 1) by default [26,27]; the inputs to the algorithm are the bootstrap sample

X^{*} = {x_{u_{0}}, x_{u_{1}}, \dots, x_{u_{k - 1}}}

, target

y^{*} = {y_{u_{0}}, y_{u_{1}}, \dots, y_{u_{k - 1}}}

, and its associated sample weights,

w = (w_{0}, w_{1}, \dots, w_{k - 1})

. The output is the (binary) decision tree,

T

.

Instead of recursion, a stack is used to keep track of nodes to be processed. Each stack record corresponds to a node in the tree and contains information about the node’s data range (which is useful to compute regression metrics), tree depth, and parent information.

The root node is first pushed onto the stack, and the main loop pops nodes off the stack, splits them as needed, and pushes the resulting child nodes if they are not leaves. In what follows, we describe the key components of the tree construction routine within a regression framework.

Each node operates on a subset of the bootstrap sample,

X^{*}

, rendered from the data range

[s, e)

as follows:

Sample : X_{[s, e)}^{*} = {x_{u_{s}}, x_{u_{s + 1}}, \dots, x_{u_{e - 1}}}, Target : y_{[s, e)}^{*} = {y_{u_{s}}, y_{u_{s + 1}}, \dots, y_{u_{e - 1}}}, Sample weights : w_{[s, e)} = (w_{s}, w_{s + 1}, \dots, w_{e - 1}),

where s denotes the start index of the data range, and e denotes the end index of the data range, with

s, e \in [0, k - 1]

, and

e > s

. As node

ν

operates on the data range

[s, e)

, the node impurity

Φ (ν)

is computed as follows:

Φ (ν) = \frac{\sum_{h \in [s, e)} w_{h} {y_{u_{h}}^{*}}^{2}}{\sum_{h \in [s, e)} w_{h}} - {(\frac{\sum_{h \in [s, e)} w_{h} y_{u_{h}}^{*}}{\sum_{h \in [s, e)} w_{h}})}^{2},

(2)

where h denotes the index within the data range

[s, e)

. The node impurity

Φ (ν)

represents a surrogate of the mean squared error (MSE), yet it is derived under the assumption that the mean target of small-sized bootstrap subsets can approximates the target prediction of such subsets; that is, assuming small

n_{ν}

and equal sample weights, the impurity is derived from the MSE as follows:

MSE (ν) = \frac{1}{n_{ν}} \sum_{i = 0}^{n_{ν} - 1} {(y_{i} - {\hat{y}}_{i})}^{2} \approx \frac{1}{n_{ν}} \sum_{i = 0}^{n_{ν} - 1} y_{i}^{2} - \frac{1}{n_{ν}^{2}} {(\sum_{i = 0}^{n_{ν} - 1} y_{i})}^{2},

(3)

where

y_{i}

is the target value, and

{\hat{y}}_{i}

is the predicted value. The above-mentioned is a special case of Equation (2) for equal-weighted samples.

Algorithm 1: Binary decision tree construction.

Furthermore, the tree construction routine checks whether a node,

ν

, is a leaf based on depth, sample counts, and impurity, as follows:

is_leaf (ν) = (d_{ν} \geq d_{\max} \lor n_{ν} < s_{\min} \lor n_{ν} < 2 l_{\min} \lor w_{ν} < 2 w_{\min} \lor Φ (ν) \leq ϵ_{i}),

(4)

where ∨ is the logical OR,

d_{ν}

is the current node depth,

d_{\max}

is the maximum allowed depth,

n_{ν}

is the number of samples at the node,

s_{\min}

is the minimum number of samples required to split,

l_{\min}

is the minimum number of samples per leaf,

w_{ν}

is the weighted number of samples at the node,

w_{\min}

is the minimum weighted samples per leaf,

Φ (ν)

denotes the impurity of the node,

ϵ_{i}

is a small positive threshold for impurity (corresponding to the value of machine epsilon which for double precision is approximately

2.22 \times 10^{- 16}

).

As such, if the node

ν

is a leaf, the node value

θ (ν)

is computed as follows:

θ (ν) = \frac{\sum_{h \in [s, e)} w_{h} y_{u_{h}}^{*}}{\sum_{h \in [s, e)} w_{h}},

(5)

where

θ (ν)

denotes the (output) prediction as the approximation from the bootstrap target mean over small-sized data ranges.

On the other hand, if the node

ν

is not a leaf, the best split is found, and child nodes are pushed onto the stack with split data ranges. The best split,

p^{*}

, is found via a greedy search [26,27] over the data range

[s, e)

, as follows:

(p^{*}, f^{*}, t^{*}) = arg max_{f, t \in H, p \in P} ψ (p),

(6)

where p denotes the split index over the range

[s, e)

with

p \geq s

and

p < e

,

P

denotes the candidate set of split points, f denotes the selected feature of the bootstrap sample, t denotes the threshold associated with feature f,

H

denotes the collection of candidate thresholds, each of which is associated with the candidate set of split points

P

, and

ψ (p)

denotes the proxy impurity improvement measure as a result of splitting the data range

[s, e)

at index p.

For each split index

p \in P

, child nodes can be denoted:

Left node : ν_{L} with data range [s, p), Right node : ν_{R} with data range [p, e) .

And the proxy impurity improvement

ψ (p)

can be computed as follows:

ψ (p) = {(\frac{\sum_{h \in [s, p)} w_{h} y_{u_{h}}^{*}}{\sum_{h \in [s, p)} w_{h}})}^{2} + {(\frac{\sum_{h \in [p, e)} w_{h} y_{u_{h}}^{*}}{\sum_{h \in [p, e)} w_{h}})}^{2},

(7)

where

y_{u_{h}}^{*}

denotes the corresponding bootstrap target value at sample point h, on either the left node

ν_{L} : [s, p)

or right node

ν_{R} : [p, e)

. The proxy impurity improvement measure can be obtained from (3), under the assumption that the target mean at small bootstrap samples approximate the prediction of such subsets and that the overall sum of target values remains constant when the best split between the left and right nodes is searched for, as follows:

\begin{matrix} MSE (ν_{L}) + MSE (ν_{R}) & = \frac{1}{n_{L}} \sum_{i = 0}^{n_{L} - 1} {(y_{i} - {\hat{y}}_{i})}^{2} + \frac{1}{n_{R}} \sum_{i = 0}^{n_{R} - 1} {(y_{i} - {\hat{y}}_{i})}^{2} \\ \approx - \frac{1}{n_{L}^{2}} {(\sum_{h \in [s, p)} y_{u_{h}}^{*})}^{2} - \frac{1}{n_{R}^{2}} {(\sum_{h \in [p, e)} y_{u_{h}}^{*})}^{2} \end{matrix}

(8)

where

n_{L}

(

n_{R}

) is the number of samples at the left (right) node. Then, for equal-weighted samples, the impurity improvement becomes the following:

φ (p) = \frac{1}{n_{L}^{2}} {(\sum_{h \in [s, p)} y_{u_{h}}^{*})}^{2} + \frac{1}{n_{R}^{2}} {(\sum_{h \in [p, e)} y_{u_{h}}^{*})}^{2},

(9)

where

n_{L}^{2}

(

n_{R}^{2}

) denotes the number of bootstrap samples in the left (right) node after splitting the node

ν

at data point p, and the positive sign of the terms in

ψ (p)

is due to the maximization of (6).

The greedy search mechanism that solves (6) in [26,27] selects the best split point

p^{*}

that maximizes the proxy impurity improvement (7), which is an approximation surrogate of the mean squared error, over a set of selected features and decision thresholds; thus, the partition is expected to minimize regression errors. Concretely speaking, the search mechanism first selects D features uniformly at random,

F = (f_{1}, f_{2}, \dots, f_{D})

, and then each selected feature,

f_{d}

, of the bootstrap sample

X_{[s, e)}^{*} = {x_{u_{s}}^{(f_{d})}, x_{u_{s + 1}}^{(f_{d})}, \dots, x_{u_{e - 1}}^{(f_{d})}}

(10)

is sorted, such that

x_{u_{(s)}}^{(f_{d})} \leq x_{u_{(s + 1)}}^{(f_{d})} \leq \dots \leq x_{u_{(e - 1)}}^{(f_{d})},

(11)

where

x_{u_{(s)}}, x_{u_{(s + 1)}}, \dots, x_{u_{(e - 1)}}

denote the sorted values of

x_{u_{s}}, x_{u_{s + 1}}, \dots, x_{u_{e - 1}}

in ascending order. Then, the candidate set of split points,

P

, is computed as follows:

P = \{p_{c}^{(f_{d})} | d = 1, \dots, D; c = 1, \dots, C_{d}\},

(12)

where

f_{d}

is a selected feature,

f \in [1, D]

, c denotes the ordinal number of a split point,

C_{d}

denotes the number of split points over feature

f_{d}

, and

p_{c}^{(f_{d})}

denotes the c-th split point when the

f_{d}

-th selected feature is used, which is computed iteratively over the data range

[s, e)

:

\{\begin{matrix} p_{0}^{(f_{d})} = s, \\ p_{c + 1}^{(f_{d})} = 1 + max \{i \in N ∣ p_{c}^{(f_{d})} \leq i < e - 1, x_{u_{(i + 1)}}^{(f_{d})} \leq x_{u_{(i)}}^{(f_{d})} + ϵ_{p}\}, \end{matrix}

(13)

where i denotes the split position,

x_{u_{(i)}}^{(f_{d})}

denotes the value of the

f_{d}

-th feature of the i-th sorted bootstrap sample

x_{u_{(i)}}

, and

ϵ_{p}

denotes a small threshold value,

10^{- 7}

, avoiding false mismatches caused by tiny-floating-point-rounding differences.

Also, for each split point,

p \in P

, over feature

f_{d}

, the corresponding threshold of the feature value can be computed as follows:

t (f_{d}, p) = \frac{x_{u_{p - 1}}^{(f_{d})} + x_{u_{p}}^{(f_{d})}}{2},

(14)

and the candidate set of thresholds is defined as follows:

H = \{t (f_{d}, p) | d = 1, \dots, D; c = 1, \dots, C_{d}\} .

(15)

2.3. Prediction

The random forest routine in the scikit-learn implementation [27] generates a collection (forest) of trees,

{T_{0}, T_{1}, \dots, T_{N - 1}}

, each generated via Algorithm 1. For a new observation, x, the prediction of a tree,

T (x)

, can be obtained by traversing the nodes of such a tree with the associated best features,

f^{*}

, and thresholds,

t^{*}

. Once the tree reaches a leaf node,

ν

, the (output) prediction of the tree can be estimated from its node value, that is,

T (x) = θ (ν)

. Then, the tree-based mapping function,

τ

, can be represented by the mean of the (output) prediction values of the collection of trees:

τ (x) = \frac{1}{N} \sum_{a = 0}^{N - 1} T_{a} (x),

(16)

where N is the number of trees in the collection (forest).

2.4. Best-First Mechanism

The above describes the (default) configuration for a depth-based random forest tree construction. When the maximum number of leaf nodes is specified by the user, the random forest routine in [26,27] switches to a best-first tree construction mechanism, in which nodes are popped up from the stack based on the highest priority (the greatest impurity improvement), thus expanding (splitting) towards the most promising nodes first and focusing on nodes that lead to the highest impurity improvement. After construction, the tree is pruned to meet the specified size constraint.

2.5. Classification Problems

Although the above describes the main tenets of a regression-based tree construction framework, trees for classification can be constructed using either a depth-based approach (the default configuration in [27]) or a best-first approach (when the maximum number of leaf nodes is specified). However, instead of using MSE-based proxies as impurity metrics, random forests for classification use the count-based Gini index [30], which measures how mixed classes are within a node, where lower impurity indicates that the node contains mostly samples from a single class. Similar to (7), the proxy impurity improvement at node

ν

combines the Gini-based metrics from the left node,

ν_{L}

, and the right node,

ν_{R}

.

The random forest for classification generates a collection of trees,

{T_{0}, T_{1}, \dots, T_{N - 1}}

, where N is the number of classification trees in the forest. Let K be the number of classes. The node value

θ (ν)

in a classification tree is a class probability vector, that is,

θ (ν) \in {[0, 1]}^{K}

, and

\sum_{λ = 1}^{K} θ {(ν)}_{λ} = 1

. For a new observation, x, the prediction of a tree,

T (x)

, is obtained by traversing the nodes of the tree using the associated best features,

f^{*}

, and thresholds,

t^{*}

. Once the tree reaches a node leaf,

ν

, the classification prediction of the tree is given by the node value, i.e.,

T (x) = θ (ν)

. The class output vector for the random forest is then computed through a probabilistic aggregation across all trees:

\hat{p} (y = λ ∣ x) = \frac{1}{N} \sum_{a = 0}^{N - 1} T_{a} {(x)}_{λ},

(17)

where

\hat{p} (y = λ ∣ x)

is the estimated probability that x belongs to class

λ

. Thus, the tree-based classification function

τ

predicts the class that has the highest mean probability across all trees:

τ (x) = arg max_{λ \in {1, \dots, K}} \hat{p} (y = λ ∣ x) .

(18)

Rather than using majority voting [28], the probabilistic aggregation approach [27]—also known as bagging-class probability estimates [29])—takes into account the confidence of each tree in its prediction and allows for probability estimates to be provided as needed.

3. Computational Experiments

This section presents our computational experiments and the observations obtained from evaluating the feasibility of using random forests to predict cooling conditions during the annealing process of steel parts.

3.1. Dataset

Our dataset consists of real-world records compiled from daily operations at Tsukimi Factory, an industrial plant that operates furnaces and cooling devices for the heat treatment of steel parts. Our goal is to predict the cooling conditions during the annealing process of steel parts. Accordingly, the target variable (also called the dependent or response variable) represents the cooling rate parameter to be set on the equipment. This parameter adjusts the fan speed: the higher the value, the faster the fan operates. Possible values are integers ranging from 0 to 60.

To avoid implying any unintended order or relationship between categories, categorical data were transformed using one-hot encoding prior to model training. Our dataset consisted of 993 observations and 21 input features, representing real-world cooling conditions from the heat treatment process of industrial steel parts. Furthermore, the explanatory variables (or input features in the context of random forest) are divided into the following types:

Chemical Composition	The percentage of elements contained in a steel part is expressed as nine items: carbon (C), silicon (Si), manganese (Mn), phosphorus (P), sulfur (S), nickel (Ni), chromium (Cr), molybdenum (Mo), and copper (Cu). During heat treatment, these elements affect the treatment results in different ways, with carbon and chromium being particularly influential. The microstructural properties of metal alloys are influenced by both cooling rates and chemical composition [16,17,18,19]. Gao et al. [18] found an exponential relationship between the critical cooling rate in U75V rail steel and its chemical composition, specifically carbon (C), manganese (Mn), silicon (Si), and vanadium (V). Afflerbach et al. [22] used features of constituent elements, such as Al, Cu, Ni, Fe, B, Zr, Si, Co, Mg, and Ti, to predict the critical cooling rate for glass formation. Schultz et al. [23] utilized calculated features derived from chemical elements and molecular properties to predict the critical cooling rate of metallic glasses.
Material	The metals that serve as raw materials for constituting steel parts are used differently, depending on the application. For each material quality, standard values for the content of the above-mentioned chemical components are established. In the context of our dataset, seven different material qualities were included as categorical data.
Weight	The weight of each steel part. During heat treatment, there is a phenomenon called the mass effect, in which the greater the material’s mass, the weaker the heat treatment effect.
Shape	Geometry considerations of the steel part. Four visually distinguishable shapes: (1) gear shaft, for rod-shaped parts with gears; (2) washer, for cylindrical parts whose thickness is greater than their height; (3) ring, for cylindrical parts whose height is greater than their thickness, and (4) block, for parts without holes that are nearly three-dimensional

3.2. Model Settings

We evaluated the following types of random forest configurations:

Random forest with default configuration. Here, we used the default parameter settings of the random forest implementation of scikit-learn [27]. The key representative parameters are as follows: the number of estimators, $n_{T} = 100$ , the maximum allowed depth, $d_{\max}$ , is set at the largest integer that can be represented as a 32-bit signed integer in NumPy (e.g., $2^{31} - 1$ ), the minimum number of samples required to split a node $s_{\min} = 2$ , the minimum number of samples per leaf $l_{\min} = 1$ , and the minimum weighted count for splits $w_{\min} = 0$ . In the above, the number of estimators, $n_{T}$ , corresponds to the number of trees in the random forest.
Random forest with oversampling. Since the acquired real-world data are imbalanced—that is, the distribution of outputs is uneven—we implemented the oversampling technique (RandomOverSampler in [27]) to ensure a more balanced distribution of output classes. Random oversampling randomly duplicates samples from the minority class until all classes are equally distributed. Although alternative techniques for handling imbalanced data exist, such as resampling or class weighting, random oversampling is a straightforward method that avoids the creation of synthetic samples, thereby maintaining alignment with the real-world cooling conditions of steel parts.
Random forest with hyperparameter tuning. In this configuration, we performed hyperparameter tuning on several key parameters of the random forest. As such, we considered the following parameters and their respective ranges: the number of estimators, $n_{T} \in [50, 500]$ , the maximum allowed depth, $d_{\max} \in [5, 200]$ , the minimum number of samples required to split a node, $s_{\min} \in [2, 100]$ , the minimum number of samples per leaf, $l_{\min} \in [1, 100]$ , and the maximum number of features considered when searching for the best split at each node during tree construction, $f_{\max} \in {\sqrt{\cdot}, \log_{2}, 0.3, 0.5, 0.8, 5, 15, 25, 50, 75, 100}$ . Here, $f_{\max}$ represents either a function (or fraction) of the total number of features when specified as such or an absolute integer value otherwise. The parameters were selected to balance the trade-off between bias and variance, as extreme values can either increase or decrease the level of randomness, thereby affecting overfitting and underfitting in the model. Due to the computationally expensive nature of the objective function, we used 10-fold cross-validation and Bayesian optimization to efficiently search for optimized parameters using a surrogate-based optimization approach [31].
Random forests for both regression and classification. Since the real-world observations of cooling conditions for steel parts comprise a discrete set of output values in the range $[0, 60]$ , it is possible to train both classification and regression models and evaluate their corresponding performance. Thus, we trained both classification and regression models for each of the aforementioned random forest configurations. Furthermore, for the regression models, to enable a meaningful comparison with the discrete nature of the observed outputs, we implemented an approach that approximates each continuous prediction by rounding it to the nearest integer. This approach ensures that predictions reflect real-world observations, enhancing interpretability and producing outputs consistent with the expected format, thereby increasing their relevance for decision-making and further analysis.

Furthermore, to ensure a relevant evaluation of training and testing performance, we used a split ratio of 75% for training and 25% for testing. During the training–test split, we used stratification to prevent a bias toward underrepresented classes. Our computational experiments were conducted on an Intel i9 9900K @ 3.6 GHz.

3.3. Example of a Decision Tree for Predicting the Cooling Parameter

In order to exemplify the generation of both a tree and a random forest, we outline the configuration of a tree in Figure 2 and the configuration of a random forest in Figure 3. In particular, Figure 2a shows the configuration of a single tree, the feature values (denoted with x[number]), the decision thresholds, the corresponding decision branches (True to the left and False to the right), the node impurity (IMP), the number of samples in the node span, and the node value. The reader may note that, for a regression model, the value of the leaf nodes denotes the output of the tree. Also, Figure 2b denotes the values of the proxy impurity improvement when searching for the best splitting points in (6).

Furthermore, Figure 3 shows the configuration of all trees when we set the number of trees to five. The reader may note the relatively different configuration of each tree in terms of the decision threshold and node values. For a regression problem, the regression output is computed by the average of the five trees in Figure 3, following (16).

3.4. Data Distribution

The real-world data on the cooling conditions of steel parts comprises an imbalanced configuration, as shown with the training-testing split in Figure 4. Yet, the stratification scheme during splitting helps preserve the distribution of the classes. Furthermore, Figure 4 shows that the oversampling technique allows for an even distribution of the training data.

3.5. Regression with Default Configuration

To show a glimpse of the performance of random forest models under a default configuration scheme, Table 1 shows the results of 10 independent runs in terms of learning time and training–testing metrics. Here, metrics with the subindex n, such as in

{MSE}_{n}

, indicate that the metric is calculated when each continuous prediction is approximated by rounding it to the nearest integer. By observing the results in Table 1, we note the following facts:

Computation time: Random forest quickly learned the suitable fitting, in around 0.19–0.20 s per trial.
Training Metrics: MSE ranges from 3.17 to 4.41, with high $R^{2}$ values above 0.98, indicating a strong model fit.
Testing Metrics: Test MSE varies more widely (11.43 to 33.44) with $R^{2}$ values between 0.84 and 0.94, showing a reasonable degree of variability in generalization.
Metrics with Nearest Approximation: MSE and $R^{2}$ with a nearest approximation closely follow their continuous counterparts, confirming consistency across scales.
Overall Performance: The model demonstrates a solid training accuracy and generally good testing performance, with some trials showing a decreased test accuracy.

Table 1. Regression performance metrics (training and testing) with default configuration.

		Training				Testing
Trial	Time (s)	MSE↓	$R^{2} ↑$	MSE_n↓	$R_{n}^{2} ↑$	MSE↓	$R^{2} ↑$	MSE_n↓	$R_{n}^{2} ↑$
1	0.19	4.20	0.9816	4.42	0.9806	11.43	0.9351	13.00	0.9262
2	0.19	3.69	0.9829	4.59	0.9788	16.09	0.9249	17.68	0.9175
3	0.19	4.41	0.9808	4.90	0.9787	20.14	0.8813	21.36	0.8741
4	0.20	3.50	0.9836	3.95	0.9815	26.67	0.8794	27.04	0.8777
5	0.19	3.87	0.9833	4.22	0.9818	19.72	0.8804	20.10	0.8781
6	0.20	3.93	0.9833	4.44	0.9811	16.47	0.8937	16.73	0.8920
7	0.20	4.13	0.9816	4.25	0.9810	18.43	0.9038	20.63	0.8923
8	0.20	3.79	0.9825	3.74	0.9827	13.06	0.9387	13.67	0.9359
9	0.19	3.17	0.9853	3.55	0.9836	33.44	0.8429	35.92	0.8312
10	0.20	3.80	0.9812	4.19	0.9793	20.49	0.9190	22.09	0.9126