Thermal Prediction for Efficient Management of Temperatures in System-in-Package (SiP) Using Machine Learning (ML)

Oukaira, Aziz; Baba, Mhamed Filali; Ettahri, Ouafaa; Lakhssassi, Ahmed

doi:10.3390/app16115468

Open AccessArticle

Thermal Prediction for Efficient Management of Temperatures in System-in-Package (SiP) Using Machine Learning (ML)

¹

Electrical Engineering Department, Université de Moncton, Moncton, NB E1A 3E9, Canada

²

Department of Engineering and Computer Science, University of Québec in Outaouais, Gatineau, QC J8X 3X7, Canada

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5468; https://doi.org/10.3390/app16115468 (registering DOI)

Submission received: 25 April 2026 / Revised: 21 May 2026 / Accepted: 28 May 2026 / Published: 31 May 2026

(This article belongs to the Special Issue Modern Trends and Applications in Thermal Energy Storage)

Download

Browse Figures

Versions Notes

Abstract

System-in-Package (SiP) technology integrates processors, memory stacks, and radio-frequency modules within millimeter-scale enclosures, generating localized thermal peaks that passive cooling cannot address and that are too costly for finite-element solvers to track in real time. Machine learning offers a tractable alternative, but ensemble and generative families have not been jointly evaluated for this task. This work has three objectives: (1) to assess whether ensemble and generative models can predict node-level temperatures in SiP modules with surrogate-grade accuracy; (2) to quantify how a structural mismatch between a generative training objective and a deterministic regression task affects prediction quality; and (3) to identify the family offering the best trade-off between accuracy, computational efficiency, and interpretability. Three paradigms are compared on a finite-element dataset of 10,201 nodes: Random Forest, Extreme Gradient Boosting, and a Variational Autoencoder using normalized three-dimensional coordinates as inputs. Random Forest delivers the strongest accuracy (mean squared error

0.098

°C²; coefficient of determination

0.997

); Extreme Gradient Boosting attains the lowest inference latency (

0.0044

ms per node,

0.8

MB); the Variational Autoencoder incurs a two-orders-of-magnitude regression penalty consistent with its generative objective but preserves a temperature-coherent latent geometry. Ensemble methods are recommended for accurate, interpretable thermal prediction, while the Variational Autoencoder suits downstream anomaly detection.

Keywords:

compact System-in-Package; Internet of Things; machine learning; extreme gradient boosting; random forest; variational autoencoder; thermal management; node-level temperature prediction

1. Introduction

Over the past decade, the progressive miniaturization of electronic systems has been shaped in large part by System-in-Package (SiP) technology, which consolidates heterogeneous components such as processors, memory chips, radio-frequency modules, and power management units into a single compact module [1,2]. The benefits are well established: shorter interconnects reduce signal latency, a reduced board footprint enables smaller form factors, and tighter integration can improve overall energy efficiency. Yet these architectural advantages are accompanied by a thermal penalty that is consistently underestimated at the design stage. When numerous high-power-density components occupy a shared, sealed enclosure, the heat they collectively generate has limited dissipation pathways, and localized temperature peaks, commonly referred to as hotspots, develop at both the die and package levels [3]. Left undetected, these peaks initiate a cascade of degradation mechanisms, including electromigration, dielectric breakdown, and solder fatigue, each of which markedly shortens the operational lifetime of the system [4].

The consequences of inadequate thermal management in SiP architectures are well documented across multiple failure modes and timescales. Thermal shock events, defined as the rapid, spatially non-uniform temperature transients that arise when a device is subjected to sudden environmental or load changes, generating mechanical stress through differential thermal expansion of co-packaged materials and steep temperature gradients, when unaccounted for during initial design, can cause irreversible damage to advanced mixed-signal circuits [4]. In three-dimensional heterogeneous stacks, this challenge is qualitatively distinct from that encountered in conventional single-die designs: the thermal behavior of one component directly perturbs its neighbors through thermal crosstalk, a coupling effect confirmed both experimentally and through simulation [5]. Furthermore, because SiP modules are sealed and their form factors exclude conventional heat-sink attachments, the junction-to-ambient thermal resistance is fundamentally higher than in discrete designs. The long-term reliability implications of this condition are significant, as thermal cycling in critical layers such as Through-Silicon Vias (TSVs) and solder joints represents a dominant wear-out mechanism that can bring about system failure well before the intended end of life [6,7].

Traditional thermal management solutions such as heat sinks, thermal interface materials, and fan-based cooling systems were designed for earlier, simpler chip architectures in which heat was distributed more evenly across the device surface. In the context of modern SiP modules, these passive approaches fall short for several compounding reasons. First, the physical form factor of SiP modules precludes bulky cooling hardware. Second, the power consumption profiles of co-packaged components vary continuously with workload, rendering static cooling strategies fundamentally ineffective. Third, and perhaps most consequentially for the design workflow, physically accurate simulation tools, such as finite element method (FEM) analysis and computational fluid dynamics (CFD), are too computationally demanding for real-time thermal monitoring and prohibitively expensive to run repeatedly during the design exploration phase [1,8]. More advanced structural solutions, such as the Chip Cooling Laminate Chip (CCLC) technology demonstrated in prior work [1] have shown promising results in achieving more uniform temperature distributions within SiP modules, but their widespread adoption remains constrained by manufacturing complexity and associated cost. Complementary efforts in embedded thermal sensing, including ring-oscillator-based monitoring networks [1], and optimized thermal sensor allocation strategies for on-chip peak detection [2], confirm that proactive, predictive thermal management is not merely desirable but necessary for next-generation compact electronic systems. Precision temperature monitoring approaches, including those leveraging high-precision CMOS sensing for IoT-grade applications [9], further underscore the growing demand for accurate, low-latency thermal characterization across integration scales. Machine learning offers a concrete way out of this impasse. Rather than relying on numerical solvers, ML models learn thermal patterns directly from data: given a dataset of node coordinates and their corresponding temperatures, a trained algorithm can uncover the hidden spatial relationships governing thermal distribution at a fraction of the computational cost. This makes ML approaches both fast at inference and flexible in adapting to the dynamic, heterogeneous conditions that characterize SiP thermal environments [10,11]. The potential of this data-driven paradigm has already been validated in adjacent domains. Cloud data center operators have deployed ML-based thermal prediction to manage server temperatures efficiently under variable workloads [10], and surrogate ML models have been applied to optimize thermal energy storage in electronics cooling systems, leveraging phase-change materials [8]. Beyond cooling optimization, machine learning has also been applied to IoT-oriented anomaly detection [12], short-term electrical load forecasting [13], and real-time temperature estimation in power devices [14], reflecting the breadth of data-driven thermal and reliability modeling in modern electronic systems.

Among the many ML algorithms available, two ensemble methods stand out for structured prediction tasks: XGBoost (Extreme Gradient Boosting) and Random Forest. XGBoost builds a sequence of decision trees, where each new tree tries to correct the mistakes of the previous ones, resulting in a highly accurate model that handles complex, non-linear data relationships well. In electronics reliability studies, XGBoost has achieved residual prediction errors as low as 0.01–0.02% for component lifespan estimation under accelerated stress conditions, outperforming both neural networks and K-nearest-neighbor classifiers [15]. Random Forest, by contrast, trains a large collection of independent decision trees in parallel and aggregates their outputs by averaging. This mechanism makes the method naturally resistant to overfitting, and the associated feature importance scores provide engineers with interpretable insight into which physical parameters most strongly influence thermal behavior. In a representative study on medium-voltage switchgear, a Random Forest model successfully predicted abnormal temperature rises from operating parameters including current, voltage, load status, and ambient temperature. This approach enabled early fault detection before equipment failure occurred [16]. A third paradigm explored in this study is the Variational Autoencoder (VAE), a class of deep generative models that has attracted growing interest for anomaly detection and data augmentation in engineering contexts where labeled failure data are scarce [17,18,19]. Rather than directly predicting an output label, a VAE first compresses input data into a compact probabilistic latent representation and then reconstructs it, with the training objective being the maximization of the Evidence Lower Bound (ELBO) rather than the minimization of a direct regression loss. This architectural distinction raises a fundamental question that has not been addressed in the SiP thermal domain: when applied to node-level temperature prediction, does the statistical richness of the VAE latent space compensate for its weaker point-estimation accuracy relative to optimized ensemble methods?

It is important to note that other ML paradigms have demonstrated promising results in related thermal prediction domains. Graph convolutional networks (GCNs), which explicitly exploit the graph-structured topology of mesh or circuit netlist representations, have recently been applied to chip-level thermal field estimation [20] and could, in principle, leverage the spatial adjacency between SiP nodes more directly than coordinate-based methods. Physics-Informed Neural Networks (PINNs) incorporate governing heat-transfer partial differential equations directly into the training loss, enabling generalization to unseen boundary conditions without full retraining. The present study deliberately excludes these paradigms for two reasons. First, the available dataset provides only nodal coordinates and temperatures without explicit graph connectivity or boundary condition labels, making GCN and PINN approaches inapplicable without additional data engineering effort that is orthogonal to the present comparative objective. Second, this study aims to establish a rigorous baseline using off-the-shelf, widely accessible algorithms that practitioners can deploy immediately without specialized graph infrastructure. Extensions to GCN- and PINN-based architectures are identified as a primary direction for future work (Section 5).

This study addresses that question directly. We present a complete, end-to-end comparison of three ML paradigms, namely XGBoost, Random Forest, and Variational Autoencoders, applied specifically to the problem of predicting node-level temperatures within SiP architectures. To our knowledge, no prior study has benchmarked these three model families under identical experimental conditions for this task. The pipeline encompasses every stage of the modeling process: raw data cleaning, coordinate normalization, model training, hyperparameter optimization via RandomizedSearchCV, and multi-dimensional evaluation combining mean squared error regression and confusion matrix classification into high, medium, and low temperature categories. This study is organized around three explicit research objectives:

1.: Predictive accuracy: Can ensemble and generative ML models predict SiP node temperatures accurately enough to serve as real-time complements to physical sensor networks?
2.: Objective compatibility: How does the structural mismatch between a generative ELBO objective and a deterministic regression task quantitatively affect prediction accuracy and classification reliability?
3.: Practical guidance: Which model family offers the best trade-off among predictive accuracy, computational efficiency, and engineering interpretability for deployment in thermal management pipelines?

The scientific contribution of this work is threefold: (i) concrete empirical evidence of how each ML family performs under SiP-specific conditions; (ii) a quantitative demonstration of the impact of systematic hyperparameter tuning on predictive accuracy; and (iii) a formal characterization of the structural incompatibility between generative model training objectives and deterministic thermal regression, a finding with direct practical implications for engineers implementing ML-based thermal management pipelines in next-generation compact electronic systems [17,19].

The remainder of this paper is organized as follows. Section 2 reviews the relevant literature and presents the mathematical foundations of each ML model. Section 3 describes the study objectives and the full experimental methodology, including data preparation, model architecture, and evaluation metrics. Section 4 presents and discusses the comparative results. Section 5 concludes the paper and outlines directions for future research.

2. Background, State of the Art, and Theoretical Foundations

2.1. Current Context and Challenges in SiP Thermal Management

The thermal challenge inherent to System-in-Package (SiP) architectures stems from a combination of structural and physical constraints that distinguish this integration paradigm from conventional single-die designs. As documented, the density trend associated with heterogeneous integration has pushed modern multicore processors into operating regimes where thermal hotspots are no longer transient anomalies but permanent design constraints, with uncontrolled temperature gradients accelerating electromigration, oxide breakdown, and solder fatigue by orders of magnitude [21]. In an SiP context, this situation is further compounded by thermal crosstalk between co-packaged components, a coupling effect confirmed experimentally and through simulation in 2.5D and 3D heterogeneous stacks [22], and by the structurally elevated junction-to-ambient thermal resistance that results from the sealed, heat-sink-incompatible form factor of these modules. It was demonstrated that reducing this resistance by one order of magnitude is a prerequisite for eliminating active refrigeration cooling in next-generation high-performance computing environments, underscoring the criticality of smarter thermal management strategies [22].

From a modeling perspective, the finite element method (FEM) and computational fluid dynamics (CFD) remain the reference standard for accurate thermal analysis. However, their computational cost scales super-linearly with problem dimensionality, which makes them impractical for real-time monitoring or iterative design optimization across heterogeneous SiP fleets. This limitation has led to growing interest in data-driven surrogate models, where machine learning approaches are used to replace physics-based simulations for specific tasks [21]. Recent surveys confirm that ensemble ML methods and deep neural networks can achieve thermal prediction accuracies competitive with FEM, while operating at inference speeds several orders of magnitude faster [20,21]. Representative results include the work of Chen et al. [23], who demonstrated that adaptive ML predictors consistently reduce peak temperature by several degrees Celsius compared to classical threshold-based control in NoC architectures, and that of Zhang et al. [24], who validated lightweight ML thermal models, achieving temperature reductions of up to 11.9 degrees Celsius with no associated performance penalty in HPC environments. More recently, Miao et al. [20] showed that graph-based neural networks can predict full-chip temperature maps for large-scale multicore chips in real time, substantially outperforming traditional compact thermal models in both speed and spatial resolution.

2.2. Random Forest: Algorithm, Theory, and Relevance

2.2.1. Mathematical Foundations

The Random Forest algorithm, introduced by Breiman [25], is a supervised ensemble learning method that constructs a large set of diverse decision trees and combines their predictions. Its effectiveness stems from two main mechanisms: reducing variance through bootstrap aggregation (bagging) and decorrelating trees by randomly selecting subsets of features at each split. Together, these mechanisms ensure that the ensemble generalizes substantially better than any individual tree.

Step 1.: Bootstrap aggregation.

Let the training dataset be

S_{n} = \{(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})\}, X \in R^{p}, Y \in R

(1)

where p is the number of input features (spatial coordinates) and n is the number of training samples. B bootstrap subsets are drawn by sampling

S_{n}

with a replacement. Each k-th tree

h (x, Θ_{k})

is trained on its own bootstrap subset, where

Θ_{k}

encodes the random sampling seed. Samples not drawn for a given tree from the Out-of-Bag (OOB) set serve as an unbiased validation set without requiring a separate hold-out partition.

Step 2.: Random feature selection and node splitting.

At each split node t, the algorithm draws a random subset of

m_{try} \leq p

features (conventional setting:

m_{try} \approx p / 3

for regression). The best split

(j^{*}, s^{*})

is found by minimizing the intra-node Mean Squared Error (MSE) impurity:

i (t) = \frac{1}{N_{t}} \sum_{x_{i} \in t} {(y_{i} - \bar{y})}^{2}, \bar{y} = \frac{1}{N_{t}} \sum_{x_{i} \in t} y_{i}

(2)

where

N_{t}

is the number of samples in node t,

y_{i}

is the actual target temperature, and

\bar{y}

is the node mean.

Step 3.: Prediction by averaging.

For a new input x, the forest prediction is the arithmetic mean of all B individual tree outputs:

{\hat{f}}_{RF} (x) = \frac{1}{B} \sum_{k = 1}^{B} h (x; Θ_{k})

(3)

Generalization bound.

Breiman’s theoretical analysis [25] shows that the ensemble’s prediction error

P E^{*}

satisfies

P E^{*} \leq \bar{ρ} \cdot \frac{1 - s^{2}}{s^{2}}

(4)

where

\bar{ρ}

is the mean pairwise inter-tree correlation and s is the mean individual tree predictive strength. Random feature selection keeps

\bar{ρ}

small; bootstrap training keeps s large. This antagonistic tuning via

m_{try}

makes the error bound controllable without any assumption on the data distribution, which is a property of particular relevance for heterogeneous SiP thermal datasets.

Feature importance (mean decrease in impurity).

The native feature importance metric ranks each input feature j by its cumulative contribution to impurity reduction across all trees and all nodes where j was selected as the splitting variable:

FI (j) = \frac{1}{B} \sum_{k = 1}^{B} \sum_{\begin{matrix} t \in T_{k} \\ j used at t \end{matrix}} \frac{N_{t}}{N} Δ i (t, j)

(5)

where

Δ i (t, j)

is the impurity reduction achieved by splitting node t on feature j, and N is the total number of training samples. This metric is used in our study to identify which spatial coordinates (

X_{norm}

,

Y_{norm}

,

Z_{norm}

) contribute most to temperature prediction within the SiP mesh.

2.2.2. State of the Art (2023–2025): Random Forest for Thermal and Microelectronic Prediction

The recent literature consistently positions Random Forest as a first-choice surrogate for multiphysics prediction tasks in electronics and thermal systems. Acharya et al. [26] applied Random Forest alongside Support Vector Regression and neural networks to predict the steady-state and transient junction temperatures of a silicon carbide (SiC) half-bridge power electronics module, using a data bank of 2592 steady-state and 1200 transient simulation samples. Random Forest achieved R² values exceeding 99.5% across all thermal response variables, with SHAP-based feature attribution confirming that encapsulant thermal conductivity and heat-sink cooling conditions were the dominant drivers of hotspot temperature, directly validating the interpretability advantage over black-box alternatives [26].

In the domain of LED junction temperature estimation, Azarifar et al. [27] applied Random Forest and XGBoost regressors to predict junction temperatures of white LEDs from optical characteristics (color coordinates and luminous flux), using a dataset compiled from manufacturer datasheets and dynamic opto-thermal measurements. The study demonstrated that Random Forest provided reliable junction temperature predictions without direct temperature measurement, achieving competitive accuracy even under wafer-level probing constraints, where infrared access is unavailable. The concurrent benchmarking of Random Forest against XGBoost in this study provides a direct methodological precedent for the comparative protocol adopted here [27].

He and Ding [28] deployed a just-in-time learning (JITL) Random Forest framework for real-time prediction of temperature-sensitive emissions in thermal power plant combustion systems, using 17,281 operational records spanning current, inlet temperature, coal feed rate, and airflow parameters. The study demonstrated that the JITL-RF model outperformed static regression baselines by adapting continuously to operating condition shifts, a robustness property directly transferable to the dynamic workload conditions of SiP thermal management [28].

2.2.3. Motivation and Expected Contribution in This Study

Taken together, these studies make a clear case for including Random Forest in this benchmark. Three properties are particularly relevant to the SiP thermal prediction problem:

Handling heterogeneous features without preprocessing: SiP datasets combine spatial coordinates, normalized node identifiers, and thermal labels that span different scales and distributions. Random Forest handles such heterogeneity natively, without requiring standardization of the target or feature transformation beyond the normalization already applied.
Physical interpretability via feature importance: In an SiP context, knowing which spatial coordinate (X, Y, or Z) most influences temperature is actionable knowledge for thermal design. The MDI-based feature importance metric provides this insight at negligible computational cost, making Random Forest the most interpretable of the three candidates evaluated here.
Robustness to the ensemble size: The OOB error provides an unbiased estimate of generalization performance throughout training, allowing early stopping without a separate validation set, which is an advantage when labeled SiP thermal data is limited.

On the basis of references [26,27,28], we expect Random Forest to deliver the best regression accuracy among the three models evaluated, primarily because its bagging mechanism is better aligned with the regression objective (MSE minimization) than the generative objective optimized by the VAE.

2.3. XGBoost: Algorithm, Theory, and Relevance

2.3.1. Mathematical Foundations

XGBoost (Extreme Gradient Boosting), formalized by Chen and Guestrin [29], extends gradient boosting by combining a second-order Taylor approximation of the loss function with explicit L1/L2 regularization of tree structure, yielding a method that is simultaneously more accurate, faster, and more resistant to overfitting than standard gradient boosted trees.

Additive model and regularized objective.

The prediction for sample i is the sum of outputs of K sequential trees:

{\hat{y}}_{i} = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} \in F

(6)

where

F

is the space of CART regression trees. The model parameters

ϕ = {f_{1}, \dots, f_{K}}

are found by minimizing the regularized objective:

L (ϕ) = \sum_{i = 1}^{n} l ({\hat{y}}_{i}, y_{i}) + \sum_{k = 1}^{K} Ω (f_{k})

(7)

with the regularization term

Ω (f) = γ T + \frac{1}{2} λ {∥ ω ∥}^{2}

(8)

where

l ({\hat{y}}_{i}, y_{i}) = {({\hat{y}}_{i} - y_{i})}^{2} / 2

is the MSE loss, T is the number of leaves,

ω \in R^{T}

is the leaf score vector,

γ

is the minimum gain threshold for a new split, and

λ

is the L2 (Ridge) regularization coefficient on leaf scores.

Second-Order Taylor Expansion.

At step t, a new tree

f_{t}

is added to the current model

{\hat{y}}_{i}^{(t - 1)}

. The step-t objective is

L^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t})

(9)

Applying a second-order Taylor expansion around

{\hat{y}}_{i}^{(t - 1)}

and dropping constant terms yields

L^{(t)} \approx \sum_{i = 1}^{n} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t})

(10)

where

g_{i} = \partial_{\hat{y}} l (y_{i}, \hat{y}) |_{{\hat{y}}_{i}^{(t - 1)}} and h_{i} = \partial_{\hat{y}}^{2} l (y_{i}, \hat{y}) |_{{\hat{y}}_{i}^{(t - 1)}}

are the first-order gradient and second-order Hessian of the loss, respectively.

Optimal Leaf Weights and Structure Score.

For leaf j with index set

I_{j} = {i ∣ q (x_{i}) = j}

, define the aggregated statistics:

G_{j} = \sum_{i \in I_{j}} g_{i} and H_{j} = \sum_{i \in I_{j}} h_{i}

(11)

The closed-form optimal leaf weight is

ω_{j}^{*} = - \frac{G_{j}}{H_{j} + λ}

(12)

Substituting

ω^{*}

back into (10) yields the Structure Score:

L^{*} = - \frac{1}{2} \sum_{j = 1}^{T} \frac{G_{j}^{2}}{H_{j} + λ} + γ T

(13)

For any candidate split of node t into left (L) and right (R) child nodes, the split gain is

Gain = \frac{1}{2} [\frac{G_{L}^{2}}{H_{L} + λ} + \frac{G_{R}^{2}}{H_{R} + λ} - \frac{{(G_{L} + G_{R})}^{2}}{H_{L} + H_{R} + λ}] - γ

(14)

A split is accepted only if

Gain > 0

; otherwise, the branch is pruned. This condition makes XGBoost both noise-robust and computationally lean:

γ

directly controls the minimum information gain required to justify tree complexity.

Remark 1.

For the MSE loss

l = {(y_{i} - {\hat{y}}_{i})}^{2} / 2

, the gradient and Hessian simplify to

g_{i} = {\hat{y}}_{i} - y_{i}

and

h_{i} = 1

, respectively. Equation (12) then reduces to the simple weighted mean of negative residuals, and the Gain Formula (14) measures the reduction in total squared error achieved by the split.

2.3.2. State of the Art (2023–2025): XGBoost for Thermal and Electronic Prediction

XGBoost has established a strong track record for thermal and materials property prediction in electronics. Bhandari et al. [30] applied XGBoost to predict the thermal conductivity of additively manufactured alloys, achieving MAE = 0.86 W/m·K and R² = 0.96 on the test set, and confirmed through SHAP analysis that composition features interact non-linearly, a finding directly analogous to the non-linear coordinate–temperature relationships expected in SiP systems. Crucially, this study demonstrated that hyperparameter tuning (via grid search over tree depth, learning rate, and regularization) reduced test MAE by 77% compared to default parameters, a compelling precedent for the RandomizedSearchCV optimization applied in the present work [30].

Mathiyazhagan S. [31] deployed XGBoost for microchannel heat-sink thermal performance prediction under non-uniform heat load conditions, training on 560 experimental data points spanning 22 geometric and boundary condition features. XGBoost achieved average R² = 0.98 and MAE = 2.1 °C across six thermal response variables including hotspot temperature, thermal resistance, and Nusselt number, substantially outperforming Artificial Neural Networks, LightGBM, and K-Nearest Neighbors on the same dataset. The authors attributed XGBoost’s superiority to its ability to capture interaction effects between geometric parameters and boundary conditions through the second-order Taylor expansion, which aligns precisely with the mechanism identified in the mathematical foundations above [31].

Miao et al. [20] demonstrated, in their benchmark of ML methods for real-time chip temperature prediction, that while graph convolutional networks achieve the highest spatial resolution, XGBoost provides the best accuracy–latency trade-off for per-node temperature estimation from structured input features, which is the exact task formulation of the present study. Their result motivates XGBoost as the computationally lightweight complement to graph-based approaches for SiP thermal management applications.

2.3.3. Motivation and Expected Contribution in This Study

Three features make XGBoost particularly suited for this problem:

Second-order thermal sensitivity: the Hessian $h_{i}$ captures the curvature of the MSE loss surface, enabling XGBoost to model rapid, non-monotonic temperature variations around hotspots with higher fidelity than gradient-only (first-order) methods such as standard gradient boosting or linear regressors.
Regularization-driven noise robustness: The $λ$ parameter (L2 penalty on leaf scores) and the $γ$ parameter (minimum split gain) together act as an adaptive filter, suppressing the influence of noisy or outlier temperature readings, a critical property in SiP monitoring where individual sensor readings may be corrupted by electromagnetic interference or thermal transients.
Efficient hyperparameter optimization: The closed-form gain formula enables XGBoost to evaluate every candidate split in O(p × N) time, making RandomizedSearchCV over large hyperparameter grids computationally feasible even on modest hardware, directly enabling the optimization protocol applied in this study.

We anticipate that XGBoost will achieve lower MSE than the VAE but marginally higher MSE than Random Forest after optimization, consistent with the patterns observed in [27,31]. The relative performance gap between these two ensemble methods is a primary empirical output of this study.

2.4. Variational Autoencoder: Algorithm, Theory, and Relevance

2.4.1. Mathematical Foundations

The Variational Autoencoder (VAE), introduced by Kingma and Welling [32] and comprehensively reviewed in [33], is a deep generative model that learns a probabilistic mapping between observed data and a structured continuous latent space. Unlike classical deterministic autoencoders, a VAE parameterizes the encoder output as a probability distribution, typically a diagonal Gaussian, enabling both generation of new samples and principled uncertainty quantification.

Generative Process.

Let

x \in R^{d}

be an observed data vector (node coordinates + temperature) and

z \in R^{k}

a latent variable (

k ≪ d

). The VAE assumes

z \sim p (z) = N (0, I)

(15)

x \sim p_{θ} (x ∣ z)

(16)

where (15) is the standard normal prior over the latent space and (16) is the conditional likelihood parameterized by the decoder network

θ

.

Marginal Log-Likelihood and Intractability.

The training objective is to maximize

log p_{θ} (x) = log \int p_{θ} (x ∣ z) p (z) d z

(17)

This integral is intractable for neural network decoders due to the exponential number of configurations of z. Variational inference introduces an approximate posterior

q_{ϕ} (z ∣ x)

, parameterized by the encoder network

ϕ

, to make the problem tractable.

KL Divergence and the Evidence Lower Bound (ELBO).

The dissimilarity between the approximate posterior

q_{ϕ} (z ∣ x)

and the true (intractable) posterior

p_{θ} (z ∣ x)

is measured by the Kullback–Leibler (KL) divergence:

D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x)) \geq 0

(18)

Minimizing this divergence is equivalent to maximizing the Evidence Lower Bound (ELBO), derived by applying Jensen’s inequality to (17)

L (x, ϕ, θ) = \underset{reconstruction term}{\underset{︸}{E_{q_{ϕ} (z | x)} [log p_{θ} (x ∣ z)]}} - \underset{KL regularization}{\underset{︸}{D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z))}}

(19)

The two terms of (19) have complementary roles:

Reconstruction term: $E_{q_{ϕ}} [log p_{θ} (x ∣ z)]$ rewards the decoder for accurately reconstructing x from latent sample z. For a Gaussian decoder, this term reduces to the negative MSE between x and its reconstruction $\hat{x}$ .
KL regularization: $- D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z))$ penalizes the encoder for deviating from the standard normal prior, ensuring a smooth, well-structured latent space that supports interpolation and anomaly scoring.

Closed-Form KL for Diagonal Gaussian Encoder.

When

q_{ϕ} (z ∣ x) = N (μ_{ϕ}, diag (σ_{ϕ}^{2}))

, the KL term has the closed form:

D_{KL} = - \frac{1}{2} \sum_{j = 1}^{k} [1 + log σ_{ϕ, j}^{2} - μ_{ϕ, j}^{2} - σ_{ϕ, j}^{2}]

(20)

where j indexes the k dimensions of the latent space, and

μ_{ϕ, j}

,

σ_{ϕ, j}^{2}

are the mean and variance of the j-th latent dimension output by the encoder.

Reparameterization Trick.

Standard backpropagation cannot pass gradients through a stochastic sampling step

z \sim N (μ_{ϕ}, σ_{ϕ}^{2})

. The VAE resolves this by reparameterizing:

z = μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ε, ε \sim N (0, I)

(21)

where ⊙ denotes element-wise multiplication and

ε

is an independent noise vector not connected to the computation graph. This factorization isolates all stochasticity in

ε

, allowing gradients to flow back through

μ_{ϕ}

and

σ_{ϕ}

via standard automatic differentiation.

Anomaly Score.

A trained VAE detects out-of-distribution inputs through their reconstruction error. For input x, the anomaly score is

A (x) = {∥ x - \hat{x} ∥}^{2}

(22)

where

\hat{x} = E_{q_{ϕ}} [p_{θ} (x ∣ z)]

is the reconstructed input. Inputs outside the training distribution generate high

A (x)

since their latent codes map to low-probability regions of

p (z)

, causing the decoder to produce poor reconstructions. This property underpins the VAE’s utility for unsupervised thermal anomaly detection [18,34].

2.4.2. State of the Art (2023–2025): VAEs for Anomaly Detection and Thermal Applications

The recent literature confirms the growing adoption of VAEs for unsupervised fault detection and process monitoring in industrial and electronic systems. Jakubowski et al. [18] demonstrated that a VAE trained solely on normal operational data could detect wear anomalies in hot-strip mill rolls and turbofan engines through reconstruction error monitoring, without requiring any labeled failure examples. The authors further combined the VAE anomaly scores with a Random Forest surrogate model and SHAP explanations, establishing a precedent for the complementary use of generative and ensemble models in the same diagnostic pipeline.

Liu et al. [34] proposed a multi-channel, multi-scale convolutional attention VAE (MCA-VAE) for high-precision anomaly detection in industrial process monitoring, demonstrating that incorporating attention mechanisms into the VAE encoder substantially improved sensitivity to subtle temporal and cross-channel anomalies. Applied to sensor fault detection in factory environments, the MCA-VAE achieved superior detection rates compared to LSTM-AE and static VAE baselines, validating that architectural enhancements to the vanilla VAE can overcome its inherent smoothing tendency.

Zhu et al. [35] applied a physics-informed VAE (PI-VAE) framework to thermal system modeling in power plants, embedding thermodynamic constraints directly into the VAE training loss as physical inconsistency penalties. This hybrid approach enforced that reconstructed thermal state sequences satisfy energy balance equations, substantially reducing the number of training samples required for accurate modeling. This precedent demonstrates the feasibility of physics-constrained VAEs for thermal applications and points toward a future extension of the present work.

2.4.3. Motivation and Expected Limitations in This Study

Three capabilities justify including the VAE in this comparative evaluation despite its generative rather than discriminative architecture:

Unsupervised thermal anomaly detection: The reconstruction error A(x) provides a model-free anomaly score that does not require labeled examples of thermal failure, an important practical advantage for SiP systems where failure data is rare or absent.
Latent space interpolation and stress scenario generation: By sampling from unexplored regions of the latent space, the trained VAE can synthesize temperature distributions corresponding to operating conditions not present in the training data, enabling systematic robustness testing of the SiP.
Robustness to incomplete data: The encoder can infer plausible latent representations from partially corrupted inputs (e.g., missing spatial coordinates), providing resilience to sensor failures during real-time monitoring.

However, a fundamental structural limitation must be acknowledged explicitly: the VAE optimizes the ELBO, a lower bound on the log-likelihood rather than a direct regression loss such as MSE. The KL regularization term encourages the encoder to produce smooth, diffuse latent representations, which introduces a distributional blurring that degrades point-prediction accuracy compared to models specifically trained to minimize MSE. This trade-off between generative flexibility and regression precision is a central empirical question of the present study, and we anticipate that the VAE will exhibit significantly higher MSE than both ensemble methods, a hypothesis directly testable through the experimental protocol described in Section 3.

3. Study Objectives and Experimental Methodology

3.1. Methodological Steps

The present study pursues three primary methodological steps in the context of thermal management for System-in-Package (SiP) architectures:

1.: Development of predictive models capable of estimating temperature distributions at mesh nodes from their normalized three-dimensional spatial coordinates ( $X_{norm}, Y_{norm}, Z_{norm}$ ).
2.: Systematic evaluation and optimization of model performance through rigorous hyperparameter search procedures using RandomizedSearchCV, with full reproducibility guaranteed by setting random_state = 42 at every stage of the pipeline.
3.: Comparative analysis of three heterogeneous machine learning paradigms, gradient boosting (XGBoost), ensemble bagging (Random Forest), and deep variational generation (Variational Autoencoder), under a unified controlled experimental framework.

3.2. Data Preparation Pipeline

3.2.1. Raw Data Loading and Cleaning

Raw datasets comprising nodal spatial coordinates and associated temperature measurements were loaded from structured tab-separated files. The nodal temperature data used in this study were obtained from finite element method (FEM) simulations performed with COMSOL Multiphysics on a representative SiP module geometry, the same as in [1]. Node coordinates are expressed in micrometers

μ m

and temperatures in degrees Celsius °C. Preprocessing involved removal of the non-standard header row, column renaming for clarity, and conversion of temperature values to double-precision floating point representation. A critical preprocessing challenge arose from the use of comma characters as decimal separators in the temperature source file. This was resolved by applying a systematic character substitution (comma to decimal point) prior to type casting. Following this transformation, no missing values were detected in either the coordinate or temperature datasets, validating the integrity of the preprocessing pipeline.

3.2.2. Feature Normalization (Z-Score Standardization)

Coordinate features

(X, Y, Z)

were standardized using Z-score normalization to ensure unit-variance, zero-mean distributions across all spatial dimensions. For each feature vector

x

, the transformation is defined as

x^{'} = \frac{x - μ}{σ}

(23)

where

μ

and

σ

denote the empirical mean and standard deviation computed over the training partition. The resulting normalized features

x^{'}

are dimensionless. Post-normalization, all three coordinate axes exhibited

μ = 0

and

σ = 1

, confirming successful standardization.

3.2.3. Dataset Merging and Partitioning

Temperature and coordinate datasets were merged on shared node identifiers, producing a joint dataset composed of feature matrix

X \in R^{n \times 3}

(normalized coordinates) and target vector

y \in R^{n}

(temperature values). For all models, a stratified 80/20 train–test split was applied with random_state

= 42

, ensuring identical partitions across every execution of the pipeline.

3.3. Evaluation Protocol

Performance was quantified along five complementary dimensions on the test partition:

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(24)

RMSE = \sqrt{MSE}

(25)

MAE = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(26)

MAPE = \frac{100}{n} \sum_{i = 1}^{n} \frac{|y_{i} - {\hat{y}}_{i}|}{|y_{i}|}

(27)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(28)

A confusion matrix analysis was performed after discretizing predicted and actual temperatures into three ordinal categories: low (

T \leq 25

°C), medium (

25 < T \leq 30

°C), and high (

T > 30

°C). The thresholds adopted in this study carry a degree of physical justification in the SiP context. The value of

25

°C corresponds to the IEEE standard ambient reference temperature and

30

°C marks a pragmatic monitoring threshold for compact sealed modules, below which passive dissipation is generally sufficient and above which active thermal management becomes necessary.

Class distribution and imbalance. The fixed-threshold protocol produces a severely imbalanced class distribution (low = 7.1%, medium = 81.0%, high = 12.0%), which structurally disadvantages all models in the high class, particularly the VAE, whose KL regularization reinforces majority-class predictions. A complementary quartile-based protocol (

Q_{1} = 25.20

°C,

Q_{3} = 26.80

°C) is therefore applied in parallel, yielding a statistically balanced distribution (35.5%/39.8%/24.7%) that enables a more meaningful assessment of each model’s discriminative capacity. Future work should consider oversampling strategies such as SMOTE to further improve high-class prediction reliability (see Figure 1).

Spatial correlation between neighboring nodes. The three models in this study treat each node independently, without explicitly encoding neighborhood thermal coupling. To quantify the degree of spatial autocorrelation present in the thermal field, the mean absolute temperature difference

| Δ T |

to the five nearest neighbors was computed for each node via a k-NN search (

k = 5

) in normalized coordinate space. The global mean

| Δ T | = 0.371

°C confirms strong positive spatial autocorrelation: nodes in close spatial proximity tend to share similar temperatures. This autocorrelation is not spatially uniform, however; the four localized hotspot clusters identified in the exploratory analysis exhibit

| Δ T |

values reaching 5–6 °C at their boundaries, reflecting the steep thermal gradients generated by the active-component heat sources. This finding has two direct implications for the present study. First, it justifies the use of spatial coordinates as predictive features: the coordinate values implicitly encode neighborhood proximity and therefore carry autocorrelation information even without explicit graph connectivity. Second, it motivates future extensions to graph-structured models (GCNs) that can exploit spatial adjacency more directly than coordinate-only regressors.

3.4. Model Architectures and Training Procedures

3.4.1. XGBoost (Extreme Gradient Boosting)

XGBoost [36] was applied as a regression model that maps the normalized node coordinates to predict temperatures. The initial configuration followed the baseline parameters:

n_estimators = 100, max_depth = 3, learning_rate = 0.1.

Hyperparameter optimization was conducted via RandomizedSearchCV (20 iterations,

k = 5

cross-validation) over the following search space:

\begin{matrix} n_{est} \in {100, 200, 300, 500}, α \in {0.01, 0.05, 0.1, 0.2}, d_{max} \in {3, 4, 5, 6}, \\ colsample \in {0.6, 0.8, 1.0}, subsample \in {0.6, 0.8, 1.0}, ω_{min} \in {1, 3, 5} \end{matrix}

The inclusion of min_child_weight and the extension of learning_rate and max_depth ranges target the non-uniform thermal gradient regimes characteristic of SiP hotspots. Setting random_state = 42 in RandomizedSearchCV guarantees full reproducibility of the sampled combination.

3.4.2. Random Forest

A RandomForestRegressor was trained on the same normalized coordinate features [37]. The initial configuration used 100 estimators with random_state

= 42

. Hyperparameter search covered

\begin{matrix} n_{est} \in {100, 200, 300, 500}, d_{max} \in {None, 10, 20, 30, 40, 50}, \\ \min_samples_split \in {2, 5, 10}, \min_samples_leaf \in {1, 2, 4}, \\ \max_features \in {sqrt, \log 2, None}, bootstrap \in {T r u e, F a l s e} \end{matrix}

The explicit inclusion of bootstrap in the search space enables empirical assessment of its impact on regression performance, a parameter fixed by default in prior configurations. Mean decrease in impurity (MDI) feature importance scores were computed to quantify the relative contribution of each coordinate axis.

3.4.3. Variational Autoencoder (VAE)

Methodological note on cross-paradigm comparison. Comparing a generative VAE, trained to maximize the Evidence Lower Bound (ELBO) on a joint reconstruction-plus-KL objective, with regressors directly minimizing MSE (XGBoost, Random Forest) is, by design, an unbalanced comparison for point-estimation tasks. We retain this comparison deliberately, with full disclosure of its asymmetry: our explicit goal is to quantify the performance gap induced by the ELBO/MSE objective mismatch on an SiP thermal field, rather than to claim a fair benchmark of generic predictive capacity. The VAE results in this study should accordingly be interpreted as a structural characterization of objective incompatibility, not as evidence against generative models in general. This framing is restated in the conclusion (Finding 2) and supports the recommendation that the VAE be deployed for tasks aligned with its training objective, namely latent space analysis and unsupervised anomaly detection.

The VAE implementation underwent fundamental methodological corrections relative to the initial configurations, motivated by two structural incompatibilities identified post hoc:

Input normalization: all four input features, temperature, $X_{norm}, Y_{norm}, Z_{norm}$ , are normalized to the range of [0, 1] using MinMaxScaler. Temperatures are scaled with $T_{min} = 25.0$ °C and $T_{max} = 76.0$ °C, fitted exclusively on the training partition to prevent test-set leakage. This step is essential because raw temperatures between 20 and 76 °C and standardized coordinates between $- 1.7$ and $+ 1.7$ operate on fundamentally different scales, which disrupts KL divergence optimization and causes the network to prioritize reconstruction of the dominant-scale feature.
Linear output activation: the decoder uses a linear activation rather than the sigmoid activation of the initial configurations. Sigmoid constrains outputs to [0, 1], rendering it physically impossible to reconstruct normalized temperatures above $1.0$ and guaranteeing an artificial minimum MSE floor of order 730.

Inverse transform for metric computation. Because the VAE is trained on the normalized range, all reported regression metrics (MSE, RMSE, MAE, R²) are computed on the original physical temperature scale (°C) after applying the inverse MinMaxScaler transform to the decoder output:

\hat{T} [^{\circ} C] = {\hat{T}}_{norm} (T_{max} - T_{min}) + T_{min}, T_{min} = 25.0 ° C, T_{max} = 76.0 ° C .

(29)

Without this inverse transform, the reported metrics would be expressed in normalized units and would not be directly comparable to the XGBoost and Random Forest baselines, which operate natively on °C. The same convention is applied to all VAE entries in Table 1.

The corrected architecture is defined as follows. The encoder maps the input vector

x \in R^{4}

through two fully connected layers with batch normalization,

Dense (128, ReLU) + BN \to Dense (64, ReLU) + BN

, producing the latent parameters

μ_{ϕ} \in R^{2}

and

log σ_{ϕ}^{2} \in R^{2}

. The sampling layer draws the latent code via the reparametrisation trick,

z = μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ε

, where

ε \sim N (0, I)

. The decoder reconstructs the input from

z

through

Dense (64, ReLU) + BN \to Dense (128, ReLU) + BN \to Dense (4, linear)

. Training minimizes the ELBO (19) using the Adam optimizer (

l_{r} = 10^{- 3}

), a maximum of 50 epochs, and three callbacks [32].

4. Results and Discussion

4.1. Exploratory Data Analysis

Prior to model training, descriptive statistics were computed for all variables following preprocessing (Figure 2). The normalized coordinate distributions confirmed successful standardization:

X_{norm}

,

Y_{norm}

, and

Z_{norm}

each exhibited zero mean and unit standard deviation. The temperature distribution reveals a marked positive asymmetry (Figure 2), with the majority of nodes falling in the medium class (

25 < T \leq 30

°C) and a right tail extending toward high-temperature values

> 30

°C corresponding to hotspot zones. A class-wise boxplot of the temperature distribution is shown in Figure 3.

The pairwise linear correlations between the four variables are summarized in Figure 4.

4.2. XGBoost Model Results

Following hyperparameter optimization via RandomizedSearchCV (20 iterations, 5-fold cross-validation, random_state = 42), the XGBoost model was evaluated on the held-out test partition using the optimal configuration identified over the extended search space. The best combination retrieved comprised 500 estimators, a maximum tree depth of 4, a learning rate of 0.1, a column sub-sampling ratio of 1.0, a row sub-sampling ratio of 0.8, and a minimum child weight of 3. This configuration achieved a test MSE of 0.1091, an RMSE of 0.3304, a MAE of 0.1778, and a coefficient of determination

R^{2}

of 0.9967, confirming the model’s capacity to capture the non-linear spatial temperature gradients of the SiP mesh after systematic tuning. Residual analysis (Figure 5) reveals a characteristic heteroscedastic pattern in which the variance in prediction errors increases with the magnitude of predicted temperatures. This funnel-shaped dispersion around the zero-residual reference line indicates that predictive accuracy diminishes in the high-temperature regime, a behavior consistent with the structural under-representation of hotspot nodes in the training distribution and with the inherently non-linear thermal gradients that develop around them. Positive residuals correspond to underestimation, and negative residuals to overestimation of actual nodal temperatures. The feature importance analysis (Figure 6) further reveals that the normalized coordinate contributes most strongly to the boosting decisions, suggesting a directional dominance in the spatial temperature gradient consistent with the thermal spreading patterns observed in planar SiP geometries.

4.3. Random Forest Model Results

Following hyperparameter optimization via RandomizedSearchCV (20 iterations, 5-fold cross-validation, random_state = 42), the Random Forest model was evaluated on the held-out test partition using the optimal configuration identified over the extended search space, which now explicitly includes the bootstrap parameter absent from prior configurations. The best combination retrieved comprised 500 estimators, a maximum depth of 30, a minimum sample split of 2, a minimum samples per leaf of 1, a maximum features setting of log2, and bootstrap set to True. This configuration achieved a test MSE of 0.0978, an RMSE of 0.3128, a MAE of 0.1112, and a coefficient of determination R² of 0.9970, establishing Random Forest as the strongest regression model among the three paradigms evaluated. Residual analysis (Figure 7) demonstrates a largely symmetric dispersion pattern around the zero-residual reference line, with a mild heteroscedastic tendency observable at elevated temperature values, consistent with the distributional scarcity of hotspot nodes in the training partition. The feature importance analysis (Figure 8), computed via mean decrease in impurity (MDI), identifies the normalized coordinate

X_{norm}

as the most influential predictor of nodal temperature, followed by

Y_{norm}

. This ranking suggests directional anisotropy in the thermal distribution within the SiP structure, consistent with the lateral heat-spreading patterns typically observed in planar package geometries, and provides engineers with physically interpretable guidance for thermal design decisions. The RMSE curve as a function of the number of estimators (Figure 9) further confirms convergence of the ensemble error, with marginal gains beyond 100 trees, validating the selected configuration as computationally efficient without sacrificing predictive accuracy. Interpretability: MDI vs. permutation importance. To assess the robustness of the MDI ranking, a model-agnostic permutation importance analysis was conducted on the held-out test partition (20 repetitions). The two methods yield a fully consistent ranking:

X_{norm}

(MDI = 0.5262) and

Y_{norm}

(MDI = 0.4738) are the sole informative predictors, and

Z_{norm}

carries zero importance under both criteria (Figure 10). This agreement confirms that the MDI ranking is not an artifact of the computation.

4.4. Variational Autoencoder Results

The VAE was trained for 50 effective epochs before early stopping was triggered, with the best model weights automatically restored from the checkpoint saved at epoch 44. The training and validation loss curves (Figure 11) exhibit a regular monotonic decrease in total ELBO loss across both partitions, with validation loss following a parallel trajectory throughout training, confirming adequate generalization without overfitting. Point-prediction assessment on the held-out test partition yielded a temperature MSE of 32.6080, an RMSE of 5.7103, a MAE of 3.0179, and an R² of −0.0008. These results represent a substantial improvement over the initial compact configuration and are directly attributable to two architectural corrections applied in this study. First, replacing the sigmoid output activation with a linear activation removes the saturation constraint that previously prevented the decoder from reconstructing temperature values outside the unit interval, which constituted a physically imposed error floor of order 730 regardless of training duration or architecture capacity. Second, applying a MinMaxScaler to all four input features prior to training ensures that the temperature variable and the normalized coordinates operate on commensurate scales, thereby stabilizing KL divergence optimization and preventing the network from disproportionately weighting the dominant-scale feature. The distribution of per-sample reconstruction errors (Figure 12) and the actual-versus-reconstructed temperature scatter plot (Figure 13) confirm that the model produces physically plausible reconstructions across the full temperature range of the dataset, a behavior entirely absent from the sigmoid-based configurations reported in prior work. The two-dimensional latent space visualization (Figure 14), obtained by projecting all test-set nodes through the encoder, reveals a physically meaningful organization in which high-temperature nodes cluster in a distinct region of the latent space, separated from low-temperature nodes by a smooth continuous gradient. This spatial coherence of the latent representation confirms that the VAE has successfully learned a compressed encoding of the thermal field, and is consistent with the hypothesis that reconstruction error thresholding could support unsupervised thermal anomaly detection, a necessary but not sufficient condition for operational deployment, which requires validation with labeled fault data and computation of AUC-ROC, precision–recall, and F1 metrics. This direction is identified as future work.

4.5. Confusion Matrix Analysis

To complement the MSE-based regression evaluation, predicted and actual temperatures were discretized into three ordinal categories and examined through confusion matrices. Two classification protocols were applied in parallel, each designed to answer a different question about model behavior.

The first protocol partitions temperatures using fixed engineering thresholds: (low:

T \leq 25

°C; medium:

25 < T \leq 30

°C; high:

T > 30

°C). These boundaries are not arbitrary. In the SiP context, 25 °C is the ambient reference temperature standardized by IEEE, and 30 °C marks a pragmatic first-line monitoring threshold for compact sealed modules, the point below which passive heat dissipation is generally adequate and above which proactive thermal management becomes operationally necessary. This partition is therefore a deliberate engineering choice: it keeps the classification results grounded in the language that thermal designers actually use and ensures that the confusion matrices can be read directly as guidance for system-level decisions. The second protocol partitions the same temperature range using quartile thresholds computed on the training partition,

Q_{1} = 25.20

°C and

Q_{3} = 26.80

°C, yielding three classes of approximately equal size: low (

T \leq Q_{1}, \approx

35.5% of nodes), medium (

Q_{1} \leq T \leq Q_{3} \approx 39.8

%), and high (

T > Q_{3}, \approx

24.70%). The motivation is straightforward: the fixed 25–

30

°C boundaries concentrate nearly 78% of the dataset observations in the medium class and reduce the high class to fewer than 1% of nodes, which makes the resulting confusion matrix metrics statistically meaningless. By contrast, the quartile partition distributes nodes evenly across all three classes, giving each cell of the confusion matrix enough samples to reflect genuine model behavior rather than the artifact of an imbalanced distribution. Reporting both partitions side by side is therefore deliberate and complementary: the fixed-threshold matrices anchor the results to engineering practice and design vocabulary, while the quartile matrices reveal the true discriminative capacity of each model under statistically fair conditions. Both partitions were analyzed via confusion matrices (Figure 15, Figure 16 and Figure 17).

4.5.1. Fixed-Threshold Classification (Low/Medium/High)

Under the fixed-threshold protocol, the Random Forest classifier exhibited the most balanced classification profile, with a strong diagonal concentration indicating a high proportion of correct assignments across all three categories (Figure 15). Residual confusion was primarily observed at the low–medium boundary, consistent with the smooth thermal gradient that characterizes the transition zone in the SiP field. XGBoost produced a comparable diagonal structure (Figure 16), with slightly lower density at the low–medium boundary, reflecting its marginally higher regression MSE. The VAE exhibited severe classification collapse (Figure 17): virtually all predictions were assigned to the medium category regardless of the true class label, a behavior that is the direct classification-domain manifestation of the ELBO-versus-MSE objective mismatch documented in Section 4.4. This collapse is further amplified by the severe class imbalance inherent to the fixed-threshold scheme (≈78% medium, ≈0.3% high), which deprives both ensemble models and the VAE of sufficient high-class training examples.

4.5.2. Quartile-Threshold Classification (Nominal/Transitional/Critical)

Under the quartile-threshold protocol, the balanced class distribution (≈35.5%/39.8%/24.7%) allows a more rigorous assessment of each model’s discriminative capacity across the full temperature range. The Random Forest model maintained its leading position (Figure 18), achieving a well-populated diagonal across all three classes and demonstrating that its strong regression accuracy translates directly into reliable multi-class thermal classification, including in the critical zone. XGBoost produced a classification profile qualitatively similar to Random Forest (Figure 19), with slightly reduced density in the nominal transitional boundary region, confirming its competitive but second-ranked position. The VAE showed a markedly improved distribution relative to the fixed-threshold configuration (Figure 20): the predictions now collapse predominantly toward the critical class rather than toward transitional, which reflects the distributional blurring induced by KL regularization combined with the upper-tail bias of the reconstructed temperature distribution. While this remains an inferior classification profile compared to the ensemble methods, a non-negligible fraction of nominal instances are now correctly assigned, confirming that the VAE retains some discriminative structure absent under the fixed-threshold protocol. The VAE is a functionally active generative model, even if its classification accuracy remains inferior to the ensemble methods for deterministic regression purposes. Taken together, the two confusion matrix analyses serve complementary roles: the fixed-threshold matrices provide engineering-interpretable evidence directly aligned with SiP design practice, while the quartile-threshold matrices provide statistically robust evidence of each model’s actual discriminative capacity, free from the distorting effect of class imbalance.

4.6. Global Performance Comparison

Table 1 and Figure 21 consolidate the key performance metrics across all evaluated model configurations. The results unequivocally establish that ensemble tree-based methods outperform the VAE for deterministic nodal temperature regression in SiP systems.

Random Forest achieved the lowest error across all regression metrics, with a test MSE of 0.0978 and an R² of 0.9970, confirming that bootstrap aggregation is strongly aligned with the MSE minimization objective on spatially structured tabular data. The feature importance analysis further identified

X_{norm}

as the dominant spatial predictor, providing physically interpretable insight into the directional anisotropy of the SiP thermal field, an advantage that neither XGBoost nor the VAE can offer in the same transparent, cost-free manner. XGBoost ranked second with a test MSE of 0.1091 and R² of 0.9967, a performance gap of approximately 11% relative to Random Forest that narrows considerably after hyperparameter optimization compared to the baseline. The residual boxplot (Figure 22) confirms that both ensemble methods produce compact, symmetric residual distributions centered near zero, while the VAE exhibits a substantially wider spread. The R² comparison (Figure 23) summarizes this hierarchy visually: Random Forest and XGBoost are virtually indistinguishable on the regression axis, while the VAE’s near-zero R² (

- 0.0008

) confirms that, in its current configuration, it provides no predictive advantage over a constant mean predictor for this regression task.

Table 2 presents the cross-study comparison of node-level temperature prediction performance between the present work and [20].

Two observations emerge from this cross-study comparison:

The optimized Random Forest achieves an MSE of 0.0978, which falls below the threshold of 0.5 reported for the Graph Convolutional Network of [20], despite operating on a feature space of dramatically reduced the dimensionality of three normalized spatial coordinates, without power map information, graph topology encoding, or adjacency matrix construction.
Both studies independently converge on the same structural conclusion regarding the positioning of ensemble methods: they provide the best accuracy to implementation cost trade-off for structured per-node temperature estimation, a finding whose consistency across two independent datasets and two distinct chip architectures strengthens its generalizability.

Comparative interpretability. Random Forest delivers intrinsic MDI (

X_{norm}

: 0.5262;

Y_{norm}

: 0.4738;

Z_{norm}

: 0.000), validated by permutation importance (Section 4.3, Figure 10). XGBoost provides gain-based attribution (

Y_{norm}

: 0.5116,

X_{norm}

: 0.4884). The VAE offers no feature ranking but supports anomaly scoring through its latent space geometry. For practitioners requiring interpretable, audit-ready models, Random Forest is the recommended choice.

Computational efficiency. Table 3 and Figure 24 report training time, per-node inference latency, and serialized model size. XGBoost achieves the lowest latency (0.0044 ms/node) and footprint (0.8 MB). Random Forest incurs higher memory (232.8 MB) but delivers the strongest accuracy. The VAE exhibits the longest inference time (0.1437 ms/node). All three models deliver sub-millisecond inference, confirming tractability for real-time SiP thermal monitoring.

Spatial error distribution. The absolute prediction error

| e_{i} |

was computed for each test-set node. Both RF and XGB exhibit elevated errors in the thermal gradient transition zones surrounding the four hotspot clusters, the regions where coordinate-only features cannot encode neighborhood thermal coupling. This result provides quantitative motivation for future GCN extensions.

Generalization scope and transfer assumptions. The three models were trained and tested on a single SiP geometry consisting of

N = 10, 201

nodes extracted from one FEM simulation. The performance figures reported here therefore characterize behavior on package layouts whose component placement, die-stack height, and substrate thermal conductivity are similar to those of the training configuration. Direct transfer to substantially different package configurations, for example, modules with markedly different component densities, alternative heat-source distributions, or different boundary conditions (heat-sink attachment, ambient flow regime), cannot be guaranteed without retraining or fine-tuning on representative data from the target design. Two practical implications follow. First, for design-space exploration within a single SiP family, the trained models can be re-applied to new node–coordinate inputs at sub-millisecond latency without modification. Second, deployment across a fleet of heterogeneous SiP designs would require either (i) a per-design training cycle on a small FEM-generated dataset or (ii) transfer-learning techniques that adapt a pre-trained model with a fraction of the data required for training from scratch. Both routes are within reach, given the modest training budget reported in Table 3 (0.1–0.5 min per model), and item 6 of the future work agenda elaborates in this direction.

5. Conclusions

5.1. Results and Discussion

This study conducted a rigorous comparative evaluation of three machine learning paradigms, Random Forest, XGBoost, and a Variational Autoencoder (VAE), applied to node-level temperature prediction in System-in-Package (SiP) architectures. All three models were trained and evaluated under identical conditions on a common FEM dataset (

N = 10, 201

nodal temperatures), isolating algorithmic differences from differential access to physical information.

1.: Ensemble methods constitute effective and practically tractable surrogates for FEM-based thermal field prediction. Random Forest achieved MSE = $0.0978$ °C², R² = 0.9970; XGBoost attained MSE = $0.1091$ °C², R² = 0.9967, both from three normalized spatial coordinates alone. The Random Forest result (MSE = 0.0978) falls below the GCN benchmark (<0.5) of [20], which requires power maps and full graph topology. MDI feature attribution ( $X_{norm}$ : 0.5262; $Y_{norm}$ : 0.4738; $Z_{norm}$ : 0.000), independently confirmed by permutation importance, reveals a planar thermal anisotropy that can guide sensor placement and design decisions.
2.: The structural incompatibility between the ELBO training objective and deterministic regression accounts for the VAE’s inferior point-prediction performance. The VAE achieved R² = −0.0008 and MSE = $32.608$ °C² (RMSE = $5.710$ °C), performing no better than a constant mean predictor. The revised architecture employs linear decoder activation, MinMaxScaler normalization, and converged at epoch 44 under early stopping. The regression failure is structural: the KL regularization term penalizes the sharp, point-estimating latent representations that deterministic regression requires. The temperature-structured latent space is consistent with the hypothesis of reconstruction-error-based anomaly detection, a claim that requires experimental validation with labeled fault data before operational deployment.
3.: The three paradigms present complementary accuracy, interpretability, and computational profiles, guiding model selection for different deployment contexts. XGBoost achieves the lowest inference latency (0.0044 ms/node) and serialized footprint (0.8 MB), positioning it as the preferred model for embedded real-time monitoring. Random Forest provides the highest regression fidelity and the richest engineering interpretability through intrinsic MDI feature attribution. The VAE, while unsuitable for deterministic regression, offers a compact representation (0.1 MB) and a structured latent geometry that may support anomaly detection.
4.: The coordinate-only input feature space is a deliberate and bounded design choice enabling rigorous cross-paradigm comparison. The exclusion of power dissipation maps, material conductivities, graph topology, and boundary condition labels ensures that observed performance differences are attributable solely to algorithmic characteristics. This constraint limits the applicability of the trained models to designs with similar spatial–thermal structures to the training configuration and identifies physics-augmented features as the highest-priority extension.

5.2. Perspectives and Future Work

1.: Physics-augmented feature engineering: Incorporating per-node power dissipation, material conductivities, heat-source locations, and boundary condition labels would substantially improve accuracy and cross-design generalization.
2.: Graph-structured and physics-informed architectures: GCNs can exploit explicit mesh adjacency; PINNs incorporate heat-conduction equations into the training loss. Both address the spatial correlation limitation quantified in Section 3.3.
3.: Hybrid ensemble–generative architectures: VAE-derived latent features fed to a Random Forest regressor could combine ensemble regression accuracy with generative uncertainty quantification and anomaly detection capabilities.
4.: Rigorous anomaly detection validation: Operational deployment of the VAE requires labeled out-of-distribution fault data enabling AUC-ROC, precision–recall, and F1 evaluation across detection thresholds.
5.: Integration with active thermal management: The sub-millisecond inference latency of all three models renders them compatible with closed-loop control architectures—DVFS controllers, fan-speed modules, or thermoelectric coolers—closing the gap between ML prediction and actionable thermal management.
6.: Cross-design generalization and transfer learning: Domain adaptation and design-agnostic representations should be explored to reduce per-design training data requirements for new SiP configurations.

Together, these directions position machine learning as a scientifically principled and scalable complement to physics-based solvers for proactive thermal management in next-generation System-in-Package architectures.

Author Contributions

A.O. developed the first version of the proposed and wrote the first version of the paper, M.F.B. completed the work and improved our approach, O.E. writing and revision this version and finally A.L. is a senior author and SiP specialist, who supervised all steps of the work. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUC-ROC	Area Under the Curve–Receiver Operating Characteristic
BCs	Boundary Conditions
CCLC	Chip Cooling Laminate Chip
CFD	Computational Fluid Dynamics
DVFS	Dynamic Voltage and Frequency Scaling
ELBO	Evidence Lower Bound
FEM	Finite Element Method (comparison baseline)
GCN	Graph Convolutional Network
IoT	Internet of Things
KL	Kullback–Leibler (divergence)
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
MDI	Mean Decrease in Impurity
ML	Machine Learning
MSE	Mean Squared Error
PINN	Physics-Informed Neural Network
RF	Random Forest
RMSE	Root Mean Squared Error
SiP	System-in-Package
SMOTE	Synthetic Minority Over-sampling Technique
SoC	System-on-Chip
TSV	Through-Silicon Via
VAE	Variational Autoencoder
XGBoost	Extreme Gradient Boosting

References

Oukaira, A.; Said, D.; Zbitou, J.; Lakhssassi, A. Advanced Thermal Control Using Chip Cooling Laminate Chip (CCLC) with Finite Element Method for System-in-Package (SiP) Technology. Electronics 2023, 14, 3154. [Google Scholar] [CrossRef]
Memik, S.O.; Mukherjee, R.; Ni, M.; Long, J. Optimizing Thermal Sensor Allocation for Microprocessors. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2008, 27, 516–527. [Google Scholar] [CrossRef]
Wang, Z.; Dong, R.; Ye, R.; Singh, S.S.K.; Wu, S.; Chen, C. A Review of Thermal Performance of 3D Stacked Chips. Int. J. Heat Mass Transf. 2024, 235, 126212. [Google Scholar] [CrossRef]
Gabriel, O.E.; Huitink, D.R. Failure Mechanisms Driven Reliability Models for Power Electronics: A Review. J. Electron. Packag. 2023, 145, 020801. [Google Scholar] [CrossRef]
Sharma, M.K.; Ramos-Alvarado, B. Thermal Management of 3-D Heterogeneously Integrated Microelectronics: Challenges and Future Research Directions. Commun. Eng. 2026, 5, 28. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Tian, M.; Gu, X. Thermal Management of Through-Silicon vias and Back-End-of-Line, Layers in 3D ICs: A Comprehensive Review. Microelectron. Eng. 2025, 298, 112325. [Google Scholar] [CrossRef]
Oukaira, A.; Oumlaz, M.; Zbitou, J.; Lakhssassi, A. Integrated Thermal Management Strategies for 3D Chip Stacking with Through-Silicon Vias (TSV). In Proceedings of the 4th International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Fez, Morocco, 16–17 May 2024; pp. 1–4. [Google Scholar]
Chuttar, A.; Banerjee, D. Machine Learning (ML) Based Thermal Management for Cooling of Electronics Chips by Utilizing Thermal Energy Storage (TES) in Packaging That Leverages Phase Change Materials (PCM). Electronics 2021, 10, 2785. [Google Scholar] [CrossRef]
Sun, Y.; Cheng, J.; Luo, Z.; Zeng, Y. A Sub-0.01 °C Resolution All-CMOS Temperature Sensor with 0.43 °C/−0.38 °C Inaccuracy and 1.9 pJ · K2 Resolution FoM for IoT Applications. Micromachines 2024, 15, 1132. [Google Scholar] [CrossRef] [PubMed]
Ilager, S.; Ramamohanarao, K.; Buyya, R. Thermal Prediction for Efficient Energy Management of Clouds Using Machine Learning. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 1044–1056. [Google Scholar] [CrossRef]
Pereira, D.; Oliveira, R.; Kim, H.S. A Machine Learning Approach for Prediction of Signaling SiP Dialogs. IEEE Access 2021, 9, 44094–44106. [Google Scholar] [CrossRef]
Balega, M.; Farag, W.; Wu, X.-W.; Ezekiel, S.; Good, Z. Enhancing IoT Security: Optimizing Anomaly Detection through Machine Learning. Electronics 2024, 13, 2148. [Google Scholar] [CrossRef]
Zabin, R.; Haque, K.F.; Abdelgawad, A. PredXGBR: A Machine Learning Framework for Short-Term Electrical Load Prediction. Electronics 2024, 13, 4521. [Google Scholar] [CrossRef]
Hu, Z.; Cui, M.; Wu, X. Real-Time Temperature Prediction of Power Devices Using an Improved Thermal Equivalent Circuit Model and Application in Power Electronics. Micromachines 2024, 15, 63. [Google Scholar] [CrossRef]
Qiu, Y.; Li, Z. Neural Network-Based Approach for Failure and Life Prediction of Electronic Components under Accelerated Life Stress. Electronics 2024, 13, 1512. [Google Scholar] [CrossRef]
Liang, H. Predicting Temperature Rise in Medium Voltage Switchgear within the Smart Grid Environment and the Application of a Random Forest Model. In Proceedings of the 10th International Forum on Electrical Engineering and Automation (IFEEA), Nanjing, China, 3–5 November 2023; pp. 584–590. [Google Scholar]
Zhu, J.; Jiang, M.; Liu, Z. Fault Detection and Diagnosis in Industrial Processes with Variational Autoencoder: A Comprehensive Study. Sensors 2022, 22, 227. [Google Scholar] [CrossRef]
Jakubowski, J.; Stanisz, P.; Bobek, S.; Nalepa, G.J. Anomaly Detection in Asset Degradation Process Using Variational Autoencoder and Explanations. Sensors 2022, 22, 291. [Google Scholar] [CrossRef]
Pham, T.; Lee, J.-H.; Park, C.-S. MST-VAE: Multi-Scale Temporal Variational Autoencoder for Anomaly Detection in Multivariate Time Series. Appl. Sci. 2022, 12, 10078. [Google Scholar] [CrossRef]
Miao, D.; Duan, G.; Chen, D.; Zhu, Y.; Zheng, X. Real-Time Temperature Prediction for Large-Scale Multi-Core Chips Based on Graph Convolutional Neural Networks. Electronics 2025, 14, 1223. [Google Scholar] [CrossRef]
Pagani, S.; Manoj, P.D.S.; Jantsch, A.; Henkel, J. Machine Learning for Power, Energy, and Thermal Management on Multicore Processors: A Survey. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 101–116. [Google Scholar] [CrossRef]
Zhang, Y.; Sarvey, T.; Bakir, M.S. Thermal Challenges for Heterogeneous 3D ICs and Opportunities for Air Gap Thermal Isolation. In Proceedings of the 2014 International 3D Systems Integration Conference (3DIC), Kinsdale, Ireland, 1–3 December 2014; pp. 1–5. [Google Scholar]
Chen, K.-C.; Liao, Y.-H.; Chen, C.-T.; Wang, L.-Q. Adaptive Machine Learning-Based Proactive Thermal Management for NoC Systems. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2023, 31, 1114–1124. [Google Scholar] [CrossRef]
Zhang, K.; Guiliani, A.; Ogrenci-Memik, S.; Memik, G.; Yoshii, K.; Sankaran, R.; Beckman, P. Machine Learning-Based Temperature Prediction for Runtime Thermal Management Across System Components. IEEE Trans. Parallel Distrib. Syst. 2018, 29, 405–419. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Acharya, P.V.; Lokanathan, M.; Ouroua, A.; Hebner, R.; Strank, S.; Bahadur, V. Machine Learning-Based Predictions of Benefits of High Thermal Conductivity Encapsulation Materials for Power Electronics Packaging. J. Electron. Packag. 2021, 143, 041109. [Google Scholar] [CrossRef]
Azarifar, M.; Ocaksonmez, K.; Cengiz, C.; Aydoğan, R.; Arik, M. Machine Learning to Predict Junction Temperature Based on Optical Characteristics in Solid-State Lighting Devices: A Test on WLEDs. Micromachines 2022, 13, 1245. [Google Scholar] [CrossRef]
He, K.; Ding, H. Prediction of NOx Emissions in Thermal Power Plants Using a Dynamic Soft Sensor Based on Random Forest and Just-in-Time Learning Methods. Sensors 2024, 24, 4442. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Bhandari, U.; Chen, Y.; Ding, H.; Zeng, C.; Emanet, S.; Gradli, P.R.; Guo, S. Machine-Learning-Based Thermal Conductivity Prediction for Additively Manufactured Alloys. J. Manuf. Mater. Process. 2023, 7, 160. [Google Scholar] [CrossRef]
Shanmugam, M.; Maganti, L.S. Machine Learning-Based Thermal Performance Study of Microchannel Heat Sink under Non-Uniform Heat Load Conditions. Appl. Therm. Eng. 2024, 253, 123769. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trends Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
Liu, J.; Huang, Y.; Wu, D.; Yang, Y.; Chen, Y.; Chen, L.; Zhang, Y. Multi-Channel Multi-Scale Convolution Attention Variational Autoencoder (MCA-VAE): An Interpretable Anomaly Detection Algorithm Based on Variational Autoencoder. Sensors 2024, 24, 5316. [Google Scholar] [CrossRef]
Zhu, B.; Ren, S.; Weng, Q.; Si, F. A Physics-Informed Variational Autoencoder for Modeling Power Plant Thermal Systems. Energies 2025, 18, 4742. [Google Scholar] [CrossRef]
Gong, J.; Chu, S.; Mehta, R.K.; McCaughey, A.J.H. XGBoost Model for Electrocaloric Temperature Change Prediction in Ceramics. npj Comput. Mater. 2022, 8, 140. [Google Scholar] [CrossRef]
Naing, W.Y.N.; Htike, Z.Z. Forecasting of Monthly Temperature Variations Using Random Forests. ARPN J. Eng. Appl. Sci. 2015, 10, 10109–10112. [Google Scholar]

Figure 1. Class distribution comparison: fixed engineering thresholds (left, low = 7.1%, medium = 81.0%, high = 12.0%) vs. quartile-based protocol (right, balanced at 35.5%/39.8%/24.7%). The fixed protocol produces severe class imbalance that disadvantages all models in the high-temperature regime. The triangle (△) above the left panel marks the severely imbalanced fixed-threshold distribution; the check-mark (✓) above the right panel marks the statistically balanced quartile distribution.

Figure 2. Descriptive statistics of the preprocessed dataset. Bars represent the empirical histogram of nodal temperatures; the solid curve is a kernel density estimate (KDE) of the same distribution.

Figure 3. Temperature. distribution and boxplot by category. Open circles denote individual outlier nodes whose temperature lies beyond 1.5× the interquartile range above the upper whisker.

Figure 4. Correlation heatmap between variables.

Figure 5. Residuals vs. predicted values of XGBoost. Each red dot represents one held-out test node; the horizontal dashed line at residual = 0 is the ideal zero-error reference.

Figure 6. XGBoost feature importance scores.

Figure 7. Residuals vs. predicted values of Random Forest. Each red dot represents one held-out test node; the horizontal dashed line at residual = 0 is the ideal zero-error reference.

Figure 8. Random Forest-optimized feature importance scores (MDI).

Figure 9. RMSE vs. number of trees.

Figure 10. RF MDI vs. permutation importance (20 repeats). Both confirm

X_{norm}

,

Y_{norm}

as sole predictors and

Z_{norm} = 0

.

Figure 10. RF MDI vs. permutation importance (20 repeats). Both confirm

X_{norm}

,

Y_{norm}

as sole predictors and

Z_{norm} = 0

.

Figure 11. Training and validation loss curves of VAE.

Figure 12. Distribution of reconstruction errors of VAE. The green curve is a kernel density estimate of the per-sample reconstruction error; the dashed red line marks the empirical mean (8.6739).

Figure 13. Actual vs. reconstructed temperature of VAE. Each red dot represents one test-set node (actual temperature on the abscissa, VAE-reconstructed temperature on the ordinate); the dashed line is the 1:1 perfect-prediction reference.

Figure 14. 2D latent space colored by temperature.

Figure 15. Confusion matrix Random Forest predictions, fixed thresholds.

Figure 16. Confusion matrix XGBoost predictions, fixed thresholds.

Figure 17. Confusion matrix VAE predictions, fixed thresholds.

Figure 18. Confusion matrix Random Forest predictions, quartile thresholds.

Figure 19. Confusion matrix XGBOOST predictions, quartile thresholds.

Figure 20. Confusion matrix VAE predictions, quartile thresholds.

Figure 21. Model comparison: grouped bar chart of MSE/RMSE/MAE.

Figure 22. Residual boxplot comparison across all three models. Red dots inside each box mark the empirical mean of the residuals; the horizontal dashed line at zero is the perfect-prediction reference.

Figure 23. R² scores and classification accuracy comparison.

Figure 24. Computational efficiency: training time, inference latency (log scale), serialized model size.

Table 1. Comparative performance summary of evaluated machine learning models on the held-out test set (n = 2041 nodes). All metrics are computed on the original temperature scale (°C) after inverse MinMaxScaler transform for the VAE. The VAE MSE is ≈380× higher than Random Forest, a structural consequence of the ELBO/MSE objective mismatch.

Model	MSE (°C²)	RMSE (°C)	MAE (°C)	MAPE (%)	$R^{2}$
Random Forest	0.0978	0.3128	0.1112	0.31	0.9970
XGBoost	0.1091	0.3304	0.1778	0.58	0.9967
VAE	32.608	5.7103	3.0179	11.12	−0.0008

Note: Bold values indicate the best score across the three models for each metric.

Table 2. Cross-study comparison of node-level temperature prediction performance between the present work and [20]. Both studies report MSE on per-node test partitions.

Model	Study	MSE (Test)	Input Features	Interpretability
Graph Convolutional Network	[20]	<0.5	Power map	Low
Random Forest	Present study	0.0978	3D normalized	High (MDI)
XGBoost	Present study	0.1091	3D normalized	Moderate
VAE	Present study	32.608	3D normalized	Low

Note: Bold indicates the best MSE across all entries. This reference [20] requires power map data and explicit graph topology encoding as inputs; the present models rely solely on three normalized spatial coordinates, without structural or power information.

Table 3. Computational efficiency.

Model	Training	Inference/Node	Memory
XGBoost	0.1 min	0.0044 ms	0.8 MB
Random Forest	0.5 min	0.0833 ms	232.8 MB
VAE	0.2 min	0.1437 ms	0.1 MB

Note: Bold values indicate the best score across the three models (lowest inference latency and smallest serialized footprint).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Oukaira, A.; Baba, M.F.; Ettahri, O.; Lakhssassi, A. Thermal Prediction for Efficient Management of Temperatures in System-in-Package (SiP) Using Machine Learning (ML). Appl. Sci. 2026, 16, 5468. https://doi.org/10.3390/app16115468

AMA Style

Oukaira A, Baba MF, Ettahri O, Lakhssassi A. Thermal Prediction for Efficient Management of Temperatures in System-in-Package (SiP) Using Machine Learning (ML). Applied Sciences. 2026; 16(11):5468. https://doi.org/10.3390/app16115468

Chicago/Turabian Style

Oukaira, Aziz, Mhamed Filali Baba, Ouafaa Ettahri, and Ahmed Lakhssassi. 2026. "Thermal Prediction for Efficient Management of Temperatures in System-in-Package (SiP) Using Machine Learning (ML)" Applied Sciences 16, no. 11: 5468. https://doi.org/10.3390/app16115468

APA Style

Oukaira, A., Baba, M. F., Ettahri, O., & Lakhssassi, A. (2026). Thermal Prediction for Efficient Management of Temperatures in System-in-Package (SiP) Using Machine Learning (ML). Applied Sciences, 16(11), 5468. https://doi.org/10.3390/app16115468

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Thermal Prediction for Efficient Management of Temperatures in System-in-Package (SiP) Using Machine Learning (ML)

Abstract

1. Introduction

2. Background, State of the Art, and Theoretical Foundations

2.1. Current Context and Challenges in SiP Thermal Management

2.2. Random Forest: Algorithm, Theory, and Relevance

2.2.1. Mathematical Foundations

2.2.2. State of the Art (2023–2025): Random Forest for Thermal and Microelectronic Prediction

2.2.3. Motivation and Expected Contribution in This Study

2.3. XGBoost: Algorithm, Theory, and Relevance

2.3.1. Mathematical Foundations

2.3.2. State of the Art (2023–2025): XGBoost for Thermal and Electronic Prediction

2.3.3. Motivation and Expected Contribution in This Study

2.4. Variational Autoencoder: Algorithm, Theory, and Relevance

2.4.1. Mathematical Foundations

2.4.2. State of the Art (2023–2025): VAEs for Anomaly Detection and Thermal Applications

2.4.3. Motivation and Expected Limitations in This Study

3. Study Objectives and Experimental Methodology

3.1. Methodological Steps

3.2. Data Preparation Pipeline

3.2.1. Raw Data Loading and Cleaning

3.2.2. Feature Normalization (Z-Score Standardization)

3.2.3. Dataset Merging and Partitioning

3.3. Evaluation Protocol

3.4. Model Architectures and Training Procedures

3.4.1. XGBoost (Extreme Gradient Boosting)

3.4.2. Random Forest

3.4.3. Variational Autoencoder (VAE)

4. Results and Discussion

4.1. Exploratory Data Analysis

4.2. XGBoost Model Results

4.3. Random Forest Model Results

4.4. Variational Autoencoder Results

4.5. Confusion Matrix Analysis

4.5.1. Fixed-Threshold Classification (Low/Medium/High)

4.5.2. Quartile-Threshold Classification (Nominal/Transitional/Critical)

4.6. Global Performance Comparison

5. Conclusions

5.1. Results and Discussion

5.2. Perspectives and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI