3.1. Overall System Architecture Design
The data center thermal management system based on distributed fiber optic temperature sensing and model predictive control is designed to achieve high-precision perception of thermal fields, accurate prediction of thermal loads, and optimal regulation of cooling equipment. The overall system architecture follows a closed-loop cyber–physical control paradigm of perception–modeling–decision–execution, as illustrated in
Figure 2. This architecture is vertically organized into four functional layers.
The bottom layer is the physical equipment layer, which consists of server racks, precision air conditioners (CRAC units), cooling towers, chilled water pumps, and distributed fiber optic sensors. These components form the physical infrastructure that generates heat loads, executes cooling actions, and provides continuous temperature measurements. The second layer is the data acquisition layer, which is responsible for temperature signal demodulation, multi-source data fusion and aggregation, and communication protocol conversion to ensure reliable and low-latency data transmission. The third layer is the intelligent computing layer, which hosts the training and inference tasks of the hybrid thermal prediction model as well as the optimization solving tasks of the model predictive controller. The top layer is the human–machine interface layer, providing visualization dashboards, alarm management services, and operation and maintenance decision support for operators.
A central concept in the proposed framework is thermal symmetry, which characterizes the spatial uniformity of temperature distribution across the data center. To quantify thermal symmetry, this study introduces the Thermal Symmetry Index (
TSI), mathematically defined as:
where
denotes the spatial standard deviation of temperature measurements across all sensing points (°C), and
represents the spatial mean temperature (°C). A lower
TSI value indicates better thermal uniformity, with
TSI = 0 representing a perfectly uniform thermal field. Based on industry practice and ASHRAE guidelines [
29],
TSI values can be interpreted as follows:
TSI < 0.03 indicates excellent thermal symmetry with minimal hotspot risk; 0.03 ≤
TSI < 0.05 represents good thermal symmetry acceptable for normal operation; 0.05 ≤
TSI < 0.08 indicates moderate asymmetry requiring attention; and
TSI ≥ 0.08 suggests poor thermal symmetry with significant hotspot risk requiring immediate intervention. The proposed MPC controller explicitly incorporates
TSI minimization as an optimization objective to achieve spatially balanced thermal regulation.
The hardware platform adopts an edge–cloud collaborative deployment mode. High-performance embedded computing devices are deployed at the edge side to execute temperature data preprocessing and fast control response tasks with stringent real-time requirements, while GPU server clusters are deployed at the cloud side to support offline training of deep learning models and large-scale historical data storage and analytics. Tanasiev et al. [
30] proposed an IoT-enhanced monitoring and control solution for HVAC systems that integrates heterogeneous devices through MQTT protocols and RESTful APIs, enabling real-time perception and remote management of equipment status via intelligent sensor nodes and edge computing applications. Inspired by this design philosophy, the proposed system constructs a hierarchical data acquisition and communication network for large-scale thermal sensing. Serale et al. [
31] further investigated IoT system architectures for MPC-based control and highlighted that well-designed communication topology and data synchronization mechanisms are essential for guaranteeing the real-time performance and stability of predictive control systems. The main design parameters of the proposed system are summarized in
Table 3.
The temperature upper limit threshold of 27 °C is determined based on ASHRAE TC 9.9 guidelines for data center thermal management [
29], which recommend that server inlet temperatures be maintained within 18–27 °C for Class A1 data centers to ensure reliable IT equipment operation. This threshold provides a safety margin below the critical temperature of 32 °C, above which server throttling or emergency shutdown may occur. The selection balances thermal safety requirements with energy efficiency considerations, as operating closer to the upper limit reduces overcooling and associated energy waste.
3.2. Distributed Fiber Optic Temperature Sensing Subsystem Design
The distributed fiber optic temperature sensing subsystem constitutes the perceptual foundation of the proposed thermal symmetry management system. Its primary function is to acquire high-density spatiotemporal distribution information of the temperature field within the data center machine room, thereby enabling continuous and fine-grained observation of thermal dynamics. In contrast to traditional point-type temperature sensors, whose deployment density is constrained by wiring complexity and installation cost and thus provides only sparse discrete sampling, distributed fiber optic sensing enables continuous temperature profiling along the entire fiber path, making it possible to capture spatial temperature gradients and evolving hotspot structures under complex airflow environments.
Ashry et al. [
32] systematically reviewed the deployment of fiber-optic distributed sensing technologies in the oil and gas industry, covering Rayleigh-based distributed acoustic sensing (DAS), Raman-based distributed temperature sensing (DTS), and Brillouin-based distributed temperature and strain sensing (DTSS). Their survey highlights that these sensing systems provide continuous real-time measurements along the full length of optical fiber cables and are particularly suitable for long-distance, large-scale monitoring applications. Lu et al. [
33] further presented a comprehensive review of distributed optical fiber sensors based on Rayleigh, Brillouin, and Raman scattering mechanisms, emphasizing their extensive applications in energy infrastructure monitoring, power generation systems, and pipeline inspection. Their study demonstrates the long-term stability, robustness, and reliability of distributed sensing technologies under complex operating conditions, together with diverse trade-offs in spatial resolution, sensing range, and temperature accuracy.
By leveraging these technical advantages, the proposed sensing subsystem establishes a high-resolution thermal perception layer for data centers, enabling continuous observation of three-dimensional thermal fields and providing a reliable data foundation for hybrid thermal modeling and symmetry-aware predictive control.
The proposed system adopts a distributed fiber optic temperature sensing scheme based on stimulated Brillouin scattering. When pulsed light propagates along an optical fiber, photons interact inelastically with acoustic phonons in the fiber medium, generating Brillouin backscattered light with a frequency shift. The Brillouin frequency shift exhibits a linear dependence on the local temperature of the fiber, which can be expressed as [
18]:
where
denotes the Brillouin frequency shift at temperature
(in GHz),
is the Brillouin frequency shift at the reference temperature
, and
represents the temperature sensitivity coefficient, which is typically approximately 1.1 MHz/°C for standard single-mode optical fibers. Here,
denotes the measured temperature (in °C), and
is the reference temperature, commonly set to 25 °C.
Through spectral analysis and time-domain localization of the backscattered optical signals, both temperature values and spatial position information along the entire fiber can be obtained simultaneously. The spatial resolution of the sensing system is determined by the width of the probing optical pulses, while the temperature resolution depends primarily on the accuracy of spectral demodulation.
Barrias et al. [
34] reviewed the application status of distributed optical fiber sensors in civil engineering, demonstrating that although fiber Bragg grating (FBG) sensors offer high measurement accuracy, they essentially belong to quasi-distributed sensing schemes, in which the number of measurement points is limited by the number of gratings deployed along the fiber. In contrast, truly distributed sensing technologies based on Rayleigh, Brillouin, or Raman scattering provide continuous temperature measurements along the entire fiber length, enabling dense spatial sampling of large-scale infrastructures.
Bense et al. [
35] extensively reviewed the application of distributed temperature sensing (DTS) as a downhole monitoring tool in hydrogeology, demonstrating both passive and active DTS modes for a wide range of monitoring scenarios. Their work verifies the long-term stability, robustness, and environmental adaptability of distributed sensing systems in complex operating environments, providing valuable references for the technology selection and deployment strategies of temperature sensing systems in large-scale data center infrastructures.
The detailed specifications of the distributed fiber optic temperature sensing system are summarized in
Table 4. The sensing system employs Brillouin optical time-domain analysis (BOTDA) technology with a spatial resolution of 0.5 m and temperature accuracy of ±0.1 °C. The measurement uncertainty analysis follows the GUM (Guide to the Expression of Uncertainty in Measurement) framework, with Type A uncertainty evaluated from repeated measurements and Type B uncertainty estimated from instrument specifications and calibration certificates.
The spatial layout of the sensing fibers directly determines the observability, integrity, and reconstruction accuracy of the three-dimensional thermal field. To achieve uniform coverage while preserving fine-grained resolution in thermally critical regions, the proposed system adopts a hybrid deployment strategy that combines serpentine routing with hotspot-aware densification, as illustrated in
Figure 3. The main trunk cable is arranged in a serpentine pattern along the upper spaces of both cold and hot aisles, ensuring continuous coverage across all rack rows. In key heat-exchange locations, including precision air-conditioner outlets, rack inlet faces, and hot-aisle return regions, local sampling density is increased through fiber coiling and localized routing, enabling high-resolution observation of thermal gradients and transient hotspots. The total fiber length of 1800 m is deployed across six rack rows, with approximately 300 m allocated to each row. The fiber is secured using cable ties and mounting clips at intervals of 1.0 m to prevent displacement and vibration-induced measurement noise. The detailed layout parameters and regional sampling strategies are summarized in
Table 5.
The fiber optic sensing system incorporates several safety features to ensure reliable operation in the data center environment. The sensing fiber is enclosed in a flame-retardant low-smoke zero-halogen (LSZH) jacket that meets IEC 60332-1 fire safety standards, preventing fire propagation and toxic gas emission. The cable routing avoids direct contact with high-temperature surfaces (>60 °C) and maintains a minimum clearance of 50 mm from power cables to minimize electromagnetic interference. Rodent protection is provided through the use of armored fiber cables in accessible areas and protective conduits in raised floor sections. The fiber installation does not obstruct airflow paths or impede equipment maintenance access.
To ensure system reliability under partial fiber failure conditions, the proposed framework incorporates a fault detection and data recovery mechanism. The sensing system continuously monitors the optical power level and Brillouin frequency shift quality along the fiber. When a fiber break or excessive attenuation is detected at a specific location, the system automatically identifies the affected measurement points and activates interpolation-based data recovery using neighboring healthy sensing points. For critical monitoring zones, redundant fiber loops are deployed to provide backup sensing capability. The MPC controller is designed to maintain stable operation with up to 15% of sensing points unavailable, utilizing a robust state estimation algorithm that weights available measurements according to their spatial proximity to the missing points. In the event of extensive fiber failure exceeding this threshold, the system automatically switches to a conservative control mode with increased safety margins until repair is completed.
3.3. Hybrid Thermal Prediction Model Design
Accurate temperature prediction constitutes the fundamental prerequisite for the implementation of model predictive control in data center thermal management. Purely physics-based models exhibit strong interpretability and extrapolation capability; however, their formulation is often complex and computationally intensive, which limits their suitability for real-time control. In contrast, purely data-driven models are easy to train and efficient in inference, but their generalization ability is inherently constrained by the distribution range of training data and may degrade under unseen operating conditions.
To address these limitations, the proposed system develops a hybrid thermal prediction model that integrates thermodynamic physical equations with deep neural networks, achieving a balance between physical consistency, predictive accuracy, and computational efficiency. From the perspective of cloud computing energy efficiency optimization, Buyya et al. [
36] analyzed the application potential of data-driven methods in data center management and emphasized that hybrid modeling strategies can effectively overcome the intrinsic limitations of single-paradigm approaches. Buyya et al. [
37] further provided a comprehensive review of energy-efficiency innovations and next-generation cloud computing technologies, highlighting that the integration of physical models with data-driven learning has become a key methodological trend for intelligent and sustainable data center operation.
Motivated by these insights, the proposed hybrid prediction framework is designed to leverage the structural prior and extrapolation capability of thermodynamic models while exploiting the nonlinear representation power of deep neural networks for complex thermal dynamics, thereby providing a robust and scalable foundation for symmetry-aware predictive control.
The overall architecture of the hybrid thermal prediction model is illustrated in
Figure 4. The model is composed of three tightly coupled components: a physical constraint layer, a feature extraction layer, and a prediction output layer. The physical constraint layer establishes macroscopic thermal balance equations for the data center machine room based on the principle of energy conservation, providing physically interpretable structural priors for the learning model.
Under steady-state operating conditions, the thermal balance of the machine room can be expressed as [
38]:
where
denotes the total heat generation power of IT equipment (kW),
represents the heat gain introduced by envelope heat transfer and infiltration air (kW),
denotes the effective cooling capacity of the cooling system (kW), and
represents other heat dissipation losses (kW).
For dynamic operating conditions, considering the thermal storage effect of machine room air and equipment, the thermal balance equation can be extended as [
38]:
where
denotes the air density (kg/m
3),
is the effective machine room volume (m
3),
is the specific heat capacity of air at constant pressure (kJ/(kg·°C)),
is the average machine room temperature (°C), and
denotes time (s).
The formulation of the physical constraint layer is inspired by the fast fluid dynamics (FFD) modeling paradigm proposed by Han et al. [
38], who developed a data center thermal simulation model based on open-source fast fluid dynamics solvers. Their improved upwind scheme enables the coupled solution of advection and diffusion equations, achieving a favorable trade-off between computational efficiency and numerical accuracy. Compared with conventional CFD solvers requiring 464.8 h of computation time, the FFD model reduces simulation time to 7.6 h while preserving sufficient accuracy, and can achieve annual energy savings of 53.4–58.8% through optimal thermal design and operation.
In parallel, Athavale et al. [
21] systematically compared multiple data-driven thermal modeling approaches, including artificial neural networks, support vector regression, and Gaussian process regression for data center temperature prediction. Their experimental results indicate that Gaussian process regression achieves the best average prediction error of 0.56 °C, providing a strong benchmark for validating the predictive accuracy of learning-based thermal models.
Motivated by these studies, the physical constraint layer in the proposed hybrid model encodes macroscopic thermodynamic principles into the learning framework, enabling the deep neural network to respect energy conservation laws while learning complex nonlinear thermal dynamics from data. This hybrid modeling strategy improves prediction robustness under dynamic workloads and unseen operating conditions, and establishes a physically consistent foundation for symmetry-aware model predictive control.
From a methodological standpoint, recent progress in deep learning-based signal modeling, time–frequency analysis, and optimization-inspired neural networks has provided powerful tools for constructing physically consistent and interpretable prediction models. A series of studies have demonstrated that combining signal processing theory, deep temporal networks, and optimization-driven learning architectures can significantly enhance prediction accuracy, stability, and interpretability in complex dynamic systems [
39,
40,
41,
42,
43,
44]. These advances offer important methodological support for the proposed hybrid physical–AI thermal prediction framework.
The feature extraction layer adopts a cascaded architecture composed of a Temporal Convolutional Network (TCN) and a Bidirectional Gated Recurrent Unit (BiGRU) to capture multi-scale temporal dependencies and long-range correlations in temperature sequences. The TCN module employs causal convolution and dilated convolution to achieve an exponentially expanding receptive field with limited network depth, enabling efficient modeling of long-term thermal evolution patterns.
The convolutional output of the TCN can be formulated as [
28]:
where
denotes the hidden-state output of the
-th layer at time
,
is the convolution kernel weight matrix of layer
,
represents the hidden-state sequence of layer
from time
to
,
denotes the dilation factor,
is the bias vector, and
denotes the nonlinear activation function.
On top of the TCN encoder, a BiGRU module is introduced to further enhance sequential representation capability by modeling bidirectional temporal dependencies. The BiGRU propagates information in both forward and backward directions, enabling the network to capture both historical thermal inertia and future trend consistency from the learned latent features. This cascaded TCN–BiGRU architecture effectively alleviates the vanishing gradient problem and achieves faster convergence compared with conventional LSTM-based recurrent networks.
The effectiveness of the TCN–BiGRU architecture for data center thermal prediction has been experimentally validated by Lin et al. [
28], who demonstrated its superior accuracy and training efficiency in multi-objective thermal optimization scenarios.
To prevent overfitting and enhance model generalization, several regularization techniques are incorporated into the training process. Dropout regularization with a rate of 0.3 is applied after each TCN residual block and BiGRU layer to prevent co-adaptation of neurons. L2 weight regularization with a coefficient of is applied to all trainable parameters to constrain model complexity. The training dataset is split into training (70%), validation (15%), and testing (15%) subsets, with the validation set used for hyperparameter tuning and early stopping. Early stopping with a patience of 20 epochs monitors the validation loss to terminate training when no improvement is observed, preventing overfitting to the training data. Additionally, the physical constraint layer serves as an implicit regularizer by enforcing thermodynamic consistency, which restricts the solution space to physically plausible predictions and improves generalization to unseen operating conditions. Data augmentation through Gaussian noise injection ( = 0.05 °C) is applied during training to improve robustness against sensor noise.
The model’s generalization capability across different operating conditions is ensured through several design choices. The input features include normalized environmental variables (outdoor temperature, humidity) that capture seasonal variations, allowing the model to adapt to different ambient conditions. The physical constraint layer provides structural priors that remain valid across different data center configurations, reducing the need for extensive retraining when deploying to new facilities. For adaptation to significantly different data center layouts or cooling system configurations, transfer learning can be employed by freezing the physical constraint layer and fine-tuning only the deep learning components with limited local data (typically 3–7 days of operation). Cross-validation experiments across different load profiles demonstrated that the hybrid model maintains prediction RMSE below 0.5 °C for load variations within ±30% of the training distribution.
In the output stage, the learned deep features are fused with physical constraint priors through residual connections, enabling the model to generate multi-step temperature forecasts while respecting thermodynamic consistency. For input feature construction, domain knowledge from data center cooling systems is incorporated. Yu et al. [
45] systematically reviewed passive and active cooling strategies for data centers, providing guidance for selecting airflow, heat exchange, and equipment operation variables as thermal drivers. Perez-Lombard et al. [
46] further analyzed global building energy consumption patterns and identified HVAC systems as major energy consumers, accounting for approximately 50% of total building energy usage. These insights motivate the inclusion of HVAC-related operational variables as key explanatory features in the hybrid prediction model.
In a broader perspective of intelligent sensing systems, recent advances in high-throughput perception, deep learning-based recognition, and real-time intelligent decision-making have demonstrated the feasibility of constructing end-to-end closed-loop systems from sensors to insights. Representative studies have shown that modern industrial intelligence platforms increasingly rely on large-scale sensing, deep neural perception, and edge–cloud collaborative computing to support real-time control and optimization [
47,
48,
49,
50,
51,
52]. These developments further validate the technical paradigm adopted in this work, namely high-density sensing, intelligent modeling, and closed-loop optimization for complex industrial infrastructures.
The optimization objective of the MPC controller is formulated to minimize the total energy consumption of the cooling system over the prediction horizon while enforcing smooth control actions to avoid frequent equipment adjustments and mechanical wear. The objective function is defined as [
31]:
where
denotes the objective function value,
is the prediction horizon length,
is the control horizon length,
represents the cooling system power at step
(kW),
denotes the predicted temperature vector at step
,
is the reference temperature setpoint, and
denotes the control input increment at step
. The weighting coefficients
,
, and
balance the trade-off among energy efficiency, thermal safety, and control smoothness.
The total cooling system power is decomposed into three major components: precision air conditioners, chilled water pumps, and cooling tower fans. The corresponding power model is given by [
53]:
where
denotes the compressor power of precision air-conditioning units (kW),
represents the aggregated power of chilled water pumps and cooling water pumps (kW), and
denotes the cooling tower fan power (kW).
The formulation of the energy consumption model and airflow-related control variables is supported by experimental and field studies. Cho et al. [
53] performed measurements and predictive analysis of air distribution systems in high-compute-density data centers, revealing that reasonable airflow organization and cooling system configuration can significantly improve thermal management efficiency and reduce energy consumption. Lazic et al. [
54] from Google further demonstrated the practical effectiveness of model predictive control in large-scale production data centers, achieving substantial energy savings through real-world deployments. Their results provide strong empirical evidence for the effectiveness and engineering feasibility of MPC-based cooling optimization.
The constraint set of the MPC controller consists of two categories: temperature safety constraints and physical constraints of cooling equipment. The temperature safety constraints ensure that the predicted rack inlet temperature remains below a predefined upper bound to prevent server frequency throttling or emergency shutdown caused by overheating. The safety constraint is formulated as:
where
denotes the predicted inlet temperature of rack
at step
,
represents the upper safety threshold of inlet temperature, and
denotes the total number of racks.
In addition to thermal safety, physical constraints of cooling equipment are imposed to ensure reliable and safe operation. These constraints limit both the admissible range and the rate of change of control variables, which are expressed as:
and
where
denotes the control input vector at step
, including the supply air temperature setpoint, airflow rate setting, and chilled water valve opening. The vectors
and
define the lower and upper bounds of the control variables, respectively, while
and
specify the allowable range of control input increments.
To ensure real-time performance under dynamic workloads, the MPC optimization problem is formulated as a quadratic programming (QP) problem and solved using the OSQP (Operator Splitting Quadratic Program) solver, which is specifically designed for embedded and real-time applications. The computational complexity scales as , where is the number of control variables and is the state dimension. With the configured prediction horizon of 30 min (180 sampling points at 10 s intervals) and control horizon of 5 min (30 control actions), the average solver computation time is 127 ms with a maximum of 312 ms on the edge computing device (Advantech Co., Ltd., Taipei, Taiwan), equipped with an Intel Core i7-10700 processor (Intel Corporation, Santa Clara, CA, USA) and 32 GB DDR4 RAM, which is well within the 5 min control update period.
System stability under model uncertainty and sensor noise is ensured through several mechanisms. First, the rolling-horizon strategy inherently provides feedback correction, as the optimization is re-executed at each control cycle using updated state measurements, compensating for prediction errors. Second, constraint tightening is employed by setting the effective temperature threshold 0.5 °C below the actual safety limit (26.5 °C instead of 27 °C), providing a safety margin against prediction uncertainty. Third, the control increment constraints limit the rate of change of control actions, preventing aggressive responses to noisy measurements and ensuring smooth transitions. Fourth, a Kalman filter-based state estimator processes the raw distributed fiber optic measurements to reduce sensor noise before feeding into the MPC controller, with the filter covariance matrices tuned based on the sensor uncertainty analysis in
Table 4. These combined mechanisms ensure robust and stable control performance under practical operating conditions with measurement noise standard deviation up to 0.3 °C and model prediction error up to 0.5 °C.
By explicitly embedding both thermal safety constraints and actuator physical limitations into the rolling-horizon optimization problem, the proposed MPC controller guarantees feasible and stable control actions under dynamic workloads and varying environmental conditions, thereby ensuring reliable and energy-efficient thermal regulation of the data center.
The solution process of the MPC controller is illustrated in
Figure 5. At each control cycle, the controller acquires real-time temperature measurements from the distributed fiber optic sensing system and collects operating status information of cooling equipment. The hybrid thermal prediction model is then invoked to generate multi-step temperature trajectories over the prediction horizon. Based on the predicted thermal evolution, the control problem is formulated as a constrained quadratic programming (QP) problem and solved to obtain the optimal control sequence. Finally, only the first control action in the sequence is applied to the actuators, and the entire optimization procedure is repeated at the next sampling instant following the rolling-horizon strategy.
Serale et al. [
31] provided a comprehensive review of model predictive control for enhancing the energy efficiency of buildings and HVAC systems, covering problem formulation, practical applications, and future opportunities. Their work highlights the importance of real-time optimization, reliable communication architectures, and human–machine interaction interfaces in large-scale energy systems, which offers valuable guidance for the control system implementation and operational interface design of the proposed data center thermal management platform.
By embedding real-time feedback and rolling optimization mechanisms into the control loop, the proposed MPC framework can effectively compensate for load disturbances, modeling uncertainties, and environmental fluctuations. This closed-loop predictive control architecture ensures safe, stable, and energy-efficient thermal regulation of the data center under dynamically varying operating conditions.