A Deep Multimodal Fusion Framework for Noncontact Temperature Detection in Ceramic Roller Kilns

Cai, Kuiyang; Tu, Shanchuan; Wang, Shujuan

doi:10.3390/app16052530

Open AccessArticle

A Deep Multimodal Fusion Framework for Noncontact Temperature Detection in Ceramic Roller Kilns

by

Kuiyang Cai

^1,2,

Shanchuan Tu

^1,2 and

Shujuan Wang

^1,2,*

¹

School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China

²

Southwest United Graduate School, Kunming 650092, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(5), 2530; https://doi.org/10.3390/app16052530

Submission received: 22 January 2026 / Revised: 28 February 2026 / Accepted: 4 March 2026 / Published: 6 March 2026

Download

Browse Figures

Versions Notes

Abstract

Accurate temperature control in ceramic roller kilns is critical for ensuring product quality; however, it remains challenging due to nonlinear thermal dynamics and the spatial lag inherent in traditional contact-based sensors. To address the limitations of sparse wall-mounted thermocouples and optical interference in kiln images, this paper presents a multimodal spatiotemporal fusion network (MST-FusionNet) for noncontact temperature detection of ceramic bodies on roller tracks. The proposed network integrates in-furnace combustion image sequences with distributed thermocouple measurements. First, a physics-informed pseudo-heatmap generation strategy based on Gaussian distributions is introduced to align discrete thermocouple readings with visual features, enabling effective early-stage multimodal fusion. Second, a residual compensation mechanism uses thermocouple data as a stable reference to learn local temperature deviations from visual and temporal features. In addition, an attention-enhanced LSTM module is employed to model combustion dynamics and suppress unreliable frames caused by smoke and flame fluctuations. Experimental results on a real industrial dataset show that the proposed method achieves a mean absolute error of 0.9164 °C and a root mean squared error of 1.2422 °C, demonstrating better performance than single-modal methods and simple fusion baselines. The proposed framework exhibits stable spatial characteristics across different roller positions and helps bridge the spatial discrepancy between boundary measurements and the actual thermal state of ceramic products, providing an effective solution for temperature detection in roller kilns.

Keywords:

ceramic roller kiln; temperature detection; multimodal data fusion; soft sensing; spatiotemporal modeling; computer vision; residual learning

1. Introduction

The roller kiln has become key equipment in the manufacturing of architectural ceramics due to its high operational efficiency. However, it is characterized by complex internal heat flow patterns and uneven heat transfer. These thermal instability factors often lead to product defects, such as deformation, color variation, and excessive energy consumption [1].

With the ceramic industry’s commitment to carbon neutrality goals, traditional experience-based manual control strategies struggle to balance energy conservation requirements with the demand for high-quality production. To address these challenges, modern roller kiln systems have integrated the oxygen-enriched nonlinear combustion technology [2]. The oxygen-enriched nonlinear combustion technology enables the roller kiln to achieve higher flame temperatures, accelerated reaction rates, reduced heat loss from exhaust gases, and lower nitrogen oxide emissions. The accuracy of temperature control during the sintering process is closely related to the final sintering quality, determining the density, mechanical strength, and aesthetic appearance of ceramic products. Although the oxygen-enriched combustion technology can significantly improve energy efficiency, it fundamentally changes the thermodynamic environment inside the kiln. Unlike the traditional air combustion method, the oxygen-enriched conditions have the characteristics of accelerated flame propagation speed, increased core temperature, and significant changes in radiation spectral characteristics. The temperature field inside the kiln exhibits highly nonlinear, strongly coupled, and time-varying characteristics [3]. This complex combustion state greatly increases the difficulty of precise temperature regulation. Therefore, achieving precise detection of the sintering zone temperature is a prerequisite for intelligent control of the roller kiln, improvement of product quality, and reduction in energy consumption, and thus has significant engineering value and practical significance.

Traditionally, industrial kilns have relied on contact-based sensors such as wall-mounted thermocouples to monitor the internal conditions. Although these sensors are convenient, they only measure the localized atmospheric or refractory wall temperature rather than the actual temperature of the ceramic workpieces on the rollers. This inherent spatial gap creates a significant thermal lag, making real-time, high-precision control challenging [4]. To address the spatial limitations of thermocouples, noncontact measurements have been increasingly adopted. Recent advances in image processing technology have further enabled the extraction of temperature representations directly from combustion flame images [5,6,7]. However, single-modal visual approaches are highly susceptible to harsh in-kiln environments, suffering from optical interference caused by smoke, dust, and variable fuel combustion states. Consequently, relying solely on visual data often results in fluctuating accuracy, whereas dependence on point-based thermocouples fails to capture the continuous spatial temperature distribution across the kiln bed. To overcome the limitations of single-modal systems, recent studies in high-temperature industrial applications, such as blast furnaces in metallurgy and boiler combustion in power plants, have begun exploring multimodal data fusion [8,9]. These approaches typically integrate sensor data with visual features to improve predictive stability in industrial soft sensing applications [10,11]. However, applying the existing integration methods that have been used in other high-temperature industries directly to ceramic rotary kilns presents a unique challenge. Existing works lack a physics-aware mechanism to map localized sensor data onto the visual space. Such alignment is essential for accurately capturing the spatial thermal lag and optical interference specific to ceramic kilns.existing approaches for kiln temperature estimation either rely on a single sensing modality or lack effective mechanisms to align heterogeneous information with the underlying spatial–temporal thermal characteristics of the kiln [12]. In particular, how to integrate visual combustion information with sparse thermocouple measurements in a physically meaningful and robust manner remains an open challenge.

To address the aforementioned challenges, the objective of this study is to develop a physically interpretable multimodal spatiotemporal fusion framework capable of accurately estimating in-situ product temperature in ceramic roller kilns under complex combustion and spatial-lag conditions. To achieve this goal, we propose MST-FusionNet, the main contributions are summarized as follows:

We propose MST-FusionNet, a multimodal temperature detection framework for ceramic roller kilns that integrates furnace interior combustion images with wall-mounted thermocouple measurements. This framework aims to bridge the spatial discrepancy between boundary wall measurements and the actual thermal state of the ceramic bodies on the roller.
We propose a physics-based perception pseudo-heatmap construction strategy, where discrete one-dimensional thermocouple readings are transformed into Gaussian-smoothed two-dimensional spatial distributions aligned with the image plane.
We have designed a residual compensation mechanism in the spatiotemporal network. The thermocouple data establishes a stable reference temperature, while the visual-based deep network learns the temperature deviations caused by local combustion non-uniformity. This mechanism is designed to improve the model’s robustness when there are fluctuating combustion conditions.
We introduced an attention-enhanced temporal model that adaptively highlights valuable images and suppresses unreliable visual information to capture the dynamic evolution and thermal lag phenomena of the combustion state within the kiln.

The novelty of this work lies not only in improving predictive accuracy, but also in providing a physically motivated and structurally interpretable multimodal fusion paradigm tailored to high-temperature industrial environments. By embedding domain knowledge into the fusion process, the proposed framework aims to enhance robustness, generalization potential, and practical applicability in real-world kiln systems.

2. Related Work

Temperature monitoring in the kiln is mainly achieved through manual observation, contact detection, and infrared thermal imaging. In the complex and dynamic environment inside the kiln, these traditional methods face significant challenges. Manual observation relies on the experience and judgment of the operator, which is prone to large subjective errors and lacks repeatability. Contact measurement technology usually requires inserting thermocouples into the furnace wall for measurement, but its operation introduces several operational limitations. Thermocouples can only measure the temperature of the air near the boundary wall and cannot directly and quickly accurately describe the actual thermal state of the ceramic products as they move rapidly on the rollers. This limitation leads to a significant spatial lag between the measured data and the actual sintering process. The furnace is continuously exposed to extremely high temperatures and corrosive environments. Such harsh environmental conditions accelerate the aging and corrosion of the thermocouple protective sleeve, resulting in a gradual decrease in measurement accuracy over time. Frequent sensor replacements also increase maintenance costs. Infrared detection methods struggle to meet the requirements of long-term, continuous, and real-time monitoring, especially in environments with dense smoke and intense physical and chemical reactions. Given these limitations, computer vision-based noncontact soft sensing has emerged as an active research area, offering a reliable alternative to traditional approaches, providing a reliable alternative solution to overcome the shortcomings of traditional methods.

2.1. Computer Vision-Based Temperature Perception

Computer vision–based soft measurement methods estimate furnace temperature fields by analyzing radiation information from flame or wall images. Research in this field has evolved from static hand-crafted feature extraction to dynamic spatiotemporal modeling, and more recently to end-to-end deep learning frameworks with increasing attention to interpretability.

2.1.1. Hand-Crafted Features and Segmentation Strategies

Early studies focused on extracting visual features such as flame color, texture descriptors, and geometric characteristics, followed by shallow machine learning classifiers. A key prerequisite for these methods is accurate region-of-interest segmentation. However, complex brightness variations and dust interference inside kilns make simple thresholding unreliable.

To address this issue, Zhang et al. [13] proposed a hybrid segmentation approach combining Otsu multi-thresholding and K-means clustering to separate combustion regions before applying SVM classification. In contrast, Li et al. [14] avoided explicit segmentation by extracting heterogeneous global and local features using multivariate image analysis (MIA), principal component analysis (PCA), and bag-of-visual-words (BoVW), followed by extreme learning machine (ELM) classification. Although effective under controlled conditions, these methods rely on manually designed descriptors and exhibit limited capacity to capture complex nonlinear thermal dynamics.

2.1.2. Spatiotemporal and Dynamic Feature Analysis

Single-frame analysis is often insufficient for modeling combustion evolution under smoke interference or unstable flame oscillations. Therefore, researchers have increasingly incorporated spatiotemporal information from video sequences.

Chen et al. [15] constructed dynamic indicators such as short-term energy and sample entropy to quantify flame fluctuation intensity, demonstrating improved robustness to blur and noise compared with static luminance features. Subsequently, Chen et al. [16] constructed indicators such as short-term energy and sample entropy to quantify flame fluctuation intensity, which demonstrated improved robustness to blur and noise compared with static luminance features. Jiang et al. [17] further introduced nonlinear dynamic system analysis to extract chaotic features from flame intensity signals for combustion stability evaluation.

Deep spatiotemporal models have achieved further improvements. Lyu et al. [18] proposed a CNN–LSTM framework in which CNN layers extract spatial representations and LSTM units model temporal dependencies. Visualization of feature maps and hidden state trajectories enhanced the physical interpretability of the learned representations.

2.1.3. Deep Learning and Model Explainability

In recent years, deep learning has significantly advanced automated feature extraction in combustion monitoring. Wang et al. [19] demonstrated that CNN-based models outperform shallow classifiers in combustion state identification and heat release estimation. Later studies addressed complex scenes and limited labeled data through instance segmentation [20], generative augmentation [4], and transfer learning with pre-trained backbones [21].

Despite these advances, concerns remain regarding the “black-box” nature of neural networks in industrial safety systems. Zhou et al. [22] incorporated class activation maps (CAM) into a CNN architecture to visualize attention regions, confirming that the model focuses on physically meaningful flame–wall interaction zones. This line of research highlights the growing importance of interpretable deep learning in high-temperature industrial environments.

2.2. Multi-Sensor and Multimodal Data Fusion for Thermal State Recognition

Visual-based soft measurement provides detailed spatial information about combustion and radiation. However, its reliability may deteriorate under severe occlusion, light saturation, or abrupt process disturbances. To improve robustness in complex industrial environments, multi-sensor measurement and data fusion have been widely studied for thermal state estimation.

From a theoretical perspective, Liu et al. [23] summarized fundamental paradigms of multi-sensor fusion, including sensor-level, feature-level, and decision-level fusion, emphasizing their importance for inferring indirectly measurable physical states. In high-temperature furnaces, where direct material temperature measurement is often infeasible, such heterogeneous data integration is particularly valuable. In metallurgical industries, early fusion applications mainly relied on numerical process variables. Wang et al. [24] proposed a CNN-based model utilizing 23 heterogeneous operational parameters to predict the endpoint oxygen content in converter steelmaking. Their results showed that deep learning-based fusion outperformed shallow neural networks in capturing complex thermalchemical couplings. Although visual data were not incorporated, this study demonstrated the feasibility of inferring latent thermal states from indirect multi-source measurements. More recently, multimodal frameworks integrating visual and numerical information have attracted increasing attention. Cui et al. [11] combined tuyere image gray-scale features with blast furnace operational variables using time-series neural networks, achieving improved prediction accuracy for hot metal temperature and silicon content. Zhou et al. [25] proposed a Dual-Channel Fusion Analysis Network (DCFANet), in which tuyere images were processed by CNN modules and sequential numerical variables were modeled using GRU with attention mechanisms, followed by feature-level fusion for temperature prediction. Industrial experiments indicated improved robustness compared with numerical-only models. In addition, Zhou et al. [10] introduced a temperature fusion clustering stacking (TFC-stacking) framework based on spatial thermocouple measurements. By constructing temperature distribution maps and extracting high-level spatial features using convolutional autoencoders, followed by clustering and ensemble learning, their method enhanced prediction accuracy and robustness. Although the target variable differs from solid product temperature, their explicit modeling of spatial thermal fields provides useful insights for addressing non-uniform temperature distribution in continuous high-temperature processes.

These studies collectively demonstrate that heterogeneous sensing modalities provide complementary information for inferring the thermal state of furnaces. However, several fundamental differences distinguish the ceramic roller kiln scenario addressed in this study.

First, most existing multimodal approaches focus on predicting bulk molten metal temperature or aggregated process-level variables. In contrast, our task targets the temperature of moving solid ceramic bodies transported along roller tracks, leading to spatial–temporal misalignment between wall thermocouple measurements and the actual workpiece temperature.

Second, prior frameworks generally treat visual and numerical signals as parallel streams and perform early or late fusion without explicit spatial alignment. In roller kilns, wall-mounted thermocouples measure localized boundary temperatures that exhibit spatial lag relative to the roller surface. To address this inconsistency, we introduce a physics-inspired pseudo-heatmap construction strategy that transforms discrete thermocouple readings into Gaussian-smoothed spatial distributions aligned with the image plane, thereby embedding spatial priors into the convolutional backbone.

Third, most multimodal regression models directly learn absolute temperature mappings from fused features. Under dynamic kiln conditions characterized by flame flickering and radiation fluctuations, such direct regression can lead to unstable predictions. Instead, we adopt a residual compensation strategy: thermocouple data provide a stable baseline estimate, and the visual–temporal network learns only local temperature deviations. This coarse-to-fine modeling paradigm improves robustness while mitigating spatial lag effects.

To achieve precise temperature detection of the ceramic slabs on the roller track of the roller kiln, we propose a multimodal temperature detection method for the ceramic roller kiln based on enhanced physical perception and spatiotemporal attention mechanism, combining the internal furnace images with distributed wall surface thermocouple measurements. Table 1 summarizes the methodological differences between representative multimodal fusion approaches in high-temperature industries and the proposed MST-FusionNet framework.

3. Preliminaries

The roller kiln is a typical continuous firing system in modern ceramic industry, featuring a narrow and elongated tunnel-like structure. Its core conveying mechanism comprises of a series of parallel rotating rollers. These rollers are laterally arranged across the cross-section of the working channel and transport ceramic bodies from the kiln entrance to the exit to achieve a continuous production process. Unlike traditional kilns, the roller kiln eliminates heavy kiln cars and sliding plates, thereby improving thermal efficiency and energy utilization. The schematic diagram of the roller kiln is shown in Figure 1. Based on the structural continuity of the kiln, the roller kiln can be divided into three main thermodynamic zones according to the temperature curve and process requirements: the preheating zone, the sintering zone, and the cooling zone.

Pre-heating Zone (Ambient∼900 °C): This area is usually under a slightly negative pressure condition and mainly involves the preparation stage prior to sintering of the raw material body of the raw material body. In this part of the kiln, there is no independent heating device but the residual heat from the sintering zone is utilized to dry the raw material body in an effective counter-flow manner. The ceramic green body evaporates residual moisture, thereby preparing its internal microstructure for subsequent high-temperature treatment.
Sintering Zone (900 °C∼1250 °C): This zone represents the core stage of the manufacturing process, where the critical physicochemical transformations occur. Heating is actively supplied by arrays of open-flame burners or electric elements mounted above and below the roller plane, maintaining a micro-positive pressure environment to ensure uniform heat distribution. Under these intense thermal conditions, the ceramic microstructure undergoes a phase transition, which determines the final density and mechanical integrity of the product.
Cooling Zone (1250 °C∼Ambient): This section represents the controlled cooling area of the ceramic product after sintering. This section is typically divided into three phases: rapid cooling, slow cooling, and final cooling. To ensure thermal and flow-field isolation, roller kilns usually install baffle walls at the junction of the sintering zone and the cooling zone. These physical barriers are crucial for preventing the backflow of smoke and alleviating the lateral temperature differences that could lead to structural defects.

4. Methods

The internal temperature field in the sintering zone of ceramic roller kilns exhibits significant nonlinearity, strong coupling, and time-varying behavior. In addition, single-sensor information is inherently limited. To address these challenges, we propose a multimodal spatiotemporal fusion network MST-FusionNet supported by physical perception enhancement mechanisms and residual compensation strategies. The main objective of this architecture is to integrate high-dimensional, unstructured flame images captured inside the kiln with low-dimensional, structured thermocouple measurements obtained from the kiln walls, thereby enabling precise temperature detection of ceramic components near multiple rollers. As shown in Figure 2, the entire framework consists of four key modules:

An Image Feature Extraction Module enhanced by pseudo-heatmaps, where a pseudo-heatmap is defined as a physics-inspired Gaussian-smoothed spatial distribution constructed from discrete thermocouple measurements to approximate their thermal influence range and achieve cross-modal spatial alignment;
A Thermocouple Feature Encoding and Baseline Temperature Estimation Module;
A Temporal Modeling Module augmented with a multi-head attention (MHA) mechanism;
A Multimodal Fusion Module based on residual compensation.

This network adopts a coarse-to-fine approach. Firstly, stable thermocouple data is utilized to establish the reference temperature. Subsequently, the deep visual and temporal sub-networks explicitly learn temperature deviations caused by local combustion inhomogeneity. By separating the reference trend from the local fluctuations, this design is intended to enhance the stability and generalization ability of the model.

4.1. Image Feature Extraction Module Enhanced by Pseudo-Heatmaps

4.1.1. Pseudo-Heatmaps

The data used in this study possesses multimodal characteristics. Flame images are represented in a high-dimensional format (

H \times W \times C

), while thermocouple readings are in a low-dimensional form (

1 \times N

). To bridge this dimensional disparity and incorporate the spatial topology of the thermocouple array into consideration, we propose a pseudo-heatmap construction method based on physical principles. This technique converts the discrete thermocouple data T into a continuous two-dimensional heat map

H

, making it spatially compatible with the resolution of the input image (

H \times W

).

The pseudo-heatmap construction is physically inspired by the steady-state solution of the heat conduction equation under localized heat sources. In practical industrial environments, the temperature measured by wall-mounted thermocouples does not remain localized but influences the surrounding spatial region through thermal radiation and convection. To provide an intuitive interpretation of our pseudo-heatmap generation process, we draw inspiration from the classical two-dimensional heat diffusion model. Mathematically, assuming the kiln cross-section is an isotropic thermal medium, the heat transfer from a point source (the thermocouple) can be described by the 2D transient heat conduction equation:

\frac{\partial T}{\partial t} = α (\frac{\partial^{2} T}{\partial x^{2}} + \frac{\partial^{2} T}{\partial y^{2}}),

(1)

where

T (x, y, t)

denotes temperature, t is the time, and

α

is the effective thermal diffusivity of the local kiln environment.

For an instantaneous point heat source in an infinite 2D plane, the fundamental solution to this partial differential equation takes the form of a Gaussian distribution:

T (x, y, t) = \frac{Q}{4 π α t} exp (- \frac{x^{2} + y^{2}}{4 α t}),

(2)

where Q represents the heat source strength, and x and y denote the spatial distance from the thermocouple.

T (x, y, t)

takes the form of a two-dimensional Gaussian distribution. By comparing the exponential term with the standard Gaussian kernel

exp (- \frac{x^{2} + y^{2}}{2 σ^{2}})

, one obtains the relation

σ = \sqrt{2 α t}

.

This correspondence suggests that

σ

can be interpreted as an effective diffusion scale, jointly reflecting the diffusivity and the elapsed diffusion time. Although our pseudo-heatmap generation process does not explicitly solve the physical diffusion equation, this analogy provides a physics-inspired understanding of how

σ

controls the spatial smoothness and influence range of the generated pseudo-heatmaps. Larger values of

σ

correspond to more spatially diffused and smoother representations, while smaller values emphasize localized responses. In practice, because the precise real-time thermal diffusivity

α

of the turbulent kiln gas is difficult to measure directly,

σ

acts as a learnable or hyper-parameterized proxy for

\sqrt{2 α t}

. Sensitivity analysis (see Section 5.4.8) demonstrates that the model performance remains stable within a reasonable

σ

range, indicating robustness rather than over-reliance on a specific heuristic parameter.

According to the actual installation positions on the kiln wall of the roller kiln, the physical positions of the sensors are projected onto the image coordinate system. The planar coordinate representation of the i-th visible thermocouple image is

(x_{i}, y_{i})

, and its corresponding temperature value is

T_{i}

. To maintain the spatial independence of the readings from different sensors, we avoid the traditional method of aggregating all points onto a single-channel image and instead constructed a multi-channel sparse tensor

S \in R^{I \times H \times W}

. For the i-th channel corresponding to the i-th visible thermocouple, the pixel initialization is defined as follows:

S_{i} (x, y) = \{\begin{matrix} T_{i}, & if x = x_{i} and y = y_{i} \\ 0, & otherwise \end{matrix},

(3)

We introduced a two-dimensional Gaussian kernel

G

of size

K \times K

. This Gaussian kernel is used to perform spatial diffusion on the discrete temperature readings, thereby approximating the inherent spatial continuity and diffusion characteristics of the temperature field. The definition of the Gaussian kernel G is

G (u, v) = \frac{1}{Z} exp (- \frac{u^{2} + v^{2}}{2 σ^{2}}),

(4)

where

(u, v)

represents the relative coordinates within the kernel,

σ

is the standard deviation controlling the effective range of thermal influence. Z denotes a normalization constant to ensure that the sum of all elements within the kernel equals 1. This normalization prevents artificial amplification or attenuation of temperature values during smoothing. The smoothed heat map

H_{i}

corresponding to the i-th visible thermocouple is obtained by performing a convolution operation between the sparse tensor

S

and the generated Gaussian kernel

G

:

H_{i} (x_{i}, y_{i}) = S_{i} (x_{i}, y_{i}) * G,

(5)

This operation allows the temperature information originally concentrated at a single point to diffuse spatially, forming a localized thermal distribution centered at

(x_{i}, y_{i})

with gradually decreasing intensity, as shown in Figure 3.

4.1.2. Image Feature Extraction

At the channel level, we adopted an early-stage fusion strategy to fuse visual features with thermal field features. Each frame

X_{t}

in the original combustion sequence is concatenated along the channel dimension with the constructed multi-channel pseudo-heat map

H

to form an enhanced multimodal input tensor to form an enhanced multimodal input tensor that fuses visual texture and spatial temperature priors. For each frame in the image sequence after early fusion, its input form can be expressed as

X_{t} \in R^{(3 + N) \times H \times W}

. For feature extraction, we selected ResNet-18 as the visual feature extractor. The selection of ResNet-18 is motivated by the trade-off between representational capacity and computational efficiency. Deeper backbones provide stronger nonlinear modeling ability but introduce substantially higher computational cost and latency. In preliminary experiments, deeper architectures such as ResNet-34 showed marginal accuracy improvement (<0.2 °C MAE) while increasing inference latency by over 70% and model size by approximately 78%. Therefore, ResNet-18 was selected as a balanced architecture that satisfies both accuracy and deployment constraints. To adapt to the extended input dimensions, we reconfigured the first convolutional layer of ResNet-18 to have

3 + N

channels, replacing the standard RGB triplet. We adopted a mixed strategy for weight initialization, where the weights corresponding to the original RGB channels inherit the parameters pre-trained on ImageNet to utilize transfer learning the weights of the new pseudo-heatmap channels are initialized using the Xavier method. This approach helps maintain numerical stability during the early stages of training. Considering that kiln combustion textures and temperature gradients are primarily captured by middle- and low-level features, we remove the first max pooling layer of the original ResNet-18 [26] to reduce the loss of local thermal texture information due to early downsampling. After passing through the convolutional backbone network, each frame of the image is mapped to a high-dimensional channel-level fused feature vector:

f_{t}^{cha} = Φ (X_{t}), f_{t}^{cha} \in R^{D},

(6)

where

Φ (\cdot)

represents the convolutional feature extraction operator. By stacking the frame-level features

f_{t}^{c h a}

along the temporal dimension, the channel-level fusion feature tensor of the sequence is obtained as

F^{cha} \in R^{L \times D}

.

4.2. Thermocouple Feature Encoding and Baseline Temperature Estimation Module

We construct a lightweight multi-layer perceptron (MLP) that directly takes the thermocouple vector

T

as input to output the base temperature of the ceramic on the roller kiln:

y_{base} = F_{base} (T) \in R^{M},

(7)

where M represents the number of temperature measurement points on the roller kiln that need to be detected. This branch relies only on thermocouple data and can provide an initial estimate without image information, offering a stable reference for subsequent deep networks.

Because some thermocouples inside the kiln may be visually occluded, relying solely on injected thermal image channels is insufficient to represent the overall thermal state. We encode the thermocouple vector

T

through a layer of MLP to frame-level thermal features:

f^{th} = ϕ (T) \in R^{128},

(8)

and it is replicated across the sequence dimension to each frame, concatenated with the frame-level fusion feature of the image channel, and then linearly projected back to D dimensions to obtain the frame-level fusion feature:

f_{t}^{fra} = W [f_{t}^{cha}; f^{th}] + b \in R^{D},

(9)

The resulting frame-level fusion features incorporate both current radiation texture information and global temperature background information. In addition, we construct a sequence-level branch encoding

g = ψ (T) \in R^{128}

for the thermocouple vector and used it in the final fusion stage.

4.3. Temporal Modeling Module Augmented with a MHA Mechanism

Combustion inside the kiln is inherently dynamic, characterized by random flame flickering, turbulent gas flow in the flue, and significant thermal lag. These factors have temporal correlations and cannot be adequately captured by static analysis. Therefore, we propose an LSTM-attention architecture to capture sequential dynamic features. An LSTM unit consists of input, forget, and output gates, which jointly regulate the memory unit and hidden state to alleviate the problem of gradient vanishing. The mathematical formulas of these gating structures are as follows:

\begin{matrix} i_{t} & = σ (W_{i} f_{t}^{fra} + U_{i} h_{t - 1} + b_{i}), \\ f_{t} & = σ (W_{f} f_{t}^{fra} + U_{f} h_{t - 1} + b_{f}), \\ o_{t} & = σ (W_{o} f_{t}^{fra} + U_{o} h_{t - 1} + b_{o}), \\ {\tilde{c}}_{t} & = tanh (W_{c} f_{t}^{fra} + U_{c} h_{t - 1} + b_{c}), \\ c_{t} & = f_{t} \times c_{t - 1} + i_{t} \times {\tilde{c}}_{t}, \\ h_{t} & = o_{t} \times tanh (c_{t}), \end{matrix}

(10)

where U , W , and b represent the input weights, recurrent weights, and bias terms for each gate, respectively.

i_{t}

determines which information to update,

f_{t}

decides which information to discard, and

o_{t}

determines which part of the memory cell

c_{t}

will be output.

{\tilde{c}}_{t}

is a new candidate value vector created by the hyperbolic tangent function.

c_{t}

retains useful features and discards useless ones. The LSTM output

h_{t}

is obtained by controlling the activation of the memory cell

c_{t}

through the output gate

o_{t}

. The output of the LSTM network is denoted as

H = [h_{1}, \dots, h_{L}]

, which characterizes the temporal evolution of the flame state within the window.

In industrial kiln scenarios, image frames may be affected by smoke occlusion, localized extinction or intensification of flames, and camera exposure variations, leading to unreliable information in certain frames. Therefore, we introduce a multi-head self-attention mechanism (MHA) after the LSTM to enable the network to adaptively focus on key frames that are more informative for temperature discrimination within the sequence. In our implementation, the hidden dimension of the LSTM is set to 256. The multi-head self-attention module is implemented with 4 attention heads and a total embedding dimension of 256, resulting in 64-dimensional projections per head. A dropout rate of 0.1 is applied within the attention mechanism. The output computation of the multi-head self-attention mechanism is as follows:

A (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(11)

where

Q = K = V = H

. No additional positional encoding is introduced, since temporal dependencies are already modeled by the preceding LSTM layer.

d_{k}

denotes the dimensionality of the key vectors. We then perform global average pooling over the sequence dimension:

z = \frac{1}{L} \sum_{t = 1}^{L} A_{t} \in R^{256},

(12)

through this mechanism, the model automatically focuses on the image frames that contribute most to the final temperature prediction, thereby improving robustness under complex operating conditions.

4.4. Multimodal Fusion Module Based on Residual Compensation

At the output stage, we adopt a dual-stream residual prediction strategy for temperature detection. Specifically, the visual-sequential features

z

and the thermocouple sequence-level encoding features

g

are concatenated, and the temperature residual term is estimated through a fully connected network:

r = F_{res} ([z; g]) \in R^{M},

(13)

the model outputs the temperature detection values for the M measurement points on the roller kiln as

\hat{p} = y_{base} + r \in R^{M},

(14)

5. Experiments

To assess the performance of the proposed temperature detection method temperature detection method for ceramic roller kilns, we conduct a series of experiments. The experimental procedure consists of two main stages: data collection and model training and validation. The data collection experiments are carried out on a single-section experimental platform corresponding to the sintering zone of a roller kiln, as shown in Figure 4. This platform, built based on actual production equipment, includes four subsystems: a gas supply system, a heat exchange system, a combustion system, and a numerical control system, which closely resembles a real production environment. The model training and validation experiments were implemented using the PyTorch deep learning framework based on the Python programming language. The development environment consisted of Python 3.10 and PyTorch 2.4.0 with CUDA 12.2 support. The hardware configuration included an Intel i9-11900K processor, an NVIDIA RTX 3090 GPU, and 64 GB of system memory.

5.1. Data Preparation

5.1.1. Data Collecting

The data used in this study were collected from the sintering zone of the roller kiln using the single-section experimental platform shown in Figure 4. The kiln is equipped with an image monitoring system (color industrial camera and cooling system) and thermocouple sensors uniformly distributed along the kiln wall. The 15 thermocouples are installed along the longitudinal direction of the kiln wall. The spatial interval between adjacent thermocouples was approximately 150 cm, covering the primary combustion zone. The color industrial camera operated at a frame rate of 30 fps was employed to acquire combustion video data within the roller kiln, while the thermocouples provided point-wise temperature measurements. During the data collection experiment, to accurately obtain the approximate temperature of the ceramic body on the roller as reflected in the combustion-state image inside the kiln, PTCR (Process Temperature Control Rings) temperature measurement rings are arranged on the rollers to measure this temperature. Specifically, in each data collection experiment, the temperature measurement ring is placed on the roller, and after a period of combustion heating, the industrial camera is used to capture videos of the combustion heating conditions inside the kiln for subsequent frame extraction and processing. The thermocouple data are first recorded, after which the firing process is stopped and the kiln is allowed to cool. After the temperature measurement ring cools, its outer diameter is measured using a micrometer. The corresponding temperature is then obtained from the outer-diameter–temperature conversion table, which reflects the nearby temperature of the ceramic body on the roller, as indicated by the combustion-state image inside the kiln. The temperature values obtained from the PTCR ring are used as ground-truth labels for both image data and thermocouple measurements. Finally, we collected data consisting of multiple experimental runs conducted at different kiln temperature settings and selected 13 sets of image and temperature data, totaling over 30,000 images of the kiln’s internal combustion state for research, covering a temperature range of 600 °C to 1200 °C, which is the temperature range of the sintering zone. In different experimental runs, both the kiln load (with a natural gas flow rate of 30 m³/h and an oxygen content of 27% in the combustion air) and the ceramic product type remained constant. Samples of the images of the kiln’s internal combustion state at different temperatures are shown in Figure 5.

5.1.2. Data Processing

Thermocouple and label data standardization: The temperature data $T = {T_{1}, T_{2}, \dots, T_{n}}$ collected by the wall thermocouples and the label data $P = {P_{1}, P_{2}, \dots, P_{m}}$ have different physical distributions. To eliminate the effect of differing temperature scales during model training, we construct separate normalizers. For each sample in the training set, the standardization is computed as follows:

$T_{n}^{'} = \frac{T_{n} - μ^{T}}{σ^{T} + ε},$

(15)

$P_{m}^{'} = \frac{P_{m} - μ^{P}}{σ^{P} + ε},$

(16)

where $μ^{T}$ and $μ^{P}$ represent the mean values of the thermocouple data and the labels in the training set, respectively ; $σ^{T}$ and $σ^{P}$ represent the standard deviations of the thermocouple data and the labels in the training set samples, respectively. $ε$ is a small constant introduced to prevent division by zero.
Image normalized chromaticity transformation: The combustion environment inside a roller kiln is complex. Raw RGB images captured by industrial cameras contain not only temperature-related chromatic features but are also are sensitive to illumination disturbances. Flame flickering causes significant brightness fluctuations inside the kiln, while the camera’s automatic gain and exposure control continuously adjust gray levels according to ambient lighting conditions. Consequently, the same temperature may correspond to different pixel intensities over time. These brightness variations caused by non-thermal factors appear as visual noise. According to Planck’s blackbody radiation law, the radiative emission of an object is directly related to its temperature and, in principle, independent of illumination intensity. If raw images are directly fed into the model, it may learn spurious features associated with brightness variations rather than intrinsic temperature information. To address this, we first scale the images to $224 \times 224$ and then apply an image normalized chromaticity transformation. The chromaticity components after normalized chromaticity transformation are computed as follows:

$\begin{matrix} r (x, y) & = \frac{R (x, y)}{R (x, y) + G (x, y) + B (x, y) + ϵ}, \\ g (x, y) & = \frac{G (x, y)}{R (x, y) + G (x, y) + B (x, y) + ϵ}, \\ b (x, y) & = \frac{B (x, y)}{R (x, y) + G (x, y) + B (x, y) + ϵ}, \end{matrix}$

(17)

where $ϵ$ is a small constant introduced to prevent the denominator from becoming zero. After the transformation, the chromaticity components of each pixel satisfy $r + g + b = 1$ , indicating that the pixel’s color attributes can be fully characterized using only the r and g components. These attributes are invariant to scaling of illumination intensity. As illustrated in Figure 6, the chromaticity-transformed images effectively suppress brightness variations caused by flame flickering, while highlighting color texture features that are strongly correlated with the spatial distribution of the temperature field.
Sliding window sequence sample generation: Due to the thermal inertia and the flickering of flames within the kiln, which characterize of the dynamic process of ceramic firing, we constructed a sliding window-based temporal sample sampling mechanism with a window length of L and a sliding step of S. This produces an input image sequence $X_{i n} \in R^{L \times 3 \times H \times W}$ .

Figure 6. Comparison between original combustion images and chromaticity-transformed images. The top row shows representative raw RGB frames captured inside the kiln, which are affected by illumination fluctuations and flame flickering. The bottom row presents the corresponding normalized chromaticity images, where pixel values are computed as ratios of RGB components. This transformation suppresses brightness variations caused by exposure and gain changes while preserving color-related features that are more strongly correlated with temperature distribution.

5.1.3. Train–Validation Split

To ensure rigorous evaluation and avoid data leakage, we adopted a run-level data partitioning strategy. Specifically, the 13 experimental runs were divided such that entire runs were assigned either to the training set or to the validation set. In this way, we prevented overlapping combustion sequences from the same firing campaign from appearing in both subsets. Within each training run, the first 80% of each sequence was used for training and the remaining 20% for validation. It is worth emphasizing that the proposed data partitioning strategy establishes a gap larger than the sliding window between the training and validation sets to prevent data leakage during the generation of samples by the temporal sliding window mechanism.

It should be noted that the dataset was divided into training and validation subsets without including a completely independent external test set collected under separate operational campaigns. This decision was primarily constrained by the limited number of experimental runs available on the industrial single-section platform, as each data acquisition process involves high operational costs and strict safety procedures. To mitigate potential overfitting and ensure fair evaluation, we adopted run-level splitting and introduced a temporal gap larger than the sliding window length to prevent data leakage between subsets. Furthermore, cross-temperature generalization experiments (Section 5.4.5) were conducted to further assess the model’s robustness under distribution shifts. Nevertheless, the absence of an independent test set remains a limitation of the current study. Future work will focus on collecting additional long-term industrial data from different production batches and operating conditions to enable evaluation on a fully independent test set.

5.2. Loss Function

During the model training phase, we design an adaptive weighted loss function. This loss function does not compute the loss in the standardized feature space but operates directly in the Celsius measurement scale to ensure physical consistency in temperature detection and enhance the model’s robustness to thermal fluctuations across rollers. Let

{\hat{p}}_{b, m}^{(e)}

and

p_{b, m}

denote the predicted and ground-truth temperatures of the b-th sample at the m-th monitoring point during the e-th training epoch. These temperature values are obtained by inverse-transforming the network outputs using the pre-computed normalization statistics in Equation (16). The total loss function is defined as the weighted average of the smoothed L1 errors over the batch and spatial dimensions:

L^{(e)} = \frac{1}{B \times M} \sum_{b = 1}^{B} \sum_{m = 1}^{M} ω_{m}^{(e)} \cdot L_{smooth} ({\hat{P}}_{b, m}^{(e)} - P_{b, m}),

(18)

where B is the batch size,

M = 6

is the number of roller measurement points, and

ω_{m}^{(e)}

represents the adaptive weight for the m-th roller at epoch e.

L_{smooth} (\cdot)

denotes the Smooth L1 loss with a threshold

β = 1.0

, which combines the differentiability of MSE near zero with the robustness of MAE to outliers:

L_{smooth} (x) = \{\begin{matrix} 0.5 x^{2}, & if | x | < β \\ | x | - 0.5 β, & otherwise \end{matrix},

(19)

The complex turbulent flow inside the roller kiln and variations in temperature stability across roller positions pose challenges for accurate temperature detection. To mitigate the risk that the model overfits stable regions while neglecting regions with large fluctuations, we introduce a dynamic weighting mechanism. At the end of each cycle (i.e., the eth cycle), we calculate the MAE of each roller m on the validation set, denoted as

E_{m}^{(e)}

. By normalizing the local error by the global average error, we obtain the target weight

τ_{m}^{(e)}

:

τ_{m}^{(e)} = \frac{E_{m}^{(e)}}{\frac{1}{M} \sum_{k = 1}^{M} E_{k}^{(e)}} .

(20)

To ensure training stability and avoid abrupt oscillations in the loss landscape, the weights are updated using an Exponential Moving Average (EMA) strategy:

ω_{m}^{(e + 1)} = Clip (λ ω_{m}^{(e)} + (1 - λ) τ_{m}^{(e)}, ξ_{min}, ξ_{max}),

(21)

where

λ

is the momentum factor controlling the update speed. The

Clip (\cdot)

function constrains the weights within

[ξ_{min}, ξ_{max}]

to prevent gradient explosion or vanishing. The weights are re-normalized to sum to M. This mechanism encourages the network to dynamically focus on roller positions that are currently difficult to predict, thereby achieving spatially balanced detection performance.

5.3. Baselines and Evaluation Metrics

5.3.1. Baselines

To comprehensively evaluate the effectiveness of the proposed multimodal temperature detection framework, a set of representative baseline methods were designed following commonly adopted technical routes in industrial temperature soft sensing and combustion state analysis. Specifically, we considered four baseline methods:

Baseline 1 (Thermocouple-only MLP): A multilayer perceptron takes the thermocouple vector T as input and directly predicts the temperatures at the six target roller positions. This baseline corresponds to conventional data-driven soft sensors relying solely on structured sensor readings.
Baseline 2 (Image-only CNN): Each frame in the input sequence is processed by a ResNet-18 backbone, and the frame-wise features are aggregated by average pooling to predict the six-point temperature. This baseline represents a noncontact, vision-only regression model in which single-frame combustion images are processed by a convolutional neural network to predict temperatures, reflecting vision-based noncontact measurement methods.
Baseline 3 (Image-only CNN–LSTM): A ResNet-18 extracts frame-level features that are fed into an LSTM to capture temporal dependencies. The final hidden representation is mapped to six temperature outputs. This baseline evaluates whether temporal modeling alone is sufficient for temperature regression without thermocouple fusion.
Baseline 4 (Early-fusion): A spatiotemporal fusion model that concatenates image sequence features extracted by ResNet-18 with thermocouple features encoded by an MLP, followed by LSTM-based temporal modeling and regression. This model corresponds to an early fusion strategy without physics-aware enhancement or residual compensation.

5.3.2. Evaluation Metrics

All models are evaluated on the validation set using temperature errors after inverse normalization to the Celsius scale. The primary metric is the Mean Absolute Error (MAE):

MAE = \frac{1}{B M} \sum_{b = 1}^{B} \sum_{m = 1}^{M} |P_{b, m} - {\hat{P}}_{b, m}|,

(22)

where B is the number of validation samples, and M is the number of target roller positions. In addition, Root Mean Squared Error (RMSE) reflects the sensitivity to large deviations:

RMSE = \sqrt{\frac{1}{B M} \sum_{b = 1}^{B} \sum_{m = 1}^{M} {(P_{b, m} - {\hat{P}}_{b, m})}^{2}},

(23)

For the actual kiln monitoring, we further provided the MAE of each temperature measurement point to quantify spatially non-uniform detection performance:

{MAE}_{m} = \frac{1}{B} \sum_{b = 1}^{B} |P_{b, m} - {\hat{P}}_{b, m}|, m = 1, \dots, 6 .

(24)

5.3.3. Implementation Details

From the collected dataset (13 experimental runs, more than 30,000 frames), the samples were split into training and validation sets at the run level (i.e., entire experimental runs are assigned to either the training or validation set) to avoid leakage between training and validation sequences. With a sequence length

L = 10

and stride

S = 1

, each training sample contains a clip

X^{in} \in R^{10 \times 3 \times H \times W}

aligned with its thermocouple vector

T \in R^{15}

and six-point temperature labels

P \in R^{6}

.

5.4. Results

5.4.1. Performance Comparison

To quantitatively evaluate the proposed method, we compared the performance of MST-FusionNet with four representative baseline methods on the validation set. As shown in Table 2, the proposed MST-FusionNet achieves the best performance among the compared methods with an MAE of 0.9164 °C and an RMSE of 1.2422 °C, outperforming all single-modal and naive multimodal baselines.

This performance difference reveals the limitations of relying on a single modality. Baseline 1, which uses only thermocouples, provides a reasonable but rough estimate, with an MAE of 8.6563 °C. This indicates that although the wall-mounted thermocouples can capture the overall thermal trend, they have spatial lag and cannot accurately reflect the specific temperature of the ceramic body on the roller. The vision-only method performs poorly. The MAE values of Baseline 2 and Baseline 3 are 17.2476 °C and 6.1107 °C, respectively. This result indicates the inherent difficulty of regressing absolute temperature from combustion images. The noise caused by smoke, dust, and nonlinear lighting changes inside the kiln can make it difficult for deep learning models to function without a stable temperature reference. Baseline 4, which directly concatenates image and thermocouple features, achieves an MAE of 3.1314 °C, performing worse than MST-FusionNet. This result shows that the fusion of unstable flame images and low-dimensional thermocouple data can improve detection accuracy, but there is still room for improvement. In contrast, compared to baseline 4, the proposed MST-FusionNet reduces the MAE by more than 70%. This gap provides evidence for the effectiveness of the residual compensation strategy. This residual compensation strategy uses thermocouple data to establish a stable reference value and only uses visual features to learn local residual deviations. The MHA mechanism effectively filters out unreliable frames caused by environmental interference to ensure that only high-quality visual information contributes to the final detection.

While the proposed MST-FusionNet achieves a less than 1 °C MAE, it is important to acknowledge the inherent measurement uncertainty of the PTCR temperature measurement rings used as the ground-truth reference. In practical industrial settings, PTCR rings are subject to specific measurement accuracy limits (typically ±3 °C) and resolution constraints dependent on the micrometer calibration (e.g., 0.01 mm). Furthermore, potential sources of bias may arise from localized thermal gradients during the firing process and slight deviations in the physical placement of the rings on the rollers. Consequently, the reported MAE of 0.9164 °C should be interpreted as the model’s high fidelity in mapping multimodal inputs to the PTCR reference measurements, rather than an absolute deviation from the true thermodynamic temperature. This sub-1 °C error demonstrates the network’s capability to capture complex spatiotemporal features, mitigate the spatial lag of wall-mounted sensors, and operate robustly within the inherent uncertainty bounds of the physical calibration instruments.

5.4.2. Per-Roller Error Analysis

The accuracy and stability of temperature measurement in different spatial positions within the kiln are equally important for industrial applications. During the sintering process, thermodynamic conditions such as airflow turbulence and flame intensity variations may undergo significant changes along the roller track. We analyzed the MAE at six different roller positions to evaluate the spatial generalization ability of the proposed method, as summarized in Table 3.

The results show that the proposed method maintains high detection accuracy at all monitoring points, with the MAE ranging from the minimum value of 0.6575 °C at point P1 to the maximum value of 1.4389 °C at point P2. This performance differs substantially from Baseline 1, which exhibits a consistent systematic error ranging from 7.6717 °C to 10.0083 °C across all points. This comparison demonstrates that MST-FusionNet learns to compensate for the inherent spatial lag associated with wall-mounted sensors and can accurately infer the temperature near multiple positions of the ceramic components. The visual Baseline 2 rises above 19.9082 °C at point P2 and the Baseline 3 exceeds 6.8743 °C at point P1. In contrast, the proposed method demonstrates strong spatial stability. Although our proposed method exhibits a slight increase in error at points P1 and P2, this may be attributed to more complex turbulent flow and heat exchange dynamics in the central sintering zone. However, these deviations remain within acceptable industrial sintering tolerance ranges. In summary, MST-FusionNet exhibits uniform and reliable detection capabilities over the entire monitoring area. The method meets the strict requirements of multi-point online temperature monitoring.

5.4.3. Visualization Analysis

To provide a more intuitive assessment of the MST-FusionNet’s detection reliability, we visualized the correlation between the predicted values and the ground truth and temperature tracking performance across the validation samples. As shown in Figure 7, the single-mode method shows noticeable limitations. Using only thermocouples (Baseline 1) results in a parallel shift from the diagonal line. This result indicates a systematic deviation caused by spatial lag. The visual results of baseline 2 and baseline 3 show a highly dispersed distribution and significant variance. This indicates that these models are limited in capturing complex nonlinear mappings of flame features without a stable reference. The MST-FusionNet proposed by us shows the strongest linear correlation with the ground truth. Within the temperature range of 600–1200 °C, the detection points are all concentrated on the

y = x

diagonal, with no significant outliers. This distribution confirms that the multimodal fusion strategy effectively reduces the deviation associated with hard sensors and suppresses visual noise.

The model’s ability to accurately capture the temperature fluctuations between consecutive samples is important for industrial monitoring. Figure 8 shows the temperature tracking curve at roller position P2. Figure 8a indicates that Baseline 3 has difficulty adapting to specific temperature trends, exhibiting pronounced oscillations and noticeable deviations from the ground-truth values. As shown in Figure 8b the proposed MST-FusionNet closely matches the measurement curve. When the validation samples exhibit step-like temperature changes, the proposed method tracks these variations with minimal lag and without overreaction. The error comparison in Figure 8c shows that the proposed method maintains a stable error close to zero, whereas the reference method exhibits large fluctuations. This result indicates that MST-FusionNet achieves higher stability and accuracy when handling temperature tracking under varying operating conditions.

5.4.4. Ablation Study

To comprehensively evaluate the contribution of each component in the proposed MST-FusionNet, we sequentially removed each module from the full model while keeping the remaining components unchanged. The quantitative results are summarized in Table 4. As shown in Table 4, removing the residual compensation mechanism results in the most significant performance degradation, with the MAE increasing from 0.9164 °C to 2.3283 °C. This confirms that learning temperature deviations relative to a thermocouple baseline is more effective than directly regressing absolute temperature values. Eliminating the pseudo-heatmap module increases the MAE to 1.9689 °C, demonstrating that spatial alignment between thermocouple readings and visual features plays a critical role in enhancing spatial sensitivity. Removing the multi-head attention mechanism increases the MAE to 1.4584 °C, indicating that adaptive temporal weighting enhances robustness under fluctuating combustion conditions. These results verify that all three modules contribute positively to the final performance, with the residual compensation mechanism providing the largest individual contribution.

To further clarify the incremental contribution of each component, we conducted a progressive addition study starting from the naive early-fusion baseline (Baseline 4). Modules were added individually to quantify their independent performance gains. The results are summarized in Table 5. As shown in Table 5, introducing the residual compensation mechanism alone reduces MAE from 3.1314 °C to 1.4571 °C, which represents the most substantial improvement among the individual components. This demonstrates that modeling residual temperature deviations relative to thermocouple baselines fundamentally stabilizes the regression task. Adding the pseudo-heatmap module independently reduces the MAE to 2.4585 °C, indicating that physics-inspired spatial alignment enhances feature integration between heterogeneous modalities. Introducing the MHA module alone reduces MAE to 2.6058 °C, confirming that temporal attention improves sequence robustness by suppressing unreliable frames. Finally, combining all three components yields the best performance (MAE = 0.9164 °C), suggesting that the modules provide complementary benefits rather than redundant effects.

Beyond the quantitative improvements observed in the ablation study, the results provide insight into the structural effectiveness of the proposed design. The substantial contribution of the residual compensation mechanism indicates that anchoring predictions to thermocouple-based baseline estimation stabilizes regression under combustion fluctuations. The improvement introduced by the pseudo-heatmap module further suggests that diffusion-inspired spatial alignment reduces representational inconsistency between discrete sensor readings and visual features. Meanwhile, the attention mechanism enhances robustness by emphasizing temporally informative combustion states. These findings collectively demonstrate that incorporating physically motivated spatial priors and structured residual learning improves both stability and interpretability in multimodal industrial soft sensing. The progressive addition results clearly reveal the relative contribution of each component, addressing the interpretability of architectural design choices.

5.4.5. Cross-Temperature Generalization

To evaluate the generalization ability of the proposed model under unseen temperature conditions, we conducted cross-temperature generalization experiments. Unlike the previous experiments that divided the training set and validation set under the same distribution, in the cross-temperature generalization experiments, we used the data of 800 °C, 900 °C, and 1000 °C in the dataset as the validation set to verify the model’s detection ability under cross-temperature distribution shifts while data at the remaining temperatures were used for training. These temperature levels were selected because they lie within the core operating range of the sintering zone and represent typical industrial firing conditions. Holding out intermediate temperature segments allows us to assess whether the model maintains stable performance when combustion morphology and radiation characteristics vary within the main production regime, rather than under identical temperature distributions. The experimental results are shown in Table 6.

From the overall results under the cross-temperature test conditions, the detection errors of all methods are noticeably higher than those observed under the same temperature distribution. This result indicates that in the nonlinear temperature field of the ceramic roller kiln, it is more challenging for the model to achieve cross-temperature generalization than to perform regression under the same temperature distribution. Under this setting, the model not only needs to learn the mapping between temperature and multimodal features, but also to adapt to systematic changes in combustion morphology, radiation characteristics, and imaging conditions in different temperature segments. Therefore, some degree of performance degradation is unavoidable. Baseline 1, which only uses thermocouple data, shows a certain stability under cross-temperature conditions, with an MAE of 12.15 °C, but the error level is still relatively high. This indicates that point-based thermocouple measurements have inherent limitations in characterizing the actual heating state of the ceramics. Baseline 2 and Baseline 3, which rely solely on image data, show a more significant performance decline under cross-temperature conditions. Among them, Baseline 3, which incorporates an LSTM, exhibits a further increase in MAE to 30.89 °C. This suggests that in the absence of temperature benchmark constraints, the transferability of visual features across temperature domains is limited, and their reliability decreases. Temporal modeling may instead amplify the error caused by the shift in operating conditions. In the multmodal data aspect, Baseline 4 has made certain progress compared to the single-modal model, but its MAE still reaches 13.90 °C. The results show that simple multimodal fusion and temporal modeling are still difficult to fully cope with the distribution shift caused by changes in combustion state and radiation characteristics in the cross-temperature scenario. In contrast, the method we proposed achieved an MAE of 4.7253 °C in the cross-temperature generalization experiments, achieving the best performance among the compared methods. Although performance decreases compared to experiments under the same distribution, the model retains some spatial perception capability through the pseudo-heatmap, reducing the MAE by 7.4 °C compared to Baseline 1. The proposed multimodal temporal fusion framework demonstrates improved robustness under temperature variations, helping to alleviate the feature distribution shift problem in cross-temperature conditions, and enabling the model to maintain stable temperature detection accuracy to a certain extent.

The degradation observed in the cross-temperature validation experiments can be interpreted from a thermodynamic and visual representation perspective. Different temperature regimes correspond to distinct combustion states, flame morphology distributions, and radiative intensity characteristics. Excluding intermediate temperature ranges from training introduces a distribution shift not only in pixel-level features but also in the underlying thermal field structure. Despite this shift, the model maintains stable performance relative to baseline methods, indicating that the proposed architecture captures invariant spatial–temporal relationships between visual cues and thermocouple signals. This suggests that the physics-inspired alignment and residual compensation mechanisms provide structural regularization that partially mitigates distribution sensitivity, even when explicit temperature regimes are unseen during training.

It should be noted that although cross-temperature validation partially evaluates robustness under distribution shifts, all experiments were conducted on the same physical kiln platform. Therefore, the current results primarily demonstrate intra-platform generalization rather than cross-factory transferability. Validating the framework across multiple industrial sites with different kiln scales, burner configurations, and production recipes remains an important direction for future work.

5.4.6. Comparison with Existing Literature

To further demonstrate the position of the proposed MST-FusionNet, we compared our experimental results with several recent temperature detection models reported in the literature for similar high-temperature industrial applications. A summary of this comparison is presented in Table 7. As shown in Table 7, existing methods applied in similar industrial environments typically report detection errors ranging from 5.4 °C to 8.0 °C. For instance, the multi-information fusion network in [11] and the data-driven ensemble framework in [10] reported errors of 5.7 °C and 7.2 °C, respectively. The dual-channel fusion model (DCFANet) [25] reported a minimum error of 5.4 °C in blast furnace temperature prediction. In comparison, the proposed MST-FusionNet achieves a MAE of 0.9164 °C for temperature detection in roller kilns under our specific experimental conditions. While direct numerical comparisons should be interpreted carefully due to differences in datasets and specific kiln types, these literature values provide a meaningful reference for the expected performance in similar high-temperature soft-sensing tasks. Furthermore, it is important to consider the inherent physical tolerance of the ground-truth measurements. In this study, the standard temperature measuring rings used for calibration have an inherent manufacturing and measurement tolerance of approximately ±3.0 °C. The MAE of MST-FusionNet (0.9164 °C) falls well within this inherent physical tolerance range. This indicates that the proposed model can effectively fit the provided calibration data and maintain stable tracking performance, demonstrating its viability and reliability for industrial temperature monitoring. Furthermore, the proposed framework achieves an MAE of 4.7253 °C under varying temperature conditions, and its accuracy is comparable to existing methods applied in similar industrial environments.

5.4.7. Industrial Deployment and Real-Time Feasibility

To assess the practical applicability of the proposed framework in industrial environments, we conducted a detailed analysis of model complexity and inference performance. The model contains 12.91 million trainable parameters, corresponding to an approximate memory footprint of 49.33 MB. This model size falls well within the memory constraints of modern industrial GPUs and edge computing platforms.

Inference latency was evaluated using synthetic inputs that match the real deployment configuration (sequence length

L = 10

, image resolution

224 \times 224

, and 15 thermocouple channels). After a warm-up phase of 20 iterations, latency was recorded over 100 forward passes. In practical industrial roller temperature monitoring systems, data acquisition intervals typically range from 1 to 5 s. Therefore, the measured inference time (approximately

0.13

s) is significantly shorter than the operational sampling interval, providing substantial computational margin for real-time deployment. The proposed model can thus be integrated into existing monitoring pipelines without introducing latency bottlenecks.

From a computational complexity perspective, the dominant cost arises from the ResNet18 backbone and temporal modeling modules. Given the moderate sequence length

L = 10

and hidden dimension (256), the temporal overhead remains secondary to the spatial convolution operations. The pseudo-heatmap generation introduces negligible additional computational burden due to its sparse construction and small-kernel convolution.

Overall, the proposed framework achieves a balanced trade-off between predictive accuracy and computational efficiency, satisfying industrial real-time requirements while maintaining robust temperature prediction performance. Regarding robustness to potential sensor failures or data corruption, the multimodal design inherently provides partial redundancy. If visual signals are temporarily degraded due to smoke or camera exposure issues, the thermocouple baseline branch can still provide stable coarse predictions. Conversely, if individual thermocouple readings become noisy or unavailable, the visual-temporal branch can compensate through learned spatial correlations. In practical deployment, abnormal sensor readings can be detected using threshold-based monitoring or statistical anomaly detection, and missing inputs can be handled through zero-masking or last-value holding strategies. Future research will further investigate explicit missing-modality training schemes to enhance fault tolerance.

5.4.8. Sensitivity Analysis of Gaussian Heatmap Parameters

To evaluate the robustness of the proposed Gaussian-based perception heatmap, we conducted a sensitivity analysis on the diffusion scale

σ

. We varied

σ

from 1 to 9 and retrained the model under the same experimental settings. As shown in Figure 9, MAE generally decreases as

σ

increases from 1 to 6, indicating that moderate spatial diffusion helps capture the spatially coupled thermal behavior. When

σ

becomes larger (over 7), MAE increases again, suggesting over-smoothing effects where local variations are suppressed. Importantly, the performance remains relatively stable over a broad range (

σ \in [2, 7]

), demonstrating that the method does not rely on narrowly tuned hyperparameters. We select

σ = 6

as the default setting since it achieves the best MAE. Regarding the kernel size, it is not treated as an independent hyperparameter. In our implementation, the Gaussian kernel window is set to

K = 6 σ + 1

, which covers approximately ±

3 σ

around the center. Since the Gaussian tails beyond this range contribute negligibly, this choice ensures numerical consistency and avoids truncation artifacts, while preserving the physical interpretation of

σ

as an effective diffusion length.

6. Conclusions

In this study, we developed a multimodal spatiotemporal fusion network (MST-FusionNet) for noncontact temperature estimation in rotary kilns. The proposed framework aims to alleviate the limitations of individual sensing modalities by integrating thermocouple measurements with kiln imagery, enabling temperature prediction at multiple roller measurement points.

To bridge the dimensional gap between heterogeneous data sources, a pseudo-heatmap generation strategy was designed to spatially align one-dimensional thermocouple readings with two-dimensional visual features. A residual compensation formulation was adopted, in which thermocouple data provide a stable baseline while visual information contributes to the estimation of local temperature deviations. This design simplifies the learning objective compared with direct absolute regression in highly dynamic combustion environments. In addition, a multi-head attention mechanism was incorporated to capture temporal dependencies and reduce the influence of transient disturbances such as smoke and flame fluctuations. Experimental results indicate that MST-FusionNet achieves an average absolute error of 0.9164 °C on the validation dataset, showing consistent improvements over baseline models and ablated variants. The ablation analysis suggests that residual learning plays a central role in stabilizing prediction, while pseudo-heatmap alignment and temporal attention contribute to spatial sensitivity and robustness.

Nevertheless, several limitations should be carefully considered. First, the proposed method depends on thermocouple measurements as reference signals. The spatial arrangement, calibration accuracy, and long-term stability of these sensors inevitably influence pseudo-heatmap construction and residual estimation. In real industrial settings, calibration drift or uneven sensor placement may affect prediction reliability. Therefore, appropriate sensor maintenance and layout optimization remain necessary conditions for practical application.

Second, the dataset used in this work consists of 13 experimental runs collected from a single-section roller kiln under controlled conditions. Although run-level data splitting and cross-temperature validation were employed to reduce data leakage and examine robustness to operating variations, all experiments were conducted on the same physical platform. As such, the present results mainly reflect intra-platform generalization capability. Differences in kiln geometry, burner configuration, airflow structure, or fuel type may lead to distinct thermal field patterns and visual characteristics. Additional validation on multi-site industrial datasets will be required to fully assess cross-platform transferability.

Third, while the temporal attention mechanism mitigates the influence of moderate visual disturbances, extreme smoke occlusion, camera degradation, or abrupt combustion instabilities may still compromise visual feature quality. Furthermore, real-world industrial systems may experience temporary sensor failures or data loss. Although the multimodal architecture provides a degree of redundancy, explicit fault-tolerant training strategies and systematic missing-modality handling were not explored in depth in this study. Incorporating uncertainty modeling, anomaly detection, and robustness-oriented training schemes represents an important direction for future research.

In summary, the proposed framework provides a preliminary yet practical exploration of multimodal temperature estimation in rotary kilns. While further validation and robustness enhancement are needed, the results suggest that multimodal spatiotemporal fusion has potential value for improving temperature monitoring accuracy in industrial thermal processes.

Author Contributions

Conceptualization, S.W. and K.C.; methodology, K.C.; validation, K.C.; formal analysis, K.C.; investigation, K.C. and S.T.; resources, S.W.; data curation, K.C. and S.T.; writing—original draft preparation, K.C.; writing—review and editing, K.C. and S.W.; visualization, K.C.; supervision, S.W.; project administration, S.W.; funding acquisition, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Yunnan Provincial Science and Technology Project at Southwest United Graduate School (Grant No. 202302AQ370003-4), and National Natural Science Foundation of China (Grant No. 62562043).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author due to confidentiality agreements related to the university–industry collaborative research platform.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hussnain, S.A.; Farooq, M.; Amjad, M.; Riaz, F.; Tahir, Z.U.R.; Sultan, M.; Hussain, I.; Shakir, M.A.; Qyyum, M.A.; Han, N.; et al. Thermal analysis and energy efficiency improvements in tunnel kiln for sustainable environment. Processes 2021, 9, 1629. [Google Scholar] [CrossRef]
Chen, P.; Sui, M.; Wang, S.; Li, F. Simulation and experimental research on uniform heating of roller-hearth furnace with oxygen-enriched pulse combustion. Fuel Process. Technol. 2025, 272, 108213. [Google Scholar] [CrossRef]
Shi, W.; Sui, M.; Li, F.; Chen, Y.; Luo, Y.; Wang, H. Nonlinear oxygen-enriched combustion and uniform heating characteristics of ceramic kilns. Appl. Therm. Eng. 2025, 284, 129087. [Google Scholar] [CrossRef]
Zhang, Y.; Gu, Z.; Yu, H.; Shi, S. Flame Combustion State Detection Method of Cement Rotary Furnace Based on Improved RE-DDPM and DAF-FasterNet. Appl. Sci. 2024, 14, 10640. [Google Scholar] [CrossRef]
Zhang, X.; Xu, L.; Yang, Y.; Zhang, B.; Xu, H. Temperature measurement of coal fired flame in the cement kiln by raw image processing. Measurement 2018, 129, 471–478. [Google Scholar] [CrossRef]
Zhang, R.; Cheng, Y.; Li, Y.; Zhou, D.; Cheng, S. Image-based flame detection and combustion analysis for blast furnace raceway. IEEE Trans. Instrum. Meas. 2019, 68, 1120–1131. [Google Scholar] [CrossRef]
Vanhaeverbeke, J.; Verstockt, S.; Van Hoecke, S. Flame monitoring and anomaly detection in steel reheating furnaces based on thermal video using a hybrid AI computer vision system. Sci. Rep. 2025, 15, 31300. [Google Scholar] [CrossRef]
An, J.; Wu, M.; He, Y. A temperature field detection system for blast furnace based on multi-source information fusion. Intell. Autom. Soft Comput. 2013, 19, 625–634. [Google Scholar] [CrossRef]
Zhu, Y.H.; Liu, Y.F. Application of flame image recognition-based information fusion technology to roller kiln temperature detection. Appl. Mech. Mater. 2013, 239, 769–774. [Google Scholar] [CrossRef]
Zhou, G.; Liu, W.; Yu, Y.; Saxén, H. Advancing blast furnace thermal state prediction: A data-driven approach using thermocouple integration and multimodal modeling. Steel Res. Int. 2025, 96, 433–447. [Google Scholar] [CrossRef]
Cui, G.M.; Jiang, Z.G.; Liu, P.L.; Chen, Z.H.; Shi, L. Prediction of blast furnace temperature based on multi-information fusion of image and data. In Proceedings of the 2018 Chinese Automation Congress (CAC), Xi’an, China, 30 November–2 December 2018; pp. 2317–2322. [Google Scholar]
Ge, Z.; Song, Z.; Ding, S.X.; Huang, B. Data mining and analytics in the process industry: The role of machine learning. IEEE Access 2017, 5, 20590–20616. [Google Scholar] [CrossRef]
Zhang, R.; Lu, S.; Yu, H.; Wang, X. Recognition method of cement rotary kiln burning state based on Otsu-Kmeans flame image segmentation and SVM. Optik 2021, 243, 167418. [Google Scholar] [CrossRef]
Li, W.; Wang, D.; Chai, T. Burning state recognition of rotary kiln using ELMs with heterogeneous features. Neurocomputing 2013, 102, 144–153. [Google Scholar] [CrossRef]
Chen, H.; Zhang, X.; Hong, P.; Hu, H.; Yin, X. Recognition of the temperature condition of a rotary kiln using dynamic features of a series of blurry flame images. IEEE Trans. Ind. Inform. 2015, 12, 148–157. [Google Scholar] [CrossRef]
Chen, H.; Yan, T.; Zhang, X. Burning condition recognition of rotary kiln based on spatiotemporal features of flame video. Energy 2020, 211, 118656. [Google Scholar] [CrossRef]
Jiang, Y.; Chen, H.; Zhang, X.; Zhou, Y.; Wang, L. Combustion condition recognition of coal-fired kiln based on chaotic characteristics analysis of flame video. IEEE Trans. Ind. Inform. 2021, 18, 3843–3852. [Google Scholar] [CrossRef]
Lyu, Z.; Jia, X.; Yang, Y.; Hu, K.; Zhang, F.; Wang, G. A comprehensive investigation of LSTM-CNN deep learning model for fast detection of combustion instability. Fuel 2021, 303, 121300. [Google Scholar] [CrossRef]
Wang, Z.; Song, C.; Chen, T. Deep learning based monitoring of furnace combustion state and measurement of heat release rate. Energy 2017, 131, 106–112. [Google Scholar] [CrossRef]
Arroyo, J.; Pillajo, C.; Barrio, J.; Compais, P.; Tavares, V.D. Deep Learning Techniques for Enhanced Flame Monitoring in Cement Rotary Kilns Using Petcoke and Refuse-Derived Fuel (RDF). Sustainability 2024, 16, 6862. [Google Scholar] [CrossRef]
Hu, Y.; Zheng, W.; Wang, X.; Qin, B. Working condition recognition based on transfer learning and attention mechanism for a rotary kiln. Entropy 2022, 24, 1186. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Zhang, C.; Han, X.; Lin, Y. Monitoring combustion instabilities of stratified swirl flames by feature extractions of time-averaged flame images using deep learning method. Aerosp. Sci. Technol. 2021, 109, 106443. [Google Scholar] [CrossRef]
Liu, Z.; Xiao, G.; Liu, H.; Wei, H. Multi-sensor measurement and data fusion. IEEE Instrum. Meas. Mag. 2022, 25, 28–36. [Google Scholar] [CrossRef]
Wang, Z.; Bao, Y.; Gu, C. Convolutional Neural Network-Based Method for Predicting Oxygen Content at the End Point of Converter. Steel Res. Int. 2023, 94, 2200342. [Google Scholar] [CrossRef]
Zhou, G.; Li, M.; Jiang, D.; Mattila, O.; Yu, Y.; Saxén, H. A Dual-Channel Multimodal Data Fusion Approach for Hot Metal Temperature Prediction in Blast Furnaces. J. Sustain. Metall. 2025, 11, 1263–1281. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

Figure 1. Schematic illustration of the ceramic roller kiln structure and thermodynamic zoning. The roller kiln consists of three major zones: pre-heating zone (ambient–900 °C), sintering zone (900–1250 °C), and cooling zone (1250 °C–ambient). Ceramic bodies are transported on rotating rollers from the entrance to the exit. Combustion air, auxiliary air, and fuel are supplied to the sintering zone, while flue gases are discharged at both ends. The schematic highlights the spatial arrangement of heating devices, airflow direction, and thermal zone transitions relevant to temperature detection.

Figure 2. Overall architecture of the proposed MST-FusionNet for multimodal temperature detection in ceramic roller kilns. The framework integrates combustion image sequences and wall-mounted thermocouple measurements. First, discrete thermocouple readings are transformed into multi-channel Gaussian pseudo-heatmaps and concatenated with chromaticity-transformed image frames for early spatial fusion. The fused features are extracted by a modified ResNet-18 backbone. In parallel, thermocouple vectors are encoded through MLP branches to provide baseline temperature estimation and global thermal context. Temporal dependencies are modeled using an LSTM followed by a multi-head self-attention (MHA) module to emphasize informative frames. Finally, a residual compensation mechanism combines baseline predictions and learned residuals to output temperatures at six roller positions.

Figure 3. Multi-channel Gaussian pseudo-heatmaps constructed from visible thermocouple measurements. Each subfigure corresponds to one visible thermocouple projected onto the image plane. The discrete temperature reading at the sensor location is spatially diffused using a two-dimensional Gaussian kernel, forming a localized thermal distribution centered at the sensor position. Different channels are constructed independently to preserve spatial separability before being concatenated with combustion images for early fusion. The blue bullets indicate the positions of different thermocouples in the images.

Figure 4. Experimental platform for data collection in the sintering zone of the roller kiln. The platform consists of a roller-hearth kiln system, oxygen supply system, combustion system, thermocouple array, and a color CCD camera equipped with a cooling mechanism. The thermocouples are distributed along the kiln wall for boundary temperature measurement, while the camera captures combustion image sequences. PTCR temperature measurement rings placed on rollers are used to obtain ground-truth ceramic body temperatures.

Figure 5. Representative combustion images at different temperature levels in the sintering zone. The images correspond to approximate kiln temperatures of (a) 610 °C, (b) 690 °C, (c) 785 °C, (d) 950 °C, (e) 1010 °C, and (f) 1070 °C. As temperature increases, flame intensity and chromatic distribution gradually change, reflecting variations in combustion morphology and radiation characteristics.

Figure 7. Scatter plots comparing predicted and measured temperatures on the validation set for different methods. Each subplot corresponds to one model. The horizontal axis denotes PTCR-measured ground-truth temperatures (°C), and the vertical axis denotes predicted temperatures (°C). The diagonal line represents perfect prediction (y = x). Compared with single-modal baselines, the proposed MST-FusionNet shows the strongest linear correlation and minimal dispersion across the full temperature range (600–1200 °C).

Figure 8. Temperature tracking performance comparison on Roller 2 across validation samples. (a) Tracking curve of the vision-only baseline compared with ground truth; (b) Tracking curve of MST-FusionNet compared with ground truth; (c) Absolute error curves of different methods. The proposed method closely follows step-like temperature changes with minimal lag and reduced oscillation, demonstrating improved temporal stability under dynamic combustion conditions.

Figure 9. Sensitivity analysis of the Gaussian diffusion scale

σ

on MAE. The kernel size is set to

K = 6 σ + 1

to cover approximately ±

3 σ

, ensuring negligible truncation error.

Figure 9. Sensitivity analysis of the Gaussian diffusion scale

σ

on MAE. The kernel size is set to

K = 6 σ + 1

to cover approximately ±

3 σ

, ensuring negligible truncation error.

Table 1. Comparison of representative multimodal fusion approaches in high-temperature industrial applications. The table summarizes differences in industrial scenarios, data modalities, fusion strategies, spatial alignment mechanisms, residual modeling, and whether the methods target solid workpiece temperature estimation.

Paper	Industrial Scenario	Data	Target Variable	Fusion Strategy	Spatial Alignment Mechanism	Residual Modeling	Focus on Solid Workpieces
[24]	23 heterogeneous process variables	Converter Steelmaking	Oxygen content at the end point	Deep learning (CNN)	✘	✘	✘
[11]	Blast furnace	Image + operational data	Hot metal temperature, Si content	Image gray features + process variables (feature-level fusion)	✘	✘	✘
[25]	Blast furnace	Tuyere images + sequential numerical data(blast parameters and gas compositions)	Hot metal temperature	Dual-channel CNN–GRU fusion	✘	✘	✘
[10]	Blast furnace	Thermocouple	Thermal state	Thermocouple spatial modeling + ensemble learning	Spatial encoding of thermocouples	✘	✘
[8]	Blast furnace	Infrared image + cross temperature-measurer + wall thermocouple + coke/ore ratio	Burden surface temperature field	Fusion based on Reliability Theory and Kalman Filter	✘	✘	✘
This paper	Ceramic roller kiln	Image + thermocouple	Temperature of ceramic bodies	Early fusion + temporal modeling + residual compensation	Gaussian pseudo-heatmap alignment	✔	✔

Table 2. Overall temperature detection performance on the validation set (°C). MAE and RMSE are computed after inverse normalization to the Celsius scale. Lower values indicate better detection accuracy.

Method	Modalities	Temporal Modeling	MAE	RMSE
Baseline 1	Thermo	None	8.6563	11.7361
Baseline 2	Image	None	17.2476	24.7732
Baseline 3	Image	LSTM	6.1107	16.8946
Baseline 4	Image + Thermo	LSTM	3.1314	8.4193
Proposed	Image + Thermo	LSTM + MHA	0.9164	1.2422

Table 3. Per-roller mean absolute error (MAE) on the validation set (°C). The results evaluate spatial detection consistency across six roller positions.

Method	P1	P2	P3	P4	P5	P6
Baseline 1	10.0083	9.4110	8.7454	8.2106	7.8909	7.6717
Baseline 2	17.8761	19.9082	19.8723	16.8757	15.2483	13.7048
Baseline 3	6.8743	6.6424	6.2383	5.5502	5.3160	6.0426
Baseline 4	3.5732	4.0030	2.9682	2.7100	2.5227	3.0112
Proposed	0.6575	1.4389	0.7719	0.8950	0.8051	0.9300

Table 4. Ablation study evaluating the contribution of individual components. Each variant removes one key module (Residual Compensation, Pseudo-Heatmap Alignment, or MHA). MAE is reported in °C.

Variant	Residual	Heatmap	MHA	MAE
w/o Residual	✘	✔	✔	2.3283
w/o Heatmap	✔	✘	✔	1.9689
w/o Attention	✔	✔	✘	1.4584
Proposed	✔	✔	✔	0.9164

Table 5. Progressive addition study starting from the naïve early-fusion baseline (Baseline 4). Each component is added independently to quantify its individual contribution to performance improvement, with MAE reported in °C.

Model	MHA	Pseudo-Heatmap	Residual	MAE
baseline4	✘	✘	✘	3.1314
+MHA	✔	✘	✘	2.6058
+Heatmap	✘	✔	✘	2.4585
+Residual	✘	✘	✔	1.4571
Proposed	✔	✔	✔	0.9164

Table 6. Cross-temperature generalization performance (°C). Validation data include temperature levels of 800 °C, 900 °C, and 1000 °C that are excluded from training. Results evaluate robustness under temperature distribution shifts.

Method	Modalities	Temporal Modeling	MAE	RMSE
Baseline 1	Thermo	None	12.1494	15.3796
Baseline 2	Image	None	18.8407	23.8339
Baseline 3	Image	LSTM	30.8856	35.5087
Baseline 4	Image + Thermo	LSTM	13.9010	17.7788
Proposed	Image + Thermo	LSTM + MHA	4.7253	6.4535

Table 7. Comparison between the proposed method and representative temperature detection approaches reported in the literature. Reported errors are extracted from published experimental validations in high-temperature industrial contexts. Direct numerical comparison should be interpreted cautiously due to differences in datasets and application scenarios.

Reference	Target Application	Modality & Method	Reported Lowest Error (°C)
[5]	Cement rotary kiln	Raw image processing & improved ratio pyrometry	8.0
[10]	Blast furnace hearth	Multimodal ensemble model (thermocouple + spatial temp)	7.2
[11]	Blast furnace hearth	Time series neural network with multi-information fusion (Tuyere images + online data)	5.7
[25]	Blast furnace	DCFANet (CNN for Tuyere images + GRU-Attention for numerical data)	5.4
Ours	Roller kiln	Physics-aware pseudo-heatmap alignment (Visual + Boundary data)	0.91

Note: The reported errors for references [5,10,11,25] are extracted from their respective experimental validations in high-temperature industrial contexts.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cai, K.; Tu, S.; Wang, S. A Deep Multimodal Fusion Framework for Noncontact Temperature Detection in Ceramic Roller Kilns. Appl. Sci. 2026, 16, 2530. https://doi.org/10.3390/app16052530

AMA Style

Cai K, Tu S, Wang S. A Deep Multimodal Fusion Framework for Noncontact Temperature Detection in Ceramic Roller Kilns. Applied Sciences. 2026; 16(5):2530. https://doi.org/10.3390/app16052530

Chicago/Turabian Style

Cai, Kuiyang, Shanchuan Tu, and Shujuan Wang. 2026. "A Deep Multimodal Fusion Framework for Noncontact Temperature Detection in Ceramic Roller Kilns" Applied Sciences 16, no. 5: 2530. https://doi.org/10.3390/app16052530

APA Style

Cai, K., Tu, S., & Wang, S. (2026). A Deep Multimodal Fusion Framework for Noncontact Temperature Detection in Ceramic Roller Kilns. Applied Sciences, 16(5), 2530. https://doi.org/10.3390/app16052530

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Multimodal Fusion Framework for Noncontact Temperature Detection in Ceramic Roller Kilns

Abstract

1. Introduction

2. Related Work

2.1. Computer Vision-Based Temperature Perception

2.1.1. Hand-Crafted Features and Segmentation Strategies

2.1.2. Spatiotemporal and Dynamic Feature Analysis

2.1.3. Deep Learning and Model Explainability

2.2. Multi-Sensor and Multimodal Data Fusion for Thermal State Recognition

3. Preliminaries

4. Methods

4.1. Image Feature Extraction Module Enhanced by Pseudo-Heatmaps

4.1.1. Pseudo-Heatmaps

4.1.2. Image Feature Extraction

4.2. Thermocouple Feature Encoding and Baseline Temperature Estimation Module

4.3. Temporal Modeling Module Augmented with a MHA Mechanism

4.4. Multimodal Fusion Module Based on Residual Compensation

5. Experiments

5.1. Data Preparation

5.1.1. Data Collecting

5.1.2. Data Processing

5.1.3. Train–Validation Split

5.2. Loss Function

5.3. Baselines and Evaluation Metrics

5.3.1. Baselines

5.3.2. Evaluation Metrics

5.3.3. Implementation Details

5.4. Results

5.4.1. Performance Comparison

5.4.2. Per-Roller Error Analysis

5.4.3. Visualization Analysis

5.4.4. Ablation Study

5.4.5. Cross-Temperature Generalization

5.4.6. Comparison with Existing Literature

5.4.7. Industrial Deployment and Real-Time Feasibility

5.4.8. Sensitivity Analysis of Gaussian Heatmap Parameters

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI