PRTNet: Combustion State Recognition Model of Municipal Solid Waste Incineration Process Based on Enhanced Res-Transformer and Multi-Scale Feature Guided Aggregation

Zhang, Jian; Ge, Junyu; Tang, Jian

doi:10.3390/su18020676

Open AccessArticle

PRTNet: Combustion State Recognition Model of Municipal Solid Waste Incineration Process Based on Enhanced Res-Transformer and Multi-Scale Feature Guided Aggregation

by

Jian Zhang

^1,*,

Junyu Ge

¹ and

Jian Tang

²

¹

School of Computer Science, Nanjing University of Information Science & Technology, Nanjing 210044, China

²

Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(2), 676; https://doi.org/10.3390/su18020676

Submission received: 17 November 2025 / Revised: 27 December 2025 / Accepted: 5 January 2026 / Published: 9 January 2026

(This article belongs to the Special Issue Life Cycle and Sustainability Nexus in Solid Waste Management)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Accurate identification of the combustion state in municipal solid waste incineration (MSWI) processes is crucial for achieving efficient, low-emission, and safe operation. However, existing methods often struggle with stable and reliable recognition due to insufficient feature extraction capabilities when confronted with challenges such as complex flame morphology, blurred boundaries, and significant noise in flame images. To address this, this paper proposes a novel hybrid architecture model named PRTNet, which aims to enhance the accuracy and robustness of combustion state recognition through multi-scale feature enhancement and adaptive fusion mechanisms. First, a local-semantic enhanced residual network is constructed to establish spatial correlations between fine-grained textures and macroscopic combustion patterns. Subsequently, a feature-adaptive fusion Transformer is designed, which models long-range dependencies and high-frequency details in parallel via deformable attention and local convolutions, and achieves adaptive fusion of global and local features through a gating mechanism. Finally, a cross-scale feature guided aggregation module is proposed to fuse shallow detailed information with deep semantic features under dual-attention guidance. Experiments conducted on a flame image dataset from an MSWI plant in Beijing show that PRTNet achieves an accuracy of 96.29% in the combustion state classification task, with precision, recall, and F1-score all exceeding 96%, significantly outperforming numerous mainstream baseline models. Ablation studies further validate the effectiveness and synergistic effects of each module. The proposed method provides a reliable solution for intelligent flame state recognition in complex industrial scenarios, contributing to the advancement of intelligent and sustainable development in municipal solid waste incineration processes.

Keywords:

municipal solid waste combustion; flame combustion state; local-semantic enhancement; adaptive fusion; multi-scale feature aggregation; sustainable development

1. Introduction

With the rapid growth of urbanization in China, the total production of municipal solid waste (MSW) continues to rise [1], leading to the phenomenon of ‘cities surrounded by garbage’ in many areas [2]. Exploring effective ways to reduce, recycle, and harmlessly treat MSW has become a major issue that cannot be ignored. Among current treatment methods, compared with landfilling and composting, municipal solid waste incineration (MSWI) has emerged as a mainstream disposal method due to its significant volume reduction and high treatment efficiency [3,4]. The core of this technology lies in converting MSW into ash, flue gas, and recoverable thermal energy through high-temperature combustion, which can not only effectively reduce secondary pollution but also achieve repeated utilization of waste resources [5,6]. Recycling transforms waste into resources, incineration generates electricity, and residues can be made into building materials, alleviating landfill pressure, promoting the green and low-carbon transformation of waste management, and achieving sustainable development [7]. While MSWI offers significant advantages in volume reduction and energy recovery, its environmental benefits are highly dependent on the stability of the combustion state. Under normal combustion conditions, pollutant generation is effectively suppressed. However, when combustion enters an abnormal state, characterized by disordered flame morphology and an imbalanced temperature field within the furnace, incomplete combustion occurs. This leads to a substantial increase in the emission concentrations of harmful substances such as carbon monoxide (CO) and dioxins [8]. These pollutants not only pose direct risks to human health but also enter the soil and water bodies through atmospheric deposition, causing persistent environmental pollution and disrupting ecosystem balance. Therefore, achieving accurate identification and real-time control of different combustion states is a crucial step for suppressing pollutant generation at the source and ensuring the environmental cleanliness of the incineration process. Unstable combustion not only manifests as flame oscillation and reduced efficiency but also directly causes rapid ash deposition on the heat transfer surfaces. This ‘insulating layer’ severely hinders heat exchange. At the same time, contact between high-temperature ash and metallic materials accelerates chemical corrosion, thereby shortening equipment lifespan [9]. If operations fail to make timely and precise adjustments to such fluctuations, the risks can increase dramatically. In the worst-case scenario, delayed adjustments can even trigger complete thermal runaway, leading to equipment damage and causing severe safety accidents and economic losses. Therefore, in modern municipal solid waste incineration power plants, maintaining a stable combustion process is the lifeline that determines whether the entire system can operate efficiently and economically [10]. However, many current MSWI plants still largely rely on experienced operators for combustion control. They need to constantly monitor screens or furnace windows, using their eyes to observe the flame’s color, shape, and flickering frequency to judge the combustion conditions, and then manually adjust key parameters such as fuel supply, air intake, and the ratio of primary to secondary air. This approach, however, presents several limitations. First, human judgment is inherently subjective and varies significantly across operators due to differences in experience and attention, leading to inconsistent control outcomes. Second, the limited reaction speed and endurance of human operators hinder their ability to respond to rapid combustion fluctuations. Most critically, the reliance on manual observation precludes the implementation of intelligent, precise, and real-time optimized control. Therefore, we must strive to transform the valuable hands-on experience accumulated by senior engineers and operators into an intelligent knowledge base understandable and executable by machines, through data analysis, machine learning, and other advanced technologies. In this way, the system can autonomously and in real-time recognize subtle changes in combustion conditions and predict potential risks. Achieving this goal not only significantly reduces the total emissions of flue gas pollutants but also promotes energy recovery and resource recycling through more efficient combustion, enabling MSWI to truly become a green industry for sustainable development.

To overcome the over-reliance on experienced experts and the strong subjectivity and variability in recognizing the combustion states in traditional MSWI processes, artificial intelligence technologies have been increasingly applied in this field in recent years [11]. Cao et al. [12] proposed a DQN-PL model that integrates GA-SA multi-threshold segmentation and deep reinforcement learning for flame state recognition. It extracts shape-statistical features, performs feature selection and dimensionality reduction, and employs a pseudo-label-enhanced classification strategy, achieving high accuracy across five combustion states. Guo et al. [13] proposed a method for recognizing the burning state of a waste incinerator based on image recognition technology, which realized the rapid judgment of the burning state in the furnace and connected the judgment results to the automatic control system, so as to improve the intelligent control level of the incinerator. Yu et al. [14] combined neural networks with infrared thermal imaging technology to detect equipment malfunctions and used Support Vector Machines to classify the flame images to achieve high accuracy of flame classification. Yang et al. [15] proposed a method for feature extraction using the YOLOv5 algorithm and implemented the recognition of combustion states in the head layer during the MSWI process. Omiotek and Kotyra [16] introduced a method for processing and classifying flame images based on the pre-trained VGG16 model and proved that the proposed method could recognize poor combustion states efficiently. Zhang et al. [17], combining three feature enhancement strategies with a multi-scale attention module, deformable multi-head attention module, and contextual feature fusion module, effectively integrated local and global features of flame combustion, improving model performance and robustness, achieving accurate recognition of MSWI flame states (see Figure 1).

In the process of constructing deep learning models to recognize the flame state of MSWI, most approaches rely on Convolutional Neural Networks (CNNs). However, traditional CNNs often have certain inherent shortcomings in feature extraction, such as insufficient utilization of spatial information, chaotic feature distribution, and limited semantic expression capability. Therefore, if these corresponding issues can be addressed, more efficient network architectures can be designed. Xu et al. [18] proposed a novel attention mechanism called Efficient Local Attention (ELA) to tackle the problems of channel dimensionality reduction and model complexity in traditional CNNs when utilizing spatial information. By using one-dimensional convolution and group normalization techniques, ELA efficiently encodes spatial positional information while maintaining channel dimensions and is applicable to various CNN architectures. Ouyang et al. [19] proposed a new Efficient Multi-scale Attention (EMA) module to address the insufficiencies of CNNs in processing multi-scale features and the limited richness of feature representation due to insufficient inter-channel information interaction. This module uses two parallel branches to encode global information for recalibrating channel weights and aggregates output features through cross-dimensional interaction to capture pixel-level pairwise relationships. Vong et al. [20] introduced a Spatial Pyramid Pooling (SPP) layer in CNNs to solve the limitation of traditional CNNs that require fixed input image sizes. The SPP layer divides the input image into several blocks of fixed numbers and selects the maximum value in each block, providing a fixed-size output for subsequent fully connected layers. This method enables CNNs to handle input images of different sizes while preserving the original image details, improving the accuracy of prediction tasks. Zhang et al. [21] proposed a method called Attention-Guided Repair for Robustness (AR2) aimed at enhancing the robustness of CNNs against common image disturbances. AR2 aligns the Class Activation Maps (CAMs) of clean and contaminated images and adopts an iterative repair strategy to alternately perform CAM-guided refinement and fine-tuning, thereby enhancing attention consistency under input perturbations. Experiments show that AR2 significantly improves model robustness across multiple benchmarks while maintaining high accuracy on clean data. Li et al. [22] proposed a module called Spatial Group-wise Enhance (SGE), which generates attention factors for spatial positions within each semantic group, adjusts the importance of sub-features, enhances the feature representation of key regions, and suppresses noise, effectively improving the capability of CNNs in semantic feature learning.

The above research indicates that deep learning-driven artificial intelligence technology, with its excellent feature extraction and representation capabilities, has demonstrated outstanding performance in MSWI combustion condition identification and other related fields. Artificial intelligence systems, by monitoring flame states in real-time and guiding technicians in dynamically optimizing incineration parameters, can significantly enhance combustion stability and efficiency. This enables the reduction in energy consumption and the generation of incomplete combustion products under ideal conditions, thereby elevating the level of intelligence in process control and strengthening the system’s ability to handle complex operating conditions. However, due to the complexity of MSWI flame images, which often exhibit issues such as intricate shapes, high noise, significant individual differences, and blurred boundaries, combined with the inherent limitations of CNN-based deep neural networks, models are unable to fully and effectively extract flame features for recognition. To address these limitations, this paper proposes the PRTNet model to achieve precise capture and identification of flame combustion features. The main contributions of this paper are as follows:

(1) A novel hybrid architecture, PRTNet, was designed, which effectively combines the advantages of CNN and Transformer and efficiently aggregates multi-scale feature information, achieving efficient recognition of MSWI flame combustion states.

(2) Combine ELA with SGE to form a Local-Semantic Enhanced Attention (LSEA) module and embed it into the ResNet backbone, establishing multi-scale spatial correlations between fine-grained textures and combustion patterns, significantly improving the residual network’s recognition accuracy of flame regions.

(3) A feature-adaptive fusion Transformer is proposed, with a global-local adaptive fusion module as the core, which not only focuses on the overall spatial distribution of the flame but also preserves key details such as edges and bright spots, enhancing the integrity and discriminative power of flame features.

(4) Designed a Cross-Scale Feature Guided Aggregation (CFGA) module to efficiently fuse shallow high-resolution spatial details, mid-level transitional features, and deep high-semantic information, strengthening multi-scale flame feature integration and significantly improving feature extraction and recognition performance in complex combustion scenarios.

2. Materials and Methods

To systematically analyze the implementation approach of the urban solid waste incineration flame state recognition algorithm, this section unfolds along two main lines of ‘experimental materials—modeling methods’: first, based on the inherent attributes of flame combustion images, the analysis criteria are established; then, an innovative framework is proposed, organically integrating a local-semantic enhanced residual network, a feature-adaptive fusion Transformer, and an attention-guided cross-scale feature aggregation module; finally, with the help of a classifier, end-to-end discrimination of the combustion state is achieved.

2.1. Introduction of Flame Combustion Image

When waste is pushed into the incinerator furnace, its thermochemical journey roughly goes through three stages. First is the drying stage, where hot furnace gases sweep over the fresh waste, instantly converting the moisture on the surface and inside into steam, which rises with the airflow, transforming the material from wet to dry. The efficiency of this stage highly depends on the initial moisture content, calorific value of the waste, and the heat transfer conditions inside the furnace. Next is the flaming stage, where combustible components meet oxygen, triggering a sudden intensification of chain reactions. The furnace chamber is filled with intense flames, a large amount of heat is released, and ash and flue gas are generated simultaneously. This stage is regarded as the core of energy conversion, and its combustion stability is directly influenced by the volatile matter content, air supply method, and temperature distribution within the furnace. Finally, the process enters the char-burning stage, where residual carbon particles continue to slowly combine with the scarce oxygen, the flames gradually dim until they extinguish completely. This stage determines the final burnout rate and the thermal reduction rate of the residual slag, and its effectiveness mainly depends on the fixed carbon content of the waste, ash characteristics, residence time, and the intensity of turbulence inside the furnace. According to the classification criteria proposed by Pan et al. [12], the four flame burning states in MSWI images are normal burning, partial burning, channeling burning and smoldering.

Figure 2 presents four typical flame burning states. Figure 2a illustrates normal burning: the flame edge is clearly linear, bright pixels converge into a coherent light band in the burning zone, and the overall radiation is uniform with smooth transitions. Figure 2b shows partial burning: the burning front curves across the area, flame heights are uneven, and although the brightness is high, it is distributed in a fragmented manner. Figure 2c corresponds to channeling burning: the burning line is arranged in scattered points, and parts of the flame suddenly surge upward. Figure 2d depicts smoldering: large dark areas appear inside the furnace, suggesting that oxygen deficiency suppresses the flame.

2.2. Methods

In order to fully exploit the discriminative information of MSWI flame images at multiple scales and semantic levels, and to overcome the challenges of feature extraction caused by complex shapes and strong noise, we propose the PRTNet combustion state recognition model. This model is a hybrid architecture that effectively integrates the strengths of both convolutional networks and Transformers. Its core lies in addressing different bottlenecks in the feature extraction process in stages and collaboratively through three carefully designed modules. This section will first provide an overview of the overall workflow of PRTNet, and then explain in detail the design motivation, implementation principles, and performance gains brought by each core module.

2.2.1. Architecture Overview

The PRTNet recognition model proposed in this paper consists of four core components: a local-semantic enhanced residual network, a feature adaptive fusion Transformer (FAFT), a cross-scale feature guided aggregation (CFGA) module, and a classifier.

The overall architecture of the PRTNet model is shown in Figure 3, and its data processing flow follows a progressive feature optimization strategy. The input image first undergoes preliminary multi-scale feature extraction through a local-semantic enhanced residual network, a stage that focuses on addressing the shortcomings of basic convolutional networks in modeling complex textures and macro-level correlations. Next, the feature maps output from the first two stages of the network are fed into the Feature Adaptive Fusion Transformer (FAFT), a module designed to overcome the difficulty of simultaneously capturing long-range dependencies and local details, generating more discriminative features through adaptive fusion. Then, the multi-scale features from the three stages of the residual network are input into the Cross-Scale Feature Guided Aggregation (CFGA) module, which specifically targets the “opposing” relationship between deep and shallow features in terms of semantics and details, achieving efficient complementarity and integration of information. Finally, the aggregated and optimized features are passed to the classifier to complete end-to-end precise recognition of combustion states. The following sections will analyze the design details of each key module one by one.

2.2.2. Local-Semantic Enhanced Residual Network

Although the traditional ResNet backbone network performs excellently in image recognition, its standard convolution operations have limitations when dealing with MSWI flame images: (1) they are not sensitive enough to fine-grained texture variations such as blurred flame edges and uneven brightness; (2) they lack an explicit mechanism to differentiate and enhance various semantic concepts (such as flame cores, edges, and background hotspots) along the channel dimension, resulting in chaotic feature distributions and insufficient discriminability.

To address the aforementioned issues, we embedded a Local-Semantic Enhancement Attention (LSEA) module at the end of each stage of ResNet. The LSEA consists of two modules, ELA and SGE, connected in series. ELA performs prior localization of the flame image’s long-range spatial coordinates, explicitly capturing the geometric distribution of the flame edges and high-temperature regions through lightweight encoding using 1D convolution and group normalization. Subsequently, the SGE module divides the semantic subspaces along the channel dimension and generates spatial attention factors based on global-local feature similarities, achieving adaptive enhancement of internal flame microstructures and suppression of background noise. The two-stage module forms a ‘localization-purification’ collaborative framework that balances local detail depiction while maintaining global contextual consistency.

Furthermore, based on the traditional ResNet framework, we propose a Local-Semantic Enhanced Residual Network to efficiently extract multi-scale features. The network mainly consists of a Stem module, residual units (Res_block), and a Local-Semantic Enhanced Attention (LSEA) module. The input image first enters the cascaded Stem module, where initial feature extraction is performed using 7 × 7 and 3 × 3 convolution kernels, followed by group normalization and ReLU nonlinear activation after each convolution operation, and finally dimensionality reduction through Max pooling. Subsequently, the network progresses in three stages: the first stage stacks three residual units, the second stage stacks four, and the final stage stacks six. All residual units adopt a bottleneck design, as shown in Figure 4, containing 1 × 1 convolution, 3 × 3 convolution, batch normalization, and ReLU activation, while the skip connections mitigate gradient decay. At the end of each stage, a Local-Semantic Enhanced Attention module is embedded to simultaneously focus on flame textures and combustion patterns, strengthening cross-scale spatial correlations and thereby extracting richer flame features.

The structure diagram of the ELA module is shown in Figure 5. Given the input feature tensor X∈R^C^×^H^×^W, ELA first performs strip pooling along the horizontal and vertical directions separately to obtain two one-dimensional global descriptors:

z_h = \frac{1}{H} \sum_{0 \leq i < H} x_{c} (h, i)

(1)

z_w = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (j, w)

(2)

This operation decouples 2D spatial relationships into separable sequences while preserving long-range positional cues, avoiding the loss of spatial details caused by traditional global pooling. Then, z_h and z_w undergo local interactions via independent 1D convolutions F_h and F_w, respectively, to capture contextual dependencies between adjacent coordinates. Group normalization is subsequently used instead of batch normalization to mitigate statistical bias caused by small-batch training. After Sigmoid activation, the output yields a direction-sensitive positional attention map. This process explicitly captures long-range spatial coordinate priors of the entire flame region, such as the flame’s orientation and high-temperature distribution.

A_h = σ (G N (F_h (z_h)))

(3)

A_w = σ (G N (F_w (z_w)))

(4)

The final spatial attention map is generated by the pixel-wise product of A_h and A_w and is multiplied with the input features channel by channel for recalibration, without any channel dimensionality reduction throughout the process, thus preserving the original expressive capability. The output result is shown in Equation (5).

Y = X \otimes A_h \otimes A_w

(5)

The ELA-weighted features are sent into the SGE module, whose structure is shown in Figure 6. According to the preset parameters, all channels C are divided into G groups, each corresponding to a latent semantic subspace. Perform spatial global average pooling on the g-th group of features

X^{(g)} \in R^{(C / G) \times H \times W}

to obtain the semantic center vector q^(g) ∈ R^C/G, which can serve as the template for the ‘ideal flame component’ of this group. Then, the cosine similarity between each spatial position

x_{i}^{(g)} \in R^{C / G}

and q^(g) is calculated to capture both directional consistency and activation intensity, as shown in Equation (6), which allows the model to adaptively enhance key microstructures related to the combustion state.

c_{i}^{(g)} = \frac{〈q^{(g)}, x_{i}^{(g)}〉}{‖q^{(g)}‖ ‖x_{i}^{(g)}‖}

(6)

Then, within the group, perform zero-mean and unit-variance normalization on

c_{i}^{(g)}

to eliminate amplitude differences between different samples, making the attention coefficients comparable across samples and batches, thereby enhancing generalization ability. Subsequently, introduce learnable parameters γ^(g) and β^(g) for scaling and shifting, and use a Sigmoid gate to obtain the spatial attention coefficients, as shown in Equation (7).

a_{i}^{(g)} = σ (γ^{(g)} c_{i}^{(g)} + β^{(g)})

(7)

The final output

o_{i}^{(g)} = x_{i}^{(g)} ⊙ a_{i}^{(g)}

. The module introduces only 2G parameters, yet it significantly suppresses specular reflections, hot spots on furnace walls, and other non-flame high-brightness areas, while enhancing key microstructures within the flame, such as temperature fields and texture folds, improving the distinction between feature classes.

LSEA cascades ELA with SGE into a “global localization–local refinement” two-stage framework. This design decouples spatial position localization from channel semantic enhancement, where the former is responsible for coarse-grained region activation and the latter achieves fine-grained semantic enhancement. The two complement and work synergistically, establishing a strong correlation between fine-grained textures and macroscopic combustion patterns, significantly improving the accuracy of flame region recognition and robustness against background noise, providing higher-quality multi-scale feature maps for subsequent modules.

2.2.3. Feature-Adaptive Fusion Transformer

Although the LSEA module enhances local details and semantic subspaces, its convolution-based nature still falls short in capturing long-range spatial dependencies in flame images (such as the correlation between dispersed fire spots during flame spread) and handling non-rigid geometric deformations. While standard Transformers can model global relationships, they have high computational complexity and are insensitive to local high-frequency details, often resulting in excessively smoothed features. To further couple global semantics with local details, avoid feature conflicts, and enhance the model’s adaptability to complex combustion scenarios, we introduce a Feature-Adaptive Fusion Transformer (FAFT) module after LSEA. FAFT captures long-range dependencies through a deformable multi-head attention mechanism, enhances fine-grained textures via a local branch, and achieves adaptive fusion of global and local features through a gating mechanism, thereby enabling more robust multi-scale representations within a unified feature space.

FAFT consists of four subcomponents: two normalization layers, a global-local adaptive fusion (GLAF) core unit, and an MLP. The output features of LSEA enter the first normalization layer for channel-spatial joint normalization, stabilizing the subsequent attention distribution. Inside GLAF, there are two parallel paths: the global path employs deformable multi-head attention (DMHA), using deformable convolution sampling to enable multi-head self-attention to sparsely focus on the real contours of flames, expanding the effective receptive field and capturing long-range dependencies; the local path uses channel shuffle combined with depthwise separable convolution architecture to enhance fine-grained textures such as flame edges and highlights while maintaining low computational cost. The adaptive gated fusion subnet generates dynamic weights using global average pooling and a Sigmoid function to achieve adaptive weighting of global and local features. The second normalization layer normalizes the fused features again, providing stable input for the MLP. Subsequently, the MLP adopts a three-stage bottleneck structure with 1 × 1 convolution: first increasing the dimensionality, then applying GELU non-linear activation, and finally reducing the dimensionality, achieving high-order interactions and non-linear mapping among channels.

Figure 7 illustrates the structure of the global-local adaptive fusion module used in this study. Since MSWI flame images simultaneously present three major challenges—non-rigid geometric deformation, large scale variation, and random spatial drift—using traditional Self-Attention can model long-range dependencies, but its fixed positional encoding and uniform grid sampling mechanism show significant shortcomings when dealing with flame geometric distortions: (1) rigid receptive field, which cannot align with deformed contours; (2) for multi-scale flame targets, a large number of attention layers need to be stacked, leading to quadratic growth in computational complexity with spatial size; (3) lack of explicit geometric priors, resulting in misactivation of background regions. To address this, deformable convolutions [23] are introduced into Self-Attention, allowing the receptive field to adaptively scale according to the flame geometry through learnable sampling offsets. This not only retains Self-Attention’s global modeling capability but also endows it with geometric adaptability, thereby significantly enhancing the representation and discriminative power of key flame features while maintaining sparse and efficient computation. Deformable convolution, based on the standard convolution kernel’s set of sampling positions R = {(−1,−1), (−1,0), …, (1,1)}, introduces a learnable offset {∆dn|n = 1, …, N} for each sampling point, where N = |R| represents the total number of sampling points. For each position d₀ on the output feature map, the computation formula of deformable convolution can be expressed as

D F C o n v (d_{0}) = \sum w (d_{n}) x (d_{0} + d_{n} + Δ d_{n})

(8)

For the input feature map X_n in Figure 6, we first perform a convolution operation with a kernel size of 1 × 1, and then compute its query, key, and value through deformable convolution, as shown below:

Q (X_{n}) = D F C o n v_{3 \times 3} (C o n v_{1 \times 1} (X_{n}))

(9)

K (X_{n}) = D F C o n v_{3 \times 3} (C o n v_{1 \times 1} (X_{n}))

(10)

V (X_{n}) = D F C o n v_{3 \times 3} (C o n v_{1 \times 1} (X_{n}))

(11)

Next, the feature tensor is reshaped along the spatial dimensions to convert it into a two-dimensional matrix representation. Then, a similarity matrix between the feature vectors is computed through matrix multiplication, followed by the application of a nonlinear activation function to normalize the similarity matrix, generating attention weights. Finally, the attention weights are multiplied with V(X_n) and then reshaped to obtain the attention feature map based on deformable convolution:

A_s = σ (Q (X_{n}) \otimes K {(X_{n})}^{T})

(12)

G (X_{n}) = A_s \otimes V (X_{n})

(13)

The deformable multi-head attention (DMHA) constructed on this basis can dynamically focus, but it is essentially still a global modeling mechanism. Its ability to respond to local high-frequency textures is limited, which can easily lead to smoothing of details and thus weaken the model’s discriminative power for subtle combustion differences. To address this, we set up a parallel local branch, using depthwise separable convolutions combined with channel shuffle to enhance the expression of micro-details without significantly increasing computational cost, avoiding the feature homogenization problem caused by excessive smoothing in DMHA. For the input features, pointwise convolutions are first applied to achieve cross-channel information recombination and introduce nonlinear expressions without changing spatial resolution. Then, inter-group channel rearrangement is performed to enhance interaction between feature subspaces, improving the diversity of the model’s representation of fine flame textures. Finally, depthwise separable convolutions are used to capture high-frequency spatial information of flame edges and local structures. The workflow is shown in Equation (14).

L (X_{n}) = D W C o n v_{3 \times 3} (C h a n n e l S h u f f l e (C o n v_{1 \times 1} X_{n})))

(14)

The global branch output G is rich in semantics, with smooth spatial details and robustness to deformation, while the local branch output L has sharp spatial details and is texture-rich but susceptible to noise. Hard stitching or simple weighting can lead to redundancy and conflicts. Therefore, a lightweight gating network is introduced, as shown in Equation (15), to achieve adaptive fusion, allowing the network to dynamically determine the weights of semantics and textures based on the input content to obtain the final output F.

g = σ (W_{2} Re L U (W_{1} G A P (X_{n})))

(15)

F = g ⊙ G + (1 - g) ⊙ L

(16)

The FAFT module parallelizes the global modeling capability of the Transformer with the local detail capturing ability of CNNs within a unified framework. Using a gating mechanism, it intelligently balances the two, making it more adaptable to flame deformations and drift, while more completely preserving the microscopic structures crucial for classification, significantly enhancing the discriminative power of the features.

2.2.4. Cross-Scale Feature Guidance Aggregation Module

In the layer-by-layer abstraction process of residual networks, the strides of convolution kernels and pooling operations constantly compress spatial resolution, resulting in top-level features that are rich in semantics but lose a large amount of fine-grained texture; meanwhile, shallow features retain high-resolution edges, highlights, and positional details but are semantically weak and significantly noisy. This ‘semantic-detail’ contrast makes it difficult for single-scale features to simultaneously represent the macroscopic combustion patterns and microscopic flame structures. Additionally, different layers focus on different aspects of flame representation: shallow layers emphasize local texture variations, middle layers capture transitional morphologies, and deep layers focus on the global combustion state. Simply concatenating or summing them would create redundancy and conflicts, weakening discriminative power. Therefore, we designed a CrossScale Feature Guided Aggregation (CFGA) module, as depicted in panel (c) of the overall PRTNet framework. The shallow high-resolution features and deep strong semantic features are guided through channel and spatial attention to instruct the middle-layer features to both preserve the key structures of global combustion states and enhance sensitivity to local flame core textures. Combined with the original features enhanced via simple spatial self-attention, the output produces a unified feature map that possesses both high semantic content and fine-grained details.

The three branches first apply 1 × 1 convolution and SE channel attention [24] to enhance the discriminative features of each level and suppress the responses of noisy channels. The SE module can effectively model the dependencies between channels, and its structure is shown in Figure 8. For an input feature X, after convolution, a feature map U is obtained. The spatial information of each channel of this feature is compressed into a single value, resulting in a 1 × 1 × C vector from the H × W × C-sized U. Then, a set of fully connected (FC) layers is applied to this vector for weight adjustment, producing a 1 × 1 × C channel attention vector. Finally, the channel attention vector is applied to U to form an enhanced feature map.

Subsequently, a bidirectional attention-guided path is constructed: shallow detailed features are downsampled using a 3 × 3 convolution with a stride of 2 to achieve spatial alignment with mid-level features. After feature refinement through a linear layer, spatial attention is applied, as shown in Figure 3 (SA), to generate a spatial attention mask. The spatial attention mechanism employs a dual-path feature compression strategy: given an input feature map of dimension H × W × C, global average pooling and global max pooling are performed in parallel. Both pooling operations compress the features along the channel dimension, resulting in single-channel feature maps with spatial dimensions H × W. These two single-channel maps are then concatenated along the channel axis, forming a fused feature representation of dimension H × W × 2. This fused feature is fed into a convolutional layer for encoding, producing a spatial weight map, which is then normalized by a Sigmoid function to obtain attention weight coefficients in the range [0, 1]. Finally, these weights are applied to the original input feature map through element-wise multiplication, thereby achieving adaptive feature enhancement in the spatial dimension. The attention mask is element-wise multiplied with mid-level features, embedding detailed information into the mid-level features, increasing sensitivity to flame edge variations and localized energy accumulation, and enhancing micro-feature discriminability. Deep high-semantic features are channel-calibrated via a linear layer and then upsampled using bilinear interpolation to match the mid-level spatial dimensions, after which an attention mask is generated to guide the retention of mid-level features and structure information relevant to the overall combustion state.

F_{s m} = S A (L i n e a r (C o n v_{3 \times 3} (F_{s}))) \otimes F_{m}

(17)

F_{d m} = S A (U p s a m p l e (L i n e a r (F_{s}))) \otimes F_{m}

(18)

The middle branch adopts a depthwise convolution with a kernel size of 3 × 3 combined with the Sigmoid function to generate pixel-wise spatial gating, achieving adaptive refinement of intermediate features, thereby enhancing the local consistency of flame shape expression while maintaining spatial resolution. Subsequently, the triple features are concatenated, and 1 × 1 convolution is used for cross-channel information interaction, followed by downsampling with 3 × 3 convolution and spatial context enhancement. Finally, a residual connection is introduced to preserve the semantic integrity of the original deep features.

F_{m m} = σ (D W C o n v_{3 \times 3} (F_{m})) \otimes F_{m}

(19)

F_{o u t} = C o n v_{3 \times 3} (C o n v_{1 \times 1} ([F_{s m} | | F_{m m} | | F_{d m}])) + F_{d}

(20)

The CFGA module abandons the crude fusion method and adopts a ‘guidance-aggregation’ strategy to achieve guided and complementary fusion, enabling the module to output a unified feature map with both high semantic representativeness and high spatial detail resolution, thereby demonstrating stronger classification robustness and higher recognition accuracy in complex and variable combustion scenarios.

2.2.5. Classifier

In the classification phase, the high-order feature tensor, optimized through the cross-scale feature-guided aggregation module, is first spatially compressed via a global average pooling layer to extract channel-level statistical features. The pooled feature vector serves as the basis for determining the combustion state and is input to a fully connected layer for nonlinear mapping, ultimately outputting the probability distribution of the MSWI image belonging to the four combustion state categories.

3. Experimental Results and Analysis

To evaluate the effectiveness of the algorithm, this chapter first establishes the evaluation metrics and provides a detailed description of the flame image dataset used in the study. Subsequently, validation is conducted through a phased experimental design: the overall performance of the algorithm is tested first, followed by ablation studies and comparative experimental analyses. The quantitative results consistently indicate that the proposed method demonstrates competitive advantages in both accuracy metrics and potential for engineering applications.

3.1. Evaluation Metrics

To quantitatively evaluate the performance of classification models, this study selects four core metrics: Accuracy, Precision, Recall, and F1-score. Accuracy reflects the proportion of correctly classified samples overall; Precision measures the proportion of predicted positive samples that are actually positive; Recall represents the proportion of actual positive samples that are correctly identified; the F1-score is the harmonic mean of Precision and Recall, providing a comprehensive and balanced evaluation. Quantitative analysis of experimental results based on these metrics can effectively reveal the performance characteristics and limitations of the model. The calculation formulas for the above evaluation metrics are as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(21)

P r e c i s i o n = \frac{T P}{T P + F P}

(22)

R e c a l l = \frac{T P}{T P + F N}

(23)

F 1_s c o r e = \frac{2 \times R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n}

(24)

3.2. Experimental Setup

We built our PRTNet model using PyTorch v2.5.1 and trained it on a single Nvidia GeForce RTX 3090 GPU. The model input consists of flame images of size 576 × 576 × 3. The Adam optimizer was used with an initial learning rate of 1 × 10⁻⁴, employing a cosine annealing strategy for gradient optimization, and a weight decay of 5 × 10⁻⁵. The network was trained for 100 epochs with a batch size of 32, and the best-performing model was selected for testing.

3.3. Flame Burning Images Dataset

This experiment uses the flame combustion image dataset created by Pan et al. [12], with data sourced from a municipal solid waste incineration (MSWI) plant in Beijing. The plant monitors the combustion state of waste on the left and right furnace grates in real-time by installing high-temperature endoscopes on both sides of the back wall of the incinerator furnace. The video signals are transmitted via coaxial cables and stored on an industrial control computer’s video capture card. The dataset classifies combustion states into four categories: normal combustion, partial combustion, flashover, and smoldering. The dataset construction process includes the following key steps: (1) removing video segments that cannot clearly reflect the combustion state; (2) having plant operation experts screen stable combustion frames under typical operating conditions based on classification standards (typical state illustrations are shown in Figure 1) and label the states; (3) standardizing the classified videos using a timing sampling algorithm developed on the MATLAB 7.0 platform, extracting key frames at fixed time intervals (1 frame per minute). Considering that the combustion state changes slowly within a short period, in practice, the saved images were resampled at one-minute intervals. Quantitative statistics show that a total of 3289 and 2685 valid image samples were obtained for the left and right furnace grates, respectively, successfully covering various multidimensional flame features for different combustion states. Given that the flame images of the left and right furnace grates exhibit symmetrical distribution characteristics, to enhance the model’s generalization ability, this study merges the images of both grates into a unified dataset for experiments. The number of data samples corresponding to each typical combustion state is shown in Table 1.

3.4. Model Experiment Results

Figure 9 shows the changes in the model’s loss and accuracy during training. The left graph displays the trends of training loss and validation loss over the number of training epochs. The horizontal axis represents the number of training epochs, and the vertical axis represents the loss value. Both curves show a downward trend: the training loss decreases rapidly in the early stages and gradually approaches zero later, indicating that the model has a significant fitting effect on the training set. Although the validation loss fluctuates to some extent after initially decreasing, the overall trend also tends to stabilize and remain at a low level. This trend indicates that the model demonstrates strong generalization ability on the validation set, and its pattern of change basically matches the training loss curve.

The right figure depicts the evolution of training accuracy and validation accuracy over the course of training epochs. Both accuracy curves show an upward trend: the training accuracy significantly increases during training and eventually stabilizes at a high level close to 1.0, further confirming the model’s excellent fit to the training data. The validation accuracy curve fluctuates somewhat in the early stages but generally maintains an upward trend and stabilizes later on, with the final value showing only a small gap compared to the training accuracy. This indicates that the model’s performance on the validation set is comparable to that on the training set, effectively demonstrating good generalization ability and showing no obvious signs of overfitting.

The synchronized optimization trends of training and validation metrics jointly confirm the effectiveness of the training process and ensure the reliability of the model’s performance evaluation on the validation set. To more intuitively demonstrate the performance of the PRTNet model proposed in this study, we present three-dimensional confusion matrices, as shown in Figure 10, summarizing the classification results obtained by the model on the test set.

Table 2 details the quantitative performance metrics of the PRTNet model in the task of identifying flame states in four types of municipal solid waste incineration (MSWI). The overall performance of the model is excellent, with average accuracy, precision, recall, and F1 score reaching 96.29%, 96.30%, 96.25%, and 96.27%, respectively, fully demonstrating the model’s high accuracy and reliability in the MSWI flame state classification task. Notably, the model performs particularly well in recognizing the partially burned state, with all four metrics exceeding 97%. At the same time, the model also demonstrates high recognition accuracy in normal combustion, flaming, and smoldering states, with accuracy and F1 scores above 95%.

Figure 11 presents the detailed classification performance data from Table 2 through intuitive visual charts, clearly illustrating the differences in the model’s ability to distinguish various combustion states and its overall balance. The data distribution in the figure clearly indicates that the model maintains similarly high levels across all evaluation metrics for each category, with no significant weaknesses observed. This balanced and excellent performance distribution strongly validates that the PRTNet model possesses outstanding robustness and generalization capability when handling complex and variable MSWI flame images.

3.5. Ablation Experiments

In order to further investigate the effectiveness of each module, we conducted ablation experiments from the perspective of each module’s contribution to the overall network performance improvement as well as its main components.

Table 3 presents the detailed results of the module ablation experiments, where different modules were gradually added to the traditional ResNet architecture to verify their performance. Network2, after independently introducing the LSEA module, achieved significant improvements in model accuracy, precision, recall, and F1 score compared to the baseline, with increases of 1.57%, 1.59%, 1.64%, and 1.61%, respectively. This confirms that the LSEA module effectively enhances the model’s ability to capture flame features by improving the local details and semantic representation of flame textures. Similarly, Network3 and Network4, which applied the FAFT module and CFGA module, respectively, also brought considerable performance gains. Network4’s accuracy and F1 score increased by 1.35% and 1.48% compared to the baseline, while Network3’s metrics improved even more, by 1.68% and 1.72%. This indicates that the feature-adaptive fusion mechanism of FAFT strengthens the representation of flame spatial distribution and key details, whereas CFGA effectively enhances the model’s recognition ability in complex combustion scenarios by aggregating multi-scale features.

Further analysis of the effects of module combinations revealed that the synergy between any two modules could lead to additional performance improvements. Network5 achieved an accuracy of 95.73% and an F1 score of 95.60%; Network6 and Network7 also both performed better than individual modules, with accuracies rising to 95.26% and 95.18%, and F1 scores increasing to 95.25%. This fully demonstrates that different modules are functionally complementary, capable of promoting each other, and jointly optimizing the model’s feature extraction and recognition performance.

Overall, the Network8 model, which integrates all three modules, achieved the best values in all evaluation metrics, with the four metrics improving to 96.29%, 96.30%, 96.25%, and 96.27%, respectively. This result strongly validates that the proposed combination of modules can fully leverage their respective advantages and work collaboratively, thereby achieving optimal model performance in the flame combustion state recognition task. To more intuitively demonstrate the performance contribution of each module, we selected the results of accuracy and F1 score for visualization, as shown in Figure 12.

Figure 13 illustrates the accuracy trends during training and validation for different network configurations. The left panel shows that the PRTNet model exhibits remarkable rapid-convergence behavior during training: its training accuracy rises sharply in the early epochs and stabilizes quickly, demonstrating a strong fitting capacity. In the right panel, the validation curve of PRTNet displays only minor fluctuations, further underscoring the robustness of its design. Overall, PRTNet delivers a combined advantage of fast convergence, high accuracy, and low volatility in both training and validation, significantly outperforming the alternative configurations.

In addition, we conducted ablation experiments on the main components of the LSEA module and the FAFT module to explore their effectiveness. The LSEA module was re-divided into ELA and SGE and embedded into ResNet to verify the coordinated and complementary effects of the two modules. Table 4 shows the results of the ablation experiments on the LSEA module, and Figure 14 provides a visual comparison of accuracy and F1 scores.

The experimental results clearly indicate that introducing either the ELA or SGE module individually results in varying degrees of improvement in both model accuracy and F1 score. Moreover, the integration of ELA and SGE to form the LSEA module demonstrates a significant synergistic enhancement effect, with accuracy substantially increasing by 1.57 percentage points to 94.43%, and the F1 score simultaneously rising to 94.43%. This 1.57-percentage-point performance gain is notably higher than the combined effects of ELA and SGE individually, strongly confirming the functional complementarity between local texture enhancement and the semantic guidance mechanism. The LSEA module shows a consistent upward trend across all evaluation metrics, fully validating the design concept that this module effectively enhances the discriminative capability of flame features by optimizing the feature representation space.

For the ablation study of the FAFT module, we focused on its core component, the three branches of GLAF, investigating the complementarity between its global and local branches as well as the necessity of the adaptive gating mechanism. The results of the ablation experiments and the visualization of the two main metrics are shown in Table 5 and Figure 15.

The experimental results systematically reveal the synergistic mechanism of the three branches in the FAFT module: when only the global branch is enabled, the model achieves an accuracy of 95.52% and an F1 score of 95.51%, verifying its effectiveness in modeling long-range spatial dependencies in flames; when only the local branch is enabled, both accuracy and F1 score reach 95.24%, highlighting its advantage in enhancing fine-grained features such as flame edge textures. However, when both branches are activated without a gating mechanism, the accuracy unexpectedly drops to 94.98%, indicating that directly weighting and fusing the smooth semantic features from the global branch with the sharp texture features from the local branch causes representation conflicts. After introducing adaptive gating, the model’s performance jumps to the optimal level, confirming that the gating network dynamically generates spatially sensitive weights, preserving the global branch’s robustness to flame deformation while enhancing the local branch’s sensitivity to flame core microstructures, thus resolving feature conflicts with minimal parameters and achieving intelligent fusion of complementary ‘semantic-texture’ representations.

3.6. Comparative Experiment

To validate the efficiency of PRTNet, we selected a diverse set of models for comparison, including convolution-based architectures such as DenseNet [25], EfficientNet [26], ConvNeXt V2 [27], RegNet [28], as well as Transformer-based models ViT [29], PVT [30], and FastViT [31]. Comparing different series of architectures allows us to comprehensively evaluate the performance of the proposed model. All models were trained and tested on the same dataset using identical hyperparameter settings and training strategies to ensure a fair comparison. The results of these models on the test dataset are shown in Table 6.

The CNN architecture overall outperformed the Transformer architecture on the test dataset. Among them, DenseNet achieved the best performance, with all metrics exceeding 93%; RegNet ranked second, with accuracy and F1 scores close to 92%; in contrast, ConvNeXtV2 performed slightly worse overall, with all metrics not yet reaching 90%. Transformer models showed relatively weaker performance, with the best model, PVT, achieving an F1 score of 88.57%, slightly lower than the optimal CNN model. The performance differences among ViT, FastViT, and PVT indicate that the pyramid structure can partially alleviate the limitations of ViT. Its constrained performance might be attributed to the ViT architecture’s reliance on large-scale pretraining data, while the flame dataset in this study is limited in size, making it challenging to fully optimize the feature extraction capability of the global attention mechanism.

PRTNet surpasses all comparison models with a significant advantage, improving accuracy and F1 score by 2.86% and 2.83%, respectively, compared with the second-best model DenseNet. As shown in Figure 16, PRTNet independently forms a high-density cluster in the three-dimensional performance space, highlighting the effectiveness of our proposed model.

Figure 17 illustrates the accuracy trends of different models during training and validation. The left panel shows the accuracy of each model on the training set, while the right panel displays the accuracy on the validation set. Although the validation accuracies of the models fluctuate, the overall trend is upward. Among them, PRTNet consistently achieves the highest validation accuracy with the least fluctuation, demonstrating stronger generalization ability and robustness.

The visualization of the test results of each model based on the confusion matrix is shown in Figure 18. Comparative analysis indicates that the method proposed in this paper demonstrates superior balance and effectiveness in distinguishing the MSWI status compared to other classification networks.

4. Conclusions and Discussion

This paper focuses on the problem of intelligent flame state recognition in municipal solid waste incineration (MSWI) and proposes and systematically validates the PRTNet hybrid architecture model. This model first embeds local-semantic enhanced attention in the ResNet backbone, cascading ELA with SGE to achieve a two-stage flame feature refinement of ‘global localization-local purification’. Next, it designs a feature-adaptive fusion Transformer to simultaneously model long-range dependencies and local high-frequency details of the flames, using a lightweight gating mechanism to achieve adaptive fusion of semantic and texture information. Finally, it utilizes cross-scale feature guidance in the aggregation module to efficiently merge shallow high-resolution details with deep high-semantic information under channel-spatial attention guidance, generating a unified feature representation that is both discriminative and robust. Experimental results on the MSWI dataset show that PRTNet achieves leading performance in the four-class combustion state recognition task (Acc: 96.29%, Pre: 96.30%, Rec: 96.25%, F1: 96.27%), significantly outperforming several state-of-the-art models, including DenseNet, EfficientNet, ConvNeXt V2, RegNet, ViT, PVT, and FastViT (with the F1 score improved by up to 2.83%). Ablation studies systematically verified the effectiveness of the LSEA, FAFT, and CFGA modules individually and the significant gains from their synergy (overall performance improvement of over 4%), particularly highlighting the crucial role of the adaptive gating mechanism in FAFT in resolving global and local feature conflicts.

Although the PRTNet model demonstrates excellent recognition performance in experiments, it still faces certain challenges in practical industrial deployment. First, the model is relatively sensitive to input image quality. In real operating environments, cameras may be obstructed by smoke, contaminated by lens dirt, or affected by strong glare, leading to degraded image quality and consequently impacting recognition stability. Second, the model’s computational complexity is relatively high, primarily due to compute-intensive operations such as deformable convolutions and multi-head attention mechanisms, which may make it difficult to meet real-time inference requirements on resource-constrained edge devices. Additionally, the model is trained on the current dataset. If deployed directly in incineration plants with different furnace types, waste compositions, or camera installation positions, performance may decline due to differences in data distribution. Future research could focus on the following aspects: first, reducing the computational burden through model lightweighting techniques to enhance real-time capability; second, introducing domain adaptation or incremental learning mechanisms to strengthen the model’s adaptability across different scenarios and operating conditions; third, integrating multimodal information, such as combining infrared temperature and flue gas composition data from other sensors, to build a more robust state recognition system; fourth, deeply integrating the recognition module with the combustion control system to achieve closed-loop intelligent regulation from state perception to parameter optimization, thereby advancing the MSWI process toward comprehensive intelligent development. Overall, the method proposed in this study provides an effective solution for recognizing complex combustion states. Its further optimization and practical application will offer strong support for energy conservation, emission reduction, and safe operation in municipal solid waste incineration processes.

Author Contributions

Methodology, J.Z. and J.G.; Software, J.Z.; Validation, J.G.; Investigation, J.T.; Resources, J.Z. and J.T.; Data curation, J.T. and J.Z.; Writing—original draft, J.Z. and J.G.; Writing—review and editing, J.Z. and J.T.; Visualization, J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Due to project restrictions, data will not be provided to the public.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nanda, S.; Berruti, F. Municipal solid waste management and landfilling technologies: A review. Environ. Chem. Lett. 2021, 19, 1433–1456. [Google Scholar] [CrossRef]
Tang, J.; Xia, H.; Yu, W.; Qiao, J.-F. Research Status and Prospects of Intelligent Optimization Control for Municipal Solid Waste Incineration Process. Acta Autom. Sin. 2023, 49, 2019–2059. [Google Scholar]
Khan, M.S.; Mubeen, I.; Caimeng, Y.; Zhu, G.; Khalid, A.; Yan, M. Waste to energy incineration technology: Recent development under climate change scenarios. Waste Manag. Res. 2022, 40, 1708–1729. [Google Scholar] [CrossRef] [PubMed]
Qiao, J.; Cui, Y.; Meng, X. Dynamic multi-objective operation optimization of municipal solid waste incineration process based on transfer learning. IEEE Trans. Autom. Sci. Eng. 2024, 22, 9338–9352. [Google Scholar] [CrossRef]
Kasiński, S.; Dębowski, M. Municipal solid waste as a renewable energy source: Advances in thermochemical conversion technologies and environmental impacts. Energies 2024, 17, 4704. [Google Scholar] [CrossRef]
Li, K.; Deng, J.; Zhu, Y.; Zhang, W.; Zhang, T.; Tian, C.; Ma, J.; Shao, Y.; Yang, Y.; Shao, Y. Utilization of municipal solid waste incineration fly ash with different pretreatments with gold tailings and coal fly ash for environmentally friendly geopolymers. Waste Manag. 2025, 194, 342–352. [Google Scholar] [CrossRef] [PubMed]
Funari, V.; Dalconi, M.C.; Farnaud, S.; Nawab, J.; Gupta, N.; Yadav, K.K.; Kremser, K.; Toller, S. Modern management options for solid waste and by-products: Sustainable treatment and environmental benefits. Front. Environ. Sci. 2024, 12, 1385669. [Google Scholar] [CrossRef]
Bian, J.; Xie, B.; Wu, H.; Wu, S.; Fang, H. Boiler Flame Combustion State Identification Based on Multi-feature Fusion and WOA-SVM. China Spec. Equip. Saf. 2024, 40, 13–18. [Google Scholar]
Yan, X.; Song, G.; Liu, J.; Liu, X.; Wang, H.; Hao, Z. A comprehensive emission inventory of air pollutants from municipal solid waste incineration in China’s megacity, Beijing based on the field measurements. Sci. Total Environ. 2024, 948, 174806. [Google Scholar] [CrossRef] [PubMed]
Zhou, C.; Cao, Y.; Yang, S. Video based combustion state identification for municipal solid waste incineration. IFAC-PapersOnLine 2020, 53, 13448–13453. [Google Scholar] [CrossRef]
Tang, J.; Wang, T.; Xia, H.; Cui, C. An overview of artificial intelligence application for optimal control of municipal solid waste incineration process. Sustainability 2024, 16, 2042. [Google Scholar] [CrossRef]
Cao, C.; Zhang, Q.; Li, M.; Wang, S. Flame combustion state recognition in municipal solid waste incineration processes based on image multi-threshold segmentation and DQN-PL model. Energy 2025, 331, 136967. [Google Scholar] [CrossRef]
Guo, T.; Yao, X.; He, D.; Liu, B. Diagnosis Method of Combustion State in Waste Incinerator Based on Image Recognition. Nonferr. Metall. Equip. 2022, 36, 43–47. [Google Scholar]
Yu, G.; Sang, J.; Sun, Y. Thermal energy diagnosis of boiler plant by computer image processing and neural network technology. Therm. Sci. 2020, 24, 128. [Google Scholar] [CrossRef]
Yang, W.; Xia, H.; Pang, X.; Cui, C.; Wang, T. Combustion Status Recognition of MSWI process Based on Flame Image by Using YOLOv5. In Proceedings of the 2024 36th Chinese Control and Decision Conference (CCDC), Xi’an, China, 25–27 May 2024; pp. 2363–2368. [Google Scholar]
Omiotek, Z.; Kotyra, A. Flame Image Processing and Classification Using a Pre-Trained VGG16 Model in Combustion Diagnosis. Sensors 2021, 21, 500. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Sun, R.; Tang, J.; Pei, H. M3RTNet: Combustion State Recognition Model of MSWI Process Based on Res-Transformer and Three Feature Enhancement Strategies. Sustainability 2025, 17, 3412. [Google Scholar] [CrossRef]
Xu, W.; Wan, Y.; Zhao, W. ELA: Efficient location attention for deep convolution neural networks. J. Real-Time Image Process. 2025, 22, 140. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–9 June 2023; pp. 1–5. [Google Scholar]
Vong, P.; Dolla, L.; Koukras, A.; Gustin, J.; Amaya, J.; Dineva, E.; Lapenta, G. Bypassing the static input size of neural networks in flare forecasting by using spatial pyramid pooling. Astron. Astrophys. 2025, 695, A65. [Google Scholar] [CrossRef]
Zhang, F.; Wang, Q.; Zhao, J. AR2: Attention-Guided Repair for the Robustness of CNNs Against Common Corruptions. arXiv 2025, arXiv:2507.06332. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Yang, J. Spatial group-wise enhance: Enhancing semantic feature learning in CNN. In Proceedings of the Asian Conference on Computer Vision, Macau, China, 4–8 December 2022; pp. 687–702. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 13–19 June 2020; pp. 10428–10436. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. Fastvit: A fast hybrid vision transformer using structural reparameterization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 5785–5795. [Google Scholar]

Figure 1. MSWI Process Technology.

Figure 2. Typical flame burning states: (a) normal burning, (b) partial burning, (c) channeling burning, and (d) smoldering.

Figure 3. Overall Architecture of the PRTNet Model.

Figure 4. Bottleneck Residual Block.

Figure 5. ELA Structure Diagram.

Figure 6. SGE Structure Diagram.

Figure 7. Structure diagram of the global-local adaptive fusion module.

Figure 8. SE Channel Attention.

Figure 9. Training process of the PRTNet model.

Figure 10. Three-dimensional confusion matrix of the model test results.

Figure 11. Visualization of Evaluation Metrics for Different Combustion States.

Figure 12. Comparison of accuracy and F1 score in ablation experiments.

Figure 13. Training and validation accuracy during ablation experiments.

Figure 14. Comparison of Accuracy and F1 Scores in the LSEA Module Ablation Experiment.

Figure 15. Comparison of Accuracy and F1 Scores in FAFT Ablation Experiments.

Figure 16. Visualization results of the comparative experiment.

Figure 17. Training and validation accuracy during comparative experiments.

Figure 18. Confusion matrix of the comparative experiment.

Table 1. The distribution of the flame burning image dataset.

Grate	Normal	Channeling	Partial	Smoldering	Amount
Left	655	1044	1176	414	3289
Right	564	534	1002	585	2685
Total	1219	1578	2178	999	5974

Table 2. Evaluation Indicators for Different Combustion Categories.

Category	Acc	Pre	Rec	F1
Normal	0.9587	0.9431	0.9586	0.9508
Partial	0.9722	0.9813	0.9722	0.9767
Channeling	0.9554	0.9494	0.9554	0.9524
Smoldering	0.9596	0.9694	0.9596	0.9645
Total	0.9629	0.9630	0.9625	0.9627

Table 3. Results of Module Ablation Experiments.

Name	LSEA	FAFT	CFGA	Acc	Pre	Rec	F1
Network1	×	×	×	0.9286	0.9285	0.9279	0.9282
Network2	√	×	×	0.9443	0.9444	0.9443	0.9443
Network3	×	√	×	0.9454	0.9460	0.9449	0.9454
Network4	×	×	√	0.9421	0.9432	0.9428	0.9430
Network5	√	√	×	0.9573	0.9562	0.9558	0.9560
Network6	√	×	√	0.9526	0.9519	0.9531	0.9525
Network7	×	√	√	0.9518	0.9529	0.9521	0.9525
Network8	√	√	√	0.9629	0.9630	0.9625	0.9627

Table 4. Ablation Experiment Results of the LSEA Module.

Module	Acc	Pre	Rec	F1
ResNet	0.9286	0.9285	0.9279	0.9282
ResNet + ELA	0.9342	0.9344	0.9337	0.9340
ResNet + SGE	0.9325	0.9318	0.9331	0.9324
ResNet + LSEA	0.9443	0.9444	0.9443	0.9443

Table 5. Ablation Experiment Results of the FAFT Module.

Global	Local	Gating	Acc	Pre	Rec	F1
√	×	×	0.9552	0.9557	0.9546	0.9551
×	√	×	0.9524	0.9531	0.9518	0.9524
√	√	×	0.9498	0.9506	0.9496	0.9501
√	√	√	0.9629	0.9630	0.9625	0.9627

Table 6. Results of Comparative Experiments.

Group	Model	Acc	Pre	Rec	F1
CNNs	DenseNet [25]	0.9343	0.9354	0.9343	0.9344
	EfficientNet [26]	0.9007	0.9033	0.9007	0.9012
	ConvNeXtV2 [27]	0.8940	0.8991	0.8939	0.8947
	RegNet [28]	0.9175	0.9196	0.9175	0.9184
Transformers	ViT [29]	0.8401	0.8512	0.8401	0.8422
	PVT [30]	0.8855	0.8860	0.8855	0.8857
	FastViT [31]	0.8822	0.8866	0.8821	0.8829
Ours	PRTNet	0.9629	0.9630	0.9625	0.9627

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Ge, J.; Tang, J. PRTNet: Combustion State Recognition Model of Municipal Solid Waste Incineration Process Based on Enhanced Res-Transformer and Multi-Scale Feature Guided Aggregation. Sustainability 2026, 18, 676. https://doi.org/10.3390/su18020676

AMA Style

Zhang J, Ge J, Tang J. PRTNet: Combustion State Recognition Model of Municipal Solid Waste Incineration Process Based on Enhanced Res-Transformer and Multi-Scale Feature Guided Aggregation. Sustainability. 2026; 18(2):676. https://doi.org/10.3390/su18020676

Chicago/Turabian Style

Zhang, Jian, Junyu Ge, and Jian Tang. 2026. "PRTNet: Combustion State Recognition Model of Municipal Solid Waste Incineration Process Based on Enhanced Res-Transformer and Multi-Scale Feature Guided Aggregation" Sustainability 18, no. 2: 676. https://doi.org/10.3390/su18020676

APA Style

Zhang, J., Ge, J., & Tang, J. (2026). PRTNet: Combustion State Recognition Model of Municipal Solid Waste Incineration Process Based on Enhanced Res-Transformer and Multi-Scale Feature Guided Aggregation. Sustainability, 18(2), 676. https://doi.org/10.3390/su18020676

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PRTNet: Combustion State Recognition Model of Municipal Solid Waste Incineration Process Based on Enhanced Res-Transformer and Multi-Scale Feature Guided Aggregation

Abstract

1. Introduction

2. Materials and Methods

2.1. Introduction of Flame Combustion Image

2.2. Methods

2.2.1. Architecture Overview

2.2.2. Local-Semantic Enhanced Residual Network

2.2.3. Feature-Adaptive Fusion Transformer

2.2.4. Cross-Scale Feature Guidance Aggregation Module

2.2.5. Classifier

3. Experimental Results and Analysis

3.1. Evaluation Metrics

3.2. Experimental Setup

3.3. Flame Burning Images Dataset

3.4. Model Experiment Results

3.5. Ablation Experiments

3.6. Comparative Experiment

4. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI