1. Introduction
With the rapid growth of urbanization in China, the total production of municipal solid waste (MSW) continues to rise [
1], leading to the phenomenon of ‘cities surrounded by garbage’ in many areas [
2]. Exploring effective ways to reduce, recycle, and harmlessly treat MSW has become a major issue that cannot be ignored. Among current treatment methods, compared with landfilling and composting, municipal solid waste incineration (MSWI) has emerged as a mainstream disposal method due to its significant volume reduction and high treatment efficiency [
3,
4]. The core of this technology lies in converting MSW into ash, flue gas, and recoverable thermal energy through high-temperature combustion, which can not only effectively reduce secondary pollution but also achieve repeated utilization of waste resources [
5,
6]. Recycling transforms waste into resources, incineration generates electricity, and residues can be made into building materials, alleviating landfill pressure, promoting the green and low-carbon transformation of waste management, and achieving sustainable development [
7]. While MSWI offers significant advantages in volume reduction and energy recovery, its environmental benefits are highly dependent on the stability of the combustion state. Under normal combustion conditions, pollutant generation is effectively suppressed. However, when combustion enters an abnormal state, characterized by disordered flame morphology and an imbalanced temperature field within the furnace, incomplete combustion occurs. This leads to a substantial increase in the emission concentrations of harmful substances such as carbon monoxide (CO) and dioxins [
8]. These pollutants not only pose direct risks to human health but also enter the soil and water bodies through atmospheric deposition, causing persistent environmental pollution and disrupting ecosystem balance. Therefore, achieving accurate identification and real-time control of different combustion states is a crucial step for suppressing pollutant generation at the source and ensuring the environmental cleanliness of the incineration process. Unstable combustion not only manifests as flame oscillation and reduced efficiency but also directly causes rapid ash deposition on the heat transfer surfaces. This ‘insulating layer’ severely hinders heat exchange. At the same time, contact between high-temperature ash and metallic materials accelerates chemical corrosion, thereby shortening equipment lifespan [
9]. If operations fail to make timely and precise adjustments to such fluctuations, the risks can increase dramatically. In the worst-case scenario, delayed adjustments can even trigger complete thermal runaway, leading to equipment damage and causing severe safety accidents and economic losses. Therefore, in modern municipal solid waste incineration power plants, maintaining a stable combustion process is the lifeline that determines whether the entire system can operate efficiently and economically [
10]. However, many current MSWI plants still largely rely on experienced operators for combustion control. They need to constantly monitor screens or furnace windows, using their eyes to observe the flame’s color, shape, and flickering frequency to judge the combustion conditions, and then manually adjust key parameters such as fuel supply, air intake, and the ratio of primary to secondary air. This approach, however, presents several limitations. First, human judgment is inherently subjective and varies significantly across operators due to differences in experience and attention, leading to inconsistent control outcomes. Second, the limited reaction speed and endurance of human operators hinder their ability to respond to rapid combustion fluctuations. Most critically, the reliance on manual observation precludes the implementation of intelligent, precise, and real-time optimized control. Therefore, we must strive to transform the valuable hands-on experience accumulated by senior engineers and operators into an intelligent knowledge base understandable and executable by machines, through data analysis, machine learning, and other advanced technologies. In this way, the system can autonomously and in real-time recognize subtle changes in combustion conditions and predict potential risks. Achieving this goal not only significantly reduces the total emissions of flue gas pollutants but also promotes energy recovery and resource recycling through more efficient combustion, enabling MSWI to truly become a green industry for sustainable development.
To overcome the over-reliance on experienced experts and the strong subjectivity and variability in recognizing the combustion states in traditional MSWI processes, artificial intelligence technologies have been increasingly applied in this field in recent years [
11]. Cao et al. [
12] proposed a DQN-PL model that integrates GA-SA multi-threshold segmentation and deep reinforcement learning for flame state recognition. It extracts shape-statistical features, performs feature selection and dimensionality reduction, and employs a pseudo-label-enhanced classification strategy, achieving high accuracy across five combustion states. Guo et al. [
13] proposed a method for recognizing the burning state of a waste incinerator based on image recognition technology, which realized the rapid judgment of the burning state in the furnace and connected the judgment results to the automatic control system, so as to improve the intelligent control level of the incinerator. Yu et al. [
14] combined neural networks with infrared thermal imaging technology to detect equipment malfunctions and used Support Vector Machines to classify the flame images to achieve high accuracy of flame classification. Yang et al. [
15] proposed a method for feature extraction using the YOLOv5 algorithm and implemented the recognition of combustion states in the head layer during the MSWI process. Omiotek and Kotyra [
16] introduced a method for processing and classifying flame images based on the pre-trained VGG16 model and proved that the proposed method could recognize poor combustion states efficiently. Zhang et al. [
17], combining three feature enhancement strategies with a multi-scale attention module, deformable multi-head attention module, and contextual feature fusion module, effectively integrated local and global features of flame combustion, improving model performance and robustness, achieving accurate recognition of MSWI flame states (see
Figure 1).
In the process of constructing deep learning models to recognize the flame state of MSWI, most approaches rely on Convolutional Neural Networks (CNNs). However, traditional CNNs often have certain inherent shortcomings in feature extraction, such as insufficient utilization of spatial information, chaotic feature distribution, and limited semantic expression capability. Therefore, if these corresponding issues can be addressed, more efficient network architectures can be designed. Xu et al. [
18] proposed a novel attention mechanism called Efficient Local Attention (ELA) to tackle the problems of channel dimensionality reduction and model complexity in traditional CNNs when utilizing spatial information. By using one-dimensional convolution and group normalization techniques, ELA efficiently encodes spatial positional information while maintaining channel dimensions and is applicable to various CNN architectures. Ouyang et al. [
19] proposed a new Efficient Multi-scale Attention (EMA) module to address the insufficiencies of CNNs in processing multi-scale features and the limited richness of feature representation due to insufficient inter-channel information interaction. This module uses two parallel branches to encode global information for recalibrating channel weights and aggregates output features through cross-dimensional interaction to capture pixel-level pairwise relationships. Vong et al. [
20] introduced a Spatial Pyramid Pooling (SPP) layer in CNNs to solve the limitation of traditional CNNs that require fixed input image sizes. The SPP layer divides the input image into several blocks of fixed numbers and selects the maximum value in each block, providing a fixed-size output for subsequent fully connected layers. This method enables CNNs to handle input images of different sizes while preserving the original image details, improving the accuracy of prediction tasks. Zhang et al. [
21] proposed a method called Attention-Guided Repair for Robustness (AR2) aimed at enhancing the robustness of CNNs against common image disturbances. AR2 aligns the Class Activation Maps (CAMs) of clean and contaminated images and adopts an iterative repair strategy to alternately perform CAM-guided refinement and fine-tuning, thereby enhancing attention consistency under input perturbations. Experiments show that AR2 significantly improves model robustness across multiple benchmarks while maintaining high accuracy on clean data. Li et al. [
22] proposed a module called Spatial Group-wise Enhance (SGE), which generates attention factors for spatial positions within each semantic group, adjusts the importance of sub-features, enhances the feature representation of key regions, and suppresses noise, effectively improving the capability of CNNs in semantic feature learning.
The above research indicates that deep learning-driven artificial intelligence technology, with its excellent feature extraction and representation capabilities, has demonstrated outstanding performance in MSWI combustion condition identification and other related fields. Artificial intelligence systems, by monitoring flame states in real-time and guiding technicians in dynamically optimizing incineration parameters, can significantly enhance combustion stability and efficiency. This enables the reduction in energy consumption and the generation of incomplete combustion products under ideal conditions, thereby elevating the level of intelligence in process control and strengthening the system’s ability to handle complex operating conditions. However, due to the complexity of MSWI flame images, which often exhibit issues such as intricate shapes, high noise, significant individual differences, and blurred boundaries, combined with the inherent limitations of CNN-based deep neural networks, models are unable to fully and effectively extract flame features for recognition. To address these limitations, this paper proposes the PRTNet model to achieve precise capture and identification of flame combustion features. The main contributions of this paper are as follows:
(1) A novel hybrid architecture, PRTNet, was designed, which effectively combines the advantages of CNN and Transformer and efficiently aggregates multi-scale feature information, achieving efficient recognition of MSWI flame combustion states.
(2) Combine ELA with SGE to form a Local-Semantic Enhanced Attention (LSEA) module and embed it into the ResNet backbone, establishing multi-scale spatial correlations between fine-grained textures and combustion patterns, significantly improving the residual network’s recognition accuracy of flame regions.
(3) A feature-adaptive fusion Transformer is proposed, with a global-local adaptive fusion module as the core, which not only focuses on the overall spatial distribution of the flame but also preserves key details such as edges and bright spots, enhancing the integrity and discriminative power of flame features.
(4) Designed a Cross-Scale Feature Guided Aggregation (CFGA) module to efficiently fuse shallow high-resolution spatial details, mid-level transitional features, and deep high-semantic information, strengthening multi-scale flame feature integration and significantly improving feature extraction and recognition performance in complex combustion scenarios.
3. Experimental Results and Analysis
To evaluate the effectiveness of the algorithm, this chapter first establishes the evaluation metrics and provides a detailed description of the flame image dataset used in the study. Subsequently, validation is conducted through a phased experimental design: the overall performance of the algorithm is tested first, followed by ablation studies and comparative experimental analyses. The quantitative results consistently indicate that the proposed method demonstrates competitive advantages in both accuracy metrics and potential for engineering applications.
3.1. Evaluation Metrics
To quantitatively evaluate the performance of classification models, this study selects four core metrics:
Accuracy,
Precision,
Recall, and
F1-score.
Accuracy reflects the proportion of correctly classified samples overall;
Precision measures the proportion of predicted positive samples that are actually positive;
Recall represents the proportion of actual positive samples that are correctly identified; the
F1-score is the harmonic mean of
Precision and
Recall, providing a comprehensive and balanced evaluation. Quantitative analysis of experimental results based on these metrics can effectively reveal the performance characteristics and limitations of the model. The calculation formulas for the above evaluation metrics are as follows:
3.2. Experimental Setup
We built our PRTNet model using PyTorch v2.5.1 and trained it on a single Nvidia GeForce RTX 3090 GPU. The model input consists of flame images of size 576 × 576 × 3. The Adam optimizer was used with an initial learning rate of 1 × 10−4, employing a cosine annealing strategy for gradient optimization, and a weight decay of 5 × 10−5. The network was trained for 100 epochs with a batch size of 32, and the best-performing model was selected for testing.
3.3. Flame Burning Images Dataset
This experiment uses the flame combustion image dataset created by Pan et al. [
12], with data sourced from a municipal solid waste incineration (MSWI) plant in Beijing. The plant monitors the combustion state of waste on the left and right furnace grates in real-time by installing high-temperature endoscopes on both sides of the back wall of the incinerator furnace. The video signals are transmitted via coaxial cables and stored on an industrial control computer’s video capture card. The dataset classifies combustion states into four categories: normal combustion, partial combustion, flashover, and smoldering. The dataset construction process includes the following key steps: (1) removing video segments that cannot clearly reflect the combustion state; (2) having plant operation experts screen stable combustion frames under typical operating conditions based on classification standards (typical state illustrations are shown in
Figure 1) and label the states; (3) standardizing the classified videos using a timing sampling algorithm developed on the MATLAB 7.0 platform, extracting key frames at fixed time intervals (1 frame per minute). Considering that the combustion state changes slowly within a short period, in practice, the saved images were resampled at one-minute intervals. Quantitative statistics show that a total of 3289 and 2685 valid image samples were obtained for the left and right furnace grates, respectively, successfully covering various multidimensional flame features for different combustion states. Given that the flame images of the left and right furnace grates exhibit symmetrical distribution characteristics, to enhance the model’s generalization ability, this study merges the images of both grates into a unified dataset for experiments. The number of data samples corresponding to each typical combustion state is shown in
Table 1.
3.4. Model Experiment Results
Figure 9 shows the changes in the model’s loss and accuracy during training. The left graph displays the trends of training loss and validation loss over the number of training epochs. The horizontal axis represents the number of training epochs, and the vertical axis represents the loss value. Both curves show a downward trend: the training loss decreases rapidly in the early stages and gradually approaches zero later, indicating that the model has a significant fitting effect on the training set. Although the validation loss fluctuates to some extent after initially decreasing, the overall trend also tends to stabilize and remain at a low level. This trend indicates that the model demonstrates strong generalization ability on the validation set, and its pattern of change basically matches the training loss curve.
The right figure depicts the evolution of training accuracy and validation accuracy over the course of training epochs. Both accuracy curves show an upward trend: the training accuracy significantly increases during training and eventually stabilizes at a high level close to 1.0, further confirming the model’s excellent fit to the training data. The validation accuracy curve fluctuates somewhat in the early stages but generally maintains an upward trend and stabilizes later on, with the final value showing only a small gap compared to the training accuracy. This indicates that the model’s performance on the validation set is comparable to that on the training set, effectively demonstrating good generalization ability and showing no obvious signs of overfitting.
The synchronized optimization trends of training and validation metrics jointly confirm the effectiveness of the training process and ensure the reliability of the model’s performance evaluation on the validation set. To more intuitively demonstrate the performance of the PRTNet model proposed in this study, we present three-dimensional confusion matrices, as shown in
Figure 10, summarizing the classification results obtained by the model on the test set.
Table 2 details the quantitative performance metrics of the PRTNet model in the task of identifying flame states in four types of municipal solid waste incineration (MSWI). The overall performance of the model is excellent, with average accuracy, precision, recall, and F1 score reaching 96.29%, 96.30%, 96.25%, and 96.27%, respectively, fully demonstrating the model’s high accuracy and reliability in the MSWI flame state classification task. Notably, the model performs particularly well in recognizing the partially burned state, with all four metrics exceeding 97%. At the same time, the model also demonstrates high recognition accuracy in normal combustion, flaming, and smoldering states, with accuracy and F1 scores above 95%.
Figure 11 presents the detailed classification performance data from
Table 2 through intuitive visual charts, clearly illustrating the differences in the model’s ability to distinguish various combustion states and its overall balance. The data distribution in the figure clearly indicates that the model maintains similarly high levels across all evaluation metrics for each category, with no significant weaknesses observed. This balanced and excellent performance distribution strongly validates that the PRTNet model possesses outstanding robustness and generalization capability when handling complex and variable MSWI flame images.
3.5. Ablation Experiments
In order to further investigate the effectiveness of each module, we conducted ablation experiments from the perspective of each module’s contribution to the overall network performance improvement as well as its main components.
Table 3 presents the detailed results of the module ablation experiments, where different modules were gradually added to the traditional ResNet architecture to verify their performance. Network2, after independently introducing the LSEA module, achieved significant improvements in model accuracy, precision, recall, and F1 score compared to the baseline, with increases of 1.57%, 1.59%, 1.64%, and 1.61%, respectively. This confirms that the LSEA module effectively enhances the model’s ability to capture flame features by improving the local details and semantic representation of flame textures. Similarly, Network3 and Network4, which applied the FAFT module and CFGA module, respectively, also brought considerable performance gains. Network4’s accuracy and F1 score increased by 1.35% and 1.48% compared to the baseline, while Network3’s metrics improved even more, by 1.68% and 1.72%. This indicates that the feature-adaptive fusion mechanism of FAFT strengthens the representation of flame spatial distribution and key details, whereas CFGA effectively enhances the model’s recognition ability in complex combustion scenarios by aggregating multi-scale features.
Further analysis of the effects of module combinations revealed that the synergy between any two modules could lead to additional performance improvements. Network5 achieved an accuracy of 95.73% and an F1 score of 95.60%; Network6 and Network7 also both performed better than individual modules, with accuracies rising to 95.26% and 95.18%, and F1 scores increasing to 95.25%. This fully demonstrates that different modules are functionally complementary, capable of promoting each other, and jointly optimizing the model’s feature extraction and recognition performance.
Overall, the Network8 model, which integrates all three modules, achieved the best values in all evaluation metrics, with the four metrics improving to 96.29%, 96.30%, 96.25%, and 96.27%, respectively. This result strongly validates that the proposed combination of modules can fully leverage their respective advantages and work collaboratively, thereby achieving optimal model performance in the flame combustion state recognition task. To more intuitively demonstrate the performance contribution of each module, we selected the results of accuracy and F1 score for visualization, as shown in
Figure 12.
Figure 13 illustrates the accuracy trends during training and validation for different network configurations. The left panel shows that the PRTNet model exhibits remarkable rapid-convergence behavior during training: its training accuracy rises sharply in the early epochs and stabilizes quickly, demonstrating a strong fitting capacity. In the right panel, the validation curve of PRTNet displays only minor fluctuations, further underscoring the robustness of its design. Overall, PRTNet delivers a combined advantage of fast convergence, high accuracy, and low volatility in both training and validation, significantly outperforming the alternative configurations.
In addition, we conducted ablation experiments on the main components of the LSEA module and the FAFT module to explore their effectiveness. The LSEA module was re-divided into ELA and SGE and embedded into ResNet to verify the coordinated and complementary effects of the two modules.
Table 4 shows the results of the ablation experiments on the LSEA module, and
Figure 14 provides a visual comparison of accuracy and F1 scores.
The experimental results clearly indicate that introducing either the ELA or SGE module individually results in varying degrees of improvement in both model accuracy and F1 score. Moreover, the integration of ELA and SGE to form the LSEA module demonstrates a significant synergistic enhancement effect, with accuracy substantially increasing by 1.57 percentage points to 94.43%, and the F1 score simultaneously rising to 94.43%. This 1.57-percentage-point performance gain is notably higher than the combined effects of ELA and SGE individually, strongly confirming the functional complementarity between local texture enhancement and the semantic guidance mechanism. The LSEA module shows a consistent upward trend across all evaluation metrics, fully validating the design concept that this module effectively enhances the discriminative capability of flame features by optimizing the feature representation space.
For the ablation study of the FAFT module, we focused on its core component, the three branches of GLAF, investigating the complementarity between its global and local branches as well as the necessity of the adaptive gating mechanism. The results of the ablation experiments and the visualization of the two main metrics are shown in
Table 5 and
Figure 15.
The experimental results systematically reveal the synergistic mechanism of the three branches in the FAFT module: when only the global branch is enabled, the model achieves an accuracy of 95.52% and an F1 score of 95.51%, verifying its effectiveness in modeling long-range spatial dependencies in flames; when only the local branch is enabled, both accuracy and F1 score reach 95.24%, highlighting its advantage in enhancing fine-grained features such as flame edge textures. However, when both branches are activated without a gating mechanism, the accuracy unexpectedly drops to 94.98%, indicating that directly weighting and fusing the smooth semantic features from the global branch with the sharp texture features from the local branch causes representation conflicts. After introducing adaptive gating, the model’s performance jumps to the optimal level, confirming that the gating network dynamically generates spatially sensitive weights, preserving the global branch’s robustness to flame deformation while enhancing the local branch’s sensitivity to flame core microstructures, thus resolving feature conflicts with minimal parameters and achieving intelligent fusion of complementary ‘semantic-texture’ representations.
3.6. Comparative Experiment
To validate the efficiency of PRTNet, we selected a diverse set of models for comparison, including convolution-based architectures such as DenseNet [
25], EfficientNet [
26], ConvNeXt V2 [
27], RegNet [
28], as well as Transformer-based models ViT [
29], PVT [
30], and FastViT [
31]. Comparing different series of architectures allows us to comprehensively evaluate the performance of the proposed model. All models were trained and tested on the same dataset using identical hyperparameter settings and training strategies to ensure a fair comparison. The results of these models on the test dataset are shown in
Table 6.
The CNN architecture overall outperformed the Transformer architecture on the test dataset. Among them, DenseNet achieved the best performance, with all metrics exceeding 93%; RegNet ranked second, with accuracy and F1 scores close to 92%; in contrast, ConvNeXtV2 performed slightly worse overall, with all metrics not yet reaching 90%. Transformer models showed relatively weaker performance, with the best model, PVT, achieving an F1 score of 88.57%, slightly lower than the optimal CNN model. The performance differences among ViT, FastViT, and PVT indicate that the pyramid structure can partially alleviate the limitations of ViT. Its constrained performance might be attributed to the ViT architecture’s reliance on large-scale pretraining data, while the flame dataset in this study is limited in size, making it challenging to fully optimize the feature extraction capability of the global attention mechanism.
PRTNet surpasses all comparison models with a significant advantage, improving accuracy and F1 score by 2.86% and 2.83%, respectively, compared with the second-best model DenseNet. As shown in
Figure 16, PRTNet independently forms a high-density cluster in the three-dimensional performance space, highlighting the effectiveness of our proposed model.
Figure 17 illustrates the accuracy trends of different models during training and validation. The left panel shows the accuracy of each model on the training set, while the right panel displays the accuracy on the validation set. Although the validation accuracies of the models fluctuate, the overall trend is upward. Among them, PRTNet consistently achieves the highest validation accuracy with the least fluctuation, demonstrating stronger generalization ability and robustness.
The visualization of the test results of each model based on the confusion matrix is shown in
Figure 18. Comparative analysis indicates that the method proposed in this paper demonstrates superior balance and effectiveness in distinguishing the MSWI status compared to other classification networks.
4. Conclusions and Discussion
This paper focuses on the problem of intelligent flame state recognition in municipal solid waste incineration (MSWI) and proposes and systematically validates the PRTNet hybrid architecture model. This model first embeds local-semantic enhanced attention in the ResNet backbone, cascading ELA with SGE to achieve a two-stage flame feature refinement of ‘global localization-local purification’. Next, it designs a feature-adaptive fusion Transformer to simultaneously model long-range dependencies and local high-frequency details of the flames, using a lightweight gating mechanism to achieve adaptive fusion of semantic and texture information. Finally, it utilizes cross-scale feature guidance in the aggregation module to efficiently merge shallow high-resolution details with deep high-semantic information under channel-spatial attention guidance, generating a unified feature representation that is both discriminative and robust. Experimental results on the MSWI dataset show that PRTNet achieves leading performance in the four-class combustion state recognition task (Acc: 96.29%, Pre: 96.30%, Rec: 96.25%, F1: 96.27%), significantly outperforming several state-of-the-art models, including DenseNet, EfficientNet, ConvNeXt V2, RegNet, ViT, PVT, and FastViT (with the F1 score improved by up to 2.83%). Ablation studies systematically verified the effectiveness of the LSEA, FAFT, and CFGA modules individually and the significant gains from their synergy (overall performance improvement of over 4%), particularly highlighting the crucial role of the adaptive gating mechanism in FAFT in resolving global and local feature conflicts.
Although the PRTNet model demonstrates excellent recognition performance in experiments, it still faces certain challenges in practical industrial deployment. First, the model is relatively sensitive to input image quality. In real operating environments, cameras may be obstructed by smoke, contaminated by lens dirt, or affected by strong glare, leading to degraded image quality and consequently impacting recognition stability. Second, the model’s computational complexity is relatively high, primarily due to compute-intensive operations such as deformable convolutions and multi-head attention mechanisms, which may make it difficult to meet real-time inference requirements on resource-constrained edge devices. Additionally, the model is trained on the current dataset. If deployed directly in incineration plants with different furnace types, waste compositions, or camera installation positions, performance may decline due to differences in data distribution. Future research could focus on the following aspects: first, reducing the computational burden through model lightweighting techniques to enhance real-time capability; second, introducing domain adaptation or incremental learning mechanisms to strengthen the model’s adaptability across different scenarios and operating conditions; third, integrating multimodal information, such as combining infrared temperature and flue gas composition data from other sensors, to build a more robust state recognition system; fourth, deeply integrating the recognition module with the combustion control system to achieve closed-loop intelligent regulation from state perception to parameter optimization, thereby advancing the MSWI process toward comprehensive intelligent development. Overall, the method proposed in this study provides an effective solution for recognizing complex combustion states. Its further optimization and practical application will offer strong support for energy conservation, emission reduction, and safe operation in municipal solid waste incineration processes.