Next Article in Journal
An Investigation of Real-Time Galileo/GPS Integrated Precise Kinematic Time Transfer Based on Galileo HAS Service
Previous Article in Journal
Design of a Lorentz Force Magnetic Bearing Group Steering Law Based on an Adaptive Weighted Pseudo-Inverse Law
Previous Article in Special Issue
A Novel Mesoscale Eddy Identification Method Using Enhanced Interpolation and A Posteriori Guidance
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimizing Backbone Networks Through Hybrid–Modal Fusion: A New Strategy for Waste Classification

1
College of Mathematics and Computer Science, Zhejiang A & F University, Hangzhou 311300, China
2
Zhejiang Provincial Key Laboratory of Forestry Intelligent Monitoring and Information Technology, Hangzhou 311300, China
3
College of Information Science and Technology, Zhejiang Shuren University, Hangzhou 311300, China
4
College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China
5
State Key Laboratory of CAD & CG, Hangzhou 310027, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work and should be considered co-first authors.
Sensors 2025, 25(10), 3241; https://doi.org/10.3390/s25103241
Submission received: 9 April 2025 / Revised: 12 May 2025 / Accepted: 19 May 2025 / Published: 21 May 2025

Abstract

:

Highlights

What are the main findings?
  • A novel multi-modal fusion model HFWC-Net (Hybrid–Modal Fusion Waste Classification Network) for efficient waste classification is proposed.
What is the implication of the main finding?
  • HFWC-Net effectively reduces the amount of calculation and training time while maintaining high performance.
  • HFWC-Net can improve the accuracy of automatic garbage sorting and protect the natural environment.

Abstract

With rapid urbanization, effective waste classification is a critical challenge. Traditional manual methods are time-consuming, labor-intensive, costly, and error-prone, resulting in reduced accuracy. Deep learning has revolutionized this field. Convolutional neural networks such as VGG and ResNet have dramatically improved automated sorting efficiency, and Transformer architectures like the Swin Transformer have further enhanced performance and adaptability in complex sorting scenarios. However, these approaches still struggle in complex environments and with diverse waste types, often suffering from limited recognition accuracy, poor generalization, or prohibitive computational demands. To overcome these challenges, we propose an efficient hybrid-modal fusion method, the Hybrid-modal Fusion Waste Classification Network (HFWC-Net), for precise waste image classification. HFWC-Net leverages a Transformer-based hierarchical architecture that integrates CNNs and Transformers, enhancing feature capture and fusion across varied image types for superior scalability and flexibility. By incorporating advanced techniques such as the Agent Attention mechanism and the LionBatch optimization strategy, HFWC-Net not only improves classification accuracy but also significantly reduces classification time. Comparative experimental results on the public datasets Garbage Classification, TrashNet, and our self-built MixTrash dataset demonstrate that HFWC-Net achieves Top-1 accuracy rates of 98.89%, 96.88%, and 94.35%, respectively. These findings indicate that HFWC-Net attains the highest accuracy among current methods, offering significant advantages in accelerating classification efficiency and supporting automated waste management applications.

1. Introduction

With the acceleration of urbanization and rapid population growth, the improvement of residents’ living standards has led to a diversification of consumption structures and a sharp increase in domestic waste. It is expected that global solid waste generation will reach 2.2 billion tons per year by 2025 [1]. Effectively managing this growing amount of waste has become a significant challenge. While waste sorting and recycling are effective methods for dealing with urban waste and protecting the environment, the wide variety and shapes of waste require substantial human investment for accurate classification [2]. Additionally, residents’ insufficient awareness of classification and incomplete implementation of relevant policies result in unsatisfactory waste classification outcomes [3]. Accurate waste classification technology can effectively distinguish different types of waste, significantly improve the feasibility of harmless treatment, ensure hazardous waste is specially treated, and reduce threats to the environment. Automated waste processing technology, leveraging artificial intelligence and machine learning, further enhances sorting efficiency and reduces operating costs [4,5]. Therefore, developing effective automatic waste classification methods has significant academic and practical importance.
Manual waste classification poses significant challenges due to its reliance on human labor, which is inherently time-consuming, costly, and prone to errors [6]. Workers are required to visually inspect and categorize heterogeneous waste materials—including plastics, metals, and organics—often under suboptimal conditions. This process is not only inefficient but also subject to inconsistencies arising from human fatigue and subjectivity, leading to frequent misclassification. For example, non-recyclable plastics may be incorrectly sorted as recyclable, thereby compromising the effectiveness of downstream recycling operations and increasing processing costs. The variability in waste appearance—caused by contamination, deformation, or degradation—further complicates accurate sorting, reducing throughput and diminishing resource recovery rates. These limitations place a growing burden on waste management systems, especially in the context of escalating urban waste volumes. Moreover, improper sorting of hazardous materials, such as batteries, can result in significant environmental risks, including soil and water contamination. In light of these challenges, there is an urgent need for robust, automated classification solutions. Leveraging the capabilities of deep learning, such systems promise substantial improvements in accuracy, efficiency, scalability, and overall cost-effectiveness, making them a critical advancement for modern waste management [7].
Previous studies have applied deep learning to waste classification with promising results. In 2016, Yang et al. from Stanford University introduced the TrashNet dataset, comprising 2527 images, which has become a cornerstone for developing waste classification models [8]. Subsequent research has leveraged transfer learning and innovative architectures to enhance performance. For example, Zhang et al. applied DenseNet169 to their NWNU-TRASH dataset, outperforming other algorithms despite high computational costs [9]. Wu et al. improved the VGG architecture for better feature extraction, though it struggled with varying object scales and interpretability [10]. Similarly, Lin et al. developed MSWNet using ResNet50 for urban waste sorting, boosting efficiency via transfer learning but requiring substantial resources [11]. Hossen et al. proposed GCDN-Net, enhancing interpretability with activation mapping (Score-CAM), yet it demands extensive labeled data and processing power [12]. Despite these advancements, limitations such as limited sample diversity, insufficient accuracy, and high computational demands persist. To address these, Transformer models have emerged as a promising alternative. Hu et al.’s Swin Transformer improved accuracy and efficiency but faced challenges with imbalanced datasets and complex environments [13]. Alrayes et al. achieved 95.8% accuracy on TrashNet using an enhanced Transformer, though it too required significant resources [14]. For resource-constrained settings, lightweight models have been explored. Xia et al.’s YOLO-MTG offers robust multi-target detection under varying lighting, albeit with slower inference on some hardware [15]. Gupta et al.’s SmartBin, built on InceptionNet and Raspberry Pi, enables real-time classification but still demands considerable power [16]. While these methods have propelled waste classification forward, their practical deployment is often hindered by resource demands, scalability, and adaptability. Thus, selecting and tailoring models to specific application needs is critical for achieving optimal results. Existing studies have combined image classification technology and machine learning methods to improve waste classification based on image recognition, but they still have certain limitations. These limitations include (1) low diversity of waste image samples, far from actual samples; (2) insufficient accuracy of waste classification models to meet actual needs; (3) deficiencies in real-time response and deployment, hindering timely and effective waste classification; and (4) limited generalization ability, restricting applicability in different scenarios.
To overcome the limitations of existing automated classification systems, this study introduces an efficient hybrid–modal fusion approach named HFWC-Net. ‘Hybrid–modal fusion’ refers to the integration of multiple image features through a hybrid architecture that combines convolutional neural networks (CNNs) and Transformer models, enhancing the model’s ability to perceive complex waste types. Unlike traditional multi-scale fusion, which focuses on extracting features at different spatial scales, or multimodal fusion, which emphasizes integrating information from distinct data sources (e.g., images and text), hybrid–modal fusion achieves finer-grained feature integration at the architectural level. Specifically, HFWC-Net leverages CNNs for local feature extraction and Transformers for global feature perception, employing a layered design and an Agent Attention mechanism to effectively fuse diverse image characteristics. This hybrid structure not only improves adaptability to complex backgrounds and varied waste types but also enhances computational efficiency and training speed through the LionBatch optimization strategy. By incorporating this approach, HFWC-Net addresses key challenges in waste classification, offering a robust solution for modern waste management.
The main contributions of this study include (1) constructing a new dataset, MixTrash, containing 135 categories of waste, with highly diverse image data meeting real-world classification needs; (2) integrating a new attention mechanism, Agent Attention, into the backbone model to effectively manage computing resources and focus on relevant input data, improving model performance with high expressiveness and low computational complexity; (3) proposing a new optimization strategy, LionBatch, combining model pruning technology with a new optimizer to reduce computational requirements and improve operational efficiency, thereby reducing resource consumption while maintaining high accuracy.

2. Materials and Methods

2.1. Self-Built Dataset MixTrash

Currently, there are few public datasets available in the field of waste identification, with most related studies relying on the TrashNet dataset [8]. The TrashNet dataset is a small collection of recyclable waste images, including six categories: glass, paper, cardboard, plastic, metal, and general waste, comprising a total of 2527 photos. However, this dataset has several shortcomings: the sample size is too small; the distribution of different types of waste is uneven; the background of the images is too uniform, which does not reflect real-world conditions and hinders the generalization ability of trained models. To address these limitations, we constructed a new waste image dataset, MixTrash, using Internet searches and manual photography. The MixTrash dataset includes 135 different categories of waste, such as banana peels, old toothbrushes, cans, old clothes, wastepaper, leftovers, and waste batteries, totaling 52,324 images (Figure 1). The images in MixTrash feature varied backgrounds, and the number of images for different types of waste is balanced, ensuring high data diversity that better meets the needs of real-world scenarios.
The MixTrash dataset is a multi-category image dataset specifically designed for waste recognition tasks, comprising four primary waste categories—recyclables, kitchen waste, hazardous waste, and other waste—encompassing a total of 135 subclasses (Table 1). The dataset is developed based on a standardized waste classification system: recyclables include 74 subclasses such as metal tools (e.g., anvils, scissors), plastic products (e.g., shampoo bottles, plastic bottles), wooden items (e.g., chair, wooden shovels), textiles (e.g., old clothes), and paper products (e.g., books, paper cups); kitchen waste comprises 21 subclasses covering easily perishable organic matter (e.g., fruit cores, peels) and processed food residues (e.g., sausages, biscuits); hazardous waste includes 15 subclasses such as toxic metal items (e.g., thermometers, button batteries) and pure electronic waste (e.g., power banks, circuit boards); and other waste features 26 subclasses like contaminated composite materials (e.g., wet wipes, old gloves). All images were captured in real-world settings (e.g., kitchens, streets) under varied lighting conditions and object variations. Compared to existing datasets such as TrashNet, MixTrash provides notable enhancements in sample scale, inter-class balance, and environmental diversity, offering essential data support for the development of efficient and robust waste classification models.

2.2. HFWC-Net

The Transformer architecture, with its self-attention mechanism, can effectively capture global features by focusing on the relationships between various regions in an image. Traditional convolutional neural networks (CNNs) primarily process information through local receptive fields, which can limit their ability to capture broader contextual information, resulting in inferior performance in complex image analysis tasks. Additionally, the hierarchical structure of CNNs often requires more layers to expand the receptive field, leading to larger, harder-to-optimize models. To address these shortcomings, the Transformer architecture is not only better suited for complex image analysis tasks, but its design also naturally supports parallel computing. This makes Transformers more efficient in processing large-scale datasets and accelerating the training process. To further enhance performance, this study proposes a new network architecture, HFWC-Net. HFWC-Net is designed to combine the efficient self-attention mechanism of the CSWin Transformer with deep feature processing capabilities specifically tailored for complex images. This approach leverages the CSWin Transformer’s strengths in multi-scale and multi-dimensional information processing. HFWC-Net addresses the limitations of CNN models in image processing through fine-grained feature fusion and efficient optimization strategies, thereby improving the model’s overall performance and adaptability.
HFWC-Net’s hybrid–modal fusion framework effectively integrates the strengths of both CNNs and Transformers into a unified architecture. This fusion is achieved through a hybrid design that combines the local feature extraction capabilities of CNNs with the global context modeling power of Transformers. Specifically, the network first employs a convolutional token embedding layer to capture fine-grained local features from input waste images, such as edges and textures. This is followed by a sequence of CSWin blocks and HFWC blocks, which leverage cross-shaped Window Attention and Agent Attention mechanisms to model long-range dependencies and multi-scale contextual relationships. Merge Blocks play a crucial role in progressively reducing spatial dimensions while increasing channel depth, thereby enabling seamless integration of local and global features within the network’s four-stage hierarchical structure. This hybrid fusion strategy offers several advantages: by incorporating the global perceptual ability of Transformers, it overcomes the limited receptive field of CNNs; and through a coarse-to-fine integration scheme, it enriches feature diversity while reducing computational overhead. These characteristics make HFWC-Net a highly practical and efficient solution for intelligent waste classification tasks.
The overall architecture of HFWC-Net is shown in the image (Figure 2). This neural network utilizes a four-layer pyramid structure to efficiently process and abstract image features. Each layer’s design goals and structures are carefully configured to optimize feature extraction and representation capabilities. The first three layers of the network form the basic feature extraction and processing modules, each comprising multiple CSWin Blocks. These blocks employ the self-attention mechanism of a cross-shaped window to process image data, which is particularly suited for capturing local features and their spatial correlations. In this four-layer pyramid structure, the spatial resolution of the image is gradually reduced through the fusion module, Merge Block, while the channel dimension is doubled at each upward layer. This design allows the network to progressively enhance the abstraction ability of features while maintaining a low computational cost. The Merge Block is primarily responsible for reducing the number of tokens and merging features from adjacent layers, enabling the model to capture more comprehensive and high-level features as the number of layers increases.
Through this pyramid setting, the first three layers are primarily responsible for feature extraction, ranging from basic to relatively complex. The CSWin Block in each layer refines and strengthens the features at different scales, preparing them for advanced feature integration. The final layer utilizes the HFWC Block, which replaces the traditional cross-shaped window self-attention mechanism with the Agent Attention mechanism. This effectively reduces the number of features involved in the calculation. The introduction of agent tokens allows the attention calculation to focus on significant features, avoiding redundant computations of the full-size feature matrix. This achieves linear complexity and significantly improves the model’s computational efficiency. By optimizing the processing flow and reducing the computational burden, the HFWC Block not only enhances the model’s capability to handle complex data but also improves its ability to capture details and distinguish different categories. Through innovative structural design, HFWC-Net effectively improves the depth and breadth of feature processing, enhances the model’s expressiveness, and optimizes computational efficiency. This enables HFWC-Net to more accurately understand objects in complex categories, making it particularly effective for challenging tasks such as waste classification.

2.3. Cswin Transformer Block (CST Block)

The CSWin Transformer is an innovative Transformer architecture designed for computer vision tasks, proposed by Microsoft Research Asia in 2021 [17]. Through its novel self-attention mechanism and position encoding method, the CSWin Transformer optimizes the theoretical applicability of the Transformer architecture and demonstrates significant performance advantages in actual vision tasks.
The structure of the CSWin Transformer Block is shown in the image (Figure 3). Layer normalization (LN) is used at the beginning and middle of the module to normalize the mean and variance of the input layer [18]. Between the two layer normalization stages, the cross-shaped window self-attention mechanism (CSWSA) is employed. CSWSA allows the model to focus on key regions of the image while maintaining computational efficiency by forming a cross-shaped window on the input image or feature map. Following CSWSA is a multilayer perceptron (MLP), which typically consists of several fully connected layers and uses nonlinear activation functions such as ReLU between these layers [19]. The MLP further processes the output of the self-attention layer, increasing the nonlinearity of the network and enabling the model to learn more complex feature representations.
One of the core features of the CSWin Transformer is its cross-shaped window self-attention mechanism. Unlike traditional Transformers that apply self-attention across the entire image or within fixed-size windows, CSWSA first splits the input image into multiple small windows. Each window is further divided into a cross center and four corners, with the center window forming a cross shape covering strips in the horizontal and vertical directions (Figure 4). Within each cross-shaped window, the model calculates self-attention independently, meaning it evaluates the relationship of each patch within the window with all other pixels in the same window. This design allows the model to more effectively capture long-range dependencies in the image while reducing computational complexity.

2.4. Merge Block

Merge Block is a dedicated module in this study designed to enhance feature fusion by enabling efficient integration and transmission of multi-level feature information. The module comprises three core stages: reshaping, convolution-based downsampling, and normalization (Figure 5). First, the input sequence of tokens is reshaped to reconstruct its spatial structure, enabling the model to reestablish spatial correlations among features. This transformation is essential for bridging the gap between sequence-based representations and spatial operations. Next, a 3 × 3 convolution with a stride of 2 (denoted as “DownConv”) is applied to simultaneously reduce spatial resolution and expand the channel dimension, thereby enriching the semantic content while reducing computational cost. Finally, LN is applied to stabilize the feature distribution and facilitate more robust training dynamics. Unlike traditional vision Transformer modules—such as ViT’s fixed Patch Merging or Swin Transformer’s window-based downsampling—the Merge Block leverages learnable convolutional operations, allowing it to adaptively capture local spatial dependencies and contextual information during the fusion process. Its CNN-based design not only enhances local detail modeling but also ensures higher computational efficiency and better compatibility with modern hardware. Within the HFWC-Net framework, the Merge Block serves as a crucial inter-stage connector, promoting seamless and effective feature propagation across hierarchical layers.

2.5. HFWC Block

HFWC Block features an advanced Agent Attention Mechanism (AA) to optimize information processing flow (Figure 6). The module employs layer normalization (LN) at both the initial and middle stages to standardize the mean and variance of the input layer. Between these two layer normalization stages, the Agent Attention Mechanism is integrated. This mechanism optimizes attention calculation by introducing agent tokens, which reduces the complexity of direct calculations and improves the efficiency and accuracy of information processing. Following the Agent Attention Mechanism, the module also incorporates a multilayer perceptron (MLP), which consists of several fully connected layers and embeds nonlinear activation functions between layers. The HFWC Block with Agent Attention excels in computational efficiency and fast processing, making it suitable for various application scenarios.
Agent Attention was proposed by Han et al. to balance computational efficiency and representation capabilities [20]. In visual Transformers, the traditional global attention mechanism has high expressive ability but also high computational complexity, limiting its application in high-resolution scenes. The cross-attention mechanism expands the attention area of each token within the Transformer block by performing self-attention in horizontal and vertical strips. However, processing high-resolution images and large-scale data can still pose a high computational burden. Agent Attention combines the advantages of Softmax [21] and linear attention [22]. It introduces agent tokens, simplifies the aggregation process of global information, maintains high expressive ability, and reduces computational complexity, thus overcoming the performance bottlenecks encountered by the cross-attention mechanism in processing large-scale data. This attention mechanism can more effectively focus on key areas of the image, thereby improving classification accuracy and efficiency, which is particularly advantageous for processing large amounts of data and complex tasks.
The Agent Attention mechanism inserts an additional set of token A into the attention triplet Q , K , V calculation to form a new quadruple attention paradigm Q , K , V , A . The Agent Attention mechanism consists of two traditional Softmax operations (Figure 7). The first Softmax is applied to the triplet A , K , V , where token A is used as a query to aggregate information from the value V to form a new agent value V A . The attention matrix is between A and K , effectively reducing the need to directly process all data and greatly reducing computational complexity. The second Softmax is calculated on the triple ( Q , A , V A ) , where V A is the result of the first step. The newly introduced token A acts as a “proxy” for query Q , as they directly collect information from K and V and then pass the results to Q . Query token Q no longer needs to communicate directly with the original key K and value V . This feature reduces the quadratic complexity of Softmax to linear complexity, while retaining the global context modeling capability.
The algorithm of the Agent Attention mechanism is as follows. First, we simplify Softmax and linear attention to
O S = S o f t m a x Q K T V A t t n S Q , K , V O ϕ = ϕ Q ϕ K T V A t t n ϕ Q , K , V
where Q , K , V represent query, key, and value matrices, and Agent Attention can be written as
O A = A t t n S Q , A , A t t n S A , K , V
equivalent to
O A = S o f t m a x Q A T S o f t m a x A K T V = ϕ q Q ϕ k K T V = A t t n ϕ q / k Q , K , V      
The Agent Attention mechanism consists of two Softmax operations, responsible for agent aggregation and agent broadcasting, respectively. Specifically, the agent token A is first regarded as a query, and attention calculation is performed between A , K , and V to aggregate the agent feature V A from all values. Subsequently, A is used as the key and V A as the value in the second attention calculation of the query matrix Q to broadcast the global information of the agent feature to each query token, obtaining the final output O. The newly defined agent token A essentially acts as an agent of Q , aggregating global information from K and V , and then broadcasting it back to Q , maintaining the global context modeling capability. To better utilize the position information, the Agent Attention adds carefully designed agent biases B 1 , B 2 :
O A = S o f t m a x Q A T + B 2 S o f t m a x A K T + B 1 V
Based on these designs, the Agent Attention mechanism can be expressed as
O A = S o f t m a x Q A T + B 2 S o f t m a x A K T + B 1 V + D W C V
Modern Transformer models typically use sparse attention [23] or Window Attention [17] to reduce the computational burden of Softmax attention. Benefiting from linear complexity, these attention mechanisms enjoy the advantages of large and even global receptive fields while ensuring manageable computational costs. Agent Attention integrates these two forms of attention, drawing on their strengths to balance computational efficiency and expressiveness. By introducing agent tokens, relevant information, Agent Attention effectively focuses the model’s capacity on important features within the input data.

2.6. New Optimization Strategy LionBatch

AdamW has become a cornerstone optimizer in deep learning due to its decoupled weight decay mechanism, which separates regularization from gradient updates to enhance training stability [24]. However, despite its effectiveness, AdamW has some significant drawbacks, particularly in terms of memory consumption and computational efficiency. AdamW stores multiple moving averages of gradients and squared gradients for each parameter, which can quickly become resource-intensive, especially in models with numerous parameters. This architecture leads to inefficient GPU memory utilization and elevated computational costs during backpropagation. Furthermore, AdamW exhibits pronounced sensitivity to hyperparameter configurations (e.g., learning rate, β 1 momentum coefficients, weight decay coefficient), necessitating meticulous tuning to avoid suboptimal convergence or training instability. To address these limitations, Lion optimizer introduces a resource-efficient paradigm by redefining momentum utilization [25]. Unlike AdamW’s adaptive learning rate mechanism, Lion employs a sign-based update rule that retains only a single momentum buffer, reducing memory consumption while eliminating redundant gradient magnitude calculations. This design enables linear computational scaling with model size, enhancing training throughput for deep architectures. Crucially, Lion replaces AdamW’s complex hyperparameter dependencies with a simplified two-parameter framework (learning rate, β 1 momentum), demonstrating superior robustness across tasks. This simplicity allows Lion to converge more quickly and efficiently, especially on large datasets, without the need for fine-tuning complex learning rates or momentum parameters. This makes Lion more accessible and effective for researchers, especially when dealing with complex models and large datasets that require both speed and accuracy.
Traditional pruning techniques reduce model size and computational complexity by removing some weights or neurons in the network, including unimportant weight pruning and random pruning [26]. While traditional pruning helps improve model running speed and reduce storage requirements, it often leads to a decline in model performance, especially when the pruning ratio is high, as the model may lose key information. To address the shortcomings of traditional pruning techniques, Qin et al. from the National University of Singapore proposed an unbiased dynamic data pruning method, InfoBatch [27]. Traditional pruning methods often introduce biases in gradient expectations by discarding important samples that help the model generalize. InfoBatch avoids this bias by adjusting the gradient weights of the remaining samples while pruning them (called gradient rescaling), maintaining the same gradient expectations as the unpruned dataset.
To better optimize the model, this paper proposes a new optimization strategy, LionBatch, which combines the Lion optimizer and InfoBatch pruning. The main algorithm flow is shown in the image and table (Figure 8 and Algorithm 1). By integrating these two technologies, our new optimization strategy not only improves the training speed and resource utilization efficiency of the model but also significantly reduces the computing resources required during the training process while maintaining high classification accuracy. LionBatch tracks momentum during model training and computes updates through symbolic operations, thereby reducing memory overhead and ensuring update magnitudes are consistent across all dimensions. LionBatch implements dynamic pruning, allowing pruning decisions to be adjusted based on the real-time performance of the data during training. This allows for more detailed control of the training process. Through this dynamic and unbiased pruning method, LionBatch can reduce training costs while maintaining model performance, achieving lossless training acceleration. These innovative designs demonstrate significant advantages in improving model efficiency and effectiveness.
The detailed steps of the LionBatch algorithm are as follows.
Algorithm 1. LionBatch Optimization Strategy
g i v e n β 1 , β 2 , λ , η , f , r
initialize   θ 0 , m 0 , z 0 , p 0 0
while   θ t not converged do
update   sample   scores   based   on   loss
  p t z L o s s z t , θ t ρ z t d z t
  θ t a r g   m i n θ 1 r p t
  g t θ f θ t 1
update   model   parameters
c t β 1 m t 1 + 1 β 1 g t
  θ t θ t 1 η t ( sign c t + λ θ t 1 )
update   EMA   of   g t
  m t β 2 m t 1 + 1 β 2 g t
e n d   w h i l e
return   θ t
First, calculate the average loss of the entire dataset and directly prune the part with smaller loss. After each epoch or several epochs of training, recalculate the average loss of the entire dataset. For pruned and untrained samples, use the previous loss; for pruned samples, use the updated loss after training. For data samples p with losses less than the average loss, perform pruning with a certain probability r . Due to the reduction in the number of samples, the gradient g of the entire dataset will change, leading to inconsistency with the expected gradient of the original dataset. To address this, rescale the gradient g of the data samples with losses less than the average loss. Then, use the Lion optimizer to optimize the pruned data subset, and calculate the current update momentum c using the gradient g and the previous momentum m . Use the s i g n function to specify the update direction to update the model parameter θ . Combine the previous momentum m and the current gradient to update the exponential moving average (EMA). Finally, repeat the above process to gradually optimize the model parameters until the desired performance indicators are achieved.
The LionBatch optimization strategy significantly reduces training time and computing resource consumption while maintaining high classification accuracy. By employing dynamic learning rates and momentum terms, LionBatch enables the model to converge to the global optimal solution faster, reducing the number of training iterations. This strategy not only makes the optimization process more efficient and stable but also reduces the risk of over-fitting and under-fitting. Additionally, LionBatch optimizes the model structure by dynamically removing redundant neurons, avoiding unnecessary computational burden and ensuring the model’s accuracy. The LionBatch strategy excels in optimizing computing resource consumption, which is particularly important for resource-constrained environments such as embedded or mobile devices, making it feasible to deploy deep learning models on these platforms.

2.7. Experimental Environment Set-Up

After building the HFWC-Net model, we used the PyTorch deep learning framework to train it in graphics processing unit (GPU) mode. In this experiment, two computer devices were used for the entire training and validation process. The detailed characteristics of the two devices are shown in the tables (Table 2 and Table 3). Machine 1 is primarily used to train CNN and Transformer architecture networks due to its strong computing power and efficient video memory management. This capability allows Machine 1 to handle a large number of matrix operations and data processing effectively. Machine 2, on the other hand, is used to run VMamba, as VMamba is more compatible with the Linux system, making its training on this device more efficient and convenient.

2.8. Experimental Evaluation Metrics

In this study, we used various deep learning models to conduct experiments on multiple datasets. To properly evaluate the performance of these models, we adopted established evaluation metrics, including Top1-Accuracy (Top1-Acc), precision, recall, and F1 score. These metrics are computed using data extracted from the confusion matrix, which contains key parameters such as the total number of test samples (ATs), true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). All values are calculated using the global confusion matrix, which consists of the results from cross-validation [28].
  T o p 1 A c c = T P s A T s  
  P r e c i s i o n = T P T P + F P
  R e c a l l = T P T P + F N
  F 1 s c o r e = 2 T P 2 T P + F P + F N

3. Results

3.1. Comparative Experimental Results

To evaluate the performance of the model proposed in this paper, we conducted a series of comparative experiments on the public datasets TrashNet [8] and the self-built waste dataset MixTrash. We selected numerous advanced classification models as references, including VGGNet [10], GoogleNet [16], ResNet [11], DenseNet [9], MobileNet [29], EfficientNet [30], ConvNext [31], ViT [32], DeiT [33], Swin Transformer [13], TNT [34], PiT [35], CaiT [36], BiFormer [37], and VMamba [38]. The experiment used the same loss function, learning strategy, and image preprocessing method to ensure the fairness of the results. The experimental results indicate that the improved model proposed in this study performs exceptionally well across the three datasets. The following sections present the results of the comparative experiments.

3.1.1. Comparison Results of Garbage Classification Dataset

The results show the Top1-Acc, precision, recall, F1 score, and number of parameters for different models on the Garbage Classification dataset (Table 4). The experimental results reveal that traditional CNN models such as VGGNet, GoogleNet, ResNet, and DenseNet achieve similar accuracy, with Top1-Acc ranging from 93% to 94%. Among them, VGGNet has the highest Top1-Acc at 93.97%, but it also has a much larger number of parameters compared to the other three traditional CNN models. In contrast, lightweight models like MobileNet and EfficientNet achieve higher accuracy with fewer parameters by optimizing computational efficiency, with EfficientNet’s Top1-Acc being 2.74% higher than VGGNet’s. Transformer architecture models, including ViT, DeiT, Swin Transformer, CSwin Transformer, TNT, PiT, CaiT, and BiFormer, perform well, with Top1-Acc ranging from 95% to 97%, exceeding that of traditional CNN models by more than 2%. Among these, CSwin Transformer stands out by introducing cross-shaped Window Attention to effectively capture spatial patterns, achieving a Top-1 accuracy of 97.03%. This makes CSwin a strong baseline among vision Transformers, balancing accuracy and model complexity. Additionally, ConvNext combines the advantages of traditional CNN and Transformer models, achieving a Top1-Acc of 96.74%, though its parameter count far exceeds that of most CNN and Transformer models. VMamba also demonstrated efficient performance, achieving a Top1-Acc of 98.45%, second only to the HFWC-Net proposed in this study. The HFWC-Net proposed in this study achieved a Top1-Acc of 98.89% on the Garbage Classification dataset, significantly higher than traditional CNN models and current mainstream Transformer models. Particularly in terms of parameter volume, HFWC-Net offers similar accuracy to VMamba but with significantly fewer parameters, translating to lower operating costs and faster processing speeds in actual deployment. Moreover, HFWC-Net’s accuracy is 2.77% higher than that of the Swin Transformer, which performs well within the Transformer architecture, while its parameter volume is about 13% lower than that of the Swin Transformer. This reduction is particularly important in resource-constrained application scenarios. The outstanding performance of HFWC-Net is attributed to its innovative network architecture and optimization strategies, which greatly enhance the model’s learning efficiency and generalization ability. These results fully demonstrate the significant advantages of HFWC-Net in handling Garbage Classification tasks.
The image details the trend of Top1-Acc of different models during training on the Garbage Classification dataset as the number of training rounds changes (Figure 9). It can be observed from the figure that the accuracy of most models increases rapidly within the first 50 epochs, with the accuracy of each model tending to stabilize as it approaches 300 epochs. Notably, some traditional CNN models, such as VGGNet and ResNet, exhibit slow convergence speeds during training. Specifically, the accuracy of these models still fluctuates significantly around 100 epochs, which may be related to their deeper network structures. In contrast, lightweight models such as MobileNet and EfficientNet show faster convergence speeds due to their optimized structures. This indicates that these models can efficiently complete training tasks under limited resources. Regarding Transformer models, such as Swin Transformer, CSwin Transformer, and BiFormer, they not only converge faster than traditional CNNs throughout the training cycle but also demonstrate higher accuracy in the early stages compared to traditional CNNs and lightweight models. This highlights the superior capability of Transformer models in handling classification tasks. It is worth mentioning that the HFWC-Net model achieved high accuracy early in the training process, second only to VMamba. Around 50 epochs, the accuracy of the HFWC-Net model surpassed VMamba, fully validating the efficiency and effectiveness of our proposed method.

3.1.2. Comparison Results of TrashNet Dataset

The results show the Top1-Acc, precision, recall, F1 score, and number of parameters for different models on the TrashNet dataset (Table 5). The experimental results reveal significant differences in accuracy among traditional CNN models such as VGGNet, GoogleNet, ResNet, and DenseNet. Among these, GoogleNet has the lowest Top1-Acc at 83.50%. In contrast, lightweight models such as MobileNet and EfficientNet exhibit good performance. Notably, EfficientNet’s performance is comparable to traditional heavy models like VGGNet, but with only 13% of VGGNet’s parameters. Transformer architecture models, including ViT, TNT, CSwin Transformer, and BiFormer, perform well, with Top1-Acc rates above 90%, significantly higher than traditional CNN models. Among them, the baseline CSwin Transformer demonstrates particularly strong performance, outperforming most other Transformer variants. Its balanced accuracy and training stability underscore the effectiveness of the cross-shaped attention mechanism in capturing both local and global features, making it especially well-suited for complex classification tasks such as waste categorization. VMamba exhibited even better performance, with a Top1-Acc of 95.83%, 3.98% higher than ConvNext, further proving the advantages of these emerging models in handling complex classification tasks. The HFWC-Net proposed in this study performed the best among all models, achieving a Top1-Acc of 96.43% on the TrashNet dataset, significantly higher than all models except VMamba, demonstrating its substantial improvement in accuracy.
The image details the Top1-Acc trends of different models as they train on the TrashNet dataset (Figure 10). It can be observed that the accuracy of most models fluctuates greatly in the early stages of training, stabilizing only after approximately 300 training cycles. Traditional CNN models such as VGGNet, ResNet50, DenseNet169, and GoogleNet improve slowly in the early stages, converge relatively slowly, and ultimately achieve lower final accuracy compared to newer models. Lightweight models like MobileNet and EfficientNet show faster convergence speeds, but their accuracy is only slightly higher than VGGNet, the best-performing traditional model, by about 0.6%. Among the Transformer models, ViT, TNT, and BiFormer perform well in both convergence speed and final accuracy, demonstrating the advantages of the Transformer architecture in processing image classification tasks. It is particularly noteworthy that HFWC-Net and VMamba improve rapidly in accuracy in the early stages of training, far exceeding all other models. HFWC-Net eventually surpasses VMamba at around 120 training cycles, indicating the efficiency of its optimization strategy and network architecture. These results highlight the design advantages of HFWC-Net in achieving high accuracy, making it ideal for handling complex image classification problems.

3.1.3. Comparison Results of MixTrash Dataset

The results show the Top1-Acc, precision, recall, F1 score, and parameter count of different models on our self-built dataset MixTrash (Table 6). The experimental results indicate that traditional CNN models such as VGGNet, GoogleNet, ResNet, and DenseNet perform poorly. The Top1-Acc of VGGNet is only 72.11%, reflecting its inefficiency in processing complex datasets due to its large number of parameters and deep network structure. Among lightweight models, EfficientNet excels. Utilizing a compound scaling method, EfficientNet optimizes in multiple dimensions and achieves excellent performance with relatively few parameters. The overall performance of Transformer architecture models is relatively average, with Top1-Acc between 83% and 86%, due to the scale and complexity of the MixTrash dataset limiting the potential of the Transformer models. BiFormer performs relatively well, with a Top1-Acc of 87.28%. Its unique bidirectional feature extraction mechanism provides advantages in processing complex image tasks, but its overall performance still lags behind the emerging efficient models HFWC-Net and VMamba. CSwin Transformer, introduced as the baseline model, delivers highly competitive performance on MixTrash. With a Top1-Acc of 92.50%, it surpasses most Transformer variants, demonstrating stable convergence and strong generalization. VMamba achieved a Top1-Acc of 93.89% on the MixTrash dataset, demonstrating extremely high classification performance. However, its training time per epoch is the highest among all models, suggesting a significant computational cost despite its excellent accuracy. The HFWC-Net proposed in this study performed best among all models, achieving a Top1-Acc of 94.35% on the MixTrash dataset. Compared to other models, HFWC-Net not only has an advantage in accuracy but also achieves higher optimization efficiency regarding parameter quantity. For instance, compared to the Swin Transformer with a similar parameter count, HFWC-Net reduces the number of parameters by 11.72% while increasing accuracy by 9.03%. This outstanding performance fully demonstrates the great potential and advantages of HFWC-Net in processing complex data features and achieving efficient classification.
The images show the relationship between the Top1-Acc, training time, and model parameter count of different models on the MixTrash dataset (Figure 11 and Figure 12). The scatter plot intuitively illustrates the performance and training efficiency of each model. Among traditional CNN models, VGGNet exhibits the lowest Top1-Acc, the longest training time, and the largest number of parameters. This indicates that VGGNet has a complex structure and low efficiency, making it difficult to handle complex datasets. Although GoogleNet, ResNet, and DenseNet have shorter training times, their Top1-Acc is relatively low. Lightweight models such as MobileNet and EfficientNet have shorter training times and higher Top1-Acc compared to most traditional CNN models. Particularly, EfficientNet achieves an accuracy of 87.30%, demonstrating excellent classification performance while maintaining a low number of parameters and short training time. Transformer architecture models such as ViT, DeiT, TNT, and PiT have accuracy rates between 83% and 86%, with training times slightly longer than those of lightweight models but shorter than those of traditional CNN models. While these models improve upon traditional CNNs, they do not surpass lightweight models. Swin Transformer and CaiT achieve high accuracy but have relatively long training times and large parameter counts, highlighting both their advantages and disadvantages on complex datasets. BiFormer performs relatively well, with a moderate training time and parameter count, showing certain advantages. CSwin Transformer, used as a baseline model in this study, demonstrates outstanding performance as well. Although its training time is longer than some lightweight models, it significantly outperforms most other Transformer variants, reflecting a strong balance between performance and computational cost. VMamba achieves a Top1-Acc close to the highest, but its training time is the longest, about five times that of mainstream Transformer models. This indicates that VMamba has outstanding classification performance but low training efficiency. HFWC-Net performs exceptionally well, standing out with the highest Top1-Acc and moderate training time. Under similar parameter conditions, HFWC-Net’s training time is significantly lower than VMamba’s, being only about one- third of it, and its accuracy is 6.09% higher than ConvNext. This demonstrates HFWC-Net’s excellent classification performance and efficient training efficiency. Overall, HFWC-Net exhibits the best overall performance across various indicators, fully demonstrating its powerful capability in waste classification tasks.

3.2. Ablation Experiment Results

In addition, on the MixTrash dataset, we conducted ablation experiments to verify the effectiveness of each improvement strategy. The experimental results clearly show that each improvement strategy, including the introduction of a new attention mechanism and optimized training algorithm technology, significantly enhances the model’s classification performance. This study uses CSwin Transformer as the baseline network for this experiment. First, the self-attention mechanism of the last layer of the CSwin Block was replaced with a parallel Agent Attention mechanism to prevent loss caused by incomplete feature extraction. The improved model benefits from the Transformer’s powerful global feature extraction capability, with an accuracy increase of 1.03% compared to the backbone network (Table 7). To accelerate network training, we introduced the LionBatch optimization strategy, which significantly reduced training time while maintaining high classification accuracy. The experimental results indicate that Model F, utilizing the LionBatch strategy, shortened the training time and reduced the computational load, despite having similar parameter counts and FLOPs. Ultimately, the proposed HFWC-Net model achieved an accuracy of 94.35% on the MixTrash dataset, an increase of 1.85% compared to the original model. Additionally, the training time was reduced from 24.67 h to 24.22 h. Therefore, HFWC-Net can quickly and accurately classify waste in actual scenarios, meeting the needs of real-time waste classification.
To verify the performance improvement of the HFWC-Net model compared to the original model, we used the test set to evaluate the classification effects of both models. According to the number of correctly predicted samples in the confusion matrix, the improved model increased the identification accuracy of recyclable waste, kitchen waste, hazardous waste, and other waste by 1%, 2%, 1%, and 3%, respectively (Figure 13). This performance improvement indicates that HFWC-Net can more accurately identify and classify different types of garbage in classification tasks. The confusion matrix of the original model reveals that the main reason for the lower accuracy is the confusion between kitchen waste and other garbage during the classification process. This issue may arise from similarities in appearance and characteristics between some food waste and other waste, causing difficulty for the model in distinguishing these categories. Specifically, kitchen waste and other garbage may share similarities in color, shape, and texture, making it challenging for the original model to accurately differentiate them. HFWC-Net significantly enhances the model’s feature extraction capabilities and classification performance by introducing innovative technologies such as multi-level feature fusion and new optimization strategies. These improvements enable HFWC-Net to better capture subtle differences in garbage images and improve the ability to identify complex features. Overall, the improvement of the HFWC-Net model not only increases the overall classification accuracy but also demonstrates the effectiveness and superiority of HFWC-Net in Garbage Classification tasks. This provides a more reliable solution for practical applications.

3.3. Real-World Performance Metrics

To assess the real-world applicability of HFWC-Net for waste classification, we evaluated its inference efficiency on the MixTrash dataset using Machine 1 (as detailed in Table 2). The experiment focused on single-image inference to simulate practical scenarios, such as real-time waste sorting in automated systems. The table presents the GPU memory consumption and average inference time for HFWC-Net alongside several baseline models, including ResNet, EfficientNet, ConvNeXt, ViT, Swin Transformer, CSwin Transformer, and BiFormer (Table 8). HFWC-Net demonstrates a balanced performance, consuming 266.2 MB of GPU memory and achieving an average inference time of 7.9 ms per image. Compared to lightweight models like EfficientNet, HFWC-Net uses slightly more memory but maintains competitive inference speed. In contrast, models like ConvNeXt and Swin Transformer exhibit significantly higher memory usage, making them less suitable for resource-constrained environments despite faster or comparable inference times. ViT offers the fastest inference but sacrifices accuracy on complex datasets like MixTrash, as shown in prior results. CSwin Transformer performs similarly to HFWC-Net in memory usage but is slightly slower, while BiFormer is the least efficient, with the longest inference time and higher memory demand. These results highlight HFWC-Net’s efficiency, making it well-suited for real-time waste classification on edge devices while maintaining high accuracy.

4. Discussion

4.1. Interpreting the Superior Performance of HFWC-Net in Waste Classification

The experimental results underscore the remarkable efficacy of the HFWC-Net in addressing the challenges of waste image classification, achieving Top-1 accuracy rates of 98.89%, 96.43%, and 94.35% on the Garbage Classification, TrashNet, and MixTrash datasets, respectively. These figures not only highlight HFWC-Net’s exceptional performance but also validate the core hypothesis driving this study: that a Transformer-based architecture, augmented by hybrid–modal fusion, the Agent Attention mechanism, and the LionBatch optimization strategy, can substantially outperform traditional convolutional neural networks (CNNs) and existing Transformer models in both accuracy and processing efficiency. This success is particularly evident when considering the diversity and complexity of the datasets tested. For instance, the Garbage Classification dataset, with its broad range of waste categories, benefits from HFWC-Net’s ability to integrate multi-modal features, achieving a near-perfect 98.89% accuracy. Similarly, on TrashNet—a smaller, less diverse dataset with 2527 images across six categories—HFWC-Net’s 96.43% accuracy demonstrates its adaptability to varying dataset scales, surpassing previous benchmarks like Alrayes et al.’s 95.8% [14]. The self-built MixTrash dataset, with its 135 categories and 52,324 images reflecting real-world variability, further showcases HFWC-Net’s robustness, achieving 94.35% accuracy despite the increased complexity and sample diversity.
This superior performance can be attributed to the synergistic design of HFWC-Net’s components. The Transformer-based backbone, leveraging the CSWin Transformer’s cross-shaped window self-attention, excels at capturing global and local feature relationships, overcoming the limitations of CNNs’ localized receptive fields. The hybrid–modal fusion approach enhances this capability by integrating diverse image attributes—such as shapes, colors, and textures—into a cohesive feature representation, as evidenced by the model’s consistent high precision, recall, and F1 scores across all datasets (e.g., 98.82% F1 score on Garbage Classification). The Agent Attention mechanism further refines this process by focusing computational resources on salient features, reducing redundancy and boosting efficiency, a critical factor in achieving linear complexity over traditional quadratic attention models. Meanwhile, the LionBatch optimization strategy accelerates training and reduces resource demands, as seen in the ablation study (Table 7), where its inclusion cuts training time from 24.67 to 24.22 h while maintaining accuracy gains. Collectively, these innovations enable HFWC-Net to process large-scale, heterogeneous waste data swiftly and accurately, aligning with the study’s aim to develop a scalable solution for automated waste classification.

4.2. Limitations and Future Directions of HFWC-Net

Despite its advantages, the proposed waste classification method HFWC-Net still presents several limitations. First, the model’s feature extraction capabilities may be insufficient when processing low-resolution garbage images or images with missing fine details, which can lead to reduced classification accuracy. This issue becomes particularly prominent in complex real-world scenes captured by cameras, where factors such as lighting variations, occlusions, and motion blur can significantly degrade model performance. Additionally, for anomalous data or garbage categories that are underrepresented or absent in the training set, HFWC-Net may exhibit limited generalization ability, increasing the likelihood of misclassification or low-confidence predictions. This reflects the model’s current challenges when dealing with long-tail distributions and open-set environments, which are common in real-world waste classification tasks. Moreover, although the LionBatch optimization strategy effectively reduces computational resource consumption and accelerates convergence, its performance is highly dependent on the quality and stability of training batches. In scenarios involving imbalanced data distributions or significant variation in sample features, LionBatch may result in slower convergence or increased performance fluctuations due to instability in the optimization process.
To address the existing limitations and further enhance the capabilities of HFWC-Net, future research will focus on several key directions. First, introducing more adaptive learning mechanisms will be a primary goal. These mechanisms could include techniques such as meta-learning, which allows the model to quickly adapt to new, unseen categories and improve generalization in dynamic environments. Self-supervised learning methods may also be explored to make the model more robust to variations in the dataset and improve its ability to generalize with minimal labeled data. Next, we aim to integrate advanced data augmentation strategies that will improve the model’s ability to handle rare and unseen categories. These strategies could include synthetic data generation through techniques like Generative Adversarial Networks (GANs), which can create realistic samples of rare or underrepresented waste types, thus enriching the training data. Mixup-based augmentation can also be used to improve the model’s ability to generalize by blending features from different classes. In terms of computational efficiency, we will explore parallel computing frameworks and distributed training architectures to optimize the training speed and resource utilization. Techniques like model pruning and quantization could be employed to reduce the computational burden and make the model more suitable for resource-constrained environments such as edge devices or mobile applications. By addressing these areas, we believe future work will enhance the versatility and practicality of HFWC-Net.

5. Conclusions

This study proposes a new Transformer-based framework called HFWC-Net. By integrating the local feature extraction capabilities of CNNs with the global contextual understanding of hybrid–modal visual Transformers, HFWC-Net leverages this CNN-Transformer synergy, enhanced by the innovative Agent Attention architecture. The use of agent tokens overcomes the limitations of traditional triple attention, enhancing the model’s flexibility, adaptability, and information capture capabilities, thereby improving training efficiency. Additionally, HFWC-Net employs the unbiased dynamic selection and adaptive adjustment pruning method, LionBatch, which dynamically reduces the number of samples in each training iteration. This approach improves computational efficiency, significantly speeds up model training, and optimizes the utilization of computing resources.
However, challenges remain, such as limited performance on low-resolution images and unseen categories, highlighting constraints in feature extraction and generalization. Future work will focus on developing adaptive learning techniques and robust preprocessing methods to improve performance in complex scenarios. Additionally, efforts will be directed toward optimizing the model for resource-constrained environments. Overall, HFWC-Net offers a powerful and efficient solution for waste classification, with the potential to advance environmental sustainability and improve urban waste management practices.

Author Contributions

Conceptualization, Q.D.; methodology, Q.D.; validation, H.Z. and Q.W.; formal analysis, Q.D., C.C. and Q.W.; investigation, H.H., G.Z. and T.H.; data curation, H.Z. and H.Y.; writing—original draft preparation, Q.D.; writing—review and editing, H.Z. and Q.W.; visualization, Q.D. and Q.L.; supervision, H.Y. and H.H.; project administration, H.Z. and J.H.; funding acquisition, G.Z. and J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Zhejiang Provincial Natural Science Foundation of China under Grant No. LY24F020005 (funder: H.Z., amount: CNY 100,000), Research Project of Zhejiang Provincial Department of Education (No. Y202147814, funder: Q.W., amount: CNY 10,000) and Zhejiang Undergraduate Innovation Plan (Xinmiao Talent Program No. 2024R412A032, funder: C.C., amount: CNY 10,000).

Data Availability Statement

The datasets and source code used in this study are available upon reasonable request.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive comments and suggestions, which significantly contributed to improving the manuscript. Additionally, we express our gratitude to Qun Wang for his valuable contributions to the validation process, providing essential support during the development of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Li, J.; Chen, J.; Sheng, B.; Li, P.; Yang, P.; Feng, D.D.; Qi, J. Automatic Detection and Classification System of Domestic Waste via Multimodel Cascaded Convolutional Neural Network. IEEE Trans. Ind. Inf. 2022, 18, 163–173. [Google Scholar] [CrossRef]
  2. Yang, S.; Wei, J.; Cheng, P. Spillover of Different Regulatory Policies for Waste Sorting: Potential Influence on Energy-Saving Policy Acceptability. Waste Manag. 2021, 125, 112–121. [Google Scholar] [CrossRef]
  3. Shu, T.; Li, X.; Guo, H.; Bai, B.; Nie, X. Research on the Innovative Model of the Whole Chain of MSW Classification in China. In Proceedings of the 2021 International Conference on Big Data and Intelligent Decision Making (BDIDM), Guilin, China, 23–25 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 230–235. [Google Scholar]
  4. Malik, M.; Sharma, S.; Uddin, M.; Chen, C.-L.; Wu, C.-M.; Soni, P.; Chaudhary, S. Waste Classification for Sustainable Development Using Image Recognition with Deep Learning Neural Network Models. Sustainability 2022, 14, 7222. [Google Scholar] [CrossRef]
  5. Ahmed, M.I.B.; Alotaibi, R.B.; Al-Qahtani, R.A.; Al-Qahtani, R.S.; Al-Hetela, S.S.; Al-Matar, K.A.; Al-Saqer, N.K.; Rahman, A.; Saraireh, L.; Youldash, M.; et al. Deep Learning Approach to Recyclable Products Classification: Towards Sustainable Waste Management. Sustainability 2023, 15, 11138. [Google Scholar] [CrossRef]
  6. Olawade, D.B.; Fapohunda, O.; Wada, O.Z.; Usman, S.O.; Ige, A.O.; Ajisafe, O.; Oladapo, B.I. Smart Waste Management: A Paradigm Shift Enabled by Artificial Intelligence. Waste Manag. Bull. 2024, 2, 244–263. [Google Scholar] [CrossRef]
  7. Guo, H.; Wu, S.; Tian, Y.; Zhang, J.; Liu, H. Application of Machine Learning Methods for the Prediction of Organic Solid Waste Treatment and Recycling Processes: A Review. Bioresour. Technol. 2021, 319, 124114. [Google Scholar] [CrossRef]
  8. Aral, R.A.; Keskin, S.R.; Kaya, M.; Haciomeroglu, M. Classification of TrashNet Dataset Based on Deep Learning Models. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 2058–2062. [Google Scholar]
  9. Zhang, Q.; Yang, Q.; Zhang, X.; Bao, Q.; Su, J.; Liu, X. Waste Image Classification Based on Transfer Learning and Convolutional Neural Network. Waste Manag. 2021, 135, 150–157. [Google Scholar] [CrossRef]
  10. Yanyan, W.; Yajie, W.; Chenglei, W.; Yinghao, S. A Novel Garbage Images Classification Method Based on Improved VGG. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China, 15–17 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1571–1575. [Google Scholar]
  11. Lin, K.; Zhao, Y.; Wang, L.; Shi, W.; Cui, F.; Zhou, T. MSWNet: A Visual Deep Machine Learning Method Adopting Transfer Learning Based upon ResNet 50 for Municipal Solid Waste Sorting. Front. Environ. Sci. Eng. 2023, 17, 77. [Google Scholar] [CrossRef]
  12. Hossen, M.M.; Ashraf, A.; Hasan, M.; Majid, M.E.; Nashbat, M.; Kashem, S.B.A.; Kunju, A.K.A.; Khandakar, A.; Mahmud, S.; Chowdhury, M.E.H. GCDN-Net: Garbage Classifier Deep Neural Network for Recyclable Urban Waste Management. Waste Manag. 2024, 174, 439–450. [Google Scholar] [CrossRef]
  13. Hu, H.; Wang, S.; Zhang, C.; Pan, Y. Garbage Image Classification Algorithm Based on Swin Transformer. In Proceedings of the 2023 WRC Symposium on Advanced Robotics and Automation (WRC SARA), Beijing, China, 19 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 414–419. [Google Scholar]
  14. Alrayes, F.S.; Asiri, M.M.; Maashi, M.S.; Nour, M.K.; Rizwanullah, M.; Osman, A.E.; Drar, S.; Zamani, A.S. Waste Classification Using Vision Transformer Based on Multilayer Hybrid Convolution Neural Network. Urban Clim. 2023, 49, 101483. [Google Scholar] [CrossRef]
  15. Xia, Z.; Zhou, H.; Yu, H.; Hu, H.; Zhang, G.; Hu, J.; He, T. YOLO-MTG: A Lightweight YOLO Model for Multi-Target Garbage Detection. Signal Image Video Process. 2024, 18, 5121–5136. [Google Scholar] [CrossRef]
  16. Gupta, T.; Joshi, R.; Mukhopadhyay, D.; Sachdeva, K.; Jain, N.; Virmani, D.; Garcia-Hernandez, L. A Deep Learning Approach Based Hardware Solution to Categorise Garbage in Environment. Complex Intell. Syst. 2022, 8, 1129–1152. [Google Scholar] [CrossRef]
  17. Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  18. Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; Liu, T.-Y. On Layer Normalization in the Transformer Architecture. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020. [Google Scholar]
  19. Zheng, H.; Wang, G.; Li, X. Swin-MLP: A Strawberry Appearance Quality Identification Method by Swin Transformer and Multi-Layer Perceptron. Food Meas. 2022, 16, 2789–2800. [Google Scholar] [CrossRef]
  20. Han, D.; Ye, T.; Han, Y.; Xia, Z.; Pan, S.; Wan, P.; Song, S.; Huang, G. Agent Attention: On the Integration of Softmax and Linear Attention. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024. [Google Scholar]
  21. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  22. Han, D.; Pan, X.; Han, Y.; Song, S.; Huang, G. FLatten Transformer: Vision Transformer Using Focused Linear Attention. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 5938–5948. [Google Scholar]
  23. Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient Content-Based Sparse Attention with Routing Transformers. Trans. Assoc. Comput. Linguist. 2021, 9, 53–68. [Google Scholar] [CrossRef]
  24. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101. [Google Scholar]
  25. Chen, X.; Liang, C.; Huang, D.; Real, E. Symbolic Discovery of Optimization Algorithms. Adv. Neural Inf. Process. Syst. 2023, 36, 49205–49233. [Google Scholar]
  26. Liang, T.; Glossner, J.; Wang, L.; Shi, S.; Zhang, X. Pruning and Quantization for Deep Neural Network Acceleration: A Survey. Neurocomputing 2021, 461, 370–403. [Google Scholar] [CrossRef]
  27. Qin, Z.; Wang, K.; Zheng, Z.; Gu, J.; Peng, X.; Xu, Z.; Zhou, D.; Shang, L.; Sun, B.; Xie, X.; et al. InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning. arXiv 2023, arXiv:2303.04947. [Google Scholar]
  28. Nisha, N.N.; Podder, K.K.; Chowdhury, M.E.H.; Rabbani, M.; Wadud, M.S.I.; Al-Maadeed, S.; Mahmud, S.; Khandakar, A.; Zughaier, S.M. A Deep Learning Framework for the Detection of Abnormality in Cerebral Blood Flow Velocity Using Transcranial Doppler Ultrasound. Diagnostics 2023, 13, 2000. [Google Scholar] [CrossRef]
  29. Tian, X.; Shi, L.; Luo, Y.; Zhang, X. Garbage Classification Algorithm Based on Improved MobileNetV3. IEEE Access 2024, 12, 44799–44807. [Google Scholar] [CrossRef]
  30. Liu, W.; Ouyang, H.; Liu, Q.; Cai, S.; Wang, C.; Xie, J.; Hu, W. Image Recognition for Garbage Classification Based on Transfer Learning and Model Fusion. Math. Probl. Eng. 2022, 2022, 4793555. [Google Scholar] [CrossRef]
  31. Qiao, Y.; Zhang, Q.; Qi, Y.; Wan, T.; Yang, L.; Yu, X. A Waste Classification Model in Low-Illumination Scenes Based on ConvNeXt. Resour. Conserv. Recycl. 2023, 199, 107274. [Google Scholar] [CrossRef]
  32. Huang, K.; Lei, H.; Jiao, Z.; Zhong, Z. Recycling Waste Classification Using Vision Transformer on Portable Device. Sustainability 2021, 13, 11572. [Google Scholar] [CrossRef]
  33. Islam, N.; Hasan Jony, M.M.; Hasan, E.; Sutradhar, S.; Rahman, A.; Islam, M.M. EWasteNet: A Two-Stream Data Efficient Image Transformer Approach for E-Waste Classification. In Proceedings of the 2023 IEEE 8th International Conference On Software Engineering and Computer Systems (ICSECS), Penang, Malaysia, 25–27 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 435–440. [Google Scholar]
  34. Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in Transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
  35. Heo, B.; Yun, S.; Han, D.; Chun, S.; Choe, J.; Oh, S.J. Rethinking Spatial Dimensions of Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  36. Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jegou, H. Going Deeper with Image Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 32–42. [Google Scholar]
  37. Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. BiFormer: Vision Transformer with Bi-Level Routing Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  38. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Figure 1. Illustration of representative images from the MixTrash dataset.
Figure 1. Illustration of representative images from the MixTrash dataset.
Sensors 25 03241 g001
Figure 2. The overall architecture of our proposed HFWC-Net.
Figure 2. The overall architecture of our proposed HFWC-Net.
Sensors 25 03241 g002
Figure 3. The architecture of CST block.
Figure 3. The architecture of CST block.
Sensors 25 03241 g003
Figure 4. Cross-Shaped Window Self-Attention.
Figure 4. Cross-Shaped Window Self-Attention.
Sensors 25 03241 g004
Figure 5. The architecture of Merge block.
Figure 5. The architecture of Merge block.
Sensors 25 03241 g005
Figure 6. The architecture of HFWC Block.
Figure 6. The architecture of HFWC Block.
Sensors 25 03241 g006
Figure 7. Agent Attention.
Figure 7. Agent Attention.
Sensors 25 03241 g007
Figure 8. Illustration of the LionBatch framework.
Figure 8. Illustration of the LionBatch framework.
Sensors 25 03241 g008
Figure 9. Training curve performance of the proposed method and other methods on the Garbage Classification.
Figure 9. Training curve performance of the proposed method and other methods on the Garbage Classification.
Sensors 25 03241 g009
Figure 10. Training curve performance of the proposed method and other methods on the TrashNet.
Figure 10. Training curve performance of the proposed method and other methods on the TrashNet.
Sensors 25 03241 g010
Figure 11. Training curve performance of the proposed method and other methods on the MixTrash.
Figure 11. Training curve performance of the proposed method and other methods on the MixTrash.
Sensors 25 03241 g011
Figure 12. Scatter plot of MixTrash dataset result comparison.
Figure 12. Scatter plot of MixTrash dataset result comparison.
Sensors 25 03241 g012
Figure 13. Confusion matrix.
Figure 13. Confusion matrix.
Sensors 25 03241 g013
Table 1. Categories and Sample Counts of the MixTrash Dataset.
Table 1. Categories and Sample Counts of the MixTrash Dataset.
CategorySubcategoryQuantitySubcategoryQuantitySubcategoryQuantity
Recyclable Wasteanvil432iron83scissors317
ashtray137kettle184scoop174
bag463keyboard85seasoningbottle436
book756knapsack93shampoobottle417
cage322milkbox135shoes621
cap103mobilephone538sodacan687
carton586mouse339stapler385
chair66oldscale402steelball423
chopsticks371oldclothes407table138
cigarettecase611paperbags183teapot147
comb186papercup579tincan525
cosmeticbottles414patchpanel779toys666
counter702pillow382trousers496
desklamp68pingpangracket275trunk363
earphone728plasticbag424tshirt564
electricfan536plasticbottle654tweezers139
electrickettle174plasticbowl417tyrepump114
eyeglass93plastichanger442usedtires337
filepocket501plushtoys703vacuumcup144
foambox271pot466vat302
gasstove288quilt97wasterpaper567
glassbottles645razor315watch98
glasscup561remotecontrol536woodenshovel156
hairdrier93router199
hairstick116ruler256
Food Wasteapplecore679eggshell401rice559
bananapeel622fishbone419snack123
bingtanghulu133frenchfries76vegetableleaf558
biscuit488sausage807wastebone379
bread160icecream449wastemeal417
cake259pulp433watermelonpeel517
chinese_vermicelli67residueoftea696wiltedflowers109
Hazardous Wastebattery663glue222mosquitoswatter295
batterybutton706iccard75nail177
bulb587insecticide352ointment401
circuitboard299thermometer553powerbank402
capsule709modulatortube588pregnancykit661
Residual Wasteadhesivetape853knittingwool82stickynote142
bandaid87lighter440strawhat210
ceramics485lunchbox669toothbrush481
cigarettebutt614mask682toothpick135
cottonswab595oldgloves352toothpick135
desiccant533pencil524towel148
featherduster286poker148umbrella754
firehydrant282socks858wetwipes193
flowerpot412sponge168
Table 2. Machine 1 Basic Properties.
Table 2. Machine 1 Basic Properties.
HardwareDevice Information
Video memory24 GB
Memory128 GB
Hard disk1 TB
ProcessorMontage Jintide(R) C6230R
Graphics processorNVIDIA GeForce GTX 3090
Operating systemWindows11 Python 3.9 Pytorch 2.1
Table 3. Machine 2 Basic Properties.
Table 3. Machine 2 Basic Properties.
HardwareDevice Information
Video memory12 GB
Memory64 GB
Hard disk1 TB
ProcessorIntel(R) Core(TM) i7-12800F CPU
Graphics processorNVIDIA GeForce GTX 3080Ti
Operating systemUbuntu 20.04 LTS Python 3.8 Pytorch 2.0
Table 4. Comparison of the results of the proposed method and other classification methods on the Garbage Classification.
Table 4. Comparison of the results of the proposed method and other classification methods on the Garbage Classification.
MethodTop1-Acc (%)Precision (%)Recall (%)F1-Score (%)Params (M)
VGGNet93.9793.8193.8493.79134.26
GoogleNet92.8993.3293.3293.2621.80
ResNet93.2993.6593.6193.5823.50
DenseNet93.5293.6693.5893.5212.33
MobileNet94.7494.6394.6594.594.32
EfficientNet96.7196.5196.4896.4717.55
ConvNext96.7496.4296.3596.3649.52
ViT94.9794.9094.9094.8722.08
DeiT94.8795.1195.0695.0221.71
Swin Transformer96.1296.2396.1996.1848.94
CSwin Transformer97.0397.1397.0897.0545.27
TNT95.4595.3895.3295.3023.42
PiT95.2995.5595.5295.4822.96
CaiT96.3296.1796.1096.0946.58
BiFormer96.4596.4396.3996.3825.09
VMamba98.4598.4198.3698.3349.98
HFWC-Net (ours)98.8998.9398.8798.8243.18
Table 5. Comparison of the results of the proposed method and other classification methods on the TrashNet.
Table 5. Comparison of the results of the proposed method and other classification methods on the TrashNet.
MethodTop1-Acc (%)Precision (%)Recall (%)F1 Score (%)Params (M)
VGGNet88.2788.1588.0787.94134.26
GoogleNet83.5083.6182.9082.5021.80
ResNet86.4885.6984.8984.9223.50
DenseNet85.0984.5083.5083.2912.33
MobileNet86.6886.9885.8885.584.32
EfficientNet88.8788.5988.2788.0717.55
ConvNext91.8591.1391.0590.9449.52
ViT90.0689.8889.6689.6222.08
DeiT88.8786.7186.6886.6121.71
Swin Transformer89.6688.4688.4788.3248.94
CSwin Transformer92.3692.1792.2492.0845.27
TNT90.2689.9389.6689.5623.42
PiT88.4788.9088.8788.8622.96
CaiT88.6788.0686.2886.6646.58
BiFormer90.2689.9590.0689.9625.09
VMamba95.8395.7895.5795.4849.98
HFWC-Net (ours)96.4396.4496.3996.3243.18
Table 6. Comparison of the results of the proposed method and other classification methods on the MixTrash.
Table 6. Comparison of the results of the proposed method and other classification methods on the MixTrash.
MethodTop1-Acc (%)Precision (%)Recall (%)F1 Score (%)Params (M)Training Time per Epoch
VGGNet72.1171.1370.6669.97134.2602:22
GoogleNet77.6578.1076.6376.4021.8001:38
ResNet80.3080.2579.3678.8523.5001:36
DenseNet80.5981.1180.1279.7612.3302:21
MobileNet82.3781.9581.3580.924.3201:37
EfficientNet87.3087.4687.1987.0017.5501:59
ConvNext88.2688.7988.2087.8649.5203:07
ViT83.7283.2583.1782.8322.0801:37
DeiT83.5682.2782.3381.9221.7102:11
Swin Transformer85.3286.0585.5185.0948.9402:38
CSwin Transformer92.5092.7492.3792.1945.2704:28
TNT83.6484.2283.3683.1123.4202:46
PiT84.4883.8883.8583.5422.9601:37
CaiT85.7886.8385.7285.3246.5803:31
BiFormer87.2887.8387.2487.1225.0904:50
VMamba93.8993.9293.7593.2749.9812:12
HFWC-Net (ours)94.3594.4894.2293.9243.1803:53
Table 7. Ablation of the results of the proposed method on the MixTrash.
Table 7. Ablation of the results of the proposed method on the MixTrash.
ModelAgent AttentionLionInfoBatchTop1-Acc (%)Parmas (M)Flops (G)Total Training Time (h)
A 92.5045.2713.224.67
B 93.5343.1812.924.05
C 94.0445.2713.224.97
D 93.5245.2713.223.74
E 94.1343.1812.924.25
F 93.9945.2713.224.08
G 93.5543.1812.923.93
H94.3543.1812.924.22
Table 8. GPU Memory Consumption and Inference Time of Models on MixTrash Dataset (Single Image).
Table 8. GPU Memory Consumption and Inference Time of Models on MixTrash Dataset (Single Image).
ModelGPU Memory (M)Average Inference Time (ms)
ResNet279.14.2
EfficientNet206.77.5
ConvNext569.85.4
ViT249.53.1
Swin Transformer568.37.6
CSwin Transformer263.48.3
BiFormer290.014.5
HFWC-Net (ours)266.27.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, H.; Ding, Q.; Chen, C.; Liao, Q.; Wang, Q.; Yu, H.; Hu, H.; Zhang, G.; Hu, J.; He, T. Optimizing Backbone Networks Through Hybrid–Modal Fusion: A New Strategy for Waste Classification. Sensors 2025, 25, 3241. https://doi.org/10.3390/s25103241

AMA Style

Zhou H, Ding Q, Chen C, Liao Q, Wang Q, Yu H, Hu H, Zhang G, Hu J, He T. Optimizing Backbone Networks Through Hybrid–Modal Fusion: A New Strategy for Waste Classification. Sensors. 2025; 25(10):3241. https://doi.org/10.3390/s25103241

Chicago/Turabian Style

Zhou, Houkui, Qifeng Ding, Chang Chen, Qinqin Liao, Qun Wang, Huimin Yu, Haoji Hu, Guangqun Zhang, Junguo Hu, and Tao He. 2025. "Optimizing Backbone Networks Through Hybrid–Modal Fusion: A New Strategy for Waste Classification" Sensors 25, no. 10: 3241. https://doi.org/10.3390/s25103241

APA Style

Zhou, H., Ding, Q., Chen, C., Liao, Q., Wang, Q., Yu, H., Hu, H., Zhang, G., Hu, J., & He, T. (2025). Optimizing Backbone Networks Through Hybrid–Modal Fusion: A New Strategy for Waste Classification. Sensors, 25(10), 3241. https://doi.org/10.3390/s25103241

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop