Multi-Scale Context Fusion Network for Urban Solid Waste Detection in Remote Sensing Images

Li, Yangke; Zhang, Xinman

doi:10.3390/rs16193595

Open AccessArticle

Multi-Scale Context Fusion Network for Urban Solid Waste Detection in Remote Sensing Images

by

Yangke Li

and

Xinman Zhang

^*

School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, MOE Key Lab for Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(19), 3595; https://doi.org/10.3390/rs16193595

Submission received: 15 July 2024 / Revised: 19 September 2024 / Accepted: 25 September 2024 / Published: 26 September 2024

Download

Browse Figures

Versions Notes

Abstract

Illegal waste dumping not only encroaches on land resources but also threatens the health of the surrounding residents. The traditional artificial waste monitoring solution requires professional workers to conduct field investigations. This solution not only requires high labor resources and economic costs but also demands a prolonged cycle for updating the monitoring status. Therefore, some scholars use deep learning to achieve automatic waste detection from satellite imagery. However, relevant models cannot effectively capture multi-scale features and enhance key information. To further bolster the monitoring efficiency of urban solid waste, we propose a novel multi-scale context fusion network for solid waste detection in remote sensing images, which can quickly collect waste distribution information in a large-scale range. Specifically, it introduces a new guidance fusion module that leverages spatial attention mechanisms alongside the use of large kernel convolutions. This module helps guide shallow features to retain useful details and adaptively adjust multi-scale spatial receptive fields. Meanwhile, it proposes a novel context awareness module based on heterogeneous convolutions and gating mechanisms. This module can effectively capture richer context information and provide anisotropic features for waste localization. In addition, it also designs an effective multi-scale interaction module based on cross-guidance and coordinate perception. This module not only enhances critical information but also fuses multi-scale semantic features. To substantiate the effectiveness of our approach, we conducted a series of comprehensive experiments on two representative urban waste detection datasets. The outcomes of relevant experiments indicate that our methodology surpasses other deep learning models. As plug-and-play components, these modules can be flexibly integrated into existing object detection frameworks, thereby delivering consistent enhancements in performance. Overall, we provide an efficient solution for monitoring illegal waste dumping, which contributes to promoting eco-friendly development.

Keywords:

waste detection; remote sensing; deep learning; attention mechanisms; multi-scale features; feature interaction

Graphical Abstract

1. Introduction

In recent years, population growth and large-scale urbanization have resulted in a swift escalation in solid waste generation [1]. As mentioned in [2], billions of tons of solid waste are generated globally every year, with China accounting for the largest share. In fact, “garbage siege” has become one of the most urgent environmental governance challenges for most cities in China. In 2019, 196 large and intermediate cities in China generated a staggering total of over 235 million tons of domestic waste [3]. The substantial volume of urban solid waste not only impinges upon the availability of valuable land assets but also poses a huge threat to the sustainable development of cities in developing countries [4]. It is expected that by 2050, the aggregate volume of waste produced worldwide will increase to 46 billion tons. Meanwhile, the amount of municipal solid waste generated is projected to escalate to over 2.89 billion tons, of which nearly 42% will be improperly treated [5]. Open dumping or uncontrolled burning may not only pollute the surrounding environment (such as soil, air, and water sources), but also pose a risk to the health of waste disposal workers [6]. In addition, improper waste accumulation can hinder plant growth and reduce soil fertility, which may affect food production to some extent. Although urban solid waste can cause significant harm to the environment and economy, it also has enormous recycling value [7]. Consequently, the emphasis of this paper is on monitoring urban solid waste, aiming to enhance environmental protection and advance green development goals.

Understanding how to effectively monitor and prevent illegal waste dumping has become an urgent issue. Traditional illegal waste dumping monitoring mainly relies on three methods: manual inspection, resident reporting, and video surveillance. As the most straightforward method, manual inspection requires personnel to regularly or irregularly conduct on-site inspections of key areas to discover and record illegal waste dumping behavior. Nevertheless, this approach necessitates a substantial allocation of human resources and cannot cover all possible illegal dumping sites. Resident reporting mainly refers to the government using the public to expand the scope of supervision and hoping that the public can report illegal waste dumping incidents through hotlines. However, this approach requires additional time to verify the authenticity of the report and involves the whistleblower’s personal privacy. Therefore, this solution limits the report’s feasibility and reliability to some extent. Video surveillance mainly refers to the installation of surveillance cameras at potential illegal dumping sites to assist staff in monitoring possible illegal waste dumping behavior. However, this method requires high installation and maintenance costs, and its monitoring range is limited. Overall, these traditional methods are not only inefficient but also expensive. In recent years, computer vision has undergone exponential advancements. This technology has been successfully applied in different fields and has yielded satisfactory results, such as waste detection [8], battery localization [9] and change detection [10]. Inspired by these studies [11,12,13], this paper uses an intelligent waste monitoring solution based on remote sensing images, which helps to improve monitoring efficiency. As shown in Figure 1, our solution mainly utilizes satellites or drones to capture high-resolution surface images and uses computer vision technology to automatically detect open waste dumping. Compared to other traditional monitoring methods, this solution can cover a wider monitoring area and effectively reduce labor costs.

Compared to traditional digital image processing techniques, deep neural networks have better performance in image classification and object detection tasks. Indeed, convolutional neural networks are renowned for their robust feature extraction capabilities, notably enhancing performance evaluation metrics in the domain of object detection. In the following, we will review some representative detectors to help us design a more suitable baseline for waste detection in remote sensing images. Typically, prevalent object detection frameworks can be categorized into two distinct classes: two-stage detectors and one-stage detectors. The two-stage detector needs to use an additional region proposal network to generate regions of interest and then classify and regress the target region. For example, Ren et al. [14] proposed a novel Faster R-CNN model for general object detection, which introduces a region proposal network to efficiently yield region proposals. Compared to Fast R-CNN, it can achieve significant performance improvements. To enable the detector to mine high-quality samples, Zhang et al. [15] proposed a novel Dynamic R-CNN model for object detection. It can adaptively adjust label assignment criteria and regression loss function shape, which contributes to improving the baseline’s performance on the AP metric. Overall, two-stage detectors can achieve satisfactory performance in most cases. However, they significantly increase the computational complexity. The one-stage detector directly generates target candidate regions and extracts image features, thus completing object recognition and bounding box regression. Zhu et al. [16] introduced a novel network based on feature-selective anchor-free modules for one-stage object detectors, which is adept at effectively overcoming the constraints associated with heuristic-guided feature selection. Yang et al. [17] proposed a novel RepPoints model for object localization and recognition, which can avoid using anchors to sample a space of bounding boxes. Kim et al. [18] proposed a novel probabilistic anchor assignment strategy to improve detection performance, which enables the model to dynamically categorize anchors into positive and negative samples. To avoid using complex feature pyramid networks, Chen et al. [19] proposed a novel framework called YOLOF. It mainly uses a dilated encoder and uniform matching to realize performance improvements. Compared to YOLOv4, YOLOF has a more satisfactory inference speed. In addition, the YOLO family has developed into a representative framework in one-stage detectors. To further exceed the YOLO series, Ge et al. [20] proposed a new anchor-free high-performance detector called YOLOX. It mainly uses the advanced label-assignment strategy and decoupled heads to improve detection performance. Compared to YOLOv5, YOLOX can achieve better performance on the AP metric. Overall, one-stage detectors have better real-time performance, which is more in line with practical application requirements. To effectively maintain a good balance between detection performance and inference speed, this paper adopts a one-stage object detector framework to realize waste detection in remote sensing images, thereby enhancing the monitoring efficiency of illegal waste dumping.

In fact, the extraction and fusion of multi-scale features play a pivotal role in object detection tasks. On the one hand, multi-scale feature extraction can effectively capture context information under different receptive fields, thus assisting the model in accurately locating targets with different scales. On the other hand, the amalgamation of features across various scales allows for the comprehensive exploitation of high-level semantic information and low-level detail features, thus providing more discriminative features for object detection. For waste detection based on remote sensing images, multi-scale features also contribute to improving detection performance. Therefore, understanding how to effectively explore and mine multi-scale features has become a critical issue. At present, relevant researchers have carried out a series of significant studies. For example, Lin et al. [21] introduced a novel feature pyramid network to capture pyramid feature representations, which can be flexibly inserted into different detectors and achieve consistent performance improvements. Moreover, Cheng et al. [22] proposed a novel multi-scale feature fusion module structured around five parallel branches. It not only helps to improve localization accuracy by using low-level feature position information but also helps to improve recognition performance by using high-level feature category information. Considering that most changing regions are irregular, Wang et al. [23] introduced an interactive deep learning model called ScribbleCDNet for change detection. It uses a refinement module to integrate high-level features, which helps achieve state-of-the-art performance with fewer interactions. In addition, Chang et al. [24] proposed a new channel self-attention guided feature pyramid network to deal with multi-scale features and model long-range dependencies, which can achieve performance improvement on different frameworks. Dong et al. [25] designed a multi-level cloud detection network to locate clouds, which uses dual-perspective feature fusion modules to fuse semantic features and capture change clues. Overall, these methods effectively achieve multi-scale feature extraction. However, they pay less attention to multi-scale feature interaction and semantic information guidance. Therefore, we aim to explore these aspects further, which helps to locate waste more accurately.

Attention mechanisms are originally derived from the human visual system, which can enhance important information about the target area while suppressing interference from other irrelevant areas. In recent years, attention mechanisms have been widely applied in the field of computer vision, which helps to further improve model performance. For example, Woo et al. [26] designed an efficient convolutional block attention module to enhance important features, which uses different axis attention mechanisms to generate channel and spatial attention maps, respectively. To effectively reduce computational overhead, Zhang et al. [27] introduced an effective shuffle attention module to make use of critical information. It groups channel dimensions into multiple features and uses channel shuffle operations to achieve information interaction. To effectively capture both spatial relationships and channel dependencies, Fu et al. [28] used a dual attention model to improve feature representation. It not only uses position attention modules to selectively aggregate useful features at each position but also uses channel attention modules to enhance important channel maps. To classify distinct remote sensing scenes, Wang et al. [29] proposed a multi-level semantic feature clustering attention network, which uses channel-wise attention mechanisms to rearrange the weight of the corresponding information. Chen et al. [30] proposed a novel dual-branch embedded multivariate attention network for remote sensing image classification. It effectively extracts spectral features with normalization-based attention modules and adaptively captures spatial features based on the Euclidean similarity attention mechanism. Wu et al. [31] introduced a novel wavelet-driven and feature-enhanced attention YOLOX network for ship detection. This model not only incorporates a wavelet cascade residual module to compensate for potential fine-grained information but also uses a feature attention enhancement module to enhance significant features pertinent to small-scale objects. In fact, these studies demonstrate that attention mechanisms have made substantial progress in diverse remote sensing applications (e.g., image classification, object detection, and semantic segmentation). Inspired by these research works, this paper also introduces attention mechanisms to achieve solid waste detection based on remote sensing images.

At present, the utilization of remote sensing images for solid waste detection has garnered significant interest among numerous researchers. In the early years, some scholars used digital image processing and machine learning techniques to analyze remote sensing images. For example, Im et al. [32] carried out vegetation coverage analysis for several hazardous waste sites. On the one hand, they mainly introduced machine learning regression trees to estimate the leaf area index of the vegetation. On the other hand, they also used machine learning decision trees to draw vegetation coverage maps. Compared to traditional methods, deep learning possesses superior capabilities in extracting complex features and is able to learn more helpful patterns from extensive training images. To further attain superior waste detection performance, more scholars have already utilized deep network models to achieve waste localization. For example, Youme et al. [33] introduced an automatic clandestine landfill detection solution based on deep learning. It not only uses unmanned aerial vehicles to collect high-resolution images of landfill sites but also uses a single-shot multibox detector to accurately locate waste dumps. Maharjan et al. [34] built an automatic water surface plastic detection solution based on unmanned aerial vehicles and deep learning and explored the plastic detection performance of different models in the YOLO family. To quickly collect information on the distribution of coastal waste, Liao et al. [35] developed a marine waste detection framework based on unmanned aerial vehicles, which uses a modified YOLO model to automatically detect marine waste. Meanwhile, Zhou et al. [36] devised a deep learning model tailored for waste detection in aerial images. It employs an asymmetric deep aggregation network to extract waste characteristics and introduces an attention fusion pyramid network for capturing richer features. In addition, Sun et al. [37] put forth an innovative blocked channel attention network for automatic dumpsite detection, which is adept at distilling richer discriminative information from satellite images. These studies indicate that using deep learning to achieve waste detection based on remote sensing images is a feasible solution.

Despite the promising advancements of deep learning in this domain, it still faces some challenges. (1) Solid waste is frequently situated within intricate urban settings, and its color, texture, and shape closely resemble the surrounding environment. This similarity poses a significant challenge for models to accurately distinguish waste from other objects. (2) Owing to the constraints inherent in the collection equipment, the boundaries of solid waste in remote sensing images are relatively blurred. Concurrently, solid waste exhibits variability in appearance and dimensions. Notably, solid waste often has distinguishing characteristics such as fragmentation and dispersion, which increase the difficulty of waste localization. (3) In several remote sensing images, urban solid waste may have smaller pixel sizes, thereby presenting a challenge for models to detect diminutive targets. These challenges limit the promotion of remote sensing in monitoring illegal waste dumping.

To effectively address these challenges, we introduce a novel multi-scale context fusion network to achieve accurate urban solid waste detection in remote sensing images. Specifically, we propose a guidance fusion module to assist the model in focusing on solid waste in complex environments. Considering that high-level semantic features have richer context information, this module introduces an attention mechanism to guide low-level detail information to focus on important regions. Concurrently, our approach leverages multiple large kernel convolutions to mine multi-scale context details, which helps detect different types of waste in remote sensing images. Furthermore, we introduce a context awareness module to improve feature representation, which helps to accurately locate waste with different shapes. In particular, this module uses conventional convolution, dilated convolution, and dynamic convolution to mine richer anisotropic information. Moreover, we propose a multi-scale interaction module to effectively fuse different features from different levels. In particular, it makes use of granular detail information contained in low-level features and broader context clues contained in high-level features, providing more discriminative features for small object detection. In summary, our proposed intelligent waste monitoring solution provides a new direction for urban waste management.

The principal contributions of this study are succinctly outlined below.

We propose an effective multi-scale context fusion network for detecting urban solid waste based on remote sensing images, thereby enhancing the monitoring performance of illegal waste dumping. As an intelligent auxiliary tool, this solution can provide a basis for the reasonable construction of landfill sites.
To explore features at different levels, we design an effective guidance fusion module. By using spatial attention mechanisms and large kernel convolutions, it not only helps guide low-level features to retain critical information but also extracts richer features under different receptive fields.
To capture more representative context information, we introduce a novel context awareness module. By using heterogeneous convolutions and gating mechanisms, it not only captures anisotropic features but also improves feature representation.
To fuse multi-scale features, we build an innovative multi-scale interaction module. By using cross guidance and coordinate perception, it not only enhances important features but also fuses low-level information with high-level information.
To substantiate the reliability of our method, we conduct relevant assessments on two representative benchmark datasets. The empirical findings demonstrate that our method is superior to other deep learning models and can achieve consistent performance improvements on different object detectors.

The subsequent sections of this paper are organized as follows. Section 2 not only describes two waste detection datasets but also introduces our proposed multi-scale context fusion network. It mainly includes four aspects: model architecture, guidance fusion module, context awareness module, and multi-scale interaction module. Section 3 presents specific evaluation metrics and analyses relevant experimental results. Section 4 further discusses model performance, research significance, and limitations. Section 5 summarizes this study and suggests avenues for future research.

2. Materials and Methods

2.1. Datasets

(1) SWAD Dataset: This paper uses the SWAD dataset [36] as the data source, which provides 1996 images for urban solid waste detection. Solid waste often has vague texture information and uncertain boundary clues, which greatly increase the difficulty and workload of sample labeling. Therefore, this dataset contains only one waste label. However, it encompasses a diverse array of waste categories, such as industrial waste, mining waste, domestic waste, etc. Meanwhile, it also covers a variety of landscapes, including cities, rural areas, and mountainous areas. This dataset focuses on waste distribution in key cities in Henan Province, China. Specifically, the latitude range of the study area is 31.23°N–36.22°N, and its longitude range is 110.21°E–116.39°E. In addition, these images are archived in the JPG file format, and their spatial resolution is 1.8 m. The training set, validation set, and testing set cover 1197, 399, and 400 images, respectively. This dataset contains 5562 instances, which are unevenly distributed across different remote sensing images. As shown in Figure 2, it presents sample distribution in different sets. Overall, each image contains an average of 2.8 annotation instances.

(2) GD Dataset: In order to comprehensively verify the effectiveness of our method, this paper also introduces an additional GD dataset [37] for dumpsite detection from satellite imagery. However, this dataset provides multiple versions and is inconsistent with the description in the original paper. Therefore, we reorganize the relevant sample data and remove some inappropriate images. At present, this dataset contains three different categories, such as domestic waste, agricultural waste, and construction waste. At the same time, the training set contains 2666 images, and the testing set covers 713 images. In addition, this dataset provides a total of 4032 dumpsite instances.

2.2. Model Architecture

In contrast to domestic waste detection in traditional images, urban solid waste detection based on remote sensing images exhibits two salient features. First, these images encompass intricate backgrounds and are often plagued by texture degradation and indistinct boundaries. This poses challenges to the performance of detection models. Second, these images typically boast higher spatial resolutions. While this inherent characteristic is beneficial for preserving details, it also poses challenges to inference speed and computational efficiency. Therefore, this paper uses a universal one-stage object detection architecture called ATSS [38] as the baseline that can maintain a satisfactory equilibrium between inference speed and detection performance. As shown in Figure 3, our multi-scale context fusion model consists of the backbone, neck, and head. Different components often have different characteristics. The backbone can capture rich semantic features, the neck is mainly used to fuse features from different levels, and the head is responsible for detecting and identifying waste. Next, we will describe different components.

Our model uses ResNet50 [39] as the feature extractor, which mainly uses residual learning to extract distinct semantic features. In fact, shallow features typically have higher resolution and fewer channels. In contrast, deep features typically have smaller spatial resolution and more channels. On this basis, we introduce additional convolutional layers equipped with a kernel size of th to extract richer semantic features. Meanwhile, we utilize supplementary convolutional layers equipped with a kernel size of 1 to modulate channel dimensions across distinct features. These processed features will serve as input features for the neck. Drawing inspiration from the feature pyramid network [21], we design a multi-input–multi-output neck to better capture context information and mine multi-scale features. Specifically, we define five different input features from bottom to top as

X_{i n}^{1}

,

X_{i n}^{2}

,

X_{i n}^{3}

,

X_{i n}^{4}

, and

X_{i n}^{5}

, respectively. Considering that different feature maps focus on different regions, we introduce a novel guidance fusion module to effectively enhance important information. In particular, the guidance fusion module has two input sources: the top-down pathway and the lateral connection. Concurrently, we also propose an innovative context awareness module that is adept at capturing richer anisotropic clues under distinct receptive fields. In addition, we propose a novel multi-scale interaction module to effectively fuse multi-scale features. In the figure, we utilize the same line type to represent the multiple input sources for each multi-scale interaction module. Therefore, we can observe that this module takes the three adjacent feature maps from the previous stage as inputs. This head adopts a dual-branch architecture similar to FCOS [40], with one branch used to identify waste types and the other branch used to regress waste locations. It is pertinent to highlight that our model incorporates the adaptive training sample selection approach [38] for autonomously dividing positive and negative samples. In pursuit of better performance, our model adopts a joint loss function for waste detection. It uses the focal loss item to mitigate the issue of class imbalance [41], the binary cross entropy loss to predict centerness [40], and the GIOU loss to better regress bounding boxes [42].

2.3. Guidance Fusion Module

Unlike domestic waste detection in indoor scenes, urban solid waste in remote sensing images usually has uncertain boundaries and ambiguous appearance. In particular, solid waste sometimes gathers and occupies a large area, while other times it disperses and occupies a small area, which increases the difficulty of detection to some extent. Therefore, directing the object detection model’s attention towards the locations of waste items has emerged as a critical challenge. For convolutional neural networks, they exhibit the capability to harness convolutional and pooling layers as effective tools for multi-level feature extraction. Low-level features provide rich detail clues for detecting small objects. High-level features provide rich semantic information for detecting large objects. Therefore, we introduce a novel guidance fusion module to utilize high-level features to assist low-level features in paying attention to important regions and perceiving effective context information. This module can help the model detect solid waste more accurately.

Figure 4 demonstrates the internal implementation details of our proposed guidance fusion module. This module uses two different input sources. It can effectively utilize high-level semantic features to guide low-level detail features to focus on important regions. The intensity of color within the image serves as an indicator of the model’s focus on specific regions, and brighter tones indicate a greater concentration of attention focused on these specific areas. In fact, this module can be logically segregated into two distinct components. The former mainly harnesses the attention mechanism to concentrate the model’s focus on critical regions. Specifically, it first uses bilinear interpolation to upsample the spatial resolution of high-level features, which helps to achieve subsequent feature fusion. Next, it uses a convolutional layer with a kernel size of 3 and a convolutional layer with a kernel size of 1 to achieve local feature perception and feature channel compression. At the same time, it introduces additional activation functions to generate relevant guidance attention maps. In addition, it not only uses element-wise multiplication to enhance important features, but also employs skip connections to preserve the original details. In other words, it adopts an element-wise addition operation on the existing basis to further fuse high-level features

X_{i n}^{i + 1}

and low-level features

X_{i n}^{i}

. Accordingly, the above procedure can be formulated as follows:

X_{g f m, m}^{i} = X_{i n}^{i} ⊙ ϑ (C 2 D_{1, 1} (δ (C 2 D_{3, 3} (U S (X_{i n}^{i + 1}))))) \oplus U S (X_{i n}^{i + 1}) \oplus X_{i n}^{i}

(1)

where

X_{g f m, m}^{i}

denotes the intermediate features of the i-th layer guidance attention module.

U S (•)

represents the feature upsampling operation.

C 2 D_{k, k} (•)

represents a 2D convolutional layer with a kernel size of (k, k). At the same time, ⊙ and ⊕ are used to denote element-wise multiplication and element-wise addition operations, respectively. In addition,

ϑ

and

δ

refer to Tanh and GELU activation functions, respectively.

The latter not only uses multi-scale perception based on large kernel convolutions to capture richer context information but also uses the spatial attention mechanism to enhance feature maps from different receptive fields. Specifically, it initially uses a convolutional layer equipped with a 1 × 1 kernel to adjust the number of feature channels, which helps reduce computational complexity. Simultaneously, we adopt convolutional layers with varying kernel sizes to extract more diverse semantic features. For multi-scale features (

X_{g f m, k 3}^{i}

,

X_{g f m, k 5}^{i}

, and

X_{g f m, k 7}^{i}

) from different branches, we adopt element-wise addition operation to achieve feature fusion. This process can be formulated as follows:

X_{g f m, f}^{i} = X_{g f m, k 3}^{i} \oplus X_{g f m, k 5}^{i} \oplus X_{g f m, k 7}^{i}

(2)

where

X_{g f m, f}^{i}

represents the fused features. To capture spatial relationships between different pixels, we use two different methods containing channel average pooling and channel convolution pooling. The former can effectively capture comprehensive spatial information across the channel dimension. The latter can adaptively extract representative spatial information along the channel dimension. In an effort to achieve information interaction among various spatial descriptors, we concatenate relevant features along the channel dimension. Meanwhile, we use a convolutional layer with a kernel size of 7 × 7 and a Sigmoid activation function. The above operations can generate spatial attention weights

W_{g f m}^{i}

, which can be divided into three sub-weights (

W_{g f m, k 3}^{i}

,

W_{g f m, k 5}^{i}

, and

W_{g f m, k 7}^{i}

) along the channel dimension. To intensify their feature representation, we employ element-wise multiplication to emphasize features. Subsequently, we utilize element-wise addition to amalgamate features across multiple scales. Finally, we incorporate a convolutional layer, a batch normalization layer, and a GELU activation function to generate more discriminative spatial features

X_{g f m, o}^{i}

. The specific process can be defined as

\begin{matrix} W_{g f m}^{i} & = σ (C 2 D_{7, 7} (P_{c a p} (X_{g f m, f}^{i}) ‖ P_{c c p} (X_{g f m, f}^{i}))) \end{matrix}

(3)

\begin{matrix} X_{g f m, o}^{i} & = δ (C 2 D_{3, 3} ((X_{g f m, k 3}^{i} ⊙ W_{g f m, k 3}^{i}) \oplus (X_{g f m, k 5}^{i} ⊙ W_{g f m, k 5}^{i}) \oplus (X_{g f m, k 7}^{i} ⊙ W_{g f m, k 7}^{i}))) \end{matrix}

(4)

where

P_{c a p} (•)

and

P_{c c p} (•)

represent channel average pooling and channel convolution pooling, respectively. Simultaneously, ‖ denotes feature concatenation. Furthermore, the symbol

σ

is the Sigmoid activation function. For simplicity, we omit the batch normalization layers used in this module.

2.4. Context Awareness Module

In remote sensing images, urban solid waste often has distinct shapes, which affects the detection performance of deep learning models. In particular, urban solid waste sometimes accumulates in a relatively concentrated area, while it is sometimes dispersed over a wider area. Due to limitations in the surrounding environment, solid waste may exhibit a strip-shaped distribution. At the same time, solid waste scattered in various parts of the city is very similar in appearance to some surrounding buildings. These above phenomena lead to the same solid waste being recognized by detectors as different types in different receptive fields. To achieve accurate waste localization, waste detectors need to capture richer context information as a basis. Consequently, we bring forth an innovative context awareness module founded on heterogeneous convolutions and gating mechanisms, which provides effective anisotropic features for waste localization.

As shown in Figure 5, the model uses a triple-branch network architecture to extract features under different receptive fields. Different branches use convolution kernels with different shapes to capture different semantic features. The top branch focuses on context information in the vertical direction, while the bottom branch pays attention to semantic features in the horizontal direction. For the intermediate branch, it can effectively capture surrounding neighborhood information. By capturing different anisotropic features, this helps the model better detect waste with different shapes. Considering that these three branches adopt similar structures, we will only elaborate on the intermediate branch as an example. Specifically, it first uses a 2D convolutional layer to capture local detail information, then uses a 2D deformable convolutional layer to adaptively capture key information, and finally uses a 2D dilated convolutional layer to extract large-scale context information. In addition, it uses dense connections to enhance feature representation. It is worth noting that the input to each subsequent layer is aggregated from the outputs of all preceding layers, which enriches feature representation. To exploit the potential of multi-scale anisotropic features, it adopts element-wise addition to achieve feature fusion. In fact, feature maps often contain a lot of irrelevant information. To enhance the model’s emphasis on critical spatial information, it introduces gating mechanisms to dynamically enhance key features based on input features. The gating mechanism uses a convolutional layer to compress feature channels and uses the Sigmoid activation function to generate corresponding weights. Next, it uses element-wise multiplication to achieve feature enhancement. Finally, it introduces an additional convolutional layer, a batch normalization layer, and an activation function. This helps to further capture richer features to a greater extent.

2.5. Multi-Scale Interaction Module

Features with different scales contain different semantic information and detail clues, which exerts a substantial impact on urban solid waste detection. In fact, shallow features with high resolution effectively retain fine-grained boundary information for small object detection. Concurrently, deep features with low resolution effectively capture comprehensive context information that is essential for detecting large objects. Therefore, understanding how to devise strategies to fully capitalize on the richness of multi-scale features has emerged as a pivotal concern. On the one hand, simple feature addition could result in the erosion of crucial data. On the other hand, simple feature concatenation may introduce a large amount of interference information. Inspired by the coordinate attention mechanism [43], we introduce a novel multi-scale interaction module based on cross guidance and coordinate perception, which adaptively enhances critical information and fuses distinct features.

As shown in Figure 6, it uses a multi-input–single-output architecture to handle features from different levels. Note that the upper and lower branches adopt a symmetrical network structure. The difference between these two branches is that high-level features use bilinear interpolation to achieve feature map upsampling, while low-level features use max pooling to achieve feature map downsampling. For simplicity, we use the upper branch as an example for illustration. This upper branch uses high-level features

X_{m i m, t i}^{i}

as inputs, and the intermediate branch uses middle-level features

X_{m i m, m i}^{i}

as inputs. In fact, it mainly uses high-level semantic features as guidance clues to enhance important information in middle-level features. This process can be further divided into two parts: one is used to capture relationships in the vertical direction, and the other is used to capture relationships in the horizontal direction. Taking the former as an example, it initially employs a dual-branch structure to capture a spectrum of distinct features. On the one hand, it uses adaptive max pooling to generate prominent features in the vertical direction, which depends on the maximum value in each row of the feature map. On the other hand, it uses adaptive average pooling to generate global features in the vertical direction, which depends on the average value in each row of the feature map. To effectively fuse different information, it also uses the element-wise addition operation. Next, it uses a convolutional layer with a kernel size of 3 to achieve feature interaction and uses an activation function to generate attention weights. Finally, it uses an element-wise multiplication operation to obtain features

X_{m i m, t m}^{i}

, which effectively enables interaction between different features. In general, this process can be expressed as follows:

X_{m i m, t m}^{i} = σ (C 2 D_{3, 3} (P_{a a p} (U S (X_{m i m, t i}^{i})) \oplus P_{a m p} (U S (X_{m i m, t i}^{i})))) ⊙ X_{m i m, m i}^{i}

(5)

where

P_{a a p} (•)

and

P_{a m p} (•)

represent adaptive average pooling and adaptive max pooling, respectively. For simplicity, the remaining parts will not be described again. For the multi-scale fusion module, we ultimately fuse the outcomes of the three branches utilizing element-wise addition. In fact, this operation helps to provide more discriminative features for solid waste detection.

3. Results

3.1. Implementation Details

This paper uses the MMDetection V3.1 framework [44] to implement relevant deep learning object detection models. All experiments were performed on an NVIDIA GeForce RTX 2080Ti GPU. Note that our model uses transfer learning to further improve model performance. In particular, it uses ResNet50 [39] as the backbone and loads pre-trained model weights on the large-scale dataset [45]. We train our multi-scale context fusion model for 40 epochs to ensure model convergence. Meanwhile, the SGD optimizer is used to optimize the model weights, and the initial learning rate is set to 0.005. Furthermore, we use the MultiStepLR scheduler to adjust the learning rate, which multiplies the current learning rate by 0.1 after the 27th and 33rd epochs. Note that we use the LinearLR scheduler to achieve learning rate warmup [39] in the first 500 iterations. Considering GPU memory limitations, the batch size is set to 8. For a fair comparison, all comparison methods use the same image augmentation method. In this context, we implement random flipping to enrich the diversity of training samples. To improve computational efficiency and reduce memory usage, the size of all images is adjusted to 640 × 640 by using bilinear interpolation. Note that this operation may pose challenges to small object detection.

3.2. Evaluation Metrics

In this study, we utilize mean average precision (

m A P

) as a comprehensive indicator to evaluate detection outcomes. It not only effectively reveals the fitness between the predicted box and the ground truth but also evaluates whether the object category is correctly predicted. Specifically,

m A P

is obtained by calculating the average value of average precision (

A P

) across all categories and multiple intersection over union (

I o U

) thresholds.

A P

is defined as the average detection precision across varying recall thresholds, which can be formulated as

A P = \int_{0}^{1} p (r) d r

(6)

where p is precision and r is recall. For the former, it is defined as the proportion of correctly predicted positive samples to all predicted positive samples. For the latter, it is defined as the proportion of correctly predicted positive samples to all actual positive samples. Specifically, they can be expressed as follows:

\begin{matrix} p & = \frac{T P}{T P + F P} \end{matrix}

(7)

\begin{matrix} r & = \frac{T P}{T P + F N} \end{matrix}

(8)

In fact,

T P

is the number of samples that are actually positive and predicted as positive.

F P

is the number of samples that are actually negative but falsely predicted as positive.

F N

is the number of samples that are actually positive but falsely predicted as negative. Note that these three evaluation metrics are closely related to the

I o U

threshold. Specifically,

I o U

can evaluate the degree of overlap between the predicted bounding box

B_{p}

and the ground truth bounding box

B_{g t}

. It can be formulated as follows:

I o U = \frac{a r e a (B_{p} \cap B_{g t})}{a r e a (B_{p} \cup B_{g t})}

(9)

The numerator represents the intersection area between the predicted bounding box and the ground truth, while the denominator represents the union area between them. Therefore, when the

I o U

threshold is set to 0.5 and 0.75, we can obtain the corresponding evaluation metrics

m A P_{50}

and

m A P_{75}

. In addition,

m A P

represents the average result of 10

I o U

thresholds taken between the ranges of 0.50 and 0.95 in steps of 0.05. In order to better evaluate objects with different scales, we also introduce

m A P_{s}

,

m A P_{m}

, and

m A P_{l}

to evaluate the detection performance of small, medium, and large objects, respectively. Moreover, we introduce the average recall rate (

A R

) metric, which is the maximum recall given a fixed number of detections per image. Note that it is averaged over categories and

I o U

thresholds. Overall, these indicators help to reflect the performance improvement of our method in terms of different aspects.

3.3. Performance Comparison

To demonstrate the superiority of our model, we conducted extensive comparative experiments on the SWAD dataset. As shown in Table 1, we compare several representative object detection models, which include Reppoints [17], FoveaBox [46], PAA [18], FSAF [16], DDOD [47], TOOD [48], VFNet [49], ATSS [38], YOLOF [19], YOLOX-S [20], SWDet [36], BCANet [37], PDPAPAN [50], and CAGFPN [24]. For the SWDet model, we directly cite the relevant experimental results from this paper [36]. By analyzing the relevant experimental outcomes, it becomes evident that there are large differences in detection performance between the different methods. For the FoveaBox model, it uses an anchor-free framework to ascertain the presence of objects and estimate their bounding box coordinates without anchor reference. Compared to the Reppoints model, it improves by 0.3% and 4.6% on the

m A P_{50}

and

m A P_{75}

metrics, respectively. However, the detection performance of this model on small objects can only reach 29.8%, which still represents significant room for improvement. In addition, the spatial mismatch between object detection and object classification tasks can affect detection performance to some extent. To better solve this problem, the TOOD model not only uses a novel task-aligned head to strike an equilibrium between task-interactive features and task-specific features but also introduces an effective strategy to draw the ideal anchors closer for detection and recognition tasks. Compared to the FoveaBox model, it improves by 14.3%, 5.3%, and 4.4% on the

m A P_{s}

,

m A P_{m}

, and

m A P_{l}

metrics, respectively. Among all the comparison methods, our method can achieve optimal performance on most evaluation metrics. In particular, it can reach 58.6%, 81.8%, and 65.7% on the

m A P

,

m A P_{50}

, and

m A P_{75}

metrics, respectively. Compared to the ATSS model, our proposed multi-scale context fusion network improves by 8.0% on the

m A P

metric. One important reason is that our method not only achieves multi-scale feature extraction but also enables the interaction and enhancement of critical information. For this baseline model, called ATSS, the total training time on this dataset is approximately 48 min. Our approach takes approximately 79 min to complete model training. Overall, the training cost of this model is acceptable. In the future, we can utilize multiple high-performance GPUs to achieve parallel training, which helps further reduce training time.

In addition, we also conducted relevant experiments on the GD dataset, which helped further validate the effectiveness of our method. In Table 2, we list several representative comparison methods. For the TOOD model, its performance on this dataset is not satisfactory. For the BCANet model tailored for dumpsite detection, it demonstrates clear advantages over the TOOD model in three common evaluation metrics (i.e.,

m A P

,

m A P_{50}

, and

m A P_{75}

). Meanwhile, the detection performance of the YOLO-S model and the YOLOF model on the

m A P_{75}

metric is very similar. By comparing the performance of the YOLO-S model on two different datasets, we find that its robustness is unsatisfactory. For the baseline model named ATSS, it can achieve 38.6%, 60.7%, and 39.6% on the

m A P

,

m A P_{50}

, and

m A P_{75}

metrics, respectively. Our methods can improve by 1.7%, 2.1%, and 1.1% on

m A P

,

m A P_{50}

, and

m A P_{75}

metrics, respectively. Overall, our method has better generalization and can achieve consistent performance improvements across different datasets.

3.4. Generalization Analysis

To ascertain the versatility of our approach, we conducted relevant experiments based on distinct baselines using the SWAD dataset. Figure 7 demonstrates the performance improvements of our method on the

m A P_{50}

and

m A P_{75}

metrics. In these images, green represents the performance of the baseline model, and blue represents the performance after adopting our method. It is observable that our approach secures consistent performance improvements across different detection frameworks. For the FSAF model, our method can achieve performance improvements of 5.2% and 11.9% on the

m A P_{50}

and

m A P_{75}

metrics, respectively. Compared to other methods, the performance improvement on this baseline is the greatest. When our method uses the DDOD model as the baseline, it can achieve optimal performance on the

m A P_{50}

metric. In addition, when our method uses the ATSS model as the baseline, it can achieve optimal performance on the

m A P_{75}

metric. Overall, our model uses guidance attention modules to enhance important regions, context awareness models to capture anisotropic features, and multi-scale interaction modules to fuse multi-scale features. This helps to provide more effective information for object detection. Therefore, our method has a satisfactory generalization capability, which helps to further promote the application of automatic urban solid waste monitoring.

3.5. Visualization Analysis

To manifest a more discernible disparity in detection performance, we visualize the relevant detection results. Specifically, we use the ATSS model as the benchmark. Figure 8 presents the detection results of urban solid waste in different remote sensing scenarios. Here, we mainly compare the baseline model with our method in terms of detection performance and recognition performance. The first- and second-row images indicate that our method can not only effectively reduce the probability of duplicate detection but also reduce the probability of incorrect detection. One important reason is that our method effectively expands the receptive field and captures richer context information. At the same time, the third-row images show that our approach is capable of attaining more precise waste localization. In fact, urban solid waste is very similar to the surrounding environment in terms of color and texture, which often leads to larger detection bounding boxes predicted by the model. However, our method can effectively focus on important spatial regions and enhance useful semantic features. This helps to achieve more accurate waste detection. By analyzing the remaining images, we discern that our approach is able to achieve notable improvements in waste identification. This mainly depends on our method being able to effectively capture and utilize richer features. In fact, this provides an important basis for waste recognition.

3.6. Ablation Studies

In this section, we perform a series of ablation studies on the SWAD dataset to assess the performance of our designed modules. They include guidance fusion modules, context awareness modules, and multi-scale interaction modules. Here, we select ATSS [38] as the baseline, which has satisfactory detection performance for solid waste detection.

(1) Effect of the Context Awareness Module: To analyze the effectiveness of different context awareness modules, we explore context awareness modules using different convolutional kernel sizes. Note that this network only introduces context awareness modules, which can reduce interference from other modules. Different convolutional kernels can capture semantic features under different receptive fields. Generally, smaller convolutional kernels capture local detail information, while larger convolutional kernels capture global contour information. However, larger convolutional kernels can significantly increase the number of model parameters. Therefore, this context awareness module adopts smaller convolution kernels and extracts richer features under different receptive fields by utilizing different convolutions. As shown in Table 3, we perform a series of ablation experiments to investigate four various context extraction methods. The first three rows of experimental results demonstrate different context awareness modules with a single branch structure. When this module adopts a convolutional kernel size of [1, 3], it can effectively capture horizontal features. Specifically, it can achieve 81.6% and 58.1% on the

m A P_{50}

and

m A P_{75}

metrics. When this module uses a convolutional kernel size of [3, 1], it can improve by 4.2%, 0.3%, and 0.4% on the

m A P_{s}

,

m A P_{m}

, and

m A P_{l}

metrics, respectively. In addition, using the context awareness module with a convolutional kernel size of [3, 3] can effectively capture richer neighborhood information. However, the context awareness module with a single branch can only extract limited context information. For solid waste items in remote sensing, they often have different shapes and sizes, making waste detection more difficult. Therefore, when the context awareness module adopts a triple-branch structure, it can achieve optimal detection performance on the

m A P

,

m A P_{m}

, and

m A P_{l}

metrics. One important reason is that it not only succeeds in extracting richer anisotropic features but also contributes to promoting semantic feature refinement. Based on the analysis of the aforementioned experimental results, this module simultaneously adopts three different convolutional kernels (i.e., [1, 3], [3, 3], and [3, 1]).

(2) Effect of Multi-scale Interaction Module: To study the performance of different multi-scale interaction modules, we investigate the detection performance of multi-scale interaction modules using different input features. Relevant ablation experiments are conducted when the baseline model uses only multi-scale interaction modules, which helps to reduce the impact of other modules on experimental results. For this multi-scale interaction module, we set its internal convolution kernel to 3. This setup draws inspiration from many mature deep network architectures (e.g., ResNet [39] and DenseNet [51]). It can maintain strong feature extraction capabilities while avoiding excessive computational burden. Here, we explore model performance on six different evaluation metrics under three different input conditions, which is shown in Table 4. High-level features have richer semantic information, and using them as auxiliary clues can provide global context information for the model. Relevant experimental results indicate that compared to using low-level features as supplementary information, using high-level features as auxiliary clues can improve the

m A P_{l}

metric by 0.5%. Therefore, using high-level features can help to boost the detection performance of large waste items to a certain extent. In addition, low-level features have richer detail information, and using them as auxiliary clues can provide effective boundary information for the model. By analyzing the performance differences between the first two rows of experimental results, we find that using low-level features as supplementary information can improve the

m A P_{s}

metric by 2.2%. Therefore, using low-level features helps to improve detection performance for small objects to some extent. To further boost waste detection performance, this multi-scale interaction module ultimately leverages both high-level and low-level features as auxiliary clues. Relevant results show that this setting helps the model achieve 54.5% on the

m A P

metric.

(3) The Effect of the Guidance Fusion Module: To verify the effectiveness of different guidance fusion modules, we investigated the differences in detection performance among different guidance fusion modules using various pooling methods. In order to avoid interference from other modules, the detection model only introduces guidance fusion modules. As shown in Table 5, it covers three different pooling methods: channel average pooling, channel max pooling, and channel convolution pooling. The channel average pooling method mainly averages all channels at each spatial position to obtain a single-channel feature map, which helps to enhance the robustness of semantic features. The channel max pooling method mainly selects a maximum value from all channels at each pixel position as the output to generate a single-channel feature map, which can effectively obtain more representative information. The channel convolution pooling method primarily employs a convolutional layer to dynamically generate a single-channel feature map. It can adaptively adjust the number of channels and skillfully compress the channel diversity. Compared to the channel average pooling method, the channel convolution pooling method improves by 0.6% and 0.9% on the

m A P_{50}

and

m A P_{75}

metrics, respectively. One important reason is that it introduces additional parameters to effectively learn the relationships between different channels. Meanwhile, we also investigate the performance of different pooling methods after fusion. By fusing the channel average pooling method and the channel convolution pooling method, it can achieve 57.8% on the

m A P

metric. Compared with the channel average pooling method, it effectively improves detection performance on small objects. In particular, its detection performance on the

m A P_{s}

metric increases by 8.9% (from 40.3% to 49.2%). Therefore, this guidance fusion module ultimately adopts both the channel average pooling method and the channel convolution pooling method.

4. Discussion

To achieve the rapid monitoring of solid waste on a large scale, this paper introduces an innovative multi-scale context fusion network to accurately detect urban solid waste. This network mainly introduces three efficient modules to focus on important information and capture richer multi-scale features. In particular, the guidance fusion module significantly improves model performance by using attention mechanisms and large-scale receptive fields. To further qualitatively analyze the guidance fusion module, we visualize the relevant feature maps. Figure 9 effectively shows the differences between feature maps before and after passing through this module. The intensity of the colors in these images is directly proportional to the model’s attention to the corresponding area. As is widely acknowledged, low-level features focus more on areas with boundary details, and high-level features focus more on regions with semantic information. This guidance fusion module can make full use of high-level features to enhance key information in low-level features, providing more discriminative clues for waste localization. Relevant visualization results indicate that attention mechanisms occupy a pivotal position in solid waste detection. In addition, the large-scale receptive field helps to further improve feature representation, while assisting the model in paying attention to large-scale waste items. As shown in these feature maps, the area that the model focuses on can effectively cover the locations of waste items.

For further error analysis, we present precision–recall curves under different evaluation settings on the SWAD dataset. As shown in Figure 10, each image contains seven different legends. C50 and C75 represent relevant

A P

metrics at IoU = 0.50 and IoU = 0.75, respectively. In other words, the

m A P_{50}

and

m A P_{75}

metrics have the same meaning as the C50 and C70 metrics. In addition, the Loc metric represents detection performance when ignoring localization errors. The Sim metric represents model performance after removing supercategory false positives. The Oth metric represents model performance after removing all category confusion. Due to the fact that the dataset contains only one category, the evaluation results of these three evaluation metrics are the same. The BG metric refers to model performance after removing all background false positives. By analyzing the above evaluation metrics, we can easily find that our method can achieve satisfactory detection performance on large objects. Compared to the localization of small objects, it can improve the Loc metric by 11.4% (from 79.6% to 91.0%). For medium objects, our method can reach 92.1% on the BG metric, which meets most practical applications. In general, our method can effectively capture multi-scale features and enhance critical information. Relevant experimental results demonstrate that our model can achieve good detection results across various objects, which helps to promote the development of illegal waste dumping monitoring.

Traditional solid waste monitoring methods rely on manual inspections, which are not only inefficient but also susceptible to environmental and human interference. In this paper, we combine remote sensing with deep learning to quickly identify and monitor waste dumps, which not only elevates the efficiency of waste detection, but also leads to a significant curtailment in labor-related expenses. Unlike machine learning, deep learning has stronger robustness and generalization capabilities and can be effectively applied to different remote sensing scenarios. In fact, this study contributes to the timely monitoring of illegal waste dumping and promotes environmental protection. At the same time, the timely disposal of waste dumps can prevent the spread of diseases and reduce the threat to public health. By accurately collecting information on the distribution of waste, the government can optimize the waste collection, transportation, and treatment processes. In fact, this not only improves the efficiency of waste recycling but also promotes sustainable economic development. In addition, this study helps to provide a scientific basis for urban planners, thus assisting them in designing a reasonable layout of waste treatment facilities. Overall, our solution helps promote the development of urban waste management, reducing the burden on human resources and financial expenditures.

Although our solution effectively improves monitoring performance, this study still faces some limitations. Currently, there is a lack of large-scale remote sensing image datasets for waste detection, and having a small number of training samples can easily lead to overfitting. Therefore, we consider using complex data augmentation techniques to increase sample diversity while using diffusion models to generate additional training samples, which can help improve the robustness of the model. In addition, sample annotation has always been a time-consuming and laborious task, and several datasets contain only one type of waste. Therefore, we plan to build a high-quality multi-label remote sensing image dataset that covers various types of waste, including domestic waste, construction waste, and agricultural waste. At the same time, we plan to use the open-set object detection model called FOOD [52] to assist in annotating waste in remote sensing images, thereby using semi-automatic annotation technology to reduce annotation costs. In addition, due to the diversity of geographical backgrounds and environmental factors, waste detection performance may vary in different regions. Therefore, we plan to use ensemble learning methods to improve the stability of detection results in complex contexts.

5. Conclusions

In fact, the unregulated accumulation of solid waste poses a significant hindrance to urban development. This paper introduces an innovative multi-scale context fusion network for urban solid waste detection in remote sensing images. This solution offers a valuable tool for relevant personnel to effectively monitor illegal waste dumping. To accurately locate waste in complex background environments, we utilize guidance fusion modules to enhance critical regions in remote sensing images and capture richer context information. At the same time, we use context awareness modules to extract different anisotropic features, which helps the model better locate waste with different shapes. Furthermore, we also use multi-scale interaction modules to handle feature maps with different semantic information, which contributes to enhancing the model’s waste detection ability. Indeed, the pertinent experimental outcomes on the two datasets corroborate the fact that our multi-scale context fusion network surpasses other object detection models. For the representative SWAD dataset, our model can achieve 81.8% and 65.7% on the

m A P_{50}

and

m A P_{75}

metrics, respectively. In particular, our method can flexibly combine with different detectors and achieve consistent performance gains. In general, our solution improves solid waste monitoring efficiency and effectively promotes sustainable development. In the future, we plan to build a large-scale remote sensing image dataset for detecting various types of waste, which will further promote the development of intelligent solid waste monitoring based on remote sensing images.

Author Contributions

Conceptualization, X.Z.; methodology, Y.L.; software, Y.L.; validation, Y.L.; formal analysis, Y.L.; investigation, Y.L.; resources, X.Z.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L.; visualization, Y.L.; supervision, X.Z.; project administration, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The SWAD dataset is available at: https://www.kaggle.com/datasets/shenhaibb/swad-dataset (accessed on 1 May 2024). The GD dataset is available at: https://www.scidb.cn/en/s/6bq2M3 (accessed on 1 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wikurendra, E.A.; Csonka, A.; Nagy, I.; Nurika, G. Urbanization and benefit of integration circular economy into waste management in Indonesia: A review. Circ. Econ. Sustain. 2024, 4, 1219–1248. [Google Scholar] [CrossRef]
Cheng, J.; Shi, F.; Yi, J.; Fu, H. Analysis of the factors that affect the production of municipal solid waste in China. J. Clean. Prod. 2020, 259, 120808. [Google Scholar] [CrossRef]
Wu, W.; Zhang, M. Exploring the motivations and obstacles of the public’s garbage classification participation: Evidence from Sina Weibo. J. Mater. Cycl. Waste Manag. 2023, 25, 2049–2062. [Google Scholar] [CrossRef]
Kuang, Y.; Lin, B. Public participation and city sustainability: Evidence from urban garbage classification in China. Sustain. Cities Soc. 2021, 67, 102741. [Google Scholar] [CrossRef]
Maalouf, A.; Mavropoulos, A. Re-assessing global municipal solid waste generation. Waste Manag. Res. 2023, 41, 936–947. [Google Scholar] [CrossRef] [PubMed]
Voukkali, I.; Papamichael, I.; Loizia, P.; Zorpas, A.A. Urbanization and solid waste production: Prospects and challenges. Environ. Sci. Pollut. Res. 2024, 31, 17678–17689. [Google Scholar] [CrossRef] [PubMed]
Teshome, Y.; Habtu, N.; Molla, M.; Ulsido, M. Municipal solid wastes quantification and model forecasting. Glob. J. Environ. Sci. Manag. 2023, 9, 227–240. [Google Scholar]
Li, Y.; Zhang, X. Intelligent X-ray waste detection and classification via X-ray characteristic enhancement and deep learning. J. Clean. Prod. 2024, 435, 140573. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X. Relation-aware graph convolutional network for waste battery inspection based on X-ray images. Sustain. Energy Technol. Assess. 2024, 63, 103651. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, Y.; Lin, H. Multi-scale feature interaction network for remote sensing change detection. Remote Sens. 2023, 15, 2880. [Google Scholar] [CrossRef]
Cheng, Y.; Wang, W.; Zhang, W.; Yang, L.; Wang, J.; Ni, H.; Guan, T.; He, J.; Gu, Y.; Tran, N.N. A multi-feature fusion and attention network for multi-scale object detection in remote sensing images. Remote Sens. 2023, 15, 2096. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X. Multi-modal deep learning networks for RGB-D pavement waste detection and recognition. Waste Manag. 2024, 177, 125–134. [Google Scholar] [CrossRef] [PubMed]
Shang, R.; Zhang, J.; Jiao, L.; Li, Y.; Marturi, N.; Stolkin, R. Multi-scale adaptive feature fusion network for semantic segmentation in remote sensing images. Remote Sens. 2020, 12, 872. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Zhang, H.; Chang, H.; Ma, B.; Wang, N.; Chen, X. Dynamic r-cnn: Towards high quality object detection via dynamic training. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 260–275. [Google Scholar]
Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 840–849. [Google Scholar]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666. [Google Scholar]
Kim, K.; Lee, H.S. Probabilistic anchor assignment with iou prediction for object detection. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 355–371. [Google Scholar]
Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You only look one-level feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual. 19–25 June 2021; pp. 13039–13048. [Google Scholar]
Ge, Z. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Cheng, Y.; Wang, W.; Ren, Z.; Zhao, Y.; Liao, Y.; Ge, Y.; Wang, J.; He, J.; Gu, Y.; Wang, Y.; et al. Multi-scale feature fusion and transformer network for urban green space segmentation from high-resolution remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103514. [Google Scholar] [CrossRef]
Wang, Z.; Xu, M.; Wang, Z.; Guo, Q.; Zhang, Q. ScribbleCDNet: Change detection on high-resolution remote sensing imagery with scribble interaction. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103761. [Google Scholar] [CrossRef]
Chang, J.; Dai, H.; Zheng, Y. Cag-fpn: Channel self-attention guided feature pyramid network for object detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024; pp. 9616–9620. [Google Scholar]
Dong, J.; Wang, Y.; Yang, Y.; Yang, M.; Chen, J. MCDNet: Multilevel cloud detection network for remote sensing images based on dual-perspective change-guided and multi-scale feature fusion. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103820. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
Wang, D.; Zhang, C.; Han, M. MLFC-Net: A multi-level feature combination attention model for remote sensing scene classification. Comput. Geosci. 2022, 160, 105042. [Google Scholar] [CrossRef]
Chen, Y.; Wang, X.; Zhang, J.; Shang, X.; Hu, Y.; Zhang, S.; Wang, J. A new dual-branch embedded multivariate attention network for hyperspectral remote sensing classification. Remote Sens. 2024, 16, 2029. [Google Scholar] [CrossRef]
Wu, F.; Hu, T.; Xia, Y.; Ma, B.; Sarwar, S.; Zhang, C. WDFA-YOLOX: A wavelet-driven and feature-enhanced attention YOLOX network for ship detection in SAR images. Remote Sens. 2024, 16, 1760. [Google Scholar] [CrossRef]
Im, J.; Jensen, J.R.; Jensen, R.R.; Gladden, J.; Waugh, J.; Serrato, M. Vegetation cover analysis of hazardous waste sites in Utah and Arizona using hyperspectral remote sensing. Remote Sens. 2012, 4, 327–353. [Google Scholar] [CrossRef]
Youme, O.; Bayet, T.; Dembele, J.M.; Cambier, C. Deep learning and remote sensing: Detection of dumping waste using UAV. Proced. Comput. Sci. 2021, 185, 361–369. [Google Scholar] [CrossRef]
Maharjan, N.; Miyazaki, H.; Pati, B.M.; Dailey, M.N.; Shrestha, S.; Nakamura, T. Detection of river plastic using UAV sensor data and deep learning. Remote Sens. 2022, 14, 3049. [Google Scholar] [CrossRef]
Liao, Y.H.; Juang, J.G. Real-time UAV trash monitoring system. Appl. Sci. 2022, 12, 1838. [Google Scholar] [CrossRef]
Zhou, L.; Rao, X.; Li, Y.; Zuo, X.; Liu, Y.; Lin, Y.; Yang, Y. SWDet: Anchor-based object detector for solid waste detection in aerial images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2022, 16, 306–320. [Google Scholar] [CrossRef]
Sun, X.; Yin, D.; Qin, F.; Yu, H.; Lu, W.; Yao, F.; He, Q.; Huang, X.; Yan, Z.; Wang, P.; et al. Revealing influencing factors on global waste distribution via deep-learning based dumpsite detection from satellite imagery. Nat. Commun. 2023, 14, 1444. [Google Scholar] [CrossRef]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 318–327. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
Chen, Z.; Yang, C.; Li, Q.; Zhao, F.; Zha, Z.J.; Wu, F. Disentangle your dense object detector. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual. 20–24 October 2021; pp. 4939–4948. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual. 19–25 June 2021; pp. 8514–8523. [Google Scholar]
Ying, Z.; Zhou, J.; Zhai, Y.; Quan, H.; Li, W.; Genovese, A.; Piuri, V.; Scotti, F. Large-scale high-altitude UAV-based vehicle detection via pyramid dual pooling attention path aggregation network. IEEE Trans. Intell. Transp. Syst. 2024. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Su, B.; Zhang, H.; Li, J.; Zhou, Z. Toward generalized few-shot open-set object detection. IEEE Trans. Image Process. 2024, 33, 1389–1402. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Illustration of urban solid waste monitoring based on remote sensing images.

Figure 2. Sample distribution in different sets of the SWAD dataset. Note that each image contains at least 1 instance and at most 14 instances.

Figure 3. The overall architecture of our proposed multi-scale context fusion network for urban solid waste detection based on remote sensing images. It is mainly composed of backbone, guidance fusion modules, context awareness modules, multi-scale interaction modules, and detection heads.

Figure 4. Internal detail diagram of our proposed guidance fusion module.

Figure 5. Illustration of our proposed context awareness module based on different convolutions.

Figure 6. Illustration of our proposed multi-scale interaction module.

Figure 7. Performance analysis of our proposed method on different detection frameworks. (a) Detection performance on the

m A P_{50}

metric. (b) Detection performance on the

m A P_{75}

metric.

Figure 7. Performance analysis of our proposed method on different detection frameworks. (a) Detection performance on the

m A P_{50}

metric. (b) Detection performance on the

m A P_{75}

metric.

Figure 8. Detection outcomes yielded by the ATSS model with/without our approach. The initial column presents the ground truth annotations. The middle column displays the detection outcomes from the baseline model. The last column highlights the improved detection results of our model.

Figure 9. Visualization results of feature maps before (second column) and after (third column) using our proposed guided attention module.

Figure 10. Error analysis diagrams of our proposed multi-scale context fusion network on objects with different scales. (a) Precision–recall curves on all objects. (b) Precision–recall curves on large objects. (c) Precision–recall curves on medium objects. (d) Precision–recall curves on small objects.

Table 1. Comparative performance analysis with other representative models on the SWAD dataset.

Method	$mAP$	${mAP}_{50}$	${mAP}_{75}$	${mAP}_{s}$	${mAP}_{m}$	${mAP}_{l}$	$AR$
Reppoints [17]	43.5	75.4	44.3	26.7	37.7	46.5	56.5
FoveaBox [46]	45.6	75.7	48.9	29.8	41.0	48.1	57.2
PAA [18]	46.0	77.8	48.2	35.7	40.9	48.8	61.8
FSAF [16]	46.9	76.9	49.0	28.1	42.6	49.5	58.8
DDOD [47]	49.2	78.7	51.9	36.7	44.7	51.7	60.4
TOOD [48]	50.0	78.3	55.2	44.1	46.3	52.5	61.7
VFNet [49]	50.2	78.6	53.7	35.0	43.9	53.7	61.6
ATSS [38]	50.6	78.8	54.5	35.4	44.3	53.9	61.7
YOLOF [19]	31.3	60.2	28.4	19.0	24.8	34.8	51.6
YOLOX-S [20]	55.3	70.6	58.1	32.8	51.7	57.3	59.4
SWDet [36]	-	77.6	58.4	-	-	-	-
BCANet [37]	48.0	79.9	50.0	27.6	42.9	50.9	59.2
PDPAPAN [50]	44.0	76.3	45.6	23.4	39.5	46.5	54.0
CAGFPN [24]	46.5	76.4	48.7	26.2	41.2	49.3	55.9
Ours	58.6	81.8	65.7	40.5	54.0	60.9	66.6

Table 2. Comparative performance analysis with other representative models on the GD dataset.

Method	$mAP$	${mAP}_{50}$	${mAP}_{75}$	${mAP}_{s}$	${mAP}_{m}$	${mAP}_{l}$	$AR$
Reppoints [17]	36.1	63.4	37.1	−1	33.8	33.7	55.0
FoveaBox [46]	38.7	62.3	40.5	−1	37.4	36.3	56.1
PAA [18]	36.6	61.8	36.6	−1	35.6	33.8	60.0
FSAF [16]	37.9	61.9	38.5	−1	38.0	33.3	52.6
DDOD [47]	39.1	62.3	40.3	−1	38.2	35.6	56.7
TOOD [48]	37.5	60.2	38.3	−1	37.4	35.2	55.8
VFNet [49]	38.3	61.4	39.7	−1	34.0	36.3	57.1
ATSS [38]	38.6	60.7	39.6	−1	39.3	35.8	56.8
YOLOF [19]	27.3	49.3	28.1	−1	25.4	26.2	49.1
YOLOX-S [20]	28.4	36.9	28.6	−1	34.1	21.6	34.1
BCANet [37]	39.0	64.3	40.0	−1	39.1	35.6	56.3
PDPAPAN [50]	35.8	60.6	37.6	−1	33.3	33.2	50.2
CAGFPN [24]	36.9	57.9	40.0	−1	35.5	33.5	48.1
Ours	40.3	62.8	40.7	−1	39.6	37.7	55.0

Table 3. Detection performance of different context awareness modules on the SWAD dataset. Note that ks represents the kernel size.

Method	$mAP$	${mAP}_{50}$	${mAP}_{75}$	${mAP}_{s}$	${mAP}_{m}$	${mAP}_{l}$
ks = [1, 3]	53.8	81.6	58.1	34.4	47.4	57.0
ks = [3, 1]	54.2	79.7	59.9	38.6	47.7	57.4
ks = [3, 3]	54.1	80.7	60.3	37.3	47.4	57.4
All	54.9	81.2	59.8	33.4	48.5	58.1

Table 4. Detection performance of different multi-scale interaction modules on the SWAD dataset. Note that HLF, MLF, and LLF represent high-level, middle-level, and low-level features, respectively.

Method	$mAP$	${mAP}_{50}$	${mAP}_{75}$	${mAP}_{s}$	${mAP}_{m}$	${mAP}_{l}$
HLF+MLF	54.1	80.3	58.9	39.6	46.8	57.7
LLF+MLF	54.2	80.9	59.8	41.8	48.2	57.2
All	54.5	81.8	59.5	41.9	46.1	58.5

Table 5. Detection performance of different guidance fusion modules on the SWAD dataset. Note that CAP, CMP, and CCP represent channel average pooling, channel max pooling, and channel convolution pooling, respectively.

Method	$mAP$	${mAP}_{50}$	${mAP}_{75}$	${mAP}_{s}$	${mAP}_{m}$	${mAP}_{l}$
CAP	56.2	80.3	62.8	40.3	50.8	58.9
CMP	56.5	82.4	62.1	41.8	49.7	59.8
CCP	56.7	80.9	63.7	38.0	51.3	59.3
CAP+CMP	57.6	82.5	63.2	42.6	52.6	60.2
CMP+CCP	57.3	82.4	62.7	39.9	52.8	59.7
CAP+CCP	57.8	82.3	64.4	49.2	50.3	61.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Zhang, X. Multi-Scale Context Fusion Network for Urban Solid Waste Detection in Remote Sensing Images. Remote Sens. 2024, 16, 3595. https://doi.org/10.3390/rs16193595

AMA Style

Li Y, Zhang X. Multi-Scale Context Fusion Network for Urban Solid Waste Detection in Remote Sensing Images. Remote Sensing. 2024; 16(19):3595. https://doi.org/10.3390/rs16193595

Chicago/Turabian Style

Li, Yangke, and Xinman Zhang. 2024. "Multi-Scale Context Fusion Network for Urban Solid Waste Detection in Remote Sensing Images" Remote Sensing 16, no. 19: 3595. https://doi.org/10.3390/rs16193595

APA Style

Li, Y., & Zhang, X. (2024). Multi-Scale Context Fusion Network for Urban Solid Waste Detection in Remote Sensing Images. Remote Sensing, 16(19), 3595. https://doi.org/10.3390/rs16193595

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Scale Context Fusion Network for Urban Solid Waste Detection in Remote Sensing Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. Model Architecture

2.3. Guidance Fusion Module

2.4. Context Awareness Module

2.5. Multi-Scale Interaction Module

3. Results

3.1. Implementation Details

3.2. Evaluation Metrics

3.3. Performance Comparison

3.4. Generalization Analysis

3.5. Visualization Analysis

3.6. Ablation Studies

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI