WMN: A Multi-Scale Nested Mixture-of-Experts-Based Method for High-Resolution Remote-Sensing Solid Waste Site Extraction and Monitoring

Wang, Kaiqi; Liu, Jianhua; Li, Chen; Yu, Bing

doi:10.3390/app16126259

Open AccessArticle

WMN: A Multi-Scale Nested Mixture-of-Experts-Based Method for High-Resolution Remote-Sensing Solid Waste Site Extraction and Monitoring

¹

The School of Geomatics and Urban Spatial Informatics, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

²

Mobile Geospatial Big Data Cloud Service Innovation Team, Beijing 100044, China

³

Aerospace Remote Sensing Intelligent Computing Joint Laboratory, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 6259; https://doi.org/10.3390/app16126259 (registering DOI)

Submission received: 18 May 2026 / Revised: 14 June 2026 / Accepted: 18 June 2026 / Published: 22 June 2026

(This article belongs to the Section Environmental Sciences)

Download

Browse Figures

Versions Notes

Abstract

Accurate and automated extraction of solid waste sites from remote-sensing imagery constitutes a pivotal demand for contemporary environmental regulation and risk mitigation. However, in high-resolution remote-sensing imagery, solid waste sites are typically represented as a single semantic image object (SIO), which is composed of multiple physical image parcels (PIPs) exhibiting significant variations in scale, morphology, and spectral properties. This intrinsic heterogeneity substantially increases the complexity and uncertainty of multi-class site identification. To address this challenge, this paper proposes WasteMOE Net (WMN), which is developed based on the core concept of modeling the SIO–PIP relationship. WMN adopts a heterogeneous expert selection mechanism combined with a nested mixture-of-experts architecture. It thus enables adaptive perception of complex PIPs across diverse scenarios and their integrated discrimination at the SIO level. In addition, by incorporating the explicit nonlinear representation capability of the KAN network, WMN effectively improves multi-class recognition accuracy while maintaining computational efficiency. Furthermore, this study constructs a high-resolution solid waste site dataset in accordance with the SIO–PIP-aware annotation principle, encompassing five representative categories: tailings ponds (TP), construction spoil sites (CSS), landfill sites (LS), garbage dump sites (GDS), and excavation sites (ES). Experimental results show that WMN achieves mAP50 values of 74.2% (GDS), 63.5% (CSS), 80.9% (ES), 85.4% (TP), and 83.1% (LS) in detection tasks, and 75.4%, 64.1%, 83.0%, 86.7%, and 84.1% for the corresponding categories in segmentation tasks. It achieves competitive performance compared with state-of-the-art methods in both tasks. Further, in a real-world application over Loudi City, China, WMN completed the processing of a 490.67 km² area within 1.34 h. The recognition accuracies for GDS and ES reached 54.8% and 65.3%, respectively. Finally, the proposed method has been successfully integrated into a GIS-based solid waste pollution risk prevention system, which markedly boosts the overall efficiency of environmental monitoring and on-site inspections.

Keywords:

solid waste site extraction; mixture of experts (MOE); KAN; environmental monitoring; instance segmentation

1. Introduction

Inappropriate disposal of solid waste often leads to the release of toxic and hazardous substances, resulting in unregulated environmental pollution incidents. Under rainfall, pollutants may disperse with runoff and infiltrate groundwater systems, causing further ecological degradation. Moreover, under extreme weather conditions, large accumulations of solid waste without effective supervision are susceptible to secondary disasters, such as waste-pile collapse, drainage system blockage, and fire or toxic gas release caused by prolonged sun exposure. Therefore, timely detection, effective treatment, and standardized management of illegal solid waste dumping sites are of great significance for safeguarding ecological and environmental security [1]. China is one of the major producers of solid waste in the world, producing more than 10 billion tons annually, with a considerable portion indiscriminately discarded without standardized treatment. Solid waste sites pose dual environmental risks of pollution and slope instability, and have become a major latent threat to environmental protection [2].

In recent years, China has made consistent progress in the standardized disposal of municipal solid waste and hazardous waste: In June 2024, the Ministry of Ecology and Environment of China, in collaboration with several national agencies and with the approval of the State Council, initiated a three-year nationwide campaign targeting the illegal dumping and disposal of solid waste. This action aims to strengthen environmental supervision and mitigate the recurrent incidents of illegal solid waste transfer and disposal across the country [3], yet the standardized management of general industrial solid waste remains insufficient, with a large legacy inventory of various solid waste sites. Moreover, the spatial database of existing dumping sites is deficient, making targeted supervision difficult. Motivated by economic gains, some enterprises engage in illegal dumping and extraction, even converting prime farmland into solid waste sites.

Traditional solid waste monitoring relies heavily on labor-intensive and time-consuming field inspections, making comprehensive assessment difficult under complex operating conditions [4]. With advances in remote-sensing interpretation, high-resolution satellite imagery combined with intelligent recognition models and GIS systems provides an efficient alternative for large-scale solid waste monitoring and management [5]. Compared with existing government and commercial monitoring systems that primarily focus on suspected-site screening and regulatory support, the proposed system emphasizes the automated detection and instance segmentation of solid waste sites from high-resolution remote-sensing imagery, while directly integrating model outputs into a GIS platform for one-click assignment of field verification tasks to volunteers. However, the more fundamental challenge lies in accurately extracting multiple categories of solid waste sites from complex remote-sensing scenes. Despite its advantages in accuracy, timeliness, and spatial coverage, efficiently and accurately extracting multiple types of solid waste sites from remote-sensing imagery remains a key research challenge.

This paper investigates intelligent detection methods for five common solid waste sites—tailings ponds (TP), construction spoil sites (CSS), landfill sites (LS), garbage dump sites (GDS), and excavation sites (ES)—based on practical regulatory requirements. Tailings ponds are artificial dam facilities for storing tailings slurry discharged after mineral processing; construction spoil sites refer to designated sites for temporary or long-term storage of solid waste such as construction debris and excavated soil generated during building projects; landfill sites denote engineered facilities specifically designed for the burial treatment of solid waste; garbage dump sites indicate temporary or permanent sites for centralized accumulation of solid waste, including household and industrial garbage; excavation sites refer to specific operational areas for mining mineral resources or excavating earth and rock. More detailed definitions and characteristic descriptions of each solid waste site category are provided in Appendix B, Table A7.

In recent years, deep learning has been widely applied to remote-sensing feature extraction owing to its strong feature representation capability. Representative models, including U-Net [6], Mask-RCNN [7], and Transformers [8], have achieved substantial progress in detection and segmentation tasks. However, practical engineering applications require a balance among accuracy, lightweight deployment, and detection efficiency; although Transformer-based models often perform well, they are generally less lightweight and efficient than convolution-based models.

Deep learning-based detection of solid waste sites in remote-sensing imagery has been widely investigated [1]. However, large-scale images usually need to be cropped into smaller patches before model inference due to computational and methodological constraints. In practical regional-scale applications, solid waste sites occupy only a small proportion of the imagery, causing most cropped patches to contain no target objects and thereby increasing the risk of extraction errors [9].

(1): Solid Waste Site Detection Using Multi-Source Spatiotemporal Data

Zhang et al. [10] and Lavender et al. [11] utilized manually extracted features (such as NDVI and NDWI) as auxiliary inputs for model branches, effectively enhancing the model’s discrimination capability for land cover types like water bodies, vegetation, and bare ground, and reduced false detection rates. Kruse et al. [12] designed pixel-level and block-level neural network classifiers to extract features from the spectral and spatial dimensions of multispectral images, respectively. By jointly training these models across the temporal dimension, they achieved more refined classification of solid waste sites. Yailymova et al. [13] combined historical remote-sensing data with vegetation indices and land surface temperature (LST) derived from multispectral imagery to analyze the temporal dynamics of landfills. Zhang et al. [14] employed a multi-branch decision tree machine learning voting method to filter deep learning prediction results using land use, road networks, building rooftops, and other information, thereby reducing misidentification. However, these approaches rely on auxiliary data sources or prior features (such as water body indices), significantly increasing the implementation costs of the methods. At the same time, environmental protection enterprises or organizations encounter practical obstacles in adopting this emerging technology due to a lack of specialized remote-sensing expertise, which hinders their ability to acquire imagery, screen for image quality, and perform image preprocessing. This study integrates practical engineering applications by utilizing widely accessible Jilin-1 satellite data and Google Earth imagery (or similar spatial resolution remote-sensing data), aiming to reduce data complexity and acquisition costs.

(2): Solid Waste Site Detection in High-Resolution Remote-Sensing Imagery with Deep Learning Integration

Several researchers [15,16,17,18] have made preliminary attempts to apply visual domain networks to the detection of illegal solid waste sites in high-resolution remote-sensing imagery. Subsequently, to enhance the accuracy of solid waste site detection, two primary approaches have emerged: ① integration of multiple deep learning models; ② targeted modifications to the internal architecture of deep learning models.

①: Integration of multiple deep learning models

Wang et al. [19] incorporated a scene recognition model into the detection process. This model first determines whether the cropped image contains solid waste from a global perspective before deciding whether to employ the extraction model. This approach effectively eliminates interference from urban areas, forests, mountainous regions, and other non-target environments. Yu et al. [20] designed a two-stage model: the first stage extracts solid waste with high inter-class similarity, while the second stage further subclassifies the segmented solid waste. This effectively resolves the issue of low segmentation accuracy for small-area solid waste instances. However, the serial connection of multiple task models heavily relies on the accuracy of the first-stage model, and the cumulative error from dual-model serialization remains a significant issue. Yong et al. [21] employed a voting classification approach using multiple deep learning models for solid waste site detection. This method mitigates recognition biases and local misclassifications inherent in single models under specific conditions, enhancing overall detection stability and accuracy to achieve more robust solid waste identification. While parallel multi-task models effectively enhance final accuracy, they overlook the exponential growth in computational costs related to deploying multiple models.

②: Targeted modifications to the internal architecture of deep learning models

Sun et al. [22] designed a Block-Channel Attention (BCA) module based on the FasterRCNN model [23] to enhance the model’s spatial and channel attention to features, thereby improving its understanding of irregular solid waste regions. Zhou et al. [24] developed the Asymmetric Deep Aggregation (ADA) network and the Efficient Attention Fusion Pyramid Network (EAFPN) to strengthen the model’s representation capabilities in complex backgrounds and its ability to capture scattered features of solid waste. Li et al. [25] developed a Location-Guided Multi-Enhanced Key Point Network (LKN-ME). By predicting key points in potential solid waste sites, the model focuses on locations likely containing solid waste, thereby improving detection accuracy. These approaches effectively enhance the model’s feature extraction capabilities for solid waste sites and its discrimination ability in complex backgrounds without a significant increase in computational costs.

After comprehensively analyzing the aforementioned solid waste site detection methods, we conclude that when applying solid waste engineering solutions, selecting targeted improvements to the internal structure of deep learning models offers greater advantages in terms of data costs, time costs, and computational costs.

However, existing network architectures still largely rely on single-path feature extraction, which limits their ability to capture the heterogeneous characteristics of solid waste sites in remote-sensing imagery, including scale variations, irregular shapes, and complex backgrounds. Although the YOLO series [26] has achieved a favorable balance between lightweight deployment and detection efficiency, current YOLO models still produce frequent false detections in solid waste site detection tasks [27]. For example, reservoirs and terraced fields may be misidentified as tailings ponds; building facades, roofs, or earth-toned plots as construction spoil sites; building shadows, solar panels, or dark-textured areas as landfill sites; bright gray-white surfaces as garbage dump sites; and concave bright valleys in mountainous regions as excavation sites. These errors increase the need for manual screening and reduce the efficiency of practical monitoring.

In high-resolution remote-sensing imagery, semantic image objects (SIOs) of the same category are composed of multiple heterogeneous physical image parcels (PIPs) arranged according to specific spatial patterns [28]. SIOs typically comprise a group of PIPs sharing similar characteristics, exhibiting consistency in both geometric and spectral attributes. PIPs, in turn, refer to adjacent pixel clusters satisfying specific geometric and spectral feature patterns. Analysis of SIOs within solid waste sites and their constituent PIPs reveals two primary causes of misdetection issues:

I. Compared to rooftops, water bodies, roads, and other types of land cover, solid waste sites typically comprise more diverse and heterogeneous PIPs, as illustrated in Figure 1. The highly heterogeneous morphology of solid waste sites poses challenges for models to extract stable intrinsic null-space correlation features from the complex and diverse a PIP, thereby increasing the difficulty of model interpretation for this type of SIO.

II. When distinguishing among multiple feature classes, the PIPs constituting the SIOs of solid waste sites may exhibit a high degree of similarity with the PIPs of other feature types. For instance, the SIOs of tailings ponds share similar water PIPs with reservoirs and ponds; excavation site SIOs exhibit similar stereoscopic shadow PIPs as mountainous gullies; landfill site SIOs resemble solar photovoltaic panels with similar dark block-like PIPs; construction spoil site SIOs display piled-structure PIPs similar to plowed fields; and garbage dump site SIOs share high-intensity PIPs with reflective surfaces. As shown in Figure 2, these cases, where different feature types possess PIPs with similar compositions, further increase the complexity and uncertainty of the model when identifying among SIOs.

Although existing methods have achieved some progress in solid waste target recognition by leveraging attention mechanisms, feature aggregation structures, and key point guidance, they encounter difficulties in precisely distinguishing the PIPs and the spatial composition features of SIO elements when objects in remote-sensing imagery display complex details at different scales. This leads to a high rate of false positives and missed detections. To address these issues, we introduce a Mixture of Experts (MOE) model [29,30,31] into our framework. Additionally, to improve both segmentation accuracy and efficiency, we incorporate a KAN linear layer [32] based on the design principles of YOLACT [33]. The main contributions of this study are as follows:

(1): We propose WMN, a novel network specifically designed for extracting solid waste sites from high-resolution remote-sensing imagery, which enables complex scene understanding through adaptive perception and expert collaboration. The design of WMN addresses a core challenge in this task: solid waste sites are represented as a single SIO at the semantic level, yet they are composed of multiple PIPs with significant variations in scale, morphology, and spectral characteristics. To address this issue, WMN introduces a MOE–based perception paradigm and incorporates two task-oriented modules: the Dynamic Adaptive Receptive-field Mixture of Experts (DARF-MOE) and the Nested Mixture of Experts (NST-MOE). Specifically, DARF-MOE dynamically adjusts receptive fields to accommodate scale and structural heterogeneity among PIPs within an SIO, whereas NST-MOE is designed to model higher-order semantic discrepancies between different SIOs under conditions where PIPs exhibit strong visual similarity. Together, these modules enable fine-grained perception and robust recognition of solid waste sites in complex backgrounds.
(2): We introduce a KAN linear layer to enhance the efficiency of the model in engineering applications. By learning the optimal mask coefficients, we can achieve accurate solid waste site segmentation with fewer parameters. Moreover, by leveraging the interpretability of the KAN linear layer, explicit formulas can be established to explain which features determine the generation of these “mask coefficients”, thereby providing a quantitative description of the interpretable relationship between PIPs and SIOs.
(3): We built a high-resolution remote-sensing dataset for solid waste site extraction, covering five categories: TP, CSS, LS, GDS, and ES. Annotations follow the SIO–PIP relationship: each waste site is an SIO, and its annotation boundary is defined by its core internal PIPs—specifically selected based on composition and spatial organization. For each SIO category, we explicitly define the essential PIPs and exclude non-representative ones that cause semantic drift. This minimizes ambiguity from background clutter and locally similar PIPs.
(4): A “GIS-based remote-sensing solid waste pollution risk prevention system” was developed to assist practitioners in more convenient and efficient supervision of local solid waste sites. This system has been practically applied by a social environmental protection NGO in Changsha, China, to monitor solid waste sites across Hunan Province, and it has achieved promising results. In this study, a distinct empirical experiment was conducted in Loudi City, Hunan Province. The empirical outcomes can be obtained upon request by contacting the correspondence author.

The remainder of this paper is organized as follows. Section 2 presents the details of the proposed method and the construction of the dataset. Section 3 reports the experimental results and analysis. Section 4 provides a case study. Section 5 concludes the paper.

2. Methods

This study proposes an automated method for the detection and segmentation of solid waste that is capable of accurately conducting large-scale detection and segmentation within a short period. Specifically, to better adapt to the complexity of ground objects in remote-sensing imagery, two plug-and-play expert groups were constructed.

(1) Inspired by previous studies on multi-scale feature fusion for enhancing object perception [34], we designed four expert structures with different receptive fields and feature extraction mechanisms. The model employs routing to select the two most suitable experts for feature fusion, extracting more discriminative multi-level contextual information—customized to the texture, morphology, and spatial distribution of solid waste sites—at the current scale.

(2) The second module group is designed as a nested expert structure based on a dual routing mechanism. It consists of

I

independent expert groups, each containing

J

identical convolutional experts, totaling

I \times J

sub-experts. First, the external router selects the expert group most suitable for the current input, followed by the internal router selecting the optimal sub-experts within that group, implementing a coarse-to-fine, hierarchical feature modeling strategy. This nested design enhances the model’s ability to represent multi-dimensional features—including texture, spectral, and morphological characteristics—of different types of solid waste sites, allowing each expert to focus on learning its specialized feature dimension, thereby improving overall recognition accuracy and robustness. Under the hardware and dataset conditions of this study, a series of comprehensive experiments demonstrated that setting

I

and

J

to 4 each achieved favorable performance.

Next, for the task of instance segmentation of multiple types of solid waste sites in imagery, we draw on the design concept of YOLACT by formulating the segmentation process as a linear combination of prototype masks and mask coefficients. To enhance both the expressive power and interpretability of the mask coefficients, we introduced a KAN linear layer in the coefficient prediction branch. This nonlinear layer maintains a low parameter count while providing strong feature transformation capability, efficiently mapping features from the initial space to a feature space better aligned with task semantics. This enhances the accuracy of mask coefficient prediction and improves the interpretability of the coefficient generation process. Figure 3 illustrates the overall implementation pipeline of the proposed module embedded in the YOLOv11 network.

2.1. DARF-MOE Module

The receptive field refers to the effective input region perceived by a neuron in a neural network. By leveraging receptive fields of different sizes, the network can capture hierarchical contextual information, thereby improving its performance in segmentation tasks [35]. Both Spatial Pyramid Pooling (SPP) [36] and Atrous Spatial Pyramid Pooling (ASPP) [37,38] have long proven that feature extraction using convolutions with different receptive fields can enhance a model’s ability to comprehend contextual information. However, Kim B. J. et al. [39] experimentally confirmed that the ASPP module with fixed parameters produces a static receptive field size. Current receptive field structures are often relatively rigid, which restricts the ability to flexibly model global and local semantic information in complex scenes. This inadequacy makes it difficult to fully adapt to various PIPs within solid waste sites, leading to misdetections in SIO extraction tasks. This is precisely the underlying cause of Challenge I highlighted in the introduction.

To address this issue, this paper customizes the Experts module based on the MOE structure and proposes the DARF-MOE module, as shown in Figure 4. Specifically, the experts in DARF-MOE include the ASPP expert with multi-scale receptive field capability, which enhances adaptability to multi-scale targets through convolutions with different receptive fields; the Convolutional Block Attention Module (CBAM) [40] expert, which emphasizes the saliency of feature maps in both channel and spatial dimensions to improve key information extraction; the Morphological Edge Expert (MEE), which strengthens the focus on object edge regions; and the Stable Expert (SE), which provides stable features when other experts are engaged in complex structure perception.

This module integrates the four expert sub-modules that enhance the feature perception range. The “Router module” dynamically routes the input features to selectively fuse the most appropriate expert features, thereby facilitating the adaptive combination of receptive fields for targets of different sizes and contexts. This mechanism effectively breaks through the limitation of fixed receptive fields in traditional convolution, enhancing the model’s capability to represent diverse targets under various environments. In addition, the MOE structure reduces the computational cost required for actual forward propagation through sparse activation, thus improving inference efficiency while maintaining strong modeling capability.

As illustrated in Figure 3, the NECK component outputs feature maps at three different scales, which are then fed into the HEAD component. In this study, the collection of multi-scale features is uniformly denoted as

F_{N E C K}

. Since the operations within the HEAD are consistent across different scales, to avoid redundant representation, we use

F_{NECK}

to denote the feature at an arbitrary scale in the following analysis. Its mathematical formulation is given as follows:

{R o u t e r}_{i} = S o f t m a x ({L i n e a r (A v g P o o l ({C o n v}_{3 x 3} (F_{N E C K})))}_{i} + b_{i})

(1)

Here,

A v g P o o l

denotes the global average pooling operation, which compresses the spatial dimensions and extracts global semantic features, thereby producing a global descriptor that characterizes the overall semantic distribution of the current scene.

L i n e a r

represents a fully connected layer that maps the global semantic vector to routing scores corresponding to the number of experts. To enhance expert load balancing, a learnable expert-specific bias term

b_{i}

is incorporated into the routing scores. The scores are then normalized using the

S o f t m a x

function to obtain a probability-based fusion weight vector, denoted as

R o u t e r

dimensionality equal to the number of experts

N

. This vector characterizes the relative preference of the current input feature for different experts.

In this study, a Hard Top-K strategy is adopted for sparse expert activation, where only the top-

k

experts with the highest routing probabilities are retained (with

k = 2

in this work), and the weights of the remaining experts are set to zero:

m_{i} = \{\begin{matrix} 1, i f i \in T o p - k (R o u t e r) \\ 0, o t h e r w i s e \end{matrix}

(2)

The output of DARF-MOE is obtained by the weighted summation of the activated expert features:

F_{D A R F - M O E} = \sum_{i = 1}^{N} m_{i} \cdot {R o u t e r}_{i} \cdot F_{e x p e r t s - i}

(3)

In this study, no temperature parameter is introduced. Instead, an implicit load-balancing strategy based on dynamic expert-bias updating is adopted. For each training batch, the number of samples assigned to each expert, denoted as

{count}_{i}

, is recorded, and the average load

\bar{c}

is computed as

\bar{c} = \frac{1}{N} \sum_{i} {c o u n t}_{i}

(4)

The expert bias terms are then updated according to the following rule:

b_{i} \leftarrow b_{i} + η (\bar{c} - {c o u n t}_{i})

(5)

where

η

denotes a small-step update rate.

Through online adjustment of expert selection tendencies, this mechanism enables adaptive load balancing among experts. It improves expert utilization uniformity without introducing additional auxiliary loss terms, thereby constituting a lightweight dynamic regulation strategy.

2.2. NST-MOE Module

If traditional detection head structures lack sufficient architectural design, they often struggle to accurately extract semantic information from rich features, limiting the model’s understanding and predictive capabilities for targets and consequently affecting overall detection accuracy [41]. The root cause of the misdetection problem II mentioned in the introduction essentially lies in the classification head’s insufficient modeling capacity to capture fine-grained semantic differences among multiple target categories. In particular, when different target categories exhibit similar apparent PIPs, the model fails to effectively extract discriminative contextual features for category differentiation, resulting in class confusion and misidentification.

MOE enables each expert module to specialize in distinct feature subspaces or task subdomains, improving representational capacity and prediction accuracy [42]. To this end, we propose an innovative NST-MOE module, which aims to strengthen the semantic modeling capability of the network by introducing deeper structures within the experts. As the second component of the classification head, this module further improves the model’s ability to represent and discriminate complex solid waste sites. Unlike the DARF-MOE module, the sub-experts in the NST-MOE module are not designed with differentiated receptive fields; instead, a uniform bottleneck structure [43] is adopted. Under this consistent architecture, the parameters are trained to automatically learn representations adapted to different semantic features. Specifically, a mixture-of-experts system composed of

I \times J

expert modules is constructed, with each expert module maintaining the same structure. This design enables the model to perceive and model geospatial objects from multiple perspectives and scales, thereby enhancing its adaptability to semantic differences in complex scenarios.

On this basis, we introduce an external expert-group mechanism that partitions the

I \times J

experts into

I

functional expert subgroups. Through a two-level nested design—“external grouping + internal specialization”—each expert group is encouraged to focus on learning a specific type of feature representation within its designated functional domain. This structure enables both specialization and diversity in feature extraction. Furthermore, a gating-based routing mechanism is employed to dynamically schedule expert groups, allowing the model to adaptively select the optimal expert combination according to the feature distribution of the input data. This design enhances the robustness and generalization capability of the model under diverse remote-sensing environments. For the

I

-th external expert group, the feature extraction process of its

J

-th internal sub-expert can be formulated as follows:

e_{i j} (F) = {C o n v}_{1 \times 1}^{(2)} (F)

(6)

Here,

{C o n v}_{1 x 1}^{(2)}

denotes two successive

1 \times 1

convolution operations.

Within each external expert group, a routing mechanism consistent with Equation (1) is employed to generate the weights of the internal sub-experts. The output of the internal expert group is computed as the weighted summation of the activated sub-expert features:

E_{i} (F) = \sum_{j = 1}^{J} {p_{i j} \cdot α}_{i j} \cdot e_{i j} (F)

(7)

where

α_{i j}

denotes the probability-based weight generated by the internal router, and

p_{i j}

represents the activation mask obtained under the Hard Top-K strategy.

Subsequently, a gating-based routing mechanism is introduced at the external group level to dynamically schedule the

I

expert groups. The final output of NST-MOE is given by

F_{N S T - M O E} = \sum_{i = 1}^{I} {P_{i} \cdot β}_{i} \cdot E_{i} (F_{D A R F - M O E})

(8)

where

β_{i}

denotes the weight of the

I

-th expert group generated by the external router, and

P_{i}

represents the corresponding activation mask at the group level. The external routing mechanism follows the same structural formulation as the internal routing, employing Softmax normalization combined with Hard Top-K sparse selection.

2.3. KAN Mask Coefficient Prediction Module

YOLACT is a structurally concise yet computationally efficient instance segmentation model. Its core idea is to learn prototype masks that can be shared across different target categories and to generate corresponding mask coefficients based on specific task requirements. The final segmentation result is constructed by linearly combining these two components. However, in remote-sensing images, targets are highly diverse, object features vary significantly with scale, and complex background interference is prevalent, which makes the semantic differences between objects more subtle.

To address this challenge, the proposed method enhances the modeling capability of target spatial structures and the ability to interpret semantic differences in complex scenes from multiple perspectives through the DARF-MOE and NST-MOE modules, effectively strengthening the representational power of the prototype masks. Moreover, the accuracy of the mask coefficients is a key factor in determining the final segmentation results. To this end, a KAN linear layer is further introduced, as shown in Figure 5, to infer the mask coefficients, leveraging its strong nonlinear mapping capability in shallow structures. Through this improvement, the expressive power of both the prototype masks and mask coefficients is enhanced simultaneously, while the number of each is reduced by half, which significantly lowers the computational overhead. As a result, this optimization not only accelerates inference efficiency but also achieves higher segmentation accuracy, effectively mitigating the impact of the error detection issue II described in the Introduction. The mathematical formulation is as follows:

y = \sum_{i = 1}^{n} \sum_{j = 1}^{m} α_{i, j} B_{j} (x_{i}) + b

(9)

Here,

B_{j} (\cdot)

denotes the basis function indexed by

j

, which in this study is a cubic spline function.

α

represents the learnable weight corresponding to the basis function indexed by

j

for the input indexed by

i

,

m

denotes the number of basis functions assigned to each feature, and

n

denotes the dimensionality of the input features.

2.4. BUCEA-SWS Dataset

The high-resolution remote-sensing images used in this study are primarily sourced from the Jilin-1 satellite, covering multiple geographic regions, including urban areas, urban–rural fringe zones, industrial parks, and mountainous regions. Based on the solid waste sites provided by the Shuguang Environmental Protection Public Welfare Development Center [44], images containing at least one type of solid waste were extracted as the raw dataset. The dataset collection areas mainly involve southern Chinese provinces such as Sichuan, Hunan, and Guangdong, and the distribution of different data collection sites is illustrated in Figure 6.

Subsequently, personnel with remote-sensing interpretation experience were organized to perform detailed annotations to improve label consistency and reliability. Remote-sensing-related professionals then conducted cross-checking according to the predefined annotation rules for the five types of solid waste sites listed in Table 1, and an expert review procedure was established to further examine samples with uncertain categories or ambiguous boundaries. After annotation, the images were cropped, with each image divided into several smaller images of size 512 × 512 pixels.

After screening, a total of 8613 image samples were obtained, including 1364 tailings ponds instances (TP:10.45%), 2757 construction spoil sites (CSS:21.13%), 3137 landfill site instances (LS:24.05%), 3328 garbage dump site instances (GDS:25.51%), and 2460 excavation site instances (ES:18.86%). The construction results for the dataset of five types of solid waste sites are summarized in Table 2.

3. Experimental Results and Analysis

3.1. Datasets

To ensure the reliability and scientific rigor of the experiments, in addition to constructing the BUCEA-SWS dataset, this study further collected four open-source high-resolution remote-sensing datasets as data sources for comparative experiments. Detailed information is provided in Table 3. Unfortunately, no publicly available dataset related to landfill sites has yet been collected.

In this experiment, “Propose+ (YOLOv11)” and “Propose+ (YOLOv12)” denote the integration of the proposed DARF-MOE module, NST-MOE module, and the KAN Mask Coefficient Prediction module (KAN-MCP) into YOLOv11 and YOLOv12, respectively. It should be noted that, as the utilized open-source datasets are all designed for object detection tasks, the KAN-MCP module is excluded from the detection experiments and is only activated in instance segmentation or other mask-related tasks.

(1): BUCEA-SWS Dataset: The dataset constructed in this study, covering five categories of solid waste sites, is described in detail in Section 2.4.
(2): Global Dumpsite Test Data [22]: The dataset covers multiple major cities in Africa and Asia, primarily sourced from the Google Earth platform. Based on the origin and distribution patterns of the waste, all solid waste sites are divided into six categories: agriculture forestry, construction waste, disposed garbage, domestic garbage, industry waste, and mining waste. These six types of waste sites have been labeled and annotated.
(3): Open Source Tailings Pond Dataset [45]: The dataset covers multiple cities in Anhui Province, China, comprising 352 positive samples and 430 negative samples of Google satellite imagery with a spatial resolution of 2.05 m. The original images were divided into 500 × 500-pixel patches, and the locations of tailings ponds were annotated based on the target positional information.
(4): Open Pit Mine Object Detection Dataset [46]: The dataset consists of 4617 remote-sensing image patches of open-pit mining areas, each sized 1024 × 1024 pixels, along with their corresponding object detection bounding boxes. All bounding boxes were meticulously manually annotated using the Labelme tool.
(5): Tailings Ponds in Henan Province [47]: This dataset is constructed based on multi-year Chinese high-resolution optical remote-sensing satellite imagery through data processing, manual interpretation and annotation, as well as image tiling. It is an open-access dataset designed for tailings pond detection in Henan Province, China. It contains 1183 image tiles and 1728 object instances, featuring multi-temporal coverage across four years: 2016, 2018, 2020, and 2021.

3.2. Training Details

All model training experiments in this study were conducted on an RTX 3070 GPU with 8 GB of memory. PyTorch version 1.11 and CUDA version 11.3 were used. The specific training parameter settings are listed in Table 4.

3.3. Comparison Methods and Evaluation Metrics

To validate the effectiveness of the proposed model, we compared it with state-of-the-art vision models such as YOLOv11, YOLOv12, RT-DERT [48], FBRT [49], PKI [50], and LEGNet [51], as well as some classical detection models including FasterRCNN, RetinaNet [52], Transformer [53], CSL [54], and R3Det [55].

Among these, PKI, RT-DERT, FBRT, LEGNet, FasterRCNN, RetinaNet, Transformer, CSL, and R3Det only support object detection tasks. In the experimental results, this study focuses on analyzing comparisons with a representative baseline model.

For model performance evaluation, this study adopts mAP₅₀ (mean Average Precision) as the primary accuracy metric. Precision represents the proportion of true positives among all predicted positives, which measures the model’s false detection rate. Recall represents the proportion of true positives that are correctly identified among all actual positives, measuring the model’s ability to find all targets. The F1-Score is the harmonic mean of Precision and Recall. It provides a balanced assessment by integrating both the model’s precision and recall into a single overall performance metric. mAP₅₀ refers to the mean average precision at an IoU threshold of 0.5, providing a comprehensive measure of the model’s accuracy in detection tasks. mAP_50–95 refers to the mean average precision calculated across multiple IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. By requiring the model to maintain high performance across varying levels of localization precision, it serves as a more rigorous and comprehensive benchmark for evaluating overall model performance.

For evaluating model computational complexity, this study uses GFLOPs (Giga Floating Point Operations) as the metric, which represents the number of billions of floating-point operations required for the model to perform inference on a single image. A higher GFLOPs value indicates greater computational cost, which may limit the model’s deployment and application on low-power devices.

In addition, this study introduces FPS (Frames Per Second) as a metric for inference speed, which reflects the model’s real-time processing capability under a given hardware environment. A higher FPS indicates that the model can process more images in a shorter time, making it suitable for scenarios with high real-time requirements.

3.4. Comparison Results and Analysis

(1) Results on the BUCEA-SWS Dataset

Considering that garbage dump sites and construction spoil sites are significantly smaller in scale than excavation sites, tailings ponds, and landfill sites, we grouped the first two categories together, while modeling and training the latter three separately. In this study, garbage dump sites (GDS), construction spoil sites (CSS), excavation sites (ES), tailings ponds (TP), and landfill sites (LS) are used as abbreviations for the five categories. The extraction accuracy comparison results based on the BUCEA-SWS dataset are shown in Table 5 and Table 6. For conciseness, this section reports only the evaluation metrics for the detection task, whereas the detailed quantitative results for the segmentation task are provided in Appendix A, Table A1 and Table A2.

On the medium-resolution solid waste dataset, the method proposed in this paper demonstrates a relatively significant overall performance advantage as shown in Table 5. Experimental results indicate that Propose + YOLOv11 achieves optimal performance in both detection accuracy and overall stability, with mAP₅₀ and mAP_50–95 reaching 83.1% and 65.0%, respectively, significantly outperforming representative methods proposed in recent years such as RT-DETR, FBRT, and PKI. It is especially notable that the proposed method demonstrates a more significant improvement in the mAP_50–95 metric when compared to its corresponding baseline model. This indicates its capacity to maintain stable and accurate object localization across different IoU thresholds. This characteristic provides excellent adaptability and robustness for the detection of solid waste sites, which frequently display complex shapes and significant scale variations. In terms of recall, both Propose + YOLOv11 and Propose + YOLOv12 achieve an improvement of approximately 6% over their respective baseline models, effectively mitigating missed detections caused by inconsistent object scales, irregular shapes, or image edge truncation. This highlights the advantage of the introduced multi-expert feature routing mechanism strategy in complex scenarios. Although the precision of the proposed method is slightly lower than that of some individual methods (e.g., PKI), it achieves a more reasonable and stable balance between precision and recall, and its F1-score consistently remains at a high level. This finding indicates that the model demonstrates robust overall detection reliability and practical utility.

Figure 7 compares the detection results for the best-performing model in this study with various baseline models in typical complex scenarios prone to false positives, including bare land, buildings, mountain shadows, and suburban areas. In (a), within an ES scenario near a major road, RT-DETR, YOLOv11, and YOLOv12 incorrectly detect adjacent buildings as LS. In the mountainous ES scenario shown in (b), YOLOv11 and YOLOv12 fail to identify the ES due to internal shadows and even misclassify the shaded area as LS.

Figure 7c–e present LS scenarios in suburban areas, where LS, CSS, and ES exhibit similar bare-land PIP features in the imagery, leading to varying degrees of false detection by methods such as YOLOv11, R3Det, and PKI. In (d), an originally continuous large LS appears as two spatially separated small LS instances in the image due to soil coverage in its central region. Different baseline models either merge them into a single target—introducing interference from the covered area—or detect only one of them, resulting in missed detection.

In the TP scenario shown in Figure 7f, surrounding CSS features that are highly similar in color and shape to the tailings pond are also misidentified as TP by several models. In contrast, Propose + YOLOv11 leverages finer-grained PIP representations to comprehensively distinguish SIOs, effectively reducing misjudgments caused by local shadows, bare land, and similar-texture PIP features, and demonstrates more stable detection performance in complex backgrounds.

The high-resolution dataset primarily includes solid waste pile targets with relatively small scales and significant morphological variations. Such targets typically occupy only small areas in remote-sensing imagery and are often distributed in complex backgrounds such as suburban regions. This situation places higher demands on the model’s ability to perceive small objects and distinguish fine-grained features. As shown in the comparative experimental results in Table 6, YOLOv12 outperforms YOLOv11 and most traditional detection methods on this dataset, achieving 66.6% in mAP₅₀ and 46.1% in mAP_50–95, demonstrating relatively stable small-object detection capability. Building on this, the model performance is further enhanced through the introduction of a multi-branch feature representation that is specifically designed to model local structural differences and the incorporation of dynamic adjustment strategies during feature selection and fusion. Propose + YOLOv12 increases mAP₅₀ and mAP_50–95 to 68.8% and 47.2%, respectively, achieving the best results among all compared methods. Meanwhile, Propose + YOLOv11 also boosts the corresponding metrics to 68.5% and 46.0%, indicating that the proposed approach does not depend on a specific baseline architecture but can stably adapt to different detection frameworks.

In terms of Precision and Recall, the proposed method achieves a more significant improvement in Recall while maintaining Precision at a reasonable level. Compared with YOLOv11 and YOLOv12, the Recall of Propose + YOLOv11 and Propose + YOLOv12 increases to 65.6% and 66.9%, respectively, effectively reducing the risk of missed detection for small-scale and irregularly shaped solid waste targets in high-resolution remote-sensing imagery. This balance between Precision and Recall keeps the model’s F1-score consistently within the top tier, reflecting good overall detection reliability. It is worth noting that LEGNet achieves a relatively high F1-score on this dataset, mainly because of its balanced trade-off between Precision and Recall; however, its performance on multi-IoU metrics such as mAP_50–95 remains noticeably lower than that of the proposed method, indicating that there is still a considerable gap in localization accuracy and boundary consistency.

Figure 8 presents a comparison of detection results between the best-performing model in this study and various baseline methods in typical scenarios that are prone to false positives. These scenarios include complex backgrounds such as buildings, field ridges, and construction sites. In sub-figures (a) and (d), the bright white stripe-like building roofs and the white marked areas at road intersections display spectral and morphological features that are highly similar to the typical PIP contained in CSS. This similarity causes YOLOv12 to frequently mistake them for each other and generate false detections. Similarly, in the yellowish field ridge scene shown in (b) and the construction site area in (e), YOLOv12 also mistakenly identifies background objects as CSS.

For the elongated CSS scenario depicted in (c), methods such as FBRT can only detect partial regions of the target, while models like RetinaNet tend to split it into multiple discontinuous objects, thus failing to accurately capture its overall shape. In the large GDS scene shown in (f), YOLOv12 fails to detect the target completely; although RT-DETR covers the overall area, it generates false detections of regularly arranged white objects. Meanwhile, models like YOLOv11 and CSL even misclassify the target as CSS.

In contrast, Propose + YOLOv12 enhances the ability to distinguish objects with similar features to some extent through finer-grained PIP representation and discrimination. At the same time, its heterogeneous experts with different roles can better adapt to significant variations in shape and scale among targets of the same category, thereby achieving more complete and reliable target detection in complex scenarios.

(2) Results on Open-Source Datasets

The detection accuracy comparison results on the Global Dumpsite Test Data (GDTD), Open Pit Mine Object Detection Dataset (OPMOD), Open Source Tailings Pond Dataset (OSTPD), and Target Detection Dataset for Tailings Ponds in Henan Province (TPHPD) are shown in Table 7. It is important to note that one category in the GDTD validation set contains only three instances, which is significantly fewer than other categories. To avoid bias in metric calculation, this category has been excluded from the computation.

More detailed metric comparisons, including mAP_50–95, Precision, Recall, and F1-score, are provided in Appendix A, Table A3, Table A4, Table A5 and Table A6. These four datasets were selected because they correspond to four representative categories in the proposed BUCEA-SWS dataset: GDTD contains garbage dump sites and construction spoil sites (GDS + CSS), OPMOD represents excavation sites (ES), and OSTPD and TPHPD both represent tailings ponds (TP). Since the TP samples in OSTPD exhibit relatively strong intra-class consistency, TPHPD was additionally included as a supplementary TP dataset with lower intra-class consistency and more complex sample variations. To date, no suitable open-source dataset corresponding to landfill sites (LS) has been found; therefore, LS is not included in the open-source dataset comparison. This experimental design enables a more comprehensive evaluation of the proposed method across different solid waste categories, image qualities, target scales, and scene complexities.

Overall, Propose+ (YOLOv11) achieves the highest average mAP₅₀ across all datasets, reaching 62.6%, followed by Propose+ (YOLOv12) with 61.3%. On the GDTD dataset, although PKI obtains the best mAP50, the proposed method still improves YOLOv11 and YOLOv12 by 2.3 and 5.1 percentage points, respectively, indicating enhanced discriminative capability in complex GDS and CSS scenarios. On the OPMOD dataset, the performance gaps among different methods are relatively small, while Propose+ (YOLOv12) achieves the best mAP50 of 56.9%, suggesting that the proposed modules can provide moderate but consistent improvements for ES detection. On the OSTPD dataset, Propose+ (YOLOv11) achieves the highest mAP50 of 52.0%, demonstrating better adaptability to TP samples with scale variations. On the more challenging TPHPD dataset, Propose+ (YOLOv11) and Propose+ (YOLOv12) rank first and second, respectively, further confirming the effectiveness of the proposed method under TP scenarios with weaker intra-class consistency. These results indicate that the proposed expert-based feature modeling strategy improves the robustness and generalization ability of YOLO-based detectors across datasets with different resolutions, target scales, and scene complexities.

Figure 9 presents a comparison of detection results between the proposed method and various baseline models across four collected datasets. In the scenes shown in (a) and (b), due to the low image resolution, GDS and CSS targets mostly appear as small-scale objects in the imagery. Meanwhile, the backgrounds encompass numerous complex land covers, which substantially elevates the overall difficulty of the detection task. Under these circumstances, methods such as PKI and RT-DETR are prone to misclassify background features—such as parking lots, reflective roof structures, and white-stacked goods in factory areas—as target objects, resulting in a higher number of false positives. In contrast, by enhancing target representation and refining feature discrimination, the proposed method empowers YOLOv11 to more reliably distinguish targets from background interference in such intricate, low-resolution scenes.

In (d), a type of mine pit target that appears relatively flat in remote-sensing imagery exists in the OPMOD. Because such pits exhibit a high degree of visual similarity to PIPs, which have distinct three-dimensional structural features and also semantically belong to SIOs, models are prone to category confusion during feature discrimination. This uncertainty, arising from both appearance similarity and semantic overlap, constitutes a major source of detection difficulty in subsequent complex scenarios.

When such targets appear in structurally more complex scenes, as shown in (c), the aforementioned category confusion is further amplified. Since the annotation process for OPMOD includes some ES targets lacking clear shadow-edge features, models may learn to treat local excavated areas inside mine pits as independent objects during training. Consequently, several methods (such as PKI) produce detection results where interior pit regions are incorrectly identified as separate targets, and YOLOv11 even struggles to detect the complete structure of the pit as a whole.

After incorporating the proposed module optimizations, the Proposed + YOLOv11 model, which has stronger PIP discrimination and structural awareness, can accurately identify the overall spatial structure of the pit in complex scenes, thereby achieving effective detection of the complete target. However, it should be noted that because such PIPs also semantically belong to SIOs, the model still unavoidably produces a small number of false detections where interior pit PIPs are mistaken for independent targets in the scene shown in (c). In contrast, in (d), where the target structure is more well-defined, the model maintains high detection completeness while effectively suppressing false positives.

In (e)–(h), the experiment further examines the model’s ability to identify PIPs inside TPs under varying scale conditions, as well as its comprehensive PIP discrimination capability. After integrating the proposed method, Proposed + YOLOv11 significantly enhances YOLOv11’s performance in these aspects, effectively reducing the occurrence of both missed detections and false positives.

(3) Comparison of Model Parameters

In this study, model parameters and FPS calculations were performed on an RTX 3070 GPU with 8 GB of memory, using PyTorch 1.11, CUDA 11.3, and thop 0.1.1. The comparison results are shown in Table 8.

Table 8 presents a comparison of different methods in terms of computational complexity (GFLOPs) and inference efficiency (FPS). The results show that while some recently proposed high-performance methods (e.g., PKI, LEGNet, etc.) achieve certain advantages in detection accuracy, their inference efficiency is relatively low. Among them, PKI achieves only 6.5 FPS in inference speed, and the FPS of methods such as LEGNet and R3Det is also noticeably constrained, making it difficult to meet the practical demands for processing efficiency in large-scale remote-sensing image applications.

In contrast, the proposed method introduces only limited computational overhead, and its inference speed remains stable at over 80 FPS. When considering the experimental results on multiple datasets presented earlier, it becomes evident that the proposed method significantly enhances detection accuracy and result stability without a substantial compromise in inference efficiency, thereby demonstrating favorable efficiency-preservation characteristics.

Considering both accuracy performance and computational efficiency, the proposed model strikes a reasonable balance between performance gain and computational cost. This allows it not only to have strong target detection capabilities in complex remote-sensing scenarios but also to demonstrate good engineering feasibility and practical application potential.

3.5. Effectiveness Analysis of the Proposed Model Architecture

To further validate the rationality and effectiveness of the model architecture proposed in this paper, this section conducts a systematic analysis of the key constituent modules of the model from multiple structural perspectives.

(1) Ablation Study

To systematically validate the independent effectiveness and synergistic effects of the proposed modules, this paper conducts a module-level ablation study under a unified training strategy and hyperparameter configuration. The experiments involve enabling, removing, or replacing the DARF-MOE module, the NST-MOE module, and the KAN-MCP module separately, followed by a comprehensive comparison of detection and segmentation accuracy, GFLOPs, and FPS. The results are presented in Table 9.

First, the impact of each module introduced individually is analyzed. Compared with the baseline Model A, incorporating only DARF-MOE (Model G) or only NST-MOE (Model H) consistently improves performance in both detection and segmentation tasks. In contrast, when only KAN-MCP is introduced (Model I), the gain in detection performance is relatively modest, while the improvement in segmentation is more pronounced (segmentation mAP₅₀ increases by about 1.7). This observation suggests that KAN-MCP is more suitable for finely modeling the nonlinear relationships in mask coefficients rather than directly boosting object localization accuracy, which aligns with the original design intention of KAN-MCP.

In the two-module combination experiments (Models E and F), it can be observed that combining KAN-MCP with either DARF-MOE or NST-MOE yields overall performance superior to that of the corresponding single-module configurations. This indicates that KAN-MCP does not operate in isolation; instead, it acts as a nonlinear enhancer in the mask-prediction stage, complementing the front-end MOE-based feature modeling modules to further improve instance segmentation quality.

To investigate the necessity of the KAN module, this paper further replaces KAN-MCP with a linear layer structure of similar computational complexity (Model C) and compares it with the full model (Model B). The experimental results show that, under nearly identical GFLOPs, the model with KAN-MCP achieves higher mAP₅₀ scores in both detection and segmentation tasks. Although the linear structure yields a slight advantage in inference speed, its accuracy is noticeably lower. Moreover, comparing Model C with Model D (which lacks any third module) reveals that introducing any form of mask-coefficient prediction module contributes to performance improvement, while KAN-MCP achieves a better trade-off between accuracy and efficiency. This demonstrates that KAN-MCP is not merely a substitute for linear mapping but introduces a more expressive nonlinear structure for modeling mask coefficients.

When all three modules are enabled (Model B), the model achieves optimal performance in both detection and segmentation while maintaining high inference efficiency and without any unacceptable increase in computational overhead. This outcome indicates that the proposed modules are not simply stacked; rather, they establish a complementary and synergistic relationship across three levels: multi-scale feature modeling, semantic structure modeling, and mask-coefficient prediction.

(2) Analysis of Expert Configuration in NST-MOE

This paper further analyzes the impact of the number of external expert groups (I) and the number of sub-experts per group (J) in the NST-MOE module on model performance and computational efficiency, as shown in Table 10. The experimental results indicate that the performance improvement of NST-MOE does not simply rely on increasing the total number of experts; it also depends on the rationality of the expert organization and hierarchical structure design. Under the same total number of experts (Experiments A and C), the nested expert structure achieves better performance in both detection and segmentation tasks. This demonstrates that the hierarchical modeling approach of “external grouping—internal refinement” can effectively promote semantic specialization among experts and enhance the model’s ability to express semantic differences among complex ground objects.

Furthermore, when the number of external expert groups

I

is fixed, continuously increasing the number of sub-experts per group

J

(Experiments B, C, and D) leads to saturated or even degraded model performance. This suggests that an excessive number of sub-experts may introduce redundant representations and increase routing uncertainty, thereby weakening expert focus. Similarly, with a fixed number of sub-experts per group

J

, excessively increasing the number of external expert groups

I

(Experiments E, C, and F) also raises routing complexity, which hinders stable model learning under limited training samples.

Considering detection accuracy, segmentation accuracy, and computational overhead comprehensively, the NST-MOE configuration with

I = 4

and

J = 4

achieves the best balance between performance and efficiency, and is therefore selected as the default setting in this work.

(3) Scene-Aware Expert Routing Analysis of DARF-MOE

Based on the validation set of the BUCEA-SWS Dataset (a total of 1761 images for ES, LS, and TP, and 542 images for GDS and CSS), this paper further statistically analyzes the expert routing selection ratios of DARF-MOE across different types of solid waste sites, as illustrated in Figure 10. The aim is to explore the underlying patterns of expert specialization and preference in various SIO scenarios. The overall results show significant differences in expert routing distributions among different solid waste site types, which validates the design intent of DARF-MOE to dynamically select experts to adapt to scene semantics.

For solid waste sites with relatively large spatial scales and continuous spatial structures—such as ES, LS, and TP—the multi-scale ASPP expert consistently accounts for a high routing proportion. This indicates that in such SIOs, adaptive receptive field expansion and global context modeling play a dominant role in characterizing the overall morphology and spatial continuity of large-scale targets. Among them, the MEE expert’s routing proportion is notably higher in LS scenes than in other categories. This observation is highly consistent with the typical characteristics of LS targets, which often have clearly defined boundaries, relatively regular contours, and strong contrast with the background. The statistical result indirectly validates the design rationale of MEE: in solid waste types with well-defined boundaries, emphasizing edge and morphological structure information can enhance the stability and accuracy of instance contour modeling.

Unlike the above categories, CSS exhibits more heterogeneous characteristics in terms of target scale and morphology. Within this category, there exist both small-scale, discretely distributed pile-like structures and medium-scale targets that show certain local continuity, resulting in highly uneven scale distribution and structural patterns. Under these conditions, the expert routing distribution for CSS is characterized by the joint dominance of the SE expert and the multi-scale ASPP expert. On the one hand, the ASPP expert, through its multi-scale receptive field modeling, can adapt to the significant scale variations of CSS targets across different samples. On the other hand, the consistently high routing proportion of the SE expert in CSS may indicate the lack of a specialized expert within the current functional expert set that is precisely adapted to the complex morphology and texture features of CSS. Therefore, when encountering CSS scenarios characterized by high complexity and weak structural priors, the routing mechanism tends to select the SE expert to achieve steady feature representation through its continuous and stable convolutional modeling capability, thereby avoiding feature instability caused by expert response uncertainty or overly sparse selection.

For GDS scenarios, the proportion of the CBAM expert in the routing distribution increases significantly. This phenomenon can be reasonably explained by the typical spectral and visual characteristics of GDS targets. Compared with other solid waste types, GDS often exhibits more prominent high-reflectance features in bright white or gray-white tones, forming strong local contrast with surrounding natural backgrounds (such as bare soil, vegetation, or dark surfaces) in high-resolution remote-sensing imagery. In such scenes, the CBAM expert can adaptively enhance the focus on highly responsive channels and salient regions through its channel-wise and spatial attention mechanisms, thereby highlighting discriminative spectral and spatial features. The expert routing statistics show that DARF-MOE tends to invoke the CBAM expert more frequently in GDS scenarios, indicating that the model relies more on saliency enhancement and region-level attention mechanisms to enhance its ability to distinguish high-reflectance and structurally cluttered targets during the representation of this SIO.

From the overall routing distribution perspective, the SE expert maintains a relatively stable activation proportion across multiple scene types. This suggests that the SE expert does not serve as a specialized module for a particular structure or scale; instead, it plays a steady-state compensation and foundational representation role within DARF-MOE. When other functional experts (such as ASPP, CBAM, or MEE) fail to generate stable or consistent responses in specific scenarios, the SE expert provides continuous, smooth, and well-generalizable feature representations through its generic convolutional modeling capability. This helps mitigate uncertainty arising from excessive sparsity in expert selection or overly high scene heterogeneity.

(4) Visualization and analysis of spatial attention based on Grad-CAM

To further analyze the mechanism of the proposed modules at the feature modeling level, this study employs the Grad-CAM (Gradient-weighted Class Activation Mapping) method to visualize the feature responses of Propose + YOLOv11 across five types of solid waste site scenarios. Grad-CAM generates spatial heatmaps by back-propagating gradient weights of the target class with respect to high-level feature maps, where color intensity reflects the model’s attention at different spatial locations, thereby partially revealing the model’s decision process and spatial semantic modeling behavior.

As shown in Figure 11, YOLOv11 generally exhibits excessive focus on non-target regions in complex remote-sensing scenes. In subfigures (a), (c), and (f), it can be observed that the model generates widespread high responses in semantically irrelevant areas such as farmland, roads, and forests. Particularly in (a) and (f), the model shows pronounced focus on blue building roofs and bare land areas, which frequently precedes false detections or misclassification. The underlying reason is that the model struggles to effectively distinguish PIPs with similar spectral or textural features among different land-cover SIOs in high-resolution imagery.

In contrast, the proposed model demonstrates a more concentrated and semantically consistent attention pattern. In ES and TP scenes, the model not only focuses on a single salient discriminative cue but also jointly models multiple key internal PIPs of the target. For instance, in (b), while the response of YOLOv11 is primarily concentrated on the water-accumulation area inside the ES, the proposed model maintains a high level of responsiveness to that region. Meanwhile, it also attends to non-water excavated pit areas and internal structures, which indicates its ability to comprehensively understand the heterogeneous composition within an ES. In (d), the proposed model not only sustains stable attention to the relatively flat black covering film but also forms continuous attention to its edge structures and adjacent feature areas. This difference is particularly evident in TP scenes. As shown in (e), YOLOv11 clearly overlooks key discriminative structures such as the interception dam and embankment in its CAM results, whereas the proposed model simultaneously covers multiple PIPs within the TP—including the water body, tailings accumulation area, embankment, and interception dam—reflecting stronger holistic structural perception.

Similarly, Figure 12 presents the Grad-CAM results for CSS and GDS scenarios. From subfigures (g), (h), (i), and (k), it can be seen that YOLOv11 still exhibits noticeable erroneous attention in these two types of scenes, while such issues are effectively alleviated after incorporating the proposed modules. Closer observation of (i), (j), (k), and (l) reveals that Propose + YOLOv11 displays more continuous and complete response distributions in its attention to CSS and GDS. For small-scale bare-soil PIPs formed inside the targets due to different stacking patterns, the proposed model can reasonably regard them as part of the whole object based on contextual semantic relations. In contrast, YOLOv11 often loses its attention due to inconsistencies in local texture arrangements and even completely misses an entire CSS area in (l). The root cause lies in its insufficient joint modeling capability for multi-source PIPs within the same SIO.

It should be noted that in certain complex scenes, such as those in Figure 11e,f the proposed model still shows attention towards a small number of non-target land-cover classes in the CAM visualizations. This indicates that under conditions of extreme spectral or spatial similarity, the model may still be disturbed by local PIP resemblances. Nevertheless, compared with YOLOv11, the spatial scope of such erroneous attention is significantly reduced, demonstrating that the proposed model possesses more stable feature-focusing ability in suppressing interference from irrelevant regions.

4. Case Study of Solid Waste Site Detection in Hunan Province

This study selected Loudi City in Hunan Province, China, as the study area to conduct empirical experiments, aiming to validate the feasibility and applicability of the proposed model in real-world scenarios. As a typical resource-based city, Loudi hosts various industrial activities, including mining, metallurgy, and building materials production. The resultant accumulation and disposal of solid waste in the region have, to some extent, impacted the local ecological environment and land-use safety, providing an ideal testbed for evaluating the model’s extraction performance in large-scale and complex scenarios.

4.1. Empirical Workflow

The empirical workflow mainly consists of three steps: first, acquiring the empirical imagery; second, training the extraction model and performing inference based on the proposed method; and finally, post-processing the inference results. The overall workflow is illustrated in Figure 13.

(1) Data Preparation

In this study, the empirical experiment dataset was constructed using 2024 Jilin-1 high-resolution satellite imagery. Specifically, Level-16 imagery was used to cover the entirety of Loudi City as the study area, while Level-18 imagery focused on selected urban areas within Loudi City. Since the cropping process inevitably disrupts the overall semantic structure of the images, a 512 × 512 pixel cropping window with 50% overlap was applied to minimize this effect, producing the cropped images as the initial input for the experiments.

(2) Model Training and Inference

Extraction models were constructed for different types of solid waste using the BUCEA-SWS dataset. Garbage dump sites and construction spoil sites were trained and inferred using the improved Propose+ (YOLOv12), while excavation sites, tailings ponds, and landfill sites were processed with Propose+ (YOLOv11). Inference produced the spatial distribution results of solid waste within the imagery.

(3) Post-Processing of Results

The output results of the model were stitched together, vectorized, and subjected to proximity-based deduplication to eliminate false positives. The processed data were then used to generate thematic maps in a GIS environment, enabling the spatial distribution visualization of solid waste sites.

4.2. Empirical Results

The Level-18 empirical imagery was cropped into 25,607 patches, and the Level-16 imagery into 52,734 patches. The total time consumed during the empirical process is shown in Table 11.

Fortunately, the five types of solid waste sites in the imagery can be evaluated through visual interpretation to assess the accuracy of the model’s extraction results. This approach partially addresses the lack of 2024 field survey data for solid waste sites in Loudi City. In subsequent work, we will continue to cooperate with the Shuguang Environmental Protection Public Welfare Development Center to further collect and monitor the spatial distribution of the five types of solid waste sites throughout Loudi City and the broader Hunan Province. Through visual interpretation, we calculated the model’s precision in the empirical experiments, with the results presented in Table 12.

It should be noted that the deployment-stage Precision reported here is obtained through manual visual interpretation under fixed inference thresholds (confidence threshold = 0.25, NMS threshold = 0.7, mask threshold = 0.5). Unlike the validation-stage metrics (mAP₅₀, mAP_50–95, Precision, Recall, and F1-score), which are computed on fully annotated datasets and integrate performance across multiple IoU thresholds, the deployment Precision reflects single-threshold screening performance in real-world imagery and therefore is not directly comparable to the benchmark evaluation metrics.

Moreover, the objective of the deployment experiment differs from that of the validation-stage evaluation. While benchmark experiments aim to comprehensively assess detection and segmentation performance under standardized conditions, the deployment stage prioritizes practical screening effectiveness—favoring higher recall to reduce missed detections—even at the cost of reduced precision.

The precision for garbage dump sites and excavation sites reached 54.8% and 65.3%, respectively, which fall within a reasonable fluctuation range under complex environmental conditions. In contrast, the precision for construction spoil sites was only 29.8%, and for tailings ponds and landfill sites, it was even lower, at 6.6% and 9.7%, respectively. Further analysis reveals that the low precision for construction spoil sites is primarily due to their complex and diverse morphological features and high sensitivity to spatial resolution. Even in Level-18 high-resolution imagery, some spoil sites fail to fully display the general characteristics summarized in Table A7 thus increasing detection difficulty. The morphological and textural features of the sites exhibit a wider range of variation under different imaging conditions. This necessitates a larger number of positive samples for the model to comprehensively learn their general characteristics. In urban areas, the number of tailings ponds and landfill sites is relatively small, only around a dozen, and they are easily diluted among tens of thousands of cropped images. Additionally, due to overlapping cropping, a single false detection may appear in multiple patches, further inflating the error rate. To reduce this error rate, it is necessary to expand the negative sample set. Moreover, the imbalance in the proportions of training samples also has an impact on model performance. Developing a more rational sample distribution that accounts for the varying feature complexity across categories is a crucial issue to be investigated in future research.

Although the overall precision is relatively low, the results of the model can still serve as a preliminary foundation for rapid screening. When combined with visual interpretation, these results can effectively identify potential solid waste sites. In future work, the diversity of construction spoil site samples with different resolutions should be enhanced. Additionally, false extractions of tailings pond and landfill sites can be used as negative samples to further improve precision.

The empirical results for Loudi City are shown in Figure 14. Panel (a) presents the results for Propose+ (YOLOv11) on Level-16 imagery, while panels (b–f) show the results for Propose+ (YOLOv12) on Level-18 imagery. Analysis of panels (a–f) indicates that excavation sites, tailings ponds, and landfill sites are generally located in forested areas away from the city center, whereas garbage dump sites and construction spoil sites are typically distributed in suburban areas.

4.3. GIS-Based Remote-Sensing Solid Waste Pollution Risk Prevention System

This study developed a “GIS-based remote-sensing solid waste pollution risk prevention system”, as shown in Figure 15, depicting the system’s visualization interface. The system integrates multiple technologies, including remote sensing interpretation, GIS visualization, and big data analytics, to achieve automatic identification, classification, and spatial display of solid waste sites. It supports comprehensive monitoring of various types of solid waste locations, including excavation sites, landfill sites, tailings ponds, construction spoil sites, and garbage dump sites, enabling batch uploading, retrieval, classification management, and tracking of disposal status. Through interactive maps and statistical charts, the system visually presents the spatial distribution and category proportions of solid waste sites, assisting in risk assessment and trend analysis. Additionally, the system includes backend management and task assignment functionalities, facilitating coordination between online monitoring and offline verification, thereby providing robust support for refined supervision and informed decision-making in solid waste pollution risk management.

Figure 16 illustrates the outcomes of a precision environmental investigation conducted by Shuguang Environmental Protection Public Welfare Development Center using this system, along with the results of cross-departmental coordinated management.

5. Discussion

(1): It should also be noted that the representativeness of the dataset and the comparability of results across different datasets still have certain limitations. The BUCEA-SWS dataset constructed in this study is mainly collected from southern Chinese provinces, although it covers multiple scene types, including urban areas, urban–rural fringe zones, industrial parks, and mountainous regions. Therefore, it may not fully represent solid waste sites under different climatic, geological, and landscape conditions, such as arid regions, cold regions, northern mining areas, or areas with substantially different vegetation and soil backgrounds. In addition, the open-source datasets used for comparative experiments differ in spatial resolution, geographic background, annotation standards, object scale, and category composition. As a result, the performance values obtained on different datasets should not be directly compared as cross-dataset rankings. Instead, these experiments are intended to evaluate the relative performance of different models within each dataset under the same experimental settings, and to provide supplementary evidence for the adaptability of the proposed method under different data conditions. In future work, the dataset will be further expanded to broader geographic regions and more diverse environmental conditions, and more standardized cross-region evaluation protocols will be explored.
(2): Figure 17 shows the remaining challenging cases for the proposed method in complex remote-sensing scenes. In panel (a), the model produces false extractions when encountering areas that are highly similar to GDS, as highlighted by the red circle in the Image row. In panel (b), when GDS are irregularly dumped along roadsides, as indicated by the red circle, the model fails to accurately extract every individual waste pile. In panel (c), false extraction occurs when yellowish piled materials exhibit both color and stacking patterns similar to those of CSS. Panel (d) shows the failure cases for large-scale targets such as TP, LS, or ES, where incomplete recognition of internal PIPs leads to fragmented extraction results. Moreover, when the integrity of a target is disrupted at the image boundary, the model may fail to identify the object.

These cases indicate that although the proposed method alleviates many misdetections caused by PIP similarity, it does not completely eliminate such errors. The remaining failure cases mainly arise from three aspects: highly similar non-target PIPs, irregularly distributed small waste piles, and incomplete spatial structures caused by large target complexity or patch cropping. In particular, when the spectral, textural, or morphological features of background objects are close to those of GDS or CSS, the model may still generate false positives. For large targets such as TP, LS, and CSS, incomplete perception of internal PIPs may further lead to fragmented extraction. These limitations suggest that future work should introduce more hard negative samples, improve the modeling of boundary-truncated objects, and incorporate stronger object-level spatial consistency constraints.

(3): Although the proposed MOE-based structure improves the adaptability of feature representation, its training process is more complex than that of a conventional single-path network. In particular, the load-balancing strategy encourages different experts to participate in feature learning, which helps alleviate the risk of expert collapse. However, this regulation may also introduce certain fluctuations during training. When the model temporarily relies more strongly on a subset of experts, the load-balancing mechanism may adjust the routing tendency to promote the participation of other experts, causing the model to temporarily deviate from a locally stable routing pattern and leading to short-term performance oscillations.

Therefore, MOE-based models generally require a longer and more sufficient training process to stabilize expert specialization and ensure that each expert is adequately optimized. In the current study, the expert-routing analysis on the validation set indicates that different experts are activated across different solid waste scenarios, suggesting that no obvious expert collapse occurs after convergence. Nevertheless, we did not fully record and analyze the epoch-wise expert utilization during training. Future work will further monitor expert load distributions across epochs and introduce more explicit routing-balance metrics to better evaluate routing stability and expert specialization.

(4): All models were trained and evaluated using fixed dataset splits, unified training settings, and consistent evaluation criteria. Although this design ensured the fairness and consistency of the model comparisons, the study did not further examine model performance from the perspective of statistical significance.

Therefore, the reported results mainly reflect the comparative performance of different methods under the adopted experimental protocol. In this study, the experimental analysis focused on model accuracy, segmentation performance, computational efficiency, ablation validation, and practical application feasibility, whereas statistical uncertainty analysis was not included as a separate experimental component. Future work will further examine the robustness and consistency of model performance through confidence interval estimation and uncertainty analysis.

6. Conclusions

This paper proposes a WMN model for extracting solid waste sites from high-resolution remote-sensing imagery. While maintaining computational efficiency, the proposed model explicitly perceives variations among different PIPs in terms of scale, structural characteristics, and contextual information within the input scenes, and adaptively activates more appropriate expert feature extraction pathways. Through progressive feature-level aggregation, WMN forms stable representations of SIOs, thereby enhancing the model’s capability to understand and discriminate complex targets. Extensive experiments conducted on multiple public datasets and a self-constructed dataset demonstrate that the proposed method exhibits robust generalization ability across diverse remote-sensing scenarios. Compared with several classical approaches and recently proposed state-of-the-art models, WMN achieves more stable and competitive performance in multi-class detection and segmentation tasks. Furthermore, systematic ablation studies and multi-dimensional analyses validate the effectiveness of each core module in mitigating the interference caused by PIP heterogeneity in SIO recognition and in enhancing the overall ability to model semantic consistency.

In the practical application in Loudi City, Hunan Province, the proposed method demonstrated both high efficiency and practical applicability. For level-18 imagery data (6.35 GB), the complete processing and filtering were accomplished within 1.34 h; for level-16 imagery data (12.40 GB), they were completed within 3.21 h. In terms of accuracy, the method achieved mAP₅₀ scores of 54.8% and 65.3% for GDS and ES, respectively; even for the more complex CSS, it attained 29.8% accuracy. For TP and LS, where samples are relatively limited, the method still quickly provided reliable location information despite lower accuracy. Collectively, these results indicate that the proposed approach can accurately extract multiple categories of sites while maintaining efficiency, demonstrating considerable practical value. Furthermore, in collaboration with Shuguang Environmental Protection Public Welfare Development Center, a “GIS-based remote-sensing solid waste pollution risk prevention system” for solid waste sites was developed. This development further enhances monitoring efficiency and provides reliable support for subsequent field surveys and regulatory decision-making.

In terms of model improvement, we have not comprehensively controlled the specific tasks that each expert should undertake, which leaves scope for enhancement in interpretability and controllability. Regarding computational efficiency, the MOE structure can theoretically enhance performance through an expert parallelization strategy, in which multiple experts are distributed across different GPUs to share the computational and memory load. This approach can not only significantly enhance training efficiency but also further utilize the advantages of MOE in practical engineering applications.

Author Contributions

Conceptualization, J.L.; methodology, K.W.; software, B.Y.; validation, K.W.; formal analysis, J.L.; investigation, J.L.; resources, J.L.; data curation, C.L.; writing—original draft preparation, K.W.; writing—review and editing, J.L.; visualization, K.W.; supervision, J.L.; project administration, K.W.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Key Laboratory of Water Safety for Beijing-Tianjin-Hebei Region of Ministry of Water Resources, under the project “Research on Intelligent Identification and Monitoring Models for Typical River–Lake Elements and Freshwater Fish Biodiversity Security in the Beijing–Tianjin–Hebei Region” [No. IWHR-JJJ-202404].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset constructed in this study was developed in collaboration with the Shuguang Environmental Protection Public Welfare Development Center. This public welfare organization has long been engaged in solid waste pollution investigation activities. Its volunteer team provided a large number of verified real-world solid waste site locations. Based on these site records, we conducted image acquisition and manual annotation. As the location information involves long-term investigation outcomes accumulated by the partner organization, data sharing requires prior consent from the collaborating party. Therefore, the dataset is currently available upon request rather than through unrestricted public download. For environmental protection and academic research purposes, applications accompanied by a clear statement of research intent are typically supported by the partner organization.

Acknowledgments

We sincerely thank the Changsha Shuguang Environmental Protection Public Welfare Development Center for its support in the construction of the dataset and for providing valuable suggestions and feedback during the system trial. We also gratefully acknowledge Shuhua Chen for her assistance with manuscript proofreading and language polishing.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Based on the category-wise results presented in Table A1, this paper further analyzes the detection and segmentation performance for ES, TP, and LS. In the detection task, Propose + YOLOv11 achieves stable and competitive performance improvements across all three categories. Specifically, for the TP category, which exhibits substantial internal variations in PIP features, the proposed method achieves an mAP₅₀ of 85.4%, significantly outperforming methods such as FasterRCNN, RT-DETR, and the baseline YOLOv11. For the LS category, which involves substantial scale variations, it also reaches 83.1%. Regarding the ES category, which generally presents relatively lower localization difficulty but contains certain internal structural complexities, the proposed method does not exhibit performance degradation due to the introduced multi-expert modeling mechanism, demonstrating the model’s strong stability across different levels of target complexity.

Table A1. The results of the model for each class on the BUCEA-SWS Dataset.

Model	Box (mAP₅₀)				Seg (mAP₅₀)
Model	ES	TP	LS	All	ES	TP	LS	All
FasterRCNN (2017)	73.6	74.9	72.7	73.7	---	---	---	---
RetinaNet (2017)	73.5	68.2	71.1	70.9	---	---	---	---
Transformer (2019)	75.1	72.3	74.8	74.1	---	---	---	---
CSL (2020)	74.7	64.4	70.1	69.7	---	---	---	---
R3Det (2021)	74.3	69.5	73.0	72.3	---	---	---	---
PKI (2024)	76.8	76.8	76.4	76.7	---	---	---	---
RT-Dert (2024)	76.3	76.6	79.9	77.6	---	---	---	---
YOLOv11 (2024)	77.9	83.1	81.1	80.7	79.8	85.3	82.0	82.4
YOLOv12 (2025)	77.8	76.5	80.8	78.4	79.0	79.3	81.4	79.9
LEGNet (2025)	72.4	68.9	72.4	71.2	---	---	---	---
FBRT (2025)	77.8	78.5	81.7	79.3	---	---	---	---
Propose+ (YOLOv11)	80.9	85.4	83.1	83.1	83.0	86.7	84.1	84.6
Propose+ (YOLOv12)	77.9	84.6	83.1	81.9	80.1	85.5	83.4	83.0

The bold values indicate the best performance under each evaluation metric.

In the instance segmentation task, the category-wise results show a consistent trend with those in detection. Propose + YOLOv11 achieves higher segmentation accuracy on ES, TP, and LS targets compared to the corresponding baseline models YOLOv11 and YOLOv12, with more significant improvements observed for the TP and LS categories. This indicates that the multi-expert feature routing mechanism not only enhances object localization performance but also effectively strengthens the modeling capability for the internal regional structures of facilities. Furthermore, the introduction of the KAN linear layer further enhances the model’s ability to represent target boundaries and fine-grained contour information.

As shown in Table A2, this paper further analyzes the performance variations in detection and segmentation tasks for the two target categories, namely GDS and CSS, at the category level. The analysis primarily focuses on comparing the baseline models with their counterparts enhanced by the proposed method. In the detection task, YOLOv11 achieves mAP50 scores of 71.3% and 60.2% for GDS and CSS, respectively. With the integration of the proposed method, Propose + YOLOv11 improves these scores to 75.5% and 61.4%. In contrast, YOLOv12 already demonstrates a robust detection capability on the GDS category (74.1%), thus showing a relatively limited improvement; however, on the CSS category, its performance still increases from 59.2% to 63.5%, which reflects the stable enhancement provided by the proposed method for more challenging categories.

Table A2. The results of the model for each class on the BUCEA-SWS Dataset (Garbage Dump Sites, Construction Spoil Sites).

Model	Box (mAP₅₀)			Seg (mAP50)
Model	GDS	CSS	All	GDS	CSS	All
FasterRCNN (2017)	62.4	54.7	58.6	-	-	-
RetinaNet (2017)	47.2	46.3	46.7	-	-	-
Transformer (2019)	59.4	51.2	55.3	-	-	-
CSL (2020)	58.8	53.9	56.3	-	-	-
R3Det (2021)	60.8	47.0	53.9	-	-	-
PKI (2024)	63.8	56.1	59.9	-	-	-
RT-Dert (2024)	70.5	59.4	64.9	-	-	-
YOLOv11 (2024)	71.3	60.2	65.8	74.6	57.7	66.1
YOLOv12 (2025)	74.1	59.2	66.6	78.6	61.6	70.1
LEGNet (2025)	68.8	61.0	64.9	-	-	-
FBRT (2025)	70.9	60.4	65.6	-	-	-
Propose+ (YOLOv11)	75.5	61.4	68.5	78.1	61.2	69.6
Propose+ (YOLOv12)	74.2	63.5	68.8	75.4	64.1	69.8

The bold values indicate the best performance under each evaluation metric.

In the instance segmentation task, the proposed method also yields remarkable improvements for YOLOv11. YOLOv11 achieves segmentation mAP50 scores of 74.6% and 57.7% for GDS and CSS, respectively, whereas Propose + YOLOv11 boosts these scores to 78.1% and 61.2%. For the YOLOv12 baseline, the performance gain in segmentation is relatively modest. The main reason lies in the fact that both GDS and CSS objects generally lack clear and stable boundary definitions in remote-sensing imagery, and the annotation process itself involves a certain degree of uncertainty. This makes pixel-wise segmentation metrics particularly sensitive to minor deviations along edges. Under such conditions, the advantage of the proposed method within the YOLOv12 framework is manifested more in the stability of segmentation results—being less sensitive to boundary perturbations and yielding more consistent region predictions—rather than in a significant quantitative boost. Such stability and robustness still contribute positively to the practical reliability of the model in complex scenarios.

Table A3. Results on GDTD.

Model	mAP₅₀	mAP_50–95	Precision	Recall	F1-Score
FasterRCNN (2017)	34.8	14.1	41.6	44.5	43.0
RetinaNet (2017)	25.2	10.2	30.4	46.0	36.6
Transformer (2019)	43.0	16.1	50.9	51.5	51.2
CSL (2020)	34.1	15.2	39.2	49.2	43.6
R3Det (2021)	41.1	18.8	45.4	50.9	48.0
PKI (2024)	59.0	35.2	71.8	56.9	63.5
RT-DETR (2024)	48.5	27.3	49.6	50.3	49.9
YOLOv11 (2024)	55.6	35.3	50.8	53.2	52.0
YOLOv12 (2025)	52.6	32.5	50.2	52.6	51.4
LEGNet (2025)	56.7	29.7	60.1	60.4	60.2
FBRT (2025)	56.4	33.6	52.9	54.8	53.8
Propose+ (YOLOv11)	57.9	35.6	54.6	53.8	54.2
Propose+ (YOLOv12)	57.7	36.7	66.4	46.4	54.6

The bold values indicate the best performance under each evaluation metric.

As shown in Table A3, the methods proposed in recent years have demonstrated a steady improving trend in overall performance on the GDTD. Among them, PKI and LEGNet achieve relatively better results in metrics such as mAP₅₀, Precision, and F1-score, reflecting their strong target discrimination capability in complex scenes. The YOLO series models exhibit stable performance overall. Specifically, YOLOv11 marginally outperforms YOLOv12 in comprehensive metrics, which indicates better detection consistency.

On the YOLOv11 baseline, after incorporating the proposed method, Propose + YOLOv11 improves mAP₅₀ from 55.6% to 57.9%, and raises Precision and F1-score by 3.8 and 2.2 percentage points, respectively, indicating that the method can effectively enhance the model’s discrimination reliability in GDTD scenarios. However, its mAP_50–95 only increases marginally by 0.3 percentage points, mainly limited by factors such as the low image resolution and blurred target boundaries in GDTD. Given that YOLOv11 already possesses relatively sufficient localization ability, the performance under high IoU thresholds is more reliant on the details of the original image. As a result, there is limited scope for further boundary refinement based on fine-grained Progressive Image Pyramid (PIP) features.

In contrast, on the YOLOv12 baseline, the proposed method leads to a more significant improvement in mAP_50–95 (32.5% → 36.7%), which demonstrates its effective enhancement of localization stability in the high-IoU range. Meanwhile, Precision increases noticeably, while Recall decreases marginally, indicating that the model adopts a more conservative prediction strategy in complex backgrounds, minimizing false positives at the cost of filtering out some low-quality targets.

Overall, the performance gains of the proposed method on GDTD are closely related to the characteristics of the baseline models: on YOLOv11, improvements are constrained by image quality and target ambiguity, whereas on YOLOv12, the method effectively compensates for its localization shortcomings under high IoU thresholds, further validating the adaptability and robustness of the approach in complex, low-quality remote-sensing scenarios.

As shown in Table A4, the overall performance gap among different methods on the OPMOD is relatively limited, indicating significant challenges in target appearance consistency and discriminative feature representation within this dataset. Models such as PKI, RT-DETR, and the YOLO series exhibit relatively stable performance on metrics such as mAP_50–95.

Table A4. Results on OPMOD.

Model	mAP₅₀	mAP_50–95	Precision	Recall	F1-Score
FasterRCNN (2017)	51.7	27.0	60.0	51.8	55.6
RetinaNet (2017)	49.8	22.2	59.1	50.2	54.3
Transformer (2019)	50.2	22.3	56.2	52.8	54.4
CSL (2020)	50.3	24.8	58.7	50.0	54.0
R3Det (2021)	51.0	24.7	59.0	51.2	54.8
PKI (2024)	54.6	28.4	61.8	55.8	58.6
RT-DETR (2024)	54.1	31.4	52.6	55.3	53.9
YOLOv11 (2024)	55.7	33.7	57.0	52.6	54.7
YOLOv12 (2025)	55.7	33.9	56.6	54.2	55.4
LEGNet (2025)	50.2	28.7	58.7	54.2	56.4
FBRT (2025)	56.5	33.9	56.5	53.6	55.0
Propose+ (YOLOv11)	56.3	34.1	58.6	52.8	55.5
Propose+ (YOLOv12)	56.9	34.5	58.8	54.3	56.5

The bold values indicate the best performance under each evaluation metric.

After integrating the proposed method into the YOLOv11 and YOLOv12 baseline models, both show consistent and stable improvements in overall performance. Both Propose + YOLOv11 and Propose + YOLOv12 outperform their respective baselines in multiple metrics,, such as mAP₅₀, mAP_50–95, and F1-score. Notably, Propose + YOLOv12 attains a value of 34.5% in mAP_50–95.

It should be noted that the overall improvement brought by the proposed method on OPMOD is relatively moderate, mainly due to the inherent characteristics of the dataset. The morphological variations among different excavation sites in OPMOD are substantial: some targets exhibit typical deep-pit structures, while others—due to the low spatial resolution—appear merely as exposed ground surfaces protruding within forested or barren areas, with indistinct pit-like features. Furthermore, the edges of some mining pits are directly adjacent to bare land, lacking clear structural boundaries. Additionally, the limited scale of training samples further exacerbates the learning difficulty for the models.

As shown in Table A5, the OSTPD is relatively small in overall sample size, yet the TP targets exhibit significant variations in scale and morphology, with individual targets occupying approximately 10% to 40% of the image area, and a small number of samples even exceeding 50%. This characteristic—large scale variation and weak intra-class consistency—poses considerable challenges for models in simultaneously achieving complete target coverage and suppressing background interference.

Table A5. Results on OSTPD.

Model	mAP₅₀	mAP_50–95	Precision	Recall	F1-Score
FasterRCNN (2017)	36.8	14.2	53.7	42.3	47.3
RetinaNet (2017)	50.9	18.2	69.3	50.0	58.1
Transformer (2019)	50.1	20.5	57.0	54.8	55.9
CSL (2020)	38.8	14.7	58.8	38.5	46.5
R3Det (2021)	46.0	16.9	60.0	49.0	53.9
PKI (2024)	46.5	17.5	59.6	53.8	56.6
RT-DETR (2024)	35.2	18.7	41.1	34.6	37.6
YOLOv11 (2024)	49.5	29.4	59.7	35.6	44.6
YOLOv12 (2025)	43.9	25.2	55.0	31.7	40.2
LEGNet (2025)	51.1	20.5	49.2	58.7	53.5
FBRT (2025)	48.5	30.3	57.4	37.5	45.4
Propose+ (YOLOv11)	52.0	31.1	52.5	50.0	51.2
Propose+ (YOLOv12)	47.4	27.9	65.2	28.8	40.0

The bold values indicate the best performance under each evaluation metric.

Among the compared methods, RetinaNet performs relatively well in metrics such as Precision and mAP₅₀, but its mAP_50–95 is notably lower. This indicates that the model has limited ability to delineate target boundaries and achieve precise localization under high IoU requirements, making it difficult to maintain stable performance in tailings pond scenarios with significant scale and morphological changes.

YOLOv11 and YOLOv12 demonstrate certain advantages over traditional methods in terms of mAP_50–95, but their baseline models generally suffer from low Recall. Building on this, after incorporating the proposed method, both Propose + YOLOv11 and Propose + YOLOv12 achieve steady improvements in both mAP₅₀ and mAP_50–95, with mAP₅₀ increasing to 52.0% and 47.4%, and mAP_50–95 rising to 31.1% and 27.9%, respectively.

Overall, the strength of the proposed method for OSTPD is not manifested as an extreme improvement in any single metric. Instead, it lies in its ability to effectively alleviate the trade-off between target coverage and false positive suppression for baseline models under the complex conditions of limited sample size and large target scale variation. This indicates the method’s strong robustness and adaptability.

As shown in Table A6, on TPHPD, the overall detection performance of various methods is significantly better than that in medium-low resolution and small-sample scenarios, indicating that the target features in this dataset are relatively clear and more discriminable.

Table A6. Results on TPHPD.

Model	mAP₅₀	mAP_50–95	Precision	Recall	F1-Score
FasterRCNN (2017)	44.4	9.8	61.7	55.7	58.5
RetinaNet (2017)	67.6	23.8	77.1	67.2	71.8
Transformer (2019)	64.5	21.0	70.8	73.9	72.3
CSL (2020)	59.5	21.2	69.3	63.2	66.1
R3Det (2021)	66.0	21.3	78.2	69.9	73.8
PKI (2024)	75.2	30.8	81.5	74.9	78.1
RT-DETR (2024)	76.7	55.1	84.1	68.4	75.4
YOLOv11 (2024)	80.6	47.2	80.7	74.1	77.3
YOLOv12 (2025)	82.1	53.3	82	72.7	77.1
LEGNet (2025)	76.2	32.1	80.2	80.1	80.1
FBRT (2025)	82.7	50.0	76.9	78.7	77.8
Propose+ (YOLOv11)	84.2	55.0	89.1	73.6	80.6
Propose+ (YOLOv12)	83.1	53.9	85.1	72.6	78.4

The bold values indicate the best performance under each evaluation metric.

According to the comparative results, recent methods such as PKI, RT-DETR, LEGNet, and FBRT achieve relatively strong performance in metrics including mAP₅₀, Precision, and F1-score, reflecting their strong feature modeling and target discrimination capabilities in complex scenes. Among them, RT-DETR performs particularly well in the mAP_50–95 metric, while LEGNet achieves a more balanced trade-off between Precision and Recall, leading to a high F1-score.

Both YOLOv11 and YOLOv12 rank among the top in terms of mAP₅₀ and mAP_50–95, which indicates their excellent target localization and scale adaptation abilities in this scenario. Based on this foundation, the model performance is further enhanced after integrating the proposed method. Propose + YOLOv11 increases mAP₅₀ and mAP_50–95 to 84.2% and 55.0%, respectively, raises Precision to 89.1%, and achieves an F1-score of 80.6%, attaining the best or second-best performance among all compared methods across all metrics.

In comparison, Propose + YOLOv12 also maintains a high level of mAP_50–95 (53.9%) and demonstrates a distinct improvement. This indicates that within the YOLOv12 framework, the method tends to reduce false positives by enhancing feature selection and suppressing redundant responses. However, its improvement in Recall is relatively limited, leading to a less significant overall increase in the F1-score compared to Propose + YOLOv11. This suggests that, given the relatively clear target characteristics in this dataset, the YOLOv12 baseline already possesses strong coverage capability, and the proposed method mainly further optimizes the reliability and stability of the detection results.

Appendix B

Table A7. Geometric, spectral, and related features of five types of solid waste sites.

	TP	CSS	LS	GDS	ES
High-resolution remote-sensing imagery


Geometric Features	Irregular pond-like in shape, with natural edges	Individual units form irregularly stacked pyramids. The overall layout exhibits a spatial arrangement pattern.	Built to follow the contours of the terrain, it has an irregular shape.	Irregular shape, chaotic outline.	Features an irregularly pitted structure.
Spectral Features	The pools display diverse colors, such as grayish white, blue-green, and earthy yellow.	The surface of the pile appears in shades of earthy yellow, brown, or gray.	Membranes are black, dark gray, or grayish white in color.	Appear grayish-white or bright white, with high reflectivity.	Yellowish-brown, grayish-white, or rust-colored, with high reflectivity.
Texture Features	The mining pool exhibits smooth and delicate characteristics, while the intercepting dam features a unique linear profile.	The surface of the construction debris is rough and uneven in color, exhibiting a mottled, blocky appearance.	The surface of the covering film is flat yet rough, exhibiting a distinct reticulated pattern.	Features strong local brightness contrast and coarse, disordered characteristics.	The surface is fractured and rough, often accompanied by distinct digging marks.
Spatial configuration	Comprising an intercepting dam, dam body, and tailings pond, it is often constructed on mountainous or forested terrain.	Structures, characterized by loose or dense accumulation, are commonly found near construction sites and exposed areas along suburban roadsides.	Enclosed by transport roads, with an internal covering membrane, it adheres to mountainous and forested terrain.	Unevenly distributed sites often appear scattered, predominantly located in urban–rural fringe areas or on open, bare land.	Pits formed by vertical cliff walls and construction structures are often surrounded by mountain forests or farmland.

References

Fraternali, P.; Morandini, L.; González, S.L.H. Solid Waste Detection, Monitoring and Mapping in Remote Sensing Images: A Survey. Waste Manag. 2024, 189, 88–102. [Google Scholar] [CrossRef] [PubMed]
Lv, Y. Research on the Identification and Risk Assessment of Solid Waste Landfill Sites Based on Multi-Source Information from Satellite, Aerial, and Ground Observations. Master’s Thesis, Zhejiang University, Hangzhou, China, 2023. [Google Scholar]
Ministry of Ecology and Environment of the People’s Republic of China. The Ministry and Seven Departments Jointly Launched a Nationwide Special Campaign Against Illegal Dumping and Disposal of Solid Waste. Available online: https://www.mee.gov.cn/ywdt/xwfb/202506/t20250625_1121878.shtml (accessed on 10 June 2026).
He, S.; Li, Y. Application of UAV Remote Sensing Images in Solid Waste Monitoring: A Case Study of Nanhu District, Jiaxing City. Cent. South Agric. Sci. Technol. 2024, 45, 82–85+99. [Google Scholar]
Mao, P.; Yu, J.; Tian, Y. Application of Satellite Remote Sensing in Solid Waste Supervision. Renew. Resour. Circ. Econ. 2024, 17, 21–23. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-Cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar] [CrossRef]
Du, S.; Xing, J.; Wang, S.; Wei, L.; Zhang, Y. STMNet: Scene Classification-Assisted and Texture Feature-Enhanced Multi-Scale Network for Large-Scale Urban Informal Settlement Extraction from Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 13169–13187. [Google Scholar] [CrossRef]
Zhang, C.; Xing, J.; Li, J.; Du, S.; Qin, Q. A New Method for the Extraction of Tailing Ponds from Very High-Resolution Remotely Sensed Images: PSVED. Int. J. Digit. Earth 2023, 16, 2681–2703. [Google Scholar] [CrossRef]
Lavender, S. Detection of Waste Plastics in the Environment: Application of Copernicus Earth Observation Data. Remote Sens. 2022, 14, 4772. [Google Scholar] [CrossRef]
Kruse, C.; Boyda, E.; Chen, S.; Karra, K.; Bou-Nahra, T.; Hammer, D.; Mathis, J.; Maddalene, T.; Jambeck, J.; Laurier, F. Satellite Monitoring of Terrestrial Plastic Waste. PLoS ONE 2023, 18, e0278997. [Google Scholar] [CrossRef] [PubMed]
Yailymova, H.; Mikava, P.; Kussul, N.; Krasilnikova, T.; Shelestov, A.; Yailymov, B.; Titkov, D. Neural Network Model for Monitoring of Landfills Using Remote Sensing Data. In Proceedings of the 2022 IEEE 3rd International Conference on System Analysis & Intelligent Computing (SAIC), Kyiv, Ukraine, 4–7 October 2022; pp. 1–4. [Google Scholar]
Zhang, S.; Ma, J. CascadeDumpNet: Enhancing Open Dumpsite Detection through Deep Learning and AutoML Integrated Dual-Stage Approach Using High-Resolution Satellite Imagery. Remote Sens. Environ. 2024, 313, 114349. [Google Scholar] [CrossRef]
Devesa, M.R.; Brust, A.V. Mapping Illegal Waste Dumping Sites with Neural-Network Classification of Satellite Imagery. arXiv 2021, arXiv:2110.08599. [Google Scholar] [CrossRef]
Rajkumar, A.; Kft, C.A.; Sziranyi, T.; Majdik, A. Detecting Landfills Using Multi-Spectral Satellite Images and Deep Learning Methods. In Proceedings of the 10th International Conference on Learning Representations (ICLR 2022), Online, 25–29 April 2022; pp. 1–9. [Google Scholar]
Torres, R.N.; Fraternali, P. Learning to Identify Illegal Landfills through Scene Classification in Aerial Images. Remote Sens. 2021, 13, 4520. [Google Scholar] [CrossRef]
Yang, K.; Zhang, C.; Luo, T.; Hu, L. Automatic Identification Method of Construction and Demolition Waste Based on Deep Learning and GAOFEN-2 Data. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, 43, 1293–1299. [Google Scholar] [CrossRef]
Wang, P.; Zhao, H.; Yang, Z.; Jin, Q.; Wu, Y.; Xia, P.; Meng, L. Fast Tailings Pond Mapping Exploiting Large Scene Remote Sensing Images by Coupling Scene Classification and Sematic Segmentation Models. Remote Sens. 2023, 15, 327. [Google Scholar] [CrossRef]
Yu, J.; Mao, P.; Wu, W.; Wang, Q.; Shao, X.; Teng, J.; Wang, Y. TSNET: A Solid Waste Instance Segmentation Model in China Based on a Two-Step Detection Strategy and Satellite Remote Sensing Images. Int. J. Appl. Earth Obs. Geoinf. 2025, 136, 104366. [Google Scholar] [CrossRef]
Yong, Q.; Wu, H.; Wang, J.; Chen, R.; Yu, B.; Zuo, J.; Du, L. Automatic Identification of Illegal Construction and Demolition Waste Landfills: A Computer Vision Approach. Waste Manag. 2023, 172, 267–277. [Google Scholar] [CrossRef] [PubMed]
Sun, X.; Yin, D.; Qin, F.; Yu, H.; Lu, W.; Yao, F.; He, Q.; Huang, X.; Yan, Z.; Wang, P.; et al. Revealing Influencing Factors on Global Waste Distribution via Deep-Learning Based Dumpsite Detection from Satellite Imagery. Nat. Commun. 2023, 14, 1444. [Google Scholar] [CrossRef] [PubMed]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Zhou, L.; Rao, X.; Li, Y.; Zuo, X.; Liu, Y.; Lin, Y.; Yang, Y. SWDet: Anchor-Based Object Detector for Solid Waste Detection in Aerial Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 306–320. [Google Scholar] [CrossRef]
Li, H.; Hu, C.; Zhong, X.; Zeng, C.; Shen, H. Solid Waste Detection in Cities Using Remote Sensing Imagery Based on a Location-Guided Key Point Network with Multiple Enhancements. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 191–201. [Google Scholar] [CrossRef]
Hussain, M. Yolov1 to v8: Unveiling Each Variant–a Comprehensive Review of Yolo. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Liu, J.; Du, M.; Mao, Z. Scale Computation on High Spatial Resolution Remotely Sensed Imagery Multi-Scale Segmentation. Int. J. Remote Sens. 2017, 38, 5186–5214. [Google Scholar] [CrossRef]
Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar] [CrossRef]
Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive Mixtures of Local Experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef] [PubMed]
Pavlitskaya, S.; Hubschneider, C.; Weber, M.; Moritz, R.; Huger, F.; Schlicht, P.; Zollner, M. Using Mixture of Expert Models to Gain Insights into Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 342–343. [Google Scholar]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-Time Instance Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P. Res2net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef]
Li, B.; Yan, H.; Wu, M.; Zhang, C. Multi-Scale Receptive Field Rectification of Remote Sensing Images. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 9835–9839. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected Crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Kim, B.J.; Choi, H.; Jang, H.; Kim, S.W. Resolution-Aware Design of Atrous Rates for Semantic Segmentation Networks. arXiv 2023, arXiv:2307.14179. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Chen, Y.; Jiang, W.; Wang, Y. FAMHE-Net: Multi-Scale Feature Augmentation and Mixture of Heterogeneous Experts for Oriented Object Detection. Remote Sens. 2025, 17, 205. [Google Scholar] [CrossRef]
Rossi, L.; Bernuzzi, V.; Fontanini, T.; Bertozzi, M.; Prati, A. Swin2-MoSE: A New Single Image Supersolution Model for Remote Sensing. IET Image Process. 2025, 19, e13303. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chen, S. Official Weibo of Changsha Shuguang Environmental Protection Public Welfare Development Center. Available online: https://www.weibo.com/sghb201381 (accessed on 17 May 2026).
Lyu, J.; Hu, Y.; Ren, S.; Yao, Y.; Ding, D.; Guan, Q.; Tao, L. Extracting the Tailings Ponds from High Spatial Resolution Remote Sensing Images by Integrating a Deep Learning-Based Model. Remote Sens. 2021, 13, 743. [Google Scholar] [CrossRef]
Lin, G. Open Pit Mine Object Detection Dataset. Figshare. 2024. Available online: https://figshare.com/articles/dataset/Open_Pit_Mine_Object_Detection_Dataset/27300960 (accessed on 17 May 2026).
Li, J.; Li, M.; Sui, Z.; Su, W.; Lian, Y.; Chen, S.; Yuan, Z. Target Detection Dataset of Tailings Ponds in Henan Province, China (2016–2021). Science Data Bank. 2022. Available online: https://www.scidb.cn/en/detail?dataSetId=720626420933296128 (accessed on 17 May 2026).
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs Beat Yolos on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8673–8681. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar] [CrossRef]
Lu, W.; Chen, S.B.; Li, H.D.; Shu, Q.L.; Ding, C.H.Q.; Tang, J.; Luo, B. LEGNet: Lightweight Edge-Gaussian Driven Network for Low-Quality Remote Sensing Image Object Detection. arXiv 2025, arXiv:2503.14012. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar] [CrossRef]
Yang, X.; Yan, J. Arbitrary-Oriented Object Detection with Circular Smooth Label. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 677–694. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar] [CrossRef]

Figure 1. SIOs and PIPs of solid waste sites ((a) Landfill Sites; (b) Tailings Ponds; (c) Excavation Sites; (d) Garbage Dump Sites; (e) Construction Spoil Sites; (f) Example of the relationship between SIOs and PIPs in remote sensing imagery).

Figure 2. Examples of other types of land cover whose PIPs resemble those of solid waste site SIOs. (a) Landfill Sites; (b) Tailings Ponds; (c) Excavation Sites; (d) Garbage Dump Sites; (e) Construction Spoil Sites.

Figure 3. Overall network architecture.

Figure 4. The composition of four perception experts.

Figure 5. KAN Mask Coefficient Prediction Module.

Figure 6. Distribution of various data collection locations (Projection: WGS 84/Web Mercator (Auxiliary Sphere), EPSG:3857).

Figure 7. Results on the BUCEA-SWS Dataset ((a,b) Excavation Sites; (c,d) Landfill Sites; (e,f) Tailings Ponds).

Figure 8. Results on the BUCEA-SWS Dataset ((a–c) Construction Spoil Sites; (d–f) Garbage Dump Sites).

Figure 9. Results on Open-Source Datasets (Panels (a,b) are from the GDTD dataset, panels (c,d) are from the OPMOD dataset, panels (e,f) are from the OSTPD dataset, and panels (g,h) are from the TPHPD dataset).

Figure 10. DARF-MOE Expert Routing across Solid-Waste Scenarios.

Figure 11. Grad-CAM comparison for ES, LS and TP scenarios ((a,b) Excavation Sites; (c,d) Landfill Sites; (e,f) Tailings Ponds).

Figure 12. Grad-CAM comparison for CSS and GDS scenarios ((g–i) Construction Spoil Sites; (j–l) Garbage Dump Sites).

Figure 13. Empirical Workflow.

Figure 14. Empirical Results for Loudi City (Projection: WGS 84/Web Mercator (Auxiliary Sphere), EPSG:3857; (a) Propose+ (YOLOv11) results on Level-16 imagery, including Excavation Sites (ES), Tailings Ponds (TP), and Landfill Sites (LS); (b–f) Propose+ (YOLOv12) results on Level-18 imagery, including Construction Spoil Sites (CSS) and Garbage Dump Sites (GDS)).

Figure 15. GIS-based remote-sensing solid waste pollution risk prevention system (The Chinese labels displayed in the interface denote different functional modules of the system, including Layer Management, Risk Assessment, Data Import, and User Management).

Figure 16. Practical application.

Figure 17. Remaining challenging cases for the proposed method in complex scenes (Red circles indicate model failure regions (a) Confusion with GDS-like objects; (b) incomplete extraction of roadside GDS; (c) confusion with CSS-like materials; (d) fragmented extraction of large-scale targets (TP, LS, and ES)).

Table 1. Annotation Specifications for Five Types of Solid Waste Sites.

Categories	Annotation Sample Specifications Based on PIPs in Solid Waste Site SIOs
TP	A typical tailings storage SIO mainly consists of three types of PIPs: dam body, tailings pond, and interception dam. During annotation, all three PIP areas should be fully covered, and any buildings should be excluded from the annotated image region whenever possible.
CSS	A typical spoil heap SIO mainly consists of conical mounds or densely arranged small mounds as PIPs. Only areas with clearly identifiable spoil characteristics should be annotated, while scattered heaps lacking typical features are not annotated.
LS	A typical landfill SIO mainly consists of black or white mesh-covered PIPs. Since the white mesh can be easily confused with overexposed areas in the imagery, potentially affecting recognition accuracy, only areas covered by black mesh should be annotated.
GDS	A typical garbage pile SIO mainly consists of bright white or grayish-white mounded PIPs, with the main area usually exhibiting a noticeable white transitional zone relative to surrounding land features. Due to blurred boundaries, only the bright white or grayish-white mounded main area should be annotated.
ES	A typical excavation site SIO mainly consists of prominent cliff edges, construction areas, and three-dimensional shadow PIPs. During annotation, boundaries should follow the cliff edges, avoiding construction buildings whenever possible, while retaining pit-like features such as excavation pits or exposed slopes.

Table 2. BUCEA-SWS Dataset.

	TP	CSS	LS	GDS	ES
Quantity	1364	2757	3137	3328	2460
Proportion	10.45%	21.13%	24.05%	25.51%	18.86%
Categories

Each column represents a different category of images, with corresponding individual instances marked in distinct colors.

Table 3. Overview of Four Open-Source Datasets.

Datasets	Location	Source	Resolution	Categories
Global Dumpsite Test Data	Cities in Africa and Asia	Google Earth	---	CSS, GDS
Open Source Tailings Pond Dataset	Cities in Anhui Province, China	Google Earth	2.05 m	TP
Open Pit Mine Object Detection Dataset	---	---	---	ES
Tailings Ponds in Henan Province	Henan Province, China	---	2.00 m	TP

Table 4. Training Parameter Settings.

Parameter	Descriptions	Value
GPU_COUNT	Number of GPUs used	1
Batch Size	Batch size per training	16
Epochs	Number of iterations	300
Image_size	Image size	512 × 512
Learning_Rate	Learning rate	0.01
Optimizer	Optimizer	SGD
Experts_num	Number of experts	4
Top-K_num	Number of active experts	2

Table 5. The detection results for the model on the BUCEA-SWS Dataset.

Model	mAP₅₀	mAP_50–95	Precision	Recall	F1-Score
FasterRCNN (2017)	73.7	28.2	80.5	74.1	77.2
RetinaNet (2017)	70.9	25.3	75.8	72.5	74.1
Transformer (2019)	74.1	28.3	77.3	75.2	76.2
CSL (2020)	69.7	26.5	75.5	72.0	73.7
R3Det (2021)	72.3	26.0	80.4	70.4	75.1
PKI (2024)	76.7	36.9	81.9	78.2	80.0
RT-DETR (2024)	77.6	57.8	78.7	72.9	75.7
YOLOv11 (2024)	80.7	61.9	80.3	72.6	76.3
YOLOv12 (2025)	78.4	60.4	77.6	72.5	75.0
LEGNet (2025)	71.2	32.5	76.4	72.1	74.2
FBRT (2025)	79.3	61.6	77.6	74.8	76.2
Propose+ (YOLOv11)	83.1	65.0	79.4	78.3	78.8
Propose+ (YOLOv12)	81.9	63.1	75.9	78.9	77.4

The bold values indicate the best performance under each evaluation metric.

Table 6. The detection results for the model on the BUCEA-SWS Dataset (Garbage Dump Sites, Construction Spoil Sites).

Model	mAP₅₀	mAP_50–95	Precision	Recall	F1-Score
FasterRCNN (2017)	58.6	22.4	61.7	68.3	64.8
RetinaNet (2017)	46.7	12.5	55.2	56.1	55.6
Transformer (2019)	55.3	20.2	64.5	58.9	61.6
CSL (2020)	56.3	20.2	60.7	61.8	61.2
R3Det (2021)	53.9	16.3	67.0	54.7	60.2
PKI (2024)	59.9	24.6	70.2	60.7	65.1
RT-DETR (2024)	64.9	44.9	67.5	59.1	63.0
YOLOv11 (2024)	65.8	41.7	71.0	59.5	64.7
YOLOv12 (2025)	66.6	46.1	67.6	60.3	63.7
LEGNet (2025)	64.9	29.3	70.2	64.4	67.2
FBRT (2025)	65.6	45.0	60.9	67.8	64.2
Propose+ (YOLOv11)	68.5	46.0	67.0	65.6	66.3
Propose+ (YOLOv12)	68.8	47.2	63.5	66.9	65.2

The bold values indicate the best performance under each evaluation metric.

Table 7. The detection results for the model on Open-Source Datasets.

Model	mAP₅₀
Model	GDTD	OPMOD	OSTPD	TPHPD
FasterRCNN (2017)	34.8	51.7	36.8	44.4
RetinaNet (2017)	25.2	49.8	50.9	67.6
Transformer (2019)	43.0	50.2	50.1	64.5
CSL (2020)	34.1	50.3	38.8	59.5
R3Det (2021)	41.1	51.0	46.0	66.0
PKI (2024)	59.0	54.6	46.5	75.2
RT-DETR (2024)	48.5	54.1	35.2	76.7
YOLOv11 (2024)	55.6	55.7	49.5	80.6
YOLOv12 (2025)	52.6	55.7	43.9	82.1
LEGNet (2025)	56.7	50.2	51.1	76.2
FBRT (2025)	56.4	56.5	48.5	82.7
Propose+ (YOLOv11)	57.9	56.3	52.0	84.2
Propose+ (YOLOv12)	57.7	56.9	47.4	83.1

The bold values indicate the best performance under each evaluation metric.

Table 8. Comparison of Model GFLOPs and FPS.

Model	Gflops	FPS
FasterRCNN (2017)	63.3	48.2
RetinaNet (2017)	52.5	50.8
Transformer (2019)	77.2	42.4
CSL (2020)	36.2	44.5
R3Det (2021)	82.3	25
PKI (2024)	45.4	6.5
RT-DETR (2024)	40.83	57
YOLOv11 (2024)	3.26	116
YOLOv12 (2025)	3.27	113
LEGNet (2025)	56.8	29.6
FBRT (2025)	7.31	114
Propose+ (YOLOv11)	7.14	88
Propose+ (YOLOv12)	7.1	84

Table 9. Ablation Study on Individual Modules.

	Gflops	FPS	DARF-MOE	NST-MOE	KAN-MCP	Linear	mAP₅₀
	Gflops	FPS	DARF-MOE	NST-MOE	KAN-MCP	Linear	Detection	Segmentation
A	3.26	116	-	-	-	-	80.5	81.4
B	7.18	85	√	√	√		83.3	84.5
C	7.26	92	√	√		√	82.5	83.7
D	7.24	94	√	√			81.9	83.4
E	5.35	98	√		√		82.1	82.7
F	5.17	92		√	√		81.9	83.1
G	5.25	107	√				81.7	82.3
H	5.11	97		√			81.5	82.4
I	3.19	105			√		80.9	83.1

Table 10. Ablation study of expert group and sub-expert configurations in NST-MOE.

	Gflops	FPS	I	J	mAP₅₀		Number of Experts
	Gflops	FPS	I	J	Box	Seg	Number of Experts
A	6.39	91.2	1	16	81.8	82.6	16
B	7.13	89.1	4	3	82.1	82.9	12
C	7.14	88	4	4	83.1	84.6	16
D	7.16	86.3	4	6	82.1	83.1	24
E	7.11	84.1	3	4	82.6	83.3	12
F	7.19	83.4	6	4	82.3	82.8	24

Table 11. Time Consumption for the Loudi City Empirical Experiment.

Image Level	Data Size (GB)	Processing Time (h)	Screening Time (h)	Total Time (h)
Level-18 Empirical Data	6.35	0.88	0.46	1.34
Level-16 Empirical Data	12.40	2.2	1.01	3.21

Table 12. Precision of the Empirical Experiment.

Categories	GDS	CSS	ES	TP	LS
Precision	54.8%	29.8%	65.3%	6.6%	9.7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, K.; Liu, J.; Li, C.; Yu, B. WMN: A Multi-Scale Nested Mixture-of-Experts-Based Method for High-Resolution Remote-Sensing Solid Waste Site Extraction and Monitoring. Appl. Sci. 2026, 16, 6259. https://doi.org/10.3390/app16126259

AMA Style

Wang K, Liu J, Li C, Yu B. WMN: A Multi-Scale Nested Mixture-of-Experts-Based Method for High-Resolution Remote-Sensing Solid Waste Site Extraction and Monitoring. Applied Sciences. 2026; 16(12):6259. https://doi.org/10.3390/app16126259

Chicago/Turabian Style

Wang, Kaiqi, Jianhua Liu, Chen Li, and Bing Yu. 2026. "WMN: A Multi-Scale Nested Mixture-of-Experts-Based Method for High-Resolution Remote-Sensing Solid Waste Site Extraction and Monitoring" Applied Sciences 16, no. 12: 6259. https://doi.org/10.3390/app16126259

APA Style

Wang, K., Liu, J., Li, C., & Yu, B. (2026). WMN: A Multi-Scale Nested Mixture-of-Experts-Based Method for High-Resolution Remote-Sensing Solid Waste Site Extraction and Monitoring. Applied Sciences, 16(12), 6259. https://doi.org/10.3390/app16126259

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WMN: A Multi-Scale Nested Mixture-of-Experts-Based Method for High-Resolution Remote-Sensing Solid Waste Site Extraction and Monitoring

Abstract

1. Introduction

2. Methods

2.1. DARF-MOE Module

2.2. NST-MOE Module

2.3. KAN Mask Coefficient Prediction Module

2.4. BUCEA-SWS Dataset

3. Experimental Results and Analysis

3.1. Datasets

3.2. Training Details

3.3. Comparison Methods and Evaluation Metrics

3.4. Comparison Results and Analysis

3.5. Effectiveness Analysis of the Proposed Model Architecture

4. Case Study of Solid Waste Site Detection in Hunan Province

4.1. Empirical Workflow

4.2. Empirical Results

4.3. GIS-Based Remote-Sensing Solid Waste Pollution Risk Prevention System

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI