Geometry-Adaptive Visual Measurement and Optimization for Anomaly Detection in Mining Conveyors

Peng, Pingan; Li, Xuhe; Cheng, Kaixuan; Gong, Shuangwei; Zhang, Haoyue

doi:10.3390/math14101611

Open AccessFeature PaperArticle

Geometry-Adaptive Visual Measurement and Optimization for Anomaly Detection in Mining Conveyors

by

Pingan Peng

,

Xuhe Li

^*

,

Kaixuan Cheng

,

Shuangwei Gong

and

Haoyue Zhang

School of Resources and Safety Engineering, Central South University, Changsha 410017, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(10), 1611; https://doi.org/10.3390/math14101611

Submission received: 25 March 2026 / Revised: 27 April 2026 / Accepted: 6 May 2026 / Published: 9 May 2026

(This article belongs to the Special Issue Mathematical Modeling and Analysis in Mining Engineering)

Download

Browse Figures

Versions Notes

Abstract

This study demonstrates how structured algorithmic optimization can enhance intelligent visual measurement systems in mining engineering. Real-time visual measurement of mining conveyor belts is critical for operational safety, yet achieving high-precision anomaly detection under complex environmental conditions remains a significant challenge. Conventional approaches often struggle to balance detection accuracy with computational efficiency due to inefficient feature representation and optimization strategies. To address this, this study proposes FDSE-DETR, a lightweight end-to-end framework designed for real-time anomaly evaluation. The framework eliminates Non-Maximum Suppression (NMS) to streamline inference. Specifically, this study introduces a deformation-aware sampling mechanism to enhance feature representation of irregular hazards, alongside a cost-effective multi-scale aggregation strategy to preserve fine cues within strict device budgets. Furthermore, a reformulated loss objective is developed to rebalance hard samples under severe class imbalance, improving the detection confidence. Experimental results on mining conveyor belt foreign object datasets show a 4.5% improvement in mean average precision (mAP), a 3.9% improvement in overall recall and a 22.5% reduction in computational cost, achieving 120.7 FPS. This study aims to address the problems of insufficient accuracy and low efficiency in real-time material flow measurements on mining conveyor belts under high-dust and low-illumination conditions.

Keywords:

visual measurement; algorithmic optimization; anomaly detection; mining engineering; computational efficiency

MSC:

68U10

Graphical Abstract

1. Introduction

The degree of automation and intelligence in material logistics is a practical indicator of modernization in large-scale mining. As shown in Figure 1, belt conveyors enable long-distance, high-throughput, continuous transport of bulk materials, so their integrity and stable operation directly affect production continuity and costs [1]. In practice, hazardous foreign objects such as wooden blocks, discarded anchor rods, and metal mesh often intrude into conveyor lines. In practice, hazardous foreign objects such as wooden blocks, discarded anchor rods, and metal mesh are often observed intruding into conveyor lines; for instance, at a mine located in Wuxuan County, Guangxi, China, such intrusions occur frequently. Their irregular shapes and varying sizes, combined with dust, occlusions, and rapidly changing illumination underground, make them easy to miss and risky to ignore [2]. It should be noted that conveyor belt specifications and operating conditions may differ across mines, and the phenomena described here are based on observations at this specific site. If not detected and removed in time, these intrusions can cause belt tearing, accelerated abrasive wear, and severe mechanical failures, increasing maintenance costs, interrupting production, and creating safety hazards, especially when manual intervention is required in confined, poorly lit roadways [3].

Historically, non-destructive testing and evaluation of conveyor systems has relied on manual inspection, metal detectors, or radar-based sensing. These approaches tend to be inefficient, limited in scope, and costly over the system lifecycle. More importantly, many conventional scalar sensors cannot characterize non-metallic hazards, which limits their suitability for high-speed mining systems [4]. Automated visual testing based on deep learning offers richer anomaly representation than traditional measurement methods. With modern hardware, detectors such as YOLO [5], RetinaNet [6], Faster R-CNN [7], and the Transformer-based DETR [8] have achieved strong results, and mining-oriented adaptations have been reported for online condition monitoring. For example, Wang et al. [9] improved SSD algorithms for underground coal mines; Tian et al. [10] proposed a CNN-based method that utilizes time–frequency images derived from electromagnetic non-destructive testing signals to identify damage in mining wire ropes; and Saran et al. [11] developed multimodal imaging for steel mill conveyors. Other methods include Wang et al.’s [12] VMD-SVM-based feature extraction approach for gangue, Pu et al.’s [13] CNN-based recognition model that employs transfer learning to mitigate overfitting, and Sun et al. [14] proposed AMAF-YOLO to address the challenges of low-resolution and densely distributed targets in complex, cluttered environments. Lai et al. [15] combined an improved Mask R-CNN with multispectral imaging for gangue instance segmentation. Hong et al. [16] proposed a dual-model weak-light enhancement pipeline with a lightweight star-shaped attention convolutional detector (SARC-DETR) for foreign object detection under low illumination. To reduce the high computational cost and slow inference of many deep learning models, Lin et al. [17] developed a lightweight detector based on an improved YOLOv8n, facilitating deployment on embedded monitoring nodes. These studies demonstrate the feasibility of vision-based perception in complex industrial environments.

To address these challenges, this study proposes FDSE-DETR (Faster-DCN-Slim Neck-EMASVFL-DETR), a lightweight end-to-end visual measurement framework for real-time anomaly evaluation in mining operations. Built on RT-DETR, the framework retains efficient end-to-end inference and avoids Non-Maximum Suppression (NMS). FDSE-DETR follows a single design logic: maintain high sensitivity to small, irregular hazards in cluttered scenes while staying within strict edge-device budgets. To meet this constraint, the backbone emphasizes deformation-aware feature sampling so that thin or distorted objects remain distinguishable, the fusion stage prioritizes low-cost multi-scale aggregation to preserve fine cues without increasing latency, and the training objective rebalances hard samples under severe class imbalance to improve the detection confidence. In our implementation, these choices are realized with a dual-path backbone using FasterNet-style blocks with DCNv2-based deformable sampling [18,19], a compact Slim Neck based on GSConv and VoVGSCSP [20], and EMASlideVarifocalLoss (EMASVFL) using sliding-window reweighting with exponential moving averages [21,22]. Overall, FDSE-DETR is designed to balance accuracy, latency, and deployability for automated structural health monitoring in intelligent mining systems.

The remainder of this paper is organized as follows. Section 2 reviews object detection fundamentals. Section 3 presents FDSE-DETR, including the architecture, backbone, loss function, and low-light image enhancement methods. Section 4 describes the experimental setup, datasets, and results, including comparisons across backbones and neck networks and against existing approaches. Section 5 concludes this paper.

2. Related Work

2.1. Fundamentals of Object Detection

Deep learning has become the mainstream approach for industrial visual inspection, replacing much of the earlier reliance on manual checks and hand-crafted, rule-based vision pipelines. Most learning-based detection methods for industrial anomalies can be grouped into two families by backbone design: Convolutional Neural Network (CNN)-based models [23] and Transformer-based models [24]. Despite their successes, existing methods still struggle with four specific challenges when applied to mining conveyor belt anomaly detection: poor visibility, irregular shapes, small sizes, and severe class imbalance.

CNN-based detectors have matured into two widely used paradigms for industrial inspection: two-stage and one-stage detectors [23]. With respect to the challenge of poor visibility under low-light/high-dust conditions, these CNN-based detectors rely on standard convolutions that are not inherently robust to low-contrast, noisy images. Two-stage methods, represented by the Faster R-CNN family [7], generate candidate regions and then refine them with classification and localization. This staged refinement tends to deliver strong localization accuracy and is commonly used when precise defect positioning is required, for example in surface crack inspection of high-value components. Mask R-CNN [25] extends this line of work by adding an instance segmentation branch, which supports pixel-level characterization of defect shapes. However, none of these two-stage designs explicitly address texture degradation under poor illumination, limiting their ability to preserve fine details needed for small or partially obscured hazards.

One-stage detectors cast anomaly detection as direct regression and classification without an explicit proposal stage, which improves inference speed and suits real-time online monitoring. Representative models include SSD (Single Shot MultiBox Detector) [26] and the YOLO series [5]. YOLO-style detectors are widely used for in situ monitoring tasks such as conveyor belt inspection, where fast response matters and deployment constraints are strict. RetinaNet [6] further improved one-stage robustness by introducing Focal Loss to mitigate the severe foreground to background imbalance during training. This made one-stage detectors more reliable under clutter, dust, and other uncontrolled industrial conditions, while narrowing the accuracy gap with two-stage methods. Nevertheless, the backbone of one-stage detectors still uses fixed-kernel convolutions, which cannot adapt to irregular object shapes such as bent anchor rods, nor can they preserve weak signals from very small anomalies when aggressive downsampling is applied. As a result, YOLO-family models remain a common choice in real-time industrial measurement, largely because they offer a practical balance among speed, accuracy, and ease of embedded deployment, but they share the same geometric and multi-scale fusion limitations.

Transformer architectures have also become increasingly relevant for visual measurement tasks. Earlier sequence models such as Recurrent Neural Networks (RNNs) [27] can struggle with long sequences due to optimization difficulties and limited ability to capture long-range dependencies. In contrast, the Transformer [28] uses self-attention to aggregate information across positions, which helps model long-range interactions that often appear in complex anomaly patterns.

Transformer-based vision backbones, including Vision Transformer (ViT) [29] and Swin Transformer [30], have shown strong representation learning capability. Carion et al. [8] introduced DETR, which brought Transformers to object detection by removing hand-designed anchors and eliminating NMS [31], producing a simpler end-to-end detection pipeline. Building on this direction, Zhao et al. [32] proposed RT-DETR, which reduces computation by decoupling intra-scale feature interaction from cross-scale feature fusion through an efficient hybrid encoder. This design improves throughput while preserving the benefits of end-to-end detection, making RT-DETR a practical candidate for real-time industrial scenarios and for deployment on edge-based measurement instrumentation. Even with these improvements, Transformer-based detectors still lack an explicit mechanism to handle severe class imbalance, such as between normal belt surfaces and rare dangerous objects, and their self-attention can become weak when all image features are uniformly degraded under poor illumination.

To overcome the four identified limitations, we propose FDSE-DETR, a lightweight framework whose architecture integrally addresses texture loss, irregular geometry, small anomaly suppression, and class imbalance through a coordinated combination of enhanced feature extraction, adaptive sampling, efficient multi-scale fusion, and dynamic loss reweighting. The following section details the architecture and implementation.

2.2. Differentiation from Existing Detection Frameworks

Unlike generic object detectors such as YOLO and RT-DETR, which are designed for common vision tasks with regular objects and balanced class distributions, FDSE-DETR is specifically tailored for mining conveyor belt scenes under high dust, low illumination, small irregular hazards, and severe class imbalance. While existing methods rely on fixed geometric convolutions or standard attention mechanisms, FDSE-DETR introduces a deformable sampling backbone to adaptively handle irregularly shaped foreign objects. In contrast to RT-DETR’s ResNet-based feature extraction and convolutional cross-scale fusion module (CCFM) neck that tend to lose weak anomaly cues during multi-scale fusion, FDSE-DETR employs lightweight FasterNet blocks and a compact Slim Neck to preserve fine texture details and small hazard signals. Furthermore, whereas existing methods use conventional loss functions that struggle with extreme class imbalance, FDSE-DETR adopts an EMASVFL loss with exponential moving average reweighting to stabilize gradients and emphasize hard samples. Collectively, these differences enable FDSE-DETR to achieve higher detection accuracy and efficiency in challenging industrial environments, outperforming generic detectors by a clear margin.

3. FDSE-DETR Network Architecture

This study presents an enhanced visual measurement framework that balances evaluation accuracy and inference efficiency under three practical constraints, namely heavy environmental noise, large variations in anomaly scale, and limited computing on embedded monitoring instruments. The proposed architecture is derived from the lightweight RT-DETR-R18 baseline and is tailored for in situ deployment.

As shown in Figure 2, RT-DETR follows an end-to-end object detection paradigm comprising three key components: a backbone network for multi-scale feature extraction, a hybrid encoder that combines an attention-driven global modeling module (AIFI) with a lightweight CCFM, and a Transformer decoder with a detection head. The hybrid encoder decouples global context modeling from local feature fusion, reducing computational overhead while preserving rich spatial details. The decoder iteratively refines initial queries via multi-layer Transformer blocks and outputs category, confidence, and bounding box predictions using bipartite graph matching, eliminating the need for NMS during inference.

FDSE-DETR is organized around a single requirement: the system must preserve weak but safety-critical cues from small and irregular hazards in cluttered conveyor scenes, while remaining deployable on resource-constrained devices. To meet this requirement, this study adjusts feature extraction, feature fusion, and the training objective in a coordinated manner so that each part supports the same end goal.

Specifically, the mining conveyor environment presents four challenges that existing detectors fail to handle adequately.

One such challenge is poor visibility due to low light and high dust, which causes existing models to lose fine texture details. To address this, the backbone uses partial convolution in FasterNet-style Faster Blocks (stages 2–4). PConv reduces redundant computation while preserving high-frequency details, enabling real-time inference on embedded devices under degraded illumination, a need that standard convolutions cannot satisfy without excessive cost.

Another issue is that hazards often have irregular shapes, such as bent anchor rods or tangled nets, whereas existing detectors rely on fixed convolution kernels that cannot adapt to such geometries. Therefore, stage 5 adopts DCNv2-based deformable sampling, which dynamically adjusts sampling locations according to object shape, directly overcoming the rigidity of conventional kernels.

Furthermore, the small size of anomalies is critical: lightweight fusion modules in existing models tend to suppress weak cues from small objects. To remedy this, the neck replaces the original CCFM with a compact Slim Neck that employs GSConv, which preserves small anomaly signals during multi-scale fusion while maintaining low computational cost.

Lastly, severe class imbalance (maogan:bang ≈ 2.8:1) biases standard loss functions toward the majority class, leading to poor recall for the rare but safety-critical bang class. Existing losses such as Focal Loss adjust weights only based on the current prediction of each sample. However, in mining scenarios the difficulty of ambiguous cases, such as a partially occluded anchor rod, can change as training progresses. A purely instantaneous weight may fail to track this shift. To overcome this limitation, this study proposes EMASVFL, a dynamically weighted loss built on Varifocal Loss with an exponential moving average mechanism. EMASVFL maintains a global indicator of training difficulty using the batch-wise mean Intersection-over-Union (IoU), which is updated after each batch via an exponential moving average. This smoothed signal reduces sensitivity to transient noise and stabilizes convergence. Based on this global IoU mean, a sliding modulation weight is defined: it assigns higher gradients to samples that remain difficult yet informative, while down-weighting easy or already well-detected samples. As a result, EMASVFL behaves like a dynamic gain control mechanism. Early in training, the model learns easier cues; later, after global performance becomes more stable, it places more emphasis on hard anomalies such as occluded bolts or rod-like objects. This design effectively handles the 2.8:1 class imbalance that existing losses cannot properly address, improving the discriminability of anomaly responses and increasing robustness to environmental interference.

In summary, each component of FDSE-DETR is motivated by a specific mining challenge and directly remedies a clear shortcoming of off-the-shelf solutions.

Figure 3 shows the overall FDSE-DETR architecture, which includes a backbone, an encoder, and a decoder. After preprocessing, images are fed into the backbone to extract multi-scale features. The encoder and decoder follow the end-to-end RT-DETR design. IoU-aware query selection retains high-quality object queries, which are then refined through stacked Transformer decoder layers. The detection head outputs class probabilities and bounding boxes jointly. This end-to-end formulation avoids Non-Maximum Suppression and maintains reliable evaluation while limiting algorithmic complexity, which suits real-time deployment in resource-constrained mining environments.

3.1. Lightweight Feature Extraction via Faster Block

To support automated visual testing on resource-constrained embedded instruments such as intrinsically safe cameras in mines, the network must keep computation low while maintaining real-time throughput. In this framework, this study adopt the Faster Block module [18] in the backbone to reduce the cost of spatial feature processing.

The key idea in the Faster Block is partial convolution (PConv), which improves the efficiency of spatial feature extraction. Standard convolution applies spatial filtering across all channels, which can waste computation on repetitive background patterns. In contrast, PConv applies spatial convolution to only a subset of channels and leaves the remaining channels unchanged. This is well matched to conveyor belt monitoring because large areas of the image are dominated by relatively uniform belt surfaces, while the foreign-object regions require richer spatial modeling.

In terms of computation, PConv substantially reduces complexity. As shown in Figure 4, PConv splits the input feature map into two branches. One branch is a spatial processing path that applies a standard 3 × 3 convolution to a subset of channels, denoted as

c_{p}

, to extract anomaly-related spatial features. The other branch is an identity path that passes the remaining channels through without spatial convolution so that the original signal content is preserved. The complexity is defined by Equation (1):

h \times w \times k^{2} \times c_{p}^{2}

(1)

where

h \times w

denotes the spatial size and

k

denotes the kernel size. With a typical ratio

r = \frac{c_{p}}{c} = \frac{1}{4}

, only 25% of the channels receive the more expensive spatial convolution. As a result, FLOPs are reduced to about 1/16 of standard convolution under the same channel setting.

For in situ measurement instrumentation, PConv also reduces memory access by avoiding redundant read and write operations on the unchanged channels. This is reflected by the memory access approximation in Equation (2):

h \times w \times 2 c_{p}^{2} + k^{2} \times c_{p}^{2} \approx h \times w \times 2 c_{p}

(2)

This reduction lowers memory bandwidth demand to roughly one-quarter of the standard case, which helps relieve a common bottleneck on edge hardware. Subsequent 1 × 1 convolutions provide cross-channel communication, fusing the anomaly features extracted by the spatial path with the context preserved by the identity path. Overall, the Faster Block provides a practical trade-off between efficiency and anomaly representation, which supports deployment on lightweight mining monitoring nodes.

3.2. Adaptive Anomaly Characterization via DCNv2

Using Faster Blocks in shallow stages reduces computation, but this alone does not fully capture the diverse shapes of hazardous objects. To improve geometric modeling where high-level semantics are formed, this study places DCNv2 in the fifth stage of the backbone. Compared with standard convolution, which samples features on a fixed grid, DCNv2 learns sampling locations conditioned on the target structure. This makes the feature extractor more tolerant to non-rigid shapes and scale changes, which is important for irregular debris on mining conveyor belts under cluttered backgrounds.

A standard convolution with a 3 × 3 kernel can be viewed as a rigid template sliding over the feature map. For each output position

p_{0}

, it samples from a predefined regular grid. This fixed sampling pattern becomes less effective when targets exhibit strong geometric deformation, such as bent anchor rods or twisted metal mesh, because informative regions may fall outside the rigid grid while background regions are unnecessarily included.

DCN relaxes the fixed grid by introducing learnable offsets

Δ p

so that sampling points can move to better match the target topology [33]. As illustrated in Figure 5, deformable sampling provides position awareness and improves sensitivity to curvilinear object structures.

On 2D images, DCNv2 extends deformable convolution by learning both an offset vector

{Δ p_{n}}

and a modulation scalar

Δ m_{n}

. The resulting feature computation is given in Equation (3):

y (p_{0}) = \sum ω (p_{n}) \cdot x (p_{0} + p_{n} + Δ p_{n}) Δ m_{n} + b (p_{n})

(3)

Here,

Δ m_{n}

acts as a learnable weight that controls the contribution of each sampling point. In our setting, this mechanism helps downweight samples that drift into non-target regions such as ore background textures or specular glare on the belt surface. As a result, downstream layers receive cleaner signals, and localization becomes more reliable in complex environments. Compared with earlier deformable variants, DCNv2 reduces the chance that sampling points fall on irrelevant background regions, which improves robustness under clutter.

Figure 6 further illustrates why DCNv2 is beneficial for anomaly characterization. With objects such as anchor rods, fixed receptive fields often capture the target incompletely while admitting excessive background noise. DCNv2 mitigates this by aligning sampling locations with the object geometry rather than a predefined rectangular neighborhood. Adding DCNv2 in the deeper stage strengthens the network’s ability to represent the geometric deformations that are common in mining debris. This improves the signal-to-noise ratio of the feature map without additional computational overhead, and it helps preserve recognition accuracy after the model is lightweighted for embedded instruments.

3.3. Optimized Feature Aggregation via Slim Neck Architecture

In RT-DETR-R18, the hybrid encoder uses CCFM to support multi-scale feature interaction. While this design enables cross-scale communication, its reliance on dense standard convolution can create high memory bandwidth pressure on embedded measurement instruments, which makes low-latency deployment difficult. In addition, fixed-grid receptive fields in the fusion stage can be less effective for elongated targets and very small anomalies, where subtle cues are easily overwhelmed by background clutter. To better match these constraints, this study introduces a Slim Neck, a lightweight fusion module designed to reduce bandwidth demand while preserving the features needed for reliable anomaly discrimination.

As shown in Figure 7, Slim Neck applies GSConv to process high-level semantic features. GSConv reduces the cost of feature mixing by producing a compact intermediate representation and then reconstructing the full channel response through inexpensive operations. This lowers computing and memory traffic on edge devices. Slim Neck also keeps the AIFI module. As a Transformer-based component, AIFI captures global dependencies at the coarsest scale, which helps the system maintain structural context of the conveyor scene and reduces confusion between true hazards and random environmental patterns.

Along the fusion path, Slim Neck uses GSConv in lateral connections to merge upsampled semantic features with higher resolution texture details. The fused features are then processed by VoVGSCSP [20] to strengthen feature interaction with limited overhead. VoVGSCSP follows the cross-stage partial principle [34] and the one-shot aggregation design of VoVNet [35], which helps maintain gradient flow while keeping the fusion stage lightweight. Overall, this redesign replaces the more expensive fusion unit with a streamlined alternative that better fits the real-time and energy constraints of intrinsically safe monitoring equipment in mines.

To quantify the efficiency gain, this study compares GSConv with standard convolution. Let

W

and H be the spatial size,

K_{1} \times K_{2}

be the kernel size, and

C_{1}

and

C_{2}

be the input and output channels. As shown in Equation (4), the computational cost ratio between GSConv and standard convolution is approximately the following:

r a t i o = \frac{\frac{1}{2} \times W \times H \times K_{1} \times K_{2} \times C_{2} \times (C_{1} + 1)}{W \times H \times C_{1} \times K_{1} \times K_{2} \times C_{2}} = \frac{C_{1} + 1}{2 C_{1}} \approx \frac{1}{2}

(4)

This indicates that GSConv can reduce the convolution cost by about half under the same spatial and channel settings, while retaining the discriminative capacity needed for hazard identification. Part of this benefit comes from channel shuffle, which improves information exchange across channel groups and supports effective feature mixing at low cost.

As illustrated in Figure 8, VoVGSCSP follows a split–transform–merge structure. The input is separated into two paths to reduce redundant computation and to support stable gradient propagation during training. In the main transform path, this study uses GSConv rather than standard convolution to keep the module efficient for edge deployment. Multiple GSConv layers form a deeper extraction path, denoted as GSbottleneck, which helps the network model richer anomaly patterns while keeping inference latency within the limits of real-time monitoring.

3.4. Dynamic Sensitivity Calibration via EMASVFL Loss

In conveyor belt monitoring, a key difficulty is extreme class imbalance. Most pixels belong to normal belt surfaces or ore, while hazardous foreign objects are rare. RT-DETR uses Focal Loss to reduce the influence of easy negatives, but its weighting is driven only by the current prediction. As training progresses, the difficulty of ambiguous cases can change, for example partially occluded anchor rods, and a purely instantaneous weight may not track this shift. To improve robustness under noise-dominant conditions, this study introduces EMASVFL, a dynamic objective that adjusts emphasis using a smoothed history of training quality.

Focal Loss applies a modulating factor

(1 - p_{t})^{γ}

to focus optimization on hard samples, as defined in Equation (5):

L_{F L} (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} l o g (p_{t})

(5)

where

p_{t}

is the predicted probability. This strategy is effective in general detection, but it can be less suitable for measurement settings where ambiguity changes across training stages. For example, separating a black rubber strip from a black rubber belt may require gradually tightening decision boundaries as feature representations improve.

EMASVFL uses the batch Intersection-over-Union (IoU) mean as a proxy for detection quality [22] and maintains an EMA-based global indicator of training difficulty, denoted as

μ_{i o u}

. This smoothed signal reduces sensitivity to transient noise and supports stable convergence. The global IoU mean is updated after each batch according to Equation (6):

μ_{i o u}^{(t)} = d^{(t)} \cdot μ_{i o u}^{(t - 1)} + (1 - d^{(t)}) \cdot {\overline{I}}_{i o u}^{(t)}

(6)

Here, the attenuation factor

d^{(t)}

adjusts the balance between past and recent states. Based on

μ_{i o u}

, this study defines a sliding modulation weight

w (I_{i o u})

that separates confident detections from ambiguous ones. Using thresholds relative to the global mean,

T_{1}

=

μ_{i o u} -

0.1,

T_{2}

=

μ_{i o u}

, the loss assigns larger gradients to samples that remain difficult but informative, as shown in Equation (7):

w (I_{i o u}) = {\begin{array}{l} 1.0 & i f I_{i o u} \leq T_{1} \\ e^{(1 - μ_{i o u})} & i f T_{1} < I_{i o u} < T_{2} \\ e^{(- (I_{i o u} - 1))} & i f I_{i o u} \geq T_{2} \end{array}

(7)

With this weighting, the EMASVFL loss is defined as Equation (8):

L_{E M A S V F L} (p_{t}, q) = L_{V F L} (p_{t}, q) \cdot w (I_{i o u})

(8)

Among these, the

L_{V F L}

is calculated using the Varifocal Loss formula, as shown in Equation (9):

L_{V F L} (p_{t}, q) = {\begin{array}{l} - q (q l o g (p_{t}) + (1 - q) l o g (1 - p_{t})) & (q > 0) \\ - α_{t} {p_{t}}^{γ} l o g (1 - p_{t}) & (q = 0) \end{array}

(9)

Here,

q

denotes the target value. If

q

is a positive sample, then

q

= IoU ∈ (0, 1]; if

q

is a negative sample, then

q

= 0.

By multiplying Varifocal Loss with the EMA-based weight

w (I_{i o u})

, EMASVFL behaves like a dynamic gain control mechanism. In mining scenarios, this allows training to emphasize different cases over time. Early in training, the model can learn easier cues, and later it can place more weight on hard anomalies such as occluded anchor bolts after global performance becomes more stable. This improves the signal-to-noise ratio of anomaly responses and increases robustness to environmental interference.

3.5. Low-Light Image Enhancement Method

Underground mining scenes are often captured under very low illumination, which leads to low contrast, blurred anomaly details, and spatially uneven lighting. To improve the visibility of hazardous objects, this study uses an image enhancement pipeline that applies Contrast-Limited Adaptive Histogram Equalization (CLAHE) followed by Gamma correction. As shown in Figure 9, this pipeline increases local contrast and expands the useful intensity range while controlling noise amplification, which improves anomaly visibility in shadowed regions.

It is worth noting that underground mining environments also suffer from camera lens contamination caused by dust, mud, or water droplets. Such contamination can further degrade image quality and affect detection accuracy. Our current enhancement pipeline does not explicitly simulate or correct for lens dirt patterns. Addressing this practical issue will be part of our future work, for instance by incorporating lens-degradation augmentations into the training process to improve real-world robustness.

3.5.1. Localized Contrast Enhancement via CLAHE

Standard histogram equalization can over-amplify background noise, which may introduce artifacts that resemble anomalies. To reduce this risk, this study applies CLAHE to enhance contrast locally on the luminance channel. The input RGB image is converted to the LAB color space so that the luminance component

L

can be processed without altering chromatic information. CLAHE divides the image into

M \times N

non-overlapping tiles and computes a histogram for each tile independently. To limit noise spikes in homogeneous regions such as smooth belt surfaces, a clip limit

β

is applied. The clipping operation for each tile histogram

H (i)

is defined in Equation (10):

H_{clipped} (i) = {\begin{matrix} β, & if H (i) > β \\ H (i), & otherwise \end{matrix}

(10)

The clipped counts are then redistributed uniformly to obtain the modified histogram

H^{'} (i)

. Next, the cumulative distribution function

C D F (i)

and the gray-level mapping

T (i)

are computed to remap pixel intensities, as shown in Equation (11) and Equation (12):

C D F (i) = \sum_{k = 0}^{i} \frac{H^{'} (k)}{N_{p i x e l s}}

(11)

T (i) = ⌊ C D F (i) (L_{m a x} - 1) ⌋

(12)

Finally, bilinear interpolation is used to reconstruct the enhanced luminance component

L^{'}

. In our implementation, the tile size is set to 8 × 8, and the clip limit is set to

β = 3.0

. These settings enhance fine structures such as anchor rods and nets while keeping the overall scene appearance natural.

3.5.2. Global Dynamic Range Expansion via Gamma

CLAHE enhances local texture, but pixels in extremely dark regions can still remain too weak for reliable detection. To compensate, this study applies nonlinear Gamma correction as a global brightness adjustment. This step expands the dynamic range of low-intensity pixels through a power-law transform, defined in Equation (13):

O (x, y) = 255 \cdot {(\frac{I (x, y)}{255})}^{\frac{1}{γ}}

(13)

With

γ = 1.2

, the curve is convex and increases the spread of gray levels in dark areas. This improves the visibility of hazards in shadowed regions while avoiding over-brightening, and it provides a better-conditioned input for the downstream detection network.

3.5.3. Evaluation and Analysis of Data Augmentation Effects

To assess the effect of the enhancement pipeline, this study reports quantitative results on the dataset using metrics related to signal quality and feature separability. As shown in Figure 10, the enhanced images exhibit improved visibility and stronger cues for automated evaluation.

The results indicate clear recovery of details in low-light regions. Under the shadow-lifting metric, the fraction of underexposed pixels with intensity below 50 decreases substantially in both the training and validation sets, including an approximately 50% reduction in the validation set. This suggests that the enhancement pipeline brings previously obscured regions into a usable intensity range. At the same time, the information content of the image increases. Information entropy rises by about 0.6 to 0.7 bits on average, which indicates a broader distribution of texture details and helps separate subtle anomaly shapes from background patterns. Contrast improvement shows a consistent increase in the gray-level standard deviation by about 5 to 9%, which strengthens grayscale separation between foreign objects and the ore-dust background. The histogram redistribution curves provide a complementary view of this change. The intensity distribution shifts from a narrow, dark-skewed peak to a wider and more even spread. This expansion of the effective dynamic range makes better use of the imaging bit depth and provides a stronger data basis for high-precision foreign-object characterization.

Overall, the joint enhancement strategy improves both signal strength and structural clarity, which supports more reliable FDSE-DETR detection in complex mining environments.

4. Experimental Results and Discussion

4.1. Dataset Construction and Partitioning

As shown in Figure 11, the experimental data were built from a combined dataset designed to reflect conveyor belt operating conditions in metal mines. Figure 11a presents the CUMT-BelT dataset, which serves as the main body of our dataset. Through screening, this study removed irrelevant images such as gangue and coal so that the remaining samples matched the target scenario. Figure 11b shows additional images collected under metal mine conveyor belt conditions, including simulated foreign objects specific to metal mines. The final dataset was annotated with LabelImg using bounding boxes, with particular attention to safety-critical categories in metal mines, including maogan (anchor rods and nets) with 1363 instances and bang (rod-like foreign objects) with 483 instances, resulting in an approximate class ratio of 2.8:1. The dataset was split into training (3036 images), validation (416 images), and testing (445 images) sets with an 8:1:1 ratio. All images were resized to 640 × 640 pixels. Although the data cover a range of typical operating conditions, they are still limited in terms of scene diversity, such as different mine sites, conveyor configurations, lighting levels, camera viewpoints, and dust or moisture conditions. Therefore, the reported performance may not fully generalize to all real-world mining environments. Future work will involve collecting more diverse data from multiple mines and applying domain adaptation or data augmentation techniques to further enhance cross-site robustness.

4.2. Experimental Platform and Parameter Configuration

All experiments in this study were conducted under a unified hardware and software environment to ensure the fairness and reproducibility of the results. The hardware platform configurations used in the experiments are presented in Table 1.

During the model training phase, this study sets a series of hyperparameters as shown in Table 2.

4.3. Evaluation Metrics

To assess whether the proposed framework is suitable for real-time mine safety monitoring, this study uses an evaluation protocol that covers detection reliability, runtime efficiency, and feasibility on embedded instruments.

For reliability, this study reports precision (P), recall (R), and mean average precision (mAP). In conveyor belt safety monitoring, recall is especially important because it is closely related to the probability of detecting hazards. Higher recall means the system is more likely to find true hazardous objects such as partially occluded bolts or rods, which helps reduce missed detections (false negatives) that can lead to belt damage and safety incidents. Precision reflects the false alarm tendency. Low precision means normal ore or belt textures are incorrectly flagged as foreign objects (false positives), which can trigger unnecessary shutdowns and reduce operational efficiency. mAP summarizes detection accuracy across different IoU thresholds.

To evaluate runtime efficiency, the inference speed of each model is measured in frames per second (FPS). All FPS measurements are performed on the same hardware platform (NVIDIA RTX 2060 SUPER) using native PyTorch (1.13) inference with FP16 precision and a batch size of four. The reported FPS values reflect only the model inference time; preprocessing time for CLAHE and Gamma correction is excluded, as these steps are applied offline before feeding images into the network.

To evaluate embedded deployment feasibility, this study reports floating point operations (FLOPs) and parameter count (Params). These two metrics are the direct indicators of a lightweight model. FLOPs describe the computation required per inference and therefore constrain the achievable frame rate on resource-constrained edge nodes. Params reflect the memory footprint and indicate whether the model can fit within the storage limits of industrial smart cameras.

During evaluation, true positives (TPs) are foreign objects that are correctly detected. False positives (FPs) are normal materials incorrectly labeled as anomalies, which reduces the effective signal-to-noise ratio of the alarm output. False negatives (FNs) are hazardous objects that are missed and represent the most critical failure mode. The metrics are defined as follows:

P = \frac{T P}{T P + F P} \times 100 %

(14)

R = \frac{T P}{T P + F N} \times 100 %

(15)

A P = \int_{0}^{1} P (R) d R

(16)

m A P = \frac{1}{n} \sum_{j = 1}^{n} A P_{j}

(17)

F 1 - S c o r e = 2 \cdot \frac{P \cdot R}{P + R}

(18)

4.4. Neck Network Comparative Experiments

To examine the capacity of the RT-DETR neck and improve feature fusion efficiency, this study performed controlled comparisons of several neck designs. This study used the same baseline setting and kept the backbone and detection head unchanged and then evaluated the following neck variants under identical training and inference conditions: AIFI-MSMHSA, DBBC3, gConvC3, PSConv, MaNet, and Slim Neck.

To make the comparison informative for mining conveyor scenes, this study selects neck variants that reflect different design philosophies for efficient multi-scale fusion. The key question is which mechanism converts computation into more useful anomaly cues under the same backbone and detection head. This study therefore compares several design choices: attention-centered fusion for multi-scale contexts; branch-based aggregation that strengthens representation during training while remaining lightweight at inference; gated separable convolution for efficient selectivity; padding-based operators that bias features toward small objects; and self-attention-based coordination of local and global information. Concretely, the evaluated variants include M2SA with MSMHSA [36], DBBC3 with Diverse Branch Blocks [37], gConvC3 [38], PSConv [39], and MaNet [40]. Table 3 reports the quantitative results.

Figure 12 shows that designs dominated by heavier attention style components, including MSMHSA, DBBC3, and MaNet, increase computational overhead but do not deliver proportional improvements in anomaly detection. This indicates that the additional capacity is not efficiently converted into more discriminative features for mining conveyor scenes. At the other end of the spectrum, gConvC3 has the lowest theoretical computational load, but its detection reliability drops sharply. The mAP decreases to 50.6%, and the frame rate decreases to 98.8 FPS. This behavior suggests that aggressive compression weakens feature expressiveness, and critical anomaly cues are lost before decoding.

Across Figure 12 and Figure 13, a Slim Neck provides the best overall balance for real-time condition monitoring. It improves mAP to 74.3% compared with 70.8% for the baseline, while also reducing computational cost by 6.5% and parameters by 2.5%. Although its inference speed is slightly lower than the unoptimized baseline, the accuracy gains and the reductions in computation and memory footprint make a Slim Neck a practical choice for continuous safety monitoring on resource-constrained nodes.

4.5. Backbone Network Comparative Experiments

To evaluate the feature extraction capability of different backbones, this study conducts comparative experiments under a controlled setting. The neck and detection head are kept unchanged, and only the backbone is varied. This study compares Faster, StarNet, FasterNet, DySnake, Faster-DCNv2, and Faster-Rep to study the trade-off between inference speed and detection accuracy. The results are summarized in Table 4.

For the backbone study in mining conveyor scenes, this study aims to cover a broad set of efficiency strategies rather than a single architectural style. The comparison asks how different backbone level mechanisms use a fixed computing budget to preserve the cues that matter for irregular foreign objects, with the neck and detection head held constant. The evaluated strategies span lightweight backbone construction, selective spatial processing to avoid redundant work, high-throughput scaling choices, deformation sensitive convolution for elongated or curved structures, and reparameterization techniques that shift complexity to training while keeping inference efficient. Accordingly, this study tests StarNet [41], FasterNet-derived Faster Block variants and FasterNet_t0 [18], a DySnakeConv-based variant [42], and a Faster-Rep configuration.

As shown in Figure 14, very lightweight backbones such as StarNet and FasterNet reduce computational complexity, but their mAP values drop below the baseline. This indicates that their capacity is not sufficient to capture subtle texture variations that are important for high-precision anomaly identification. Although such backbones may fit low power sensors, the associated loss in accuracy increases the risk of missed detections in safety critical monitoring. Faster-Rep and Faster improve processing speed, but they do not improve discrimination accuracy relative to the baseline. The DySnake configuration reaches a mAP of 71.2%, but its efficiency is limited to 87.3 FPS, which does not match high-speed conveyor scenarios that require fast response.

Figure 15 shows that Faster-DCNv2 provides the most favorable balance for measurement instrumentation in our comparisons. It achieves a detection accuracy of 74.3%, which is a 3.5% improvement in reliability over the baseline. This gain is consistent with the ability of DCNv2 to adapt the receptive field to non-rigid geometric deformation in irregular foreign objects, such as bent rods, which supports more informative feature sampling for structurally complex hazards. At the same time, the computation cost decreases to 47.8G FLOPs, which corresponds to a 16% reduction, and the parameter count is also lower. With a processing speed of 112.7 FPS, the Faster-DCNv2 backbone supports real-time deployment while maintaining the requirements of industrial safety monitoring.

4.6. Ablation Experiment of the FDSE-DETR

To assess the contribution of each design choice, this study conducts ablation experiments. Reliable foreign-object monitoring in mining scenes depends on retaining fine texture cues under a tight computing budget, maintaining geometric sensitivity for irregular hazards, and keeping multi-scale fusion and supervision stable under clutter. Table 5 summarizes the settings, where “√” indicates that a component is enabled, and “×” indicates that it is removed. It should be noted that CLAHE and Gamma preprocessing were applied uniformly to all configurations in this ablation study; therefore, its individual contribution was not quantified separately.

The results highlight four consistent patterns. (1) When enabling A, the backbone becomes more efficient, yet mAP@0.5 increases to 72.2%. This indicates that PConv allocates spatial computation to informative channels and preserves high-frequency anomaly textures while reducing redundant processing on repetitive background regions. The recall reaches 74.6%, and mAP@0.5:0.95 reaches 38.4%, showing that lightweight feature extraction does not compromise detection sensitivity. The improved recall directly reduces the probability of missed hazards, which is critical for preventing belt damage and safety incidents. (2) Enabling B in deeper stages further increases mAP@0.5 to 73.4%, which supports the role of adaptive sampling in describing irregular geometry such as twisted anchor rods, and the accuracy gain outweighs the modest parameter increase. Recall improves to 75.3% and mAP@0.5:0.95 to 38.5%, confirming that deformable sampling benefits both localization and recall of irregular hazards. The higher recall for irregular objects means that fewer bent anchor rods or tangled nets escape detection, directly enhancing operational safety. (3) Enabling C reduces computation by 6.5% while improving mAP@0.5 to 74.4%, suggesting that the original CCFM fusion stage includes redundant operations and that a more direct multi-scale fusion path can retain fine target cues at lower cost. Correspondingly, recall reaches 75.0%, and mAP@0.5:0.95 is 38.3%, indicating stable multi-scale fusion without sacrificing recall or fine-grained localization. Maintaining recall under lightweight fusion ensures that small anomalies remain detectable, preventing false negatives that could otherwise lead to accumulated risks. (4) Finally, enabling D improves inference stability and directly targets the class imbalance. Unlike standard losses that rely only on instantaneous predictions, EMASVFL introduces an EMA of batch-wise IoU to track global training difficulty. It then applies a sliding modulation weight that assigns higher gradients to ambiguous hard samples while down-weighting easy ones. This dynamic gain control mechanism stabilizes gradient updates and gradually emphasizes rare but safety-critical objects as training progresses. The loss adjustment yields a recall of 75.9% and mAP@0.5:0.95 of 38.9%, demonstrating that EMASVFL effectively handles class imbalance and hard samples while improving overall detection confidence. Specifically, the recall gain from EMASVFL is particularly valuable because it improves detection of the minority bang class without increasing false alarms on the majority maogan class, directly addressing the concern about class imbalance.

When A and B are combined (A + B), the mAP@0.5 reaches 74.3%, which is 2.1% higher than using A alone and 0.9% higher than using B alone. This indicates that lightweight feature extraction (Faster Blocks) and deformable sampling (DCNv2) are complementary rather than redundant. A preserves high-frequency texture details at low computational cost, while B adaptively adjusts sampling locations to irregular geometries. Their combination allows the network to simultaneously capture fine texture and shape deformation, resulting in a performance gain that exceeds either individual contribution. The synergy is not merely additive but multiplicative in effect, as A and B operate on different aspects of feature representation.

Adding C to A + B (A + B + C) further increases mAP@0.5 to 74.8%, a gain of 0.5% over A + B. Although this gain is modest compared to the large jump from baseline to C alone, it must be interpreted in the context of bottleneck shifting. C already resolves the most severe limitation of the original design: inefficient multi-scale fusion and suppression of small anomaly signals. Once this major bottleneck is alleviated, the remaining headroom for A and B is naturally smaller. Nevertheless, the positive increment (0.5 points) confirms that A and B still provide useful features that C can further exploit, and there is no negative interaction.

Finally, incorporating D into A + B + C yields the full FDSE-DETR, which achieves the best overall performance: mAP@0.5 of 75.3%, mAP@0.5:0.95 of 40.3%, and recall of 79.1%, confirming that the full model not only improves standard accuracy but also significantly enhances the recall of rare safety-critical anomalies and maintains high localization quality across IoU thresholds. The 3.9% improvement in overall recall (from 75.2% to 79.1%) translates directly into a substantial reduction in missed hazards, which is the primary safety concern in mining conveyor operations. Compared with A + B + C, the full model improves recall by 1.8% and mAP@0.5:0.95 by 0.5%. This improvement is particularly significant because D operates on the loss function, directly affecting the optimization dynamics rather than the network architecture. While A, B, and C enhance representation capability, D changes how the model learns from imbalanced data. Consequently, D’s effect is largely orthogonal to the other components, leading to a clear boost in recall and localization quality without sacrificing precision. The fact that adding D after A + B + C still yields noticeable gains demonstrates that the architectural improvements have already provided a strong feature basis, and the loss reweighting further unleashes the model’s potential on hard and rare samples.

Figure 16 illustrates the mAP@0.5 progression from epoch 100 to epoch 200 during the ablation study, where the horizontal dashed line marks the 0.75 reference. It can be clearly observed that only the complete FDSE-DETR configuration reaches or exceeds this threshold, while all ablated variants fall below it. This further confirms that each added component contributes positively to detection accuracy and that the full model achieves a practically meaningful performance level. Under normal circumstances, the conveyor belt in an underground mine operates at a speed of 2.5 m/s. At this speed, an image processing rate of 120.7 FPS means that the system captures and analyzes a new frame every 8.33 ms. Within this short interval, the belt moves approximately 20.8 mm (2.5 m/s × 0.00833 s). Such a high frame rate ensures that even small hazards are sampled multiple times as they pass through the camera’s field of view, significantly reducing the risk of missed detection. This real-time capability is essential for reliable operation in high-speed mining conveyor belt environments.

Overall, the ablation results show that the proposed framework follows a coherent design logic rather than isolated tricks. Compared with the baseline, the final configuration reduces Param by 5.56% and FLOPs by 22.5% and improves mAP@0.5 by 4.5%, which better matches the real-time and accuracy requirements of mining monitoring. Recall improves by 3.9% and mAP@0.5:0.95 by 2.2%, further demonstrating the effectiveness of the proposed components in addressing class imbalance and irregular hazard detection.

4.7. Comparison of Different Object Detection Models

To examine classification reliability, this study compares confusion matrices for the baseline and the enhanced system, as shown in Figure 17 and Figure 18. For each model, subfigure (a) corresponds to the maogan class, and subfigure (b) corresponds to the bang class. The matrices indicate that the enhanced model makes fewer confusions between hazardous categories and background related classes.

FDSE-DETR exhibits stronger class discrimination than the baseline, particularly for safety-critical hazards. This study reports the F1-score (Equation (13)) as a summary metric that balances recall and precision. For the maogan class, FDSE-DETR reduces false positives by 115 and false negatives by 52 compared to RT-DETR, while increasing true positives by 52. Consequently, the F1-score improves from 75.3% to 83.3%. For the bang class, false positives drop from 58 to 24 (a 58.6% reduction), and the F1-score increases from 86.8% to 94.0%. These gains reflect fewer missed detections and fewer false alarms. In mining operations, reducing missed detections lowers the probability that hazardous objects pass the inspection point undetected, while reducing false alarms avoids unnecessary line stoppages and helps maintain production continuity.

Overall, the confusion matrix trends suggest that the proposed design improves separability between foreign object appearances and background textures. This leads to more stable predictions and supports autonomous deployment in industrial monitoring.

To place FDSE-DETR in the context of intelligent monitoring for mining conveyor scenes, this study compares it with twelve representative detectors. For clarity, this study groups the baselines into three categories: CNN-based detectors (Faster R-CNN, RetinaNet, SSD, FCOS, EfficientNet, TIMM), lightweight real-time detectors (YOLOv5, YOLOv8, NanoDet, YOLO-NAS), and Transformer-based architectures (DETR, DAB-DETR, PVT, Swin Transformer). All methods are evaluated on the same composite mining dataset, and the results are reported in Table 6.

As summarized in the table, FDSE-DETR offers a strong balance between reliability, efficiency, and deployability. Compared with the two-stage Faster R-CNN baseline, FDSE-DETR reaches similar detection reliability while using 45.4% of the parameters and 48.5% of the FLOPs. This reduction supports deployment on embedded edge nodes without compromising monitoring quality. Compared with heavier Transformer models such as Swin Transformer and PVT, FDSE-DETR achieves better accuracy and higher efficiency, which suggests that a task-focused design is more suitable for this industrial setting than general-purpose large models. Methods such as SSD and FCOS offer lower computational costs but fall short in safety-critical accuracy. RetinaNet and EfficientNet achieve moderate accuracy with higher FLOPs, while TIMM shows even lower reliability. Similarly, lightweight detectors like NanoDet and YOLO-NAS are either less accurate or slower than FDSE-DETR. Specifically, NanoDet is much lighter but sacrifices both accuracy and speed; YOLO-NAS has a similar size to FDSE-DETR yet lags by 5.1% in mAP@0.5 and is 40 FPS slower. Relative to the YOLO family, FDSE-DETR improves mAP@0.5 by 3.3% over YOLOv5 and by 1.8% over YOLOv8, while maintaining the highest throughput at 120.7 FPS, providing a larger safety margin against missed detections. Although DETR has the benefit of end-to-end detection, it typically converges slowly and runs with higher latency. Compared with DAB-DETR, FDSE-DETR achieves higher accuracy with substantially fewer parameters, which is consistent with tailoring the model to measurement constraints.

As shown in Figure 19, FDSE-DETR demonstrates a well-balanced trade-off among key metrics, achieving 120.7 FPS with only 18.78M parameters while maintaining the highest detection accuracy. This real-time throughput matches the timing constraints of high-speed conveyor operation and helps ensure that hazardous foreign objects are detected early enough to trigger intervention before structural damage occurs. Overall, these results support FDSE-DETR as a deployable measurement solution for automated structural health monitoring, rather than purely algorithmic refinement.

4.8. Visualization and Analysis of Results

To understand why FDSE-DETR is more robust in mining scenes, this study uses GradCAM++ to visualize class-specific network responses. By backpropagating category gradients, this study generates activation heatmaps that show which regions contribute most to each prediction. This analysis targets difficult cases where hazards such as anchor rods and wooden sticks have weak contrast and are easily confused with conveyor belt textures.

Figure 20 compares the activation maps produced by RT-DETR and FDSE-DETR. RT-DETR shows diffuse responses that frequently spill into high-contrast background areas, such as glare at the edges of the conveyor belt. This reduces the effective signal-to-noise ratio and can prevent the model from concentrating on thin targets like anchor rods, which increases the chance of missed detections. For deformable objects such as metal mesh, the baseline responses also fail to follow the continuous contour, which suggests limited geometric characterization.

FDSE-DETR produces more concentrated responses on the hazardous objects and less activation on background regions. This trend is consistent with the use of adaptive sampling in DCNv2 and the history-aware weighting in EMASVFL, which together improve feature selectivity under clutter. The heatmaps align more closely with irregular object geometry, which supports more reliable localization and recognition, including under low illumination. Overall, the visualization results provide qualitative evidence that the proposed design improves extraction of foreign object cues from complex backgrounds, which is consistent with the accuracy gains reported earlier.

4.9. Edge Deployment Validation

While prior evaluations on a desktop RTX 2060 SUPER GPU demonstrate the baseline efficiency of FDSE-DETR, practical mining applications impose strict hardware limitations. Deploying detection algorithms within embedded monitoring nodes requires robust performance on severely resource-constrained devices. To transition our deployment narrative from theoretical to empirical, we conducted extensive hardware benchmarks on a representative industrial edge platform: the NVIDIA Jetson Orin Nano. To ensure a rigorous and fair evaluation, the edge deployment setup utilized FP16 precision and a batch size of four, keeping parameters consistent with our desktop experiments. Evaluation results indicate that FDSE-DETR achieves a stable, real-time inference speed of 37.5 FPS on this embedded hardware. Crucially, this transition incurs zero accuracy degradation, maintaining an mAP@0.5 of 75.3%. Given that industrial real-time monitoring typically requires at least 25 FPS to ensure timely detection and response, FDSE-DETR’s 37.5 FPS comfortably exceeds this threshold, fully satisfying the real-time requirement for edge deployment in underground mining environments. These physical hardware benchmarks explicitly substantiate our earlier claims. They confirm that the proposed framework is not merely theoretically lightweight but practically viable for real-world edge monitoring in underground mining environments.

5. Conclusions

This study presents a visual measurement framework for real-time, non-invasive anomaly detection on mining conveyor belts. FDSE-DETR is built around a single optimization objective: the detector must remain sensitive to small, irregular hazards under dust and illumination fluctuations, while operating within strict computational efficiency budgets. To meet this constraint, the framework integrates deformation-aware sampling with selective multi-scale fusion and a reformulated loss objective that emphasizes difficult samples as learning stabilizes. This end-to-end framework improves robustness to heterogeneous background noise and supports stable inspection in degraded visual conditions. Quantitative results demonstrate practical algorithmic optimization performance. FDSE-DETR improves detection accuracy by 4.5% and reaches 120.7 FPS, while reducing computational costs by 22.5% and storage requirements by 5.56% relative to the baseline. Crucially, hardware validation on an NVIDIA Jetson Orin Nano confirms its deployment feasibility. The model sustains real-time processing at 37.5 FPS on this resource-constrained edge device with zero accuracy degradation (75.3% mAP@0.5). Overall, the proposed framework provides a resource-efficient solution for structural health monitoring of continuous transport infrastructure in mining engineering.

Author Contributions

Conceptualization, P.P.; methodology, X.L.; software, X.L. and K.C.; validation, X.L. and S.G.; formal analysis, K.C.; investigation, H.Z. and S.G.; resources, X.L.; data curation, K.C.; writing—original draft preparation, X.L.; writing—review and editing, P.P.; visualization, K.C.; supervision, X.L. and P.P.; project administration, X.L.; funding acquisition, P.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 52374168, the National Key Research and Development Program of China under Grants 2022YFC2904105 and 2023YFC2907403, the Science and Technology Innovation Program of Hunan Province under Grant 2023RC3069, and in part by the Fundamental Research Funds for the Central Universities of Central South University under Grant 2026ZZTS0215.

Data Availability Statement

The source code and datasets will be deposited in the project repository and made publicly available upon publication. The repositories can be accessed at https://github.com/lixuhe/FDSE-DETR-Mining-Detection (accessed on 5 May 2026) and https://github.com/lixuhe/foreign-data (accessed on 5 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, X.; Huang, X.; Zhang, C.; Zhang, L. Fault Detection Network for X-Ray Imaging Steel Cord Conveyor Belt Based on Improved YOLOv11. Nondestruct. Test. Eval. 2025, 1–20. [Google Scholar] [CrossRef]
Zhang, M.; Shi, H.; Zhang, Y.; Yu, Y.; Zhou, M. Deep Learning-Based Damage Detection of Mining Conveyor Belt. Measurement 2021, 175, 109130. [Google Scholar] [CrossRef]
Liu, M.; Zhu, Q.; Yin, Y.; Fan, Y.; Su, Z.; Zhang, S. Damage Detection Method of Mining Conveyor Belt Based on Deep Learning. IEEE Sens. J. 2022, 22, 10870–10879. [Google Scholar] [CrossRef]
Dong, L.; Wang, J.; Wang, J.; Wang, H. Safe and Intelligent Mining: Some Explorations and Challenges in the Era of Big Data. J. Cent. South Univ. 2023, 30, 1900–1914. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Las Vegas, NV, USA, 2016; pp. 779–788. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Wang, Y.; Wang, Y.; Dang, L. Video Detection of Foreign Objects on the Surface of Belt Conveyor Underground Coal Mine Based on Improved SSD. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 5507–5516. [Google Scholar] [CrossRef]
Tian, J.; Zhao, C.; Wang, H. Damage Identification for Mining Wire Rope Based on Continuous Wavelet Transform and Convolutional Neural Network. Nondestruct. Test. Eval. 2025, 40, 2598–2620. [Google Scholar] [CrossRef]
Saran, G.; Ganguly, A.; Tripathi, V.; Kumar, A.A.; Gigie, A.; Bhaumik, C.; Chakravarty, T. Multi-Modal Imaging-Based Foreign Particle Detection System on Coal Conveyor Belt. Trans. Indian Inst. Met. 2022, 75, 2231–2240. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Guo, Y.; Hu, K.; Wang, W. Dielectric and Geometric Feature Extraction and Recognition Method of Coal and Gangue Based on VMD-SVM. Powder Technol. 2021, 392, 241–250. [Google Scholar] [CrossRef]
Pu, Y.; Apel, D.B.; Szmigiel, A.; Chen, J. Image Recognition of Coal and Coal Gangue Using a Convolutional Neural Network and Transfer Learning. Energies 2019, 12, 1735. [Google Scholar] [CrossRef]
Sun, Q.; Zhang, J.; Zeng, S.; Hu, J.; Li, K. AMAF-YOLO: Dynamic Cross-Region Attention and Multi-Scale Fusion for Small Object Detection. Nondestruct. Test. Eval. 2025, 1–31. [Google Scholar] [CrossRef]
Lai, W.; Hu, F.; Kong, X.; Yan, P.; Bian, K.; Dai, X. The Study of Coal Gangue Segmentation for Location and Shape Predicts Based on Multispectral and Improved Mask R-CNN. Powder Technol. 2022, 407, 117655. [Google Scholar] [CrossRef]
Hong, Y.; Wang, L.; Su, J.; Li, Y.; Zhu, B.; Wang, H. Enhanced Foreign Body Detection on Coal Mine Conveyor Belts Using Improved DLEA and Lightweight SARC-DETR Model. SIViP 2025, 19, 349. [Google Scholar] [CrossRef]
Ling, J.; Fu, Z.; Yuan, X. Lightweight Coal Mine Conveyor Belt Foreign Object Detection Based on Improved Yolov8n. Sci. Rep. 2025, 15, 10361. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Vancouver, BC, Canada, 2023; pp. 12021–12031. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2019; pp. 9300–9308. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by GSConv: A Lightweight-Design for Real-Time Detector Architectures. J. Real.-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. VarifocalNet: An IoU-Aware Dense Object Detector. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Nashville, TN, USA, 2021; pp. 8510–8519. [Google Scholar]
Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. YOLO-FaceV2: A Scale and Occlusion Aware Face Detector. Pattern Recognit. 2024, 155, 110714. [Google Scholar] [CrossRef]
Sultana, F.; Sufian, A.; Dutta, P. A Review of Object Detection Models Based on Convolutional Neural Network. In Intelligent Computing: Image Processing Based Applications; Mandal, J.K., Banerjee, S., Eds.; Springer: Singapore, 2020; pp. 1–16. ISBN 978-981-15-4288-6. [Google Scholar]
Vaidwan, H.; Seth, N.; Parihar, A.S.; Singh, K. A Study on Transformer-Based Object Detection. In Proceedings of the 2021 International Conference on Intelligent Technologies (CONIT); IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Sherstinsky, A. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. Phys. D. Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lucic, M.; Schmid, C. ViViT: A Video Vision Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Montreal, QC, Canada, 2021; pp. 6816–6826. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Montreal, QC, Canada, 2021; pp. 9992–10002. [Google Scholar]
Salscheider, N.O. FeatureNMS: Non-Maximum Suppression by Learning Feature Embeddings. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR); IEEE: New York, NY, USA, 2021; pp. 7848–7854. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Seattle, WA, USA, 2024; pp. 16965–16974. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Scaled-YOLOv4: Scaling Cross Stage Partial Network. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Nashville, TN, USA, 2021; pp. 13024–13033. [Google Scholar]
Lee, Y.; Hwang, J.; Lee, S.; Bae, Y.; Park, J. An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: Long Beach, CA, USA, 2019; pp. 752–760. [Google Scholar]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse Branch Block: Building a Convolution as an Inception-like Unit. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Nashville, TN, USA, 2021; pp. 10881–10890. [Google Scholar]
Song, Y.; Zhou, Y.; Qian, H.; Du, X. Rethinking Performance Gains in Image Dehazing Networks. arXiv 2022, arXiv:2209.11448. [Google Scholar] [CrossRef]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-Shaped Convolution and Scale-Based Dynamic Loss for Infrared Small Target Detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 9202–9210. [Google Scholar] [CrossRef]
Fan, T.; Wang, G.; Li, Y.; Wang, H. MA-Net: A Multi-Scale Attention Network for Liver and Tumor Segmentation. IEEE Access 2020, 8, 179656–179665. [Google Scholar] [CrossRef]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; pp. 5694–5703. [Google Scholar] [CrossRef]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic Snake Convolution Based on Topological Geometric Constraints for Tubular Structure Segmentation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Paris, France, 2023; pp. 6047–6056. [Google Scholar]

Figure 1. Foreign objects scenario on the mining conveyor belt.

Figure 2. RT-DETR network structure diagram.

Figure 3. FDSE-DETR network structure diagram.

Figure 4. Partial convolutional and Faster Block structure diagram.

Figure 5. Illustration of the sampling locations in (a) standard and (b) deformable convolutions.

Figure 6. Illustration of the fixed receptive field in standard convolution (a) and the adaptive receptive field in deformable convolution (b).

Figure 7. The structure of the GSConv module.

Figure 8. The structures of the (a) GSbottleneck module and the (b) VoVGSCSP module.

Figure 9. CLAHE with Gamma correction enhancement mechanism diagram.

Figure 10. Data analysis chart of the dataset before and after enhancement.

Figure 11. Visualized images of the consolidated dataset: (a) foreign objects and normal conveyor belt in CUMT-BelT; (b) foreign objects and normal conveyor belt in the self-built dataset.

Figure 12. Comparison bar chart of different neck network modules.

Figure 13. Comparison mAP@0.5 of different neck network models.

Figure 14. Comparison bar chart of different backbone network modules.

Figure 15. Comparison mAP@0.5 of different backbone network modules.

Figure 16. Partial comparison mAP@0.5 of different improvement modules.

Figure 17. Confusion matrix of FDSE-DETR: (a) represents the confusion matrix for the maogan class, while (b) corresponds to the bang class.

Figure 18. Confusion matrix of RT-DETR: (a) represents the confusion matrix for the maogan class, while (b) corresponds to the bang class.

Figure 19. Radar chart for comparison of different models.

Figure 20. Comparison of heatmaps between the FDSE-DETR and RT-DETR: (a) maogan; (b) bang; (c) maogan; (d) maogan; (e) bang.

Table 1. Hardware environment and software configuration.

Configuration Items	Parameters
CPU	Intel(R) Core (TM) i7-10700K CPU @ 3.80 GHz (Intel Corporation, Santa Clara, CA, USA)
GPU	NVIDIA GeForce RTX 2060 SUPER (NVIDIA Corporation, Santa Clara, CA, USA)
RAM	32.0 GB
Operating System	Windows 11
CUDA Version	11.7
PyTorch	1.13
Python	3.9.21

Table 2. Training hyperparameters.

Parameters	Setup
Epoch	200
Batch size	8
Image size	640 × 640
Workers	0
Optimizer	AdamW
Weight decay	0.0001
Conf	0.25
lrf	0.1
lr0	0.0001
IoU	0.7

Table 3. Comparison of different neck network modules.

Module	Param/M	FLOPs/G	FPS	mAP@0.5/%
AIFI-MSMHSA	19.9	57.1	106.4	73.0
DBBC3	22.2	68.2	106.1	70.3
gConvC3	18.8	51.8	98.8	50.6
PSConv	19.6	57.8	107.1	64.5
MAN	21.9	58.1	96.2	70.3
Slim Neck	19.3	53.2	108.5	74.3
Baseline	19.87	56.9	117.9	70.8

Table 4. Comparison of different backbone network modules.

Module	Param/M	FLOPs/G	FPS	mAP@0.5/%
StarNet	11.9	31.8	96.5	71.5
Faster	16.78	49.5	116.6	72.2
DySnake	27.8	60.8	87.3	71.2
Faster-DCNv2	19.36	47.8	112.7	74.3
Faster-Rep	16.7	49.5	104.0	71.4
FasterNet	10.8	28.5	126.3	70.7
Baseline	19.87	56.9	117.9	70.8

Table 5. Ablation experiment.

Model	Faster	DCNv2	Slim Neck	EMASVFL	Param/M	FLOPs/G	FPS	mAP@0.5/%	mAP@ 0.5:0.95/%	Recall/%
Baseline	×	×	×	×	19.87	56.9	117.9	70.8 (±0.68)	38.1 (±0.57)	74.2
A	√	×	×	×	16.78	49.5	116.6	72.2 (±0.22)	38.4 (±0.16)	74.6
B	×	√	×	×	20.12	53.4	120.5	73.4 (±0.20)	38.5 (±0.36)	75.3
C	×	×	√	×	19.3	53.2	108.5	74.4 (±0.54)	38.3 (±0.24)	75.0
D	×	×	×	√	19.87	56.9	120.9	72.6 (±0.34)	38.9 (±0.27)	75.9
A + B	√	√	×	×	19.36	47.8	112.7	74.3 (±0.42)	38.7 (±0.33)	74.5
A + B + C	√	√	√	×	18.78	44.1	98.7	74.8 (±0.58)	39.8 (±0.18)	77.3
FDSE-DTER	√	√	√	√	18.78	44.1	120.7	75.3 (±0.38)	40.3 (±0.49)	79.1

Table 6. Comparison results of different models.

Model	Param/M	FLOPs/G	FPS	mAP@0.5/%	mAP@0.5:0.95/%
Faster R-CNN	41.35	90.9	108.6	74.7 (±0.30)	36.9 (±0.38)
Retinanet	19.79	96.36	87.3	74.4 (±0.25)	39.1 (±0.58)
SSD	23.88	30.47	110.8	73.1 (±0.25)	37.7 (±0.59)
DETR	28.83	31.1	99.8	70.9 (±0.68)	37.8 (±0.65)
Fcos	19.1	60.7	102.1	73.7 (±0.61)	36.5 (±0.33)
Yolov5-m	20.9	64.2	108.3	72.0 (±0.31)	38.8 (±0.6)
Yolov8-m	30.06	78.9	118.2	73.5 (±0.15)	39.65 (±0.44)
PVT	21.35	67.04	118.5	72.7 (±0.53)	35.3 (±0.57)
Swin Transformer	36.84	84.86	110.5	72.9 (±0.76)	37.7 (±0.67)
TIMM	19.16	51.09	113.8	70.6 (±0.36)	37.3 (±0.12)
DAB-DETR	43.7	65.32	114.6	71.0 (±0.49)	37.9 (±0.47)
EfficientNet	18.86	85.14	91.2	70.5 (±0.76)	35.4 (±0.55)
NanoDet-m	1.17	1.44	84.13	72.1 (±0.40)	40.0 (±0.26)
Yolo-Nas	19.02	18.46	80.7	70.2 (±0.37)	38.4 (±0.24)
FDSE-DETR	18.78	44.1	120.7	75.3 (±0.38)	40.3 (±0.49)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Peng, P.; Li, X.; Cheng, K.; Gong, S.; Zhang, H. Geometry-Adaptive Visual Measurement and Optimization for Anomaly Detection in Mining Conveyors. Mathematics 2026, 14, 1611. https://doi.org/10.3390/math14101611

AMA Style

Peng P, Li X, Cheng K, Gong S, Zhang H. Geometry-Adaptive Visual Measurement and Optimization for Anomaly Detection in Mining Conveyors. Mathematics. 2026; 14(10):1611. https://doi.org/10.3390/math14101611

Chicago/Turabian Style

Peng, Pingan, Xuhe Li, Kaixuan Cheng, Shuangwei Gong, and Haoyue Zhang. 2026. "Geometry-Adaptive Visual Measurement and Optimization for Anomaly Detection in Mining Conveyors" Mathematics 14, no. 10: 1611. https://doi.org/10.3390/math14101611

APA Style

Peng, P., Li, X., Cheng, K., Gong, S., & Zhang, H. (2026). Geometry-Adaptive Visual Measurement and Optimization for Anomaly Detection in Mining Conveyors. Mathematics, 14(10), 1611. https://doi.org/10.3390/math14101611

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Geometry-Adaptive Visual Measurement and Optimization for Anomaly Detection in Mining Conveyors

Abstract

1. Introduction

2. Related Work

2.1. Fundamentals of Object Detection

2.2. Differentiation from Existing Detection Frameworks

3. FDSE-DETR Network Architecture

3.1. Lightweight Feature Extraction via Faster Block

3.2. Adaptive Anomaly Characterization via DCNv2

3.3. Optimized Feature Aggregation via Slim Neck Architecture

3.4. Dynamic Sensitivity Calibration via EMASVFL Loss

3.5. Low-Light Image Enhancement Method

3.5.1. Localized Contrast Enhancement via CLAHE

3.5.2. Global Dynamic Range Expansion via Gamma

3.5.3. Evaluation and Analysis of Data Augmentation Effects

4. Experimental Results and Discussion

4.1. Dataset Construction and Partitioning

4.2. Experimental Platform and Parameter Configuration

4.3. Evaluation Metrics

4.4. Neck Network Comparative Experiments

4.5. Backbone Network Comparative Experiments

4.6. Ablation Experiment of the FDSE-DETR

4.7. Comparison of Different Object Detection Models

4.8. Visualization and Analysis of Results

4.9. Edge Deployment Validation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI