Application of Wavelet Convolution and Scale-Based Dynamic Loss for Multi-Scale Damage Detection of Mining Conveyor Belt

Xie, Fangwei; Wang, Jianfei; Gordin, Sergey Alexandrovich; Ermakov, Aleksandr Nikolaevich; Varnavskiy, Kirill Aleksandrovich

doi:10.3390/mining6010008

Open AccessArticle

Application of Wavelet Convolution and Scale-Based Dynamic Loss for Multi-Scale Damage Detection of Mining Conveyor Belt

by

Fangwei Xie

¹,

Jianfei Wang

²

,

Sergey Alexandrovich Gordin

²

,

Aleksandr Nikolaevich Ermakov

²

and

Kirill Aleksandrovich Varnavskiy

^2,*

¹

School of Mechatronic Engineering, China University of Mining and Technology, Xuzhou 221116, China

²

Mining Institute, T. F. Gorbachev Kuzbass State Technical University, Kemerovo 650000, Russia

^*

Author to whom correspondence should be addressed.

Mining 2026, 6(1), 8; https://doi.org/10.3390/mining6010008

Submission received: 9 December 2025 / Revised: 14 January 2026 / Accepted: 26 January 2026 / Published: 30 January 2026

(This article belongs to the Special Issue Advances in Mining Technology and Equipment: Innovations and Case Studies)

Download

Browse Figures

Versions Notes

Abstract

Mining conveyor belts are critical components in bulk material transportation, but their operational safety is frequently threatened by diverse damages such as blocks, cracks, foreign objects, and holes. Existing detection methods, including traditional computer vision and conventional neural networks, struggle to balance accuracy and efficiency in harsh mining environments—marked by high levels of dust, uneven lighting, and extreme scale variability (5–300 pixels). Our study proposes WTConv-YOLO, an improved model based on YOLOv11, integrating two core modules: (1) wavelet transform convolution (WTConv), which achieves a logarithmically expanding receptive field with linearly growing parameters, allowing for the concurrent capture of high-frequency local details and low-frequency global context; (2) Scale-based Dynamic Loss (SD Loss), which dynamically adjusts bounding box similarity and localization loss weights according to target scale, mitigating IoU fluctuation interference and enhancing small-target detection stability. Experiments on the Mining Industrial Conveyor Belt Dataset show that WTConv-YOLOv11 achieves a mean Average Precision (mAP@0.5) of 73.8%—a 3.5% improvement over the baseline YOLOv11. A Python-based software system is developed for end-to-end detection. This work provides a practical solution for reliable conveyor belt damage detection in mining scenarios.

Keywords:

conveyor belts; multi-scale damage detection; wavelet convolution; scale-aware dynamic loss

Graphical Abstract

1. Introduction

1.1. Background

Conveyor belts [1,2] are critical to mining operations, enabling the continuous transportation of bulk materials such as coal and ores. Prolonged exposure to abrasion, impact, dust, and moisture inevitably leads to multi-scale damages (blocks, cracks, foreign objects, holes), which may cause production downtime, equipment failure, or safety accidents. Industry reports indicate that conveyor belt maintenance accounts for 30–50% of the total cost of conveyor systems [3], highlighting the urgency of developing efficient damage detection technologies.

The development of mining conveyor belt damage detection technology has undergone several stages. Early methods relied on traditional techniques such as mechanical scanning and X-ray inspection [4], which suffer from low efficiency and heavy reliance on manual analysis. In recent years, machine vision combined with deep learning has become the mainstream for object detection of conveyor belts. In terms of foreign object detection, Ling et al. [5] proposed a lightweight detection algorithm based on improved YOLOv8n, while Peng et al. [6] developed a camera-adaptive foreign object detection model for coal conveyor belts. For crack/tear detection, Huang et al. [7] constructed BeltCrack, the first sequential-image dataset for industrial conveyor belt crack detection, and Liu et al. [8] proposed YOLO-STOD, a tear detection model based on the YOLOv5 algorithm. In the field of damage condition, Yang et al. [9] proposed a Retinex-YOLOv8-EfficientNet-NAM integrated framework for wear state detection of underground mine conveyor belts, which enhances environmental adaptability under harsh mining conditions. Wang et al. [10] developed a PLC-based laser scanning system for conveyor belt surface monitoring. For misalignment detection, Wang et al. [11] presented the lightweight GES-YOLO model, and Ni et al. [12] achieved high-precision misalignment detection by improving YOLOv8.

1.2. Limitations of Existing Methods

Traditional damage detection methods are inefficient and heavily dependent on manual analysis, making them unsuitable for real-time monitoring. Machine vision-based methods, which utilize image processing techniques like edge detection and thresholding, are easily disrupted by environmental noise and fail to generalize to diverse damage types.

In recent years, deep learning has stood out as a promising alternative. Conventional CNNs (e.g., Faster R-CNN, YOLOv11) [13,14] face two critical bottlenecks: (1) receptive field constraints—small-kernel convolutions fail to capture large-scale damages, while large-kernel convolutions (e.g., RepLK-31 [15]) cause quadratic parameter growth and overfitting [16]; (2) scale-insensitive loss functions—IoU, GIoU, and CIoU use fixed weights, leading to poor localization for small targets and unstable IoU for large targets under noise [17]. Lightweight [18] models prioritize speed but compromise accuracy for complex damages. Our study is motivated by these two limitations and aims to improve them.

Recent advancements include wavelet convolutions (WTConv) for efficient receptive field expansion [19] and scale-aware losses for adaptive regression [20]. However, WTConv has primarily been applied to general computer vision tasks [21,22], with minimal adaptation to mining-specific noise (dust, shadows). Scale-aware losses [23] lack dynamic adaptation to feature map resolution and mining damage morphologies. Additionally, recent baselines (e.g., YOLOv8-LDH [24], EfficientDet [25]) for conveyor belt detection are underutilized in comparative studies.

1.3. Motivation and Contributions

This study adapts WTConv and SD Loss for mining conveyor belt damage detection, aiming to address the multi-scale and anti-interference requirements of safety monitoring. The key contributions are as follows:

A novel damage detection framework integrating large receptive field convolutions and dynamically adjusts bounding box loss function to adapt to multi-scale damage in mining environments
Quantitative validation of the proposed model’s enhanced accuracy and superior anti-disturbance capacity in comparison to existing traditional detection models.
An integrated software visual display system is developed to enable end-to-end mining conveyor belt damage detection for practical industrial application.

The structure of the remaining sections in this paper is arranged as follows:

Section 2 provides an overview of the research advances in deep learning models, along with an analysis of their limitations when applied to damage detection for mining conveyor belts. Section 3 elaborates on the architecture of the lightweight neural network, its specific implementation approach, as well as the operating environment and relevant parameter configurations employed. Section 4 presents the experimental results, while Section 5 summarizes the key conclusions of this study.

2. Materials and Methods

2.1. Large Receptive Field Convolutions

Capturing multi-scale features is essential for detecting mining conveyor belt damages, driving research on large receptive field convolutions. Existing methods are categorized into three types:

Dilated convolutions [26]: These expand the receptive field by inserting zeros between kernel elements, avoiding parameter increases. However, they introduce “gridding artifacts”—discontinuities in feature maps that disrupt the detection of continuous damages (e.g., longitudinal cracks). This makes them ineffective for mining conveyors, where cracks often span the entire belt width.

Large-kernel convolutions: Models like RepLK-31 [15] use large kernels (e.g., 31 × 31) to directly expand the receptive field. While effective for large-scale features, parameters grow quadratically with kernel size (a 31 × 31 kernel has 961× more parameters than a 3 × 3 kernel), leading to overfitting and high computational cost (≥10 GFLOPs).

Wavelet transform convolutions [19]: WTConv decomposes input features into multi-frequency sub-bands via Haar wavelet transform, achieving logarithmic receptive field expansion with linear parameter growth. For mining conveyors, WTConv’s anti-interference capability is particularly valuable: it distinguishes high-frequency damage details from low-frequency dust noise, outperforming dilated and large-kernel convolutions in balancing accuracy and efficiency.

2.2. Scale-Sensitive Loss Functions

IoU, GIoU, and CIoU [17] are the most widely used loss functions. They rely on fixed weights for all target scales. IoU quantifies the overlap degree between predicted bounding boxes and ground-truth bounding boxes. GIoU adds the ratio of the smallest enclosing box to address IoU’s limitations. CIoU Incorporates center distance and aspect ratio. These losses perform poorly for mining conveyors, where damage scales range from 5 pixels (micro-holes) to 300 pixels (large block).

To address scale dependence, recent studies have proposed scale-aware losses. Scale-IoU Loss: adjusts weights based on target area, increasing the weight of small targets [23]. However, it uses empirical thresholds (e.g., defining “small” targets as <50 pixels) that lack adaptability to feature map resolution—a 50-pixel target in a 256 × 256 feature map is “large” relative to a 128 × 128 map, leading to mismatched supervision. Alpha-IoU [17] introduces a power parameter to adjust IoU sensitivity, but this parameter is static and cannot dynamically adapt to mining-specific damage morphologies. Scale-based Dynamic Loss [20] used in this study dynamically calculates weights using both target size and feature map resolution, achieving better alignment with the characteristics of large scale variations.

3. Methodology

3.1. Overview of the Proposed Model

The core task of this study focuses on multi-class damage detection for mining conveyor belts based on image and video data, where the target damage categories include block, crack, foreign, and hole [27]. The input data for this task consists of images and videos captured by on-site monitoring cameras in actual mining environments, and these visual data typically exhibit characteristics of real-world industrial scenarios such as noise interference, dust contamination, and uneven lighting, which pose natural challenges to the detection process. The expected output of the task is to generate, for each detected damage instance in the input visual data, corresponding bounding boxes that accurately localize the damage regions, along with the specific damage category assigned to each bounding box and the corresponding confidence scores that reflect the reliability of the category classification and localization results.

To address the multi-scale damage characteristics and complex environmental interference in mining conveyor belt detection, this study proposes a novel end-to-end damage detection framework (as illustrated in Figure 1), termed WTConv-YOLOv11. The framework integrates wavelet transform convolution (WTConv) and improved SD Loss into the YOLOv11 [14] architecture, following a six-step technical workflow: data collection, WTConv-based feature enhancement, damage detection network construction, SD Loss optimization, result visualization, and software development.

The core design leverages WTConv’s multi-frequency analysis to separate damage details from noise and SD Loss’s dynamic weighting to address scale-dependent regression errors. WTConv is selectively embedded in C3K2 Backbone and Neck modules, where multi-frequency feature extraction is most impactful for cluttered industrial backgrounds.

3.2. Wavelet Transform Convolution (WTConv)

The WTConv layer, as proposed in [19], leverages wavelet transform (WT) to decompose input features into multiple frequency-related components, enabling small-kernel convolutions to operate on enlarged receptive fields. For integration into YOLOv11:

Wavelet Decomposition: For each input feature map

X \in R^{C \times H \times W}

(where C denotes the channel count, H and W represent the spatial dimensions of feature map), a cascade Haar wavelet transform is applied recursively.

For a given image X, the one-level Haar WT performed on one spatial dimension is realized through depth-wise convolution with the kernels [1, 1]/√2 and [1, −1]/√2, and is then followed by a conventional downsampling operator with a factor of 2. For the realization of 2D Haar WT, we combine the above operation across both dimensions, resulting in a depth-wise convolution with a stride of 2 that employs the subsequent set of four filters:

f_{L L} = \frac{1}{2} [\begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix}], f_{L H} = \frac{1}{2} [\begin{matrix} 1 & - 1 \\ 1 & - 1 \end{matrix}], f_{H L} = \frac{1}{2} [\begin{matrix} 1 & 1 \\ - 1 & - 1 \end{matrix}], f_{H H} = \frac{1}{2} [\begin{matrix} - 1 & - 1 \\ - 1 & - 1 \end{matrix}]

(1)

[X_{L L}, X_{L H}, X_{H L}, X_{H H}] = C o n v ([f_{L L}, f_{L H}, f_{H L}, f_{H H}], X)

(2)

It should be noted that

f_{L L}

functions as a low-pass filter, whereas

f_{L H}

,

f_{H L}

,

f_{H H}

constitute a group of high-pass filters. For every input channel, the convolution output consists of four channels, each of which features a spatial resolution half of

X

along each dimension.

X_{L L}

corresponds to the low-frequency component of

X

, while

X_{L H}, X_{H L}, X_{H H}

represent its horizontal, vertical, and diagonal high-frequency components, respectively. Given that the kernels presented in Equation (1) form an orthonormal basis, the inverse wavelet transform (IWT) can be implemented by means of transposed convolution.

X = C o n v - t r a n s p o s e d ([f_{L L}, f_{L H}, f_{H L}, f_{H H}], [X_{L L}, X_{L H}, X_{H L}, X_{H H}])

(3)

At each transform level i, the algorithm decomposes the low-frequency component

X_{L L}^{(i - 1)}

into four separate sub-bands:

X_{L L}^{(i)}

: Low-frequency component (downsampled by 2× in both dimensions).

X_{L H}^{(i)}, X_{H L}^{(i)}, X_{H H}^{(i)}

: High-frequency components (horizontal, vertical, and diagonal details).

This decomposition emphasizes low-frequency features (e.g., large-scale damage contours) while preserving high-frequency details (e.g., fine cracks).

Multi-Frequency Convolutions: At each decomposition level i, a 3 × 3 depth-wise convolution is applied to all four sub-bands

X_{L L}^{(i)}, X_{L H}^{(i)}, X_{H L}^{(i)}, X_{H H}^{(i)}

using learnable weights

W^{(i)}

. This produces output sub-bands

Y_{L L}^{(i)}, Y_{L H}^{(i)}, Y_{H L}^{(i)}, Y_{H H}^{(i)}

, where small kernels effectively operate on spatially expanded receptive fields (scaling as

2^{i} \times 2^{i}

relative to the input).

The cascaded wavelet decomposition is subsequently implemented by recursively decomposing the low-frequency component. The operation corresponding to each decomposition level is defined as

X_{L L}^{(i)}, X_{L H}^{(i)}, X_{H L}^{(i)}, X_{H H}^{(i)} = W T (X_{L L}^{(i - 1)})

(4)

where

X_{L L}^{(i)} = X

and

i

denotes the current decomposition level. This decomposition scheme yields enhanced frequency resolution while reducing the spatial resolution of low-frequency components.

First, the wavelet transform (WT) is employed to filter and downscale both the low- and high-frequency components of the input data. Next, small-kernel depth-wise convolution is applied to the distinct frequency maps, after which the inverse wavelet transform (IWT) is utilized to reconstruct the final output. In other words, the process is given by

Y = I W T (C o n v (W, W T (X))),

(5)

where X denotes the input tensor, and W represents the weight tensor corresponding to a k × k depth-wise kernel, with the number of input channels being four times that of X. This operation not only decouples the convolution operations across different frequency components but also enables a smaller kernel to act on a broader region of the original input—thus expanding its receptive field relative to the input.

3.3. Integration of WTConv into YOLOv11

In the YOLOv11 framework [28], the C3K2 module participates in hierarchical feature extraction (Backbone) and feature fusion/refinement (Neck). We replace Conv layers in 4 Bottleneck modules of C3K2 Backbone (layers 3, 5, 7, 9) and 2 in C3K2 Neck with WTConv (Figure 2), targeting scale-aware feature extraction:

Backbone WTConv: Captures low-frequency global context (large blocks) and high-frequency details (cracks) from multi-scale feature maps.
Neck WTConv: Refines fused features to enhance small-target (micro-holes) visibility.

To enhance the model’s receptive field and feature-extraction capability, we focus on its internal Bottleneck sub-structure. Traditionally, the standard Conv in the Bottleneck performs local feature aggregation via fixed-kernel sliding operations. While effective for basic feature extraction, its receptive field is limited by kernel size and stacking depth. As shown in Figure 2, we substitute the Conv layer in the Bottleneck with WTConv. Wavelet convolution leverages wavelet transform principles to decompose input feature maps into multi-scale sub-bands. This enables the model to concurrently capture fine-grained high-frequency details and low-frequency global contextual information, while inherently expanding the effective receptive field in the process.

As illustrated in Figure 3, the aforementioned two-level combined operation is further extended to additional levels. And the calculation process is given by

X_{L L}^{1}, X_{H}^{1} = W T (X^{0}),

(6)

X_{L L}^{2}, X_{H}^{2} = W T (X^{1}),

(7)

Y_{L L}^{1}, Y_{H}^{1} = C o n v (W^{1}, (X_{L L}^{1}, X_{H}^{1}))

(8)

Y_{L L}^{2}, Y_{H}^{2} = C o n v (W^{2}, (X_{L L}^{2}, X_{H}^{2}))

(9)

where

X_{L L}^{(0)}

denotes the input to the layer, and

X_{H}^{(i)}

represents all three high-frequency components.

To integrate the outputs from distinct frequency components, we leverage the linear nature inherent to both wavelet transform (WT) and its inverse operation (IWT). This linearity ensures that the inverse transform of a summed input equals the sum of individual inverse transforms, following the relation IWT(X + Y) = IWT(X) + IWT(Y). On this basis, we proceed to implement

Z^{2} = I W T (Y_{L L}^{2}, Y_{H}^{2})

(10)

Z^{1} = I W T (Y_{L L}^{1} + Z^{2}, Y_{H}^{1})

(11)

Z^{0} = Y^{0} + Z^{1}

(12)

The updated C3K2 module with WT_Conv in the Bottleneck is reintegrated into YOLOV11’s Backbone. By expanding the receptive field via WT_Conv, the model gains stronger capability to detect objects of varying sizes and complexities, aligning with the goal of boosting detection accuracy. This modification directly targets the feature-transformation core of C3K2, leveraging wavelet convolution to mitigate the inherent limitations associated with standard convolution operations in capturing long-range dependencies. The change is localized to the Bottleneck but propagates benefits across the entire YOLOv11 pipeline.

3.4. Scale-Based Dynamic Loss Function

Although the YOLO model integrated with wavelet convolution is already sufficiently excellent, in this paper, the conveyor belt damage studied is highly specific. Due to the difficulties in feature extraction and the variable scales of damage, existing loss functions do not fully account for the differences in sensitivity to scale and position under different target scales. The instability of small targets is more prominent, which limits the detection performance of the model for targets of different scales. To address these issues, this paper will improve the loss function to enable the model to accurately locate damage regions. Specifically, the loss function is adjusted to SD Loss (Scale-based Dynamic Loss) [20], which enhances the localization accuracy and accelerates the model convergence speed. The loss function consists of the following two losses:

L_{B S} = 1 - I o U + α v, L_{B L} = \frac{ρ^{2} (b_{p}, b_{g t})}{c^{2}}

(13)

Referring to Figure 4, Intersection over Union (

I o U

) quantifies the overlap ratio between the predicted bounding box and the ground truth bounding box. In addition,

α v

evaluates the aspect ratio consistency of the bounding box.

ρ (\cdot)

denotes the Euclidean distance metric,

b_{p}

and

b_{g t}

represent the center points of the predicted bounding box

B_{p}

and the ground truth bounding box

B_{g t}

, respectively, while

c

refers to the diagonal length of the minimum enclosing rectangle covering the two bounding boxes.

SD Loss dynamically tunes the influence coefficients of

L_{B S}

and

L_{B L}

based on the target scale. The calculation formula for the influence coefficient

β_{B}

is given by

R_{O C} = \frac{w_{0} \times h_{0}}{w_{c} \times h_{c}}

(14)

β_{B} = m i n (\frac{B_{g t}}{B_{g t m a x}} \times R_{O C} \times δ, δ)

(15)

where

w_{0}

,

h_{0}

denote the width and height of the original image, respectively, and

w_{c}

,

h_{c}

represent the corresponding dimensions of the current feature map;

B_{g t m a x}

refers to the maximum size of ground truth bounding box. The impact coefficient of the loss is determined by the area of the current target box, with its value constrained within the range of δ. In this work, δ is set to 0.5. The final Scale-based Dynamic Loss function for bounding box regression is formulated as follows:

β_{L_{B S}} = 1 - δ + β_{B}, β_{L_{B L}} = 1 + δ - β_{B}

(16)

L_{S D B} = β_{L_{B S}} \times L_{B S} + β_{L_{B L}} \times L_{B L}

(17)

SD Loss complements WTConv by enhancing regression accuracy for scale-specific targets: while WTConv extracts multi-scale feature representations, SD Loss dynamically adjusts loss weights to mitigate scale-dependent localization errors—specifically, increasing the weight of

L_{B L}

for small holes to reduce positional deviation, and prioritizing the weight of

L_{B S}

for large blocks to improve IoU consistency.

SD Loss enhances model stability and detection accuracy by dynamically adjusting loss coefficients. For small damages and large damages in Detection of Mining Conveyor Belt, it precisely adjusts the loss calculation, reduces the interference of IoU fluctuations, and strengthens the regression ability of the detection model, enabling the model to more stably capture the characteristics of damages. Additionally, regarding computational efficiency, SD Loss features a more straightforward calculation procedure. It avoids introducing undue computational complexity and time expenses, ensuring efficiency throughout the process.

4. Experiments

4.1. Dataset

The Mine Industrial Conveyor Belt Dataset utilized in this study is a specialized collection designed for object detection tasks in mining conveyor belt monitoring scenarios. This dataset is designed to enable the automated detection of anomalies on conveyor belts, a capability critical to maintaining operational efficiency and safety in coal and mineral mining operations.

The dataset comprises 2345 high-resolution images captured via high-definition cameras deployed in real mining environments. These images encapsulate diverse operational conditions, including varying lighting intensities, material flow rates, and conveyor belt surface states, thereby enhancing the generalizability of models trained on this data. To augment sample diversity and improve model robustness against environmental variations, data augmentation techniques (e.g., rotation, scaling, and brightness adjustment) were applied during dataset preparation. Figure 5 and Table 1 illustrate the details of the dataset.

4.2. Evaluation Metrics

For the quantitative evaluation of our method, we employ mean Average Precision (mAP) [29] for each damage category as the core evaluation metric. Specifically, Average Precision (AP) for an individual class is calculated as the area under the Precision–Recall (P-R) curve. We select mAP@0.5 as the key indicator, since a detection result is deemed a True Positive (TP) when the Intersection over Union (IoU) between the predicted region and the ground truth bounding box exceeds the predefined threshold of 0.5. Some definitions (cited from PASCAL VOC Challenge 2012 [30]) are as follows:

I o U = \frac{A \cap B}{A \cup B}

(18)

P r e c i s i o n = \frac{T P}{T P + F P}

(19)

R e c a l l = \frac{T P}{T P + F N}

(20)

TP (True Positive): Correctly detected damage; TN (True Negative): Correctly identified undamaged region; FP (False Positive): False damage detection; FN (False Negative): Missed damage.

4.3. Experimental Setup

Hardware: Intel Core i9-12900K CPU, RTX 3090 GPU (24 GB VRAM), 64 GB RAM.
Software: PyTorch 2.1, Python 3.9.
Training hyperparameters: Epochs = 100, batch size = 16, optimizer = AdamW (weight decay = 1 × 10⁻⁴), initial learning rate = 1 × 10⁻⁴ (cosine annealing to 1 × 10⁻⁵).

4.4. Results and Analysis

4.4.1. Effect of Damage Detection

To comprehensively validate the superiority of the proposed WTConv-YOLOv11 model in mining conveyor belt damage detection, we conducted comparative experiments with seven representative object detection models, covering both two-stage and single-stage architectures. All models were trained and tested under the same experimental setup to ensure fairness.

Table 2 compares WTConv-YOLOv11 with seven representative models (two-stage and single-stage architectures). Two-stage models (e.g., Faster R-CNN [13]) excel at localization but lack real-time performance; single-stage models (e.g., YOLOv11 [14], CenterNet [31]) are faster but struggle with multi-scale damages. WT-YOLO achieves the highest mAP@0.5 (73.8%), outperforming recent baselines [24,25].

4.4.2. Ablation Study

Ablation studies were performed to evaluate the individual contribution of two core modules in WTConv-YOLOv11: (1) WTConv in the Backbone. (2) SD Loss.

Table 3 evaluates the impact of WTConv and SD Loss. Adding WTConv to the Backbone of YOLOV11 structure alone improves mAP@0.5 by 2.7% (70.3% to 73.0%) compared to the baseline, with notable gains in block and crack detection. This is because WTConv’s multi-frequency decomposition expands the receptive field, enabling capture of large-scale block features without increasing parameters (GFLOPs [34] decreases from 6.6 to 6.4). SD Loss increases mAP@0.5 by 3.2% (73.5% to 70.3%), mainly enhancing small holes’ localization accuracy. When both modules are integrated into the framework, the proposed model attains a maximum mAP@0.5 of 73.8%—representing a notable 3.5% performance gain compared with the baseline model.

Table 4 evaluates the class-specific impact of WTConv and SD Loss. WTConv improves crack AP by 3% (87.1% to 90.1%), while SD Loss enhances hole AP by 5.1% (30.7% to 35.8%).

While WTConv and SD Loss bring obvious promotion to targeted categories, there exists a slight performance trade-off when the two modules are combined, which is a reasonable balance for the overall multi-scale detection performance. Taking hole detection as an example, the hole mAP@0.5 of SD Loss alone is 0.358, and it decreases slightly to 0.342 when combined with WTConv. This is due to the temporary feature emphasis conflict between WTConv’s high-frequency decomposition and SD Loss’s small-target weighting in the fusion stage, but the value is still 3.5% (34.2% to 30.7%) higher than the baseline, maintaining excellent small-target detection capability.

Figure 6 illustrates the visualization outcomes of the experiment. The first column of each row is the original image with ground truth, where the boxes are labeled ground truth, clearly marking the actual location and range of damages in the image. The subsequent columns are the detection results of YOLOv11, Centernet, and our WTConv-YOLOv11, respectively, with predicted bounding boxes marked in different colors. Our WT-YOLO model exhibits the most superior performance, particularly in target localization accuracy small object detection capability. This synergistic effect confirms that WTConv provides multi-scale feature representation, while SD Loss optimizes regression accuracy—addressing the two core limitations of YOLOv11 for Damage Detection of Mining Conveyor Belt.

4.5. Software Development

To achieve end-to-end real-time damage detection for mining conveyor belts, a Python-based integrated software system was developed (as shown in Figure 7), with its architecture structured into four core functional modules: data acquisition module, model inference module, result visualization module, and external device control module. The overall architecture follows a modular design paradigm, ensuring loose coupling between components to facilitate subsequent function expansion and maintenance. The detection performance can be visualized by loading different models, including or excluding the WTConv and SD Loss modules. From top to bottom, the interface components and their functions are elaborated as follows:

(1): Top Basic Information Area. This area displays the interface title and the current system timestamp, confirming that this tool is a dedicated monitoring system for damage inspection of mining conveyor belts.
(2): Model and Parameter Configuration Area. This section supports the configuration of detection-related parameters. We can also adjust the camera.
(3): Real-Time Detection Display Area. This area consists of two sub-windows: live video streaming and detected object frame, both synchronously presenting the on-site view of the mining conveyor belt.
(4): Control and Status Prompt Area. This part is used to control the stop and start of the conveyor belt.

5. Conclusions

In summary, this study introduces large receptive field convolutions (WTConv) as a practical approach to address the multi-scale damage detection challenge in mining conveyor belt, significantly enhancing both detection accuracy. The core mechanisms underlying this improvement lie in WTConv’s dual capabilities: capturing global spatial relationships, which is crucial for identifying large-scale tears spanning multiple image regions, and enabling multi-frequency decomposition to enhance sensitivity to both low-frequency features and high-frequency details.

From a practical perspective, the proposed method demonstrates strong compatibility with existing monitoring systems, as it leverages on-site cameras without requiring additional sensors, thus minimizing deployment costs and complexity. Moreover, it offers a flexible trade-off between processing speed and detection accuracy through configurable wavelet transform (WT) levels, with practical configurations.

Specifically, compared to the baseline YOLOv11, the proposed WTConv-YOLOv11 achieves a 3.5% relative improvement in mAP@0.5; ablation study results show that the WTConv module brings a 3% increase in crack AP (benefiting from high-frequency feature capture), while the SD Loss module achieves an 5.1% improvement in hole AP (benefiting from dynamic localization weighting).

Despite these advancements, the method has inherent limitations. It exhibits higher computational overhead compared to lightweight models, though this can be mitigated through optimized WTConv implementation. Additionally, its performance is limited when detecting extremely small damage like holes, highlighting the need for further refinement.

Moving forward, our future research will focus on two primary directions. First, extending the framework to video sequences by integrating temporal convolutions (TCN) [25] to enable dynamic damage tracking, which will enhance its applicability to real-time, continuous monitoring. Second, deploying the model on edge devices (e.g., Raspberry Pi) through model compression techniques, aiming to facilitate on-site, low-latency detection in resource-constrained environments. These efforts will further solidify the method’s practical value in industrial conveyor belt maintenance and safety monitoring.

Author Contributions

F.X.: Conceptualization, original idea, methodology design, and first draft writing; J.W.: Software development, algorithm implementation, and experimental data processing; S.A.G.: Research framework optimization, technical route guidance, and manuscript revision; A.N.E.: Supervision, experimental design review, and academic direction guidance; K.A.V.: Project administration, funding acquisition, resource coordination, and final manuscript approval. All authors have read and agreed to the published version of the manuscript.

Funding

This research was carried out with financial support under the state assignment of the Ministry of Science and Higher Education of the Russian Federation (No. 075-03-2024-082-2).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors gratefully acknowledge the Department of Mining Machines and Complexes of T. F. Gorbachev Kuzbass State Technical University for their support and constructive discussions that contributed to this research.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Conventional Neural Networks
YOLO	You Only Look Once
WT	Wavelet Transform
WTConv	Wavelet Transform Convolution
SD Loss	Scale-based Dynamic Loss
IoU	Intersection over Union
GIoU	Generalized Intersection over Union
CIoU	Complete Intersection over Union
AP	Average Precision
mAP	mean Average Precision
GFLOPs	Giga Floating Point Operations
TCN	Temporal Convolutional Network

References

Rzeszowska, A.; Błażej, R.; Jurdziak, L. Non-Destructive Testing for Conveyor Belt Monitoring and Diagnostics: A Review. Appl. Sci. 2025, 15, 13272. [Google Scholar] [CrossRef]
Sha, L.; Zhang, W.; Zhou, J.; Peng, C.; Yu, Z. Review of Non-Destructive Testing Techniques for Conveyor Belt Damage. NDT 2025, 3, 27. [Google Scholar] [CrossRef]
Wang, Q.; Wang, M.; Sun, J.; Chen, D.; Shi, P. Review of Surface-defect Detection Methods for Industrial Products Based on Machine Vision. IEEE Access 2025, 13, 90668–90697. [Google Scholar] [CrossRef]
Li, X.; Huang, X.; Zhang, C.; Zhang, L. Fault detection network for X-ray imaging steel cord conveyor belt based on improved YOLOv11. Nondestruct. Test. Eval. 2025, 1–20. [Google Scholar] [CrossRef]
Ling, J.; Fu, Z.; Yuan, X. Lightweight coal mine conveyor belt foreign object detection based on improved Yolov8n. Sci. Rep. 2025, 15, 10361. [Google Scholar] [CrossRef]
Peng, F.; Hao, K.; Lu, X. Camera-Adaptive Foreign Object Detection for Coal Conveyor Belts. Appl. Sci. 2025, 15, 4769. [Google Scholar] [CrossRef]
Huang, J.; Ji, L.; Ma, X.; Ye, M. BeltCrack: The First Sequential-image Industrial Conveyor Belt Crack Detection Dataset and Its Baseline with Triple-domain Feature Learning. arXiv 2025, arXiv:2506.17892. [Google Scholar]
Liu, W.; Tao, Q.; Wang, N.; Xiao, W.; Pan, C. YOLO-STOD: An industrial conveyor belt tear detection model based on Yolov5 algorithm. Sci. Rep. 2025, 15, 1659. [Google Scholar] [CrossRef]
Yang, L.; Chen, G.; Liu, J.; Guo, J. Wear State Detection of Conveyor Belt in Underground Mine Based on Retinex-YOLOv8-EfficientNet-NAM. IEEE Access 2024, 12, 25309–25324. [Google Scholar] [CrossRef]
Wang, R.; Li, Y.; Yang, F.; Wang, Z.; Dong, J.; Yuan, C.; Lu, X. PLC based laser scanning system for conveyor belt surface monitoring. Sci. Rep. 2024, 14, 27914. [Google Scholar] [CrossRef]
Wang, H.; Kou, Z.; Wang, Y. GES-YOLO: A Light-Weight and Efficient Method for Conveyor Belt Deviation Detection in Mining Environments. Machines 2025, 13, 126. [Google Scholar] [CrossRef]
Ni, Y.; Cheng, H.; Hou, Y.; Guo, P. Study of conveyor belt deviation detection based on improved YOLOv8 algorithm. Sci. Rep. 2024, 14, 26876. [Google Scholar] [CrossRef] [PubMed]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31×31: Revisiting large kernel design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Chen, H.; Chu, X.; Ren, Y.; Zhao, X.; Huang, K. Pelk: Parameter-efficient large kernel convnets with peripheral convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5557–5567. [Google Scholar]
He, J.; Erfani, S.; Ma, X.; Bailey, J.; Chi, Y.; Hua, X.S. Alpha-IoU: A family of power Intersection over Union losses for bounding box regression. arXiv 2021, arXiv:2110.13675. [Google Scholar]
Zhang, M.; Zhang, Y.; Zhou, M.; Jiang, K.; Shi, H.; Yu, Y.; Hao, N. Application of lightweight convolutional neural network for damage detection of conveyor belt. Appl. Sci. 2021, 11, 7282. [Google Scholar] [CrossRef]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 363–380. [Google Scholar]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 9202–9210. [Google Scholar]
Xin, Q.; Hu, S.; Liu, S.; Zhao, L.; Zhang, Y.-D. An attention-based wavelet convolution neural network for epilepsy EEG classification. IEEE Trans. Neural Syst. Rehabil. Eng. 2022, 30, 957–966. [Google Scholar] [CrossRef]
Sun, Z.; Lin, Y.; Li, Y.; Lin, Z. Crossed wavelet convolution network for few-shot defect detection of industrial chips. Sensors 2025, 25, 4377. [Google Scholar] [CrossRef]
Zakria, Z.; Deng, J.; Kumar, R.; Khokhar, M.S.; Cai, J.; Kumar, J. Multiscale and direction target detecting in remote sensing images via modified YOLO-v4. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1039–1048. [Google Scholar] [CrossRef]
Chen, Y.; Zhou, M.; Hu, F.; Gao, L.; Wang, K. YOLOv8-LDH: A lightweight model for detection of conveyor belt damage based on multispectral imaging. Measurement 2025, 245, 116675. [Google Scholar] [CrossRef]
Liu, W.; Tao, Q.; Pei, H. Yolo-EV2: An Industrial Mining Conveyor Belt Tear Detection Model Based on Improved Yolov5 Algorithm for Efficient Backbone Networks. In Proceedings of the 2024 International Conference on Cyber-Physical Social Intelligence (ICCSI), Doha, Qatar, 8–12 November 2024; pp. 1–5. [Google Scholar]
Khalfaoui-Hassani, I.; Pellegrini, T.; Masquelier, T. Dilated convolution with learnable spacings. arXiv 2021, arXiv:2112.03740. [Google Scholar]
Wang, G.; Yang, Z.; Sun, H.; Zhou, Q.; Yang, Z. AC-SNGAN: Multi-class data augmentation for damage detection of conveyor belt surface using improved ACGAN. Measurement 2024, 224, 113814. [Google Scholar] [CrossRef]
Zhou, K.; Jiang, S. Forest fire detection algorithm based on improved YOLOv11n. Sensors 2025, 25, 2989. [Google Scholar] [CrossRef]
Hidayat, Z.S.; Wijaya, Y.A.; Kurniawan, R. Optimizing YOLOv8 for autonomous driving: Batch size for best mean average precision (mAP). J. Tek. Inform. (JUTIF) 2024, 5, 1147–1153. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6569–6578. [Google Scholar]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June2018; pp. 4203–4212. [Google Scholar]
Lu, X.; Kang, X.; Nishide, S.; Ren, F. Object detection based on SSD-ResNet. In Proceedings of the 2019 IEEE 6th International Conference on Cloud Computing and Intelligence Systems (CCIS), Singapore, 19–21 December 2019; pp. 89–92. [Google Scholar]
Zhou, W.; Min, X.; Hu, R.; Long, Y.; Luo, H. FasterX: Real-time object detection based on edge GPUs for UAV applications. arXiv 2022, arXiv:2209.03157. [Google Scholar]

Figure 1. Overview of the proposed WTConv-YOLOv11 model architecture for mining conveyor belt damage detection.

Figure 2. Schematic diagram of replacing convolutional layers in the Bottleneck module with wavelet convolution.

Figure 3. Flowchart of the two-level combined operation of wavelet transform convolution in the Bottleneck of C3K2 module.

Figure 4. Visual illustration of the Scale-based Dynamic Loss (SD Loss) calculation mechanism.

Figure 5. Sample images of the mining industrial conveyor belt dataset covering four critical anomaly categories.

Figure 6. Visual comparison of damage detection results among YOLOv11, Centernet, and WTConv-YOLOv11 models.

Figure 7. Interface snapshot of the developed Python-based software for real-time mining conveyor belt damage detection.

Table 1. Dataset instance distribution.

Split	Images	Block (Instance)	Crack (Instance)	Foreign (Instance)	Hole (Instance)
Training	1876	6212	226	1596	234
Validation	234	748	24	212	31
Test	235	737	28	176	32

Table 2. Performance comparison with state-of-the-art models.

Model	Architecture	Block	Crack	Foreign	Hole	mAP50
Fast R-CNN [13]	Two-stage	0.621	0.785	0.923	0.256	0.646
SSD [25]	Single-stage	0.638	0.801	0.937	0.269	0.661
Centernet [31]	Single-stage	0.675	0.834	0.952	0.289	0.688
YOLOv11 [14]	Single-stage	0.692	0.871	0.971	0.307	0.703
YOLO-EV2 [32]	Single-stage	0.689	0.867	0.968	0.301	0.706
YOLOv8-LDH [24]	Single-stage	0.695	0.873	0.972	0.312	0.713
RefineDet [33]	Single-stage	0.653	0.812	0.945	0.278	0.672
WT_YOLO	Single-stage	0.724	0.927	0.983	0.317	0.738

Table 3. Ablation study on WTConv and SD Loss modules.

Model	WTconv	SD Loss	mAP50	mAP50-95	P	R	GFLOPs
YOLO V11			0.703	0.434	0.788	0.612	6.6
	√		0.730	0.453	0.813	0.642	6.4
		√	0.735	0.478	0.841	0.642	6.6
	√	√	0.738	0.451	0.834	0.654	6.4

Table 4. Class-specific mAP@0.5 performance of two core modules.

Model	WTConv	SD Loss	Block	Crack	Foreign	Hole	Processing Time (ms/Image)
YOLO V11			0.692	0.871	0.971	0.307	28.5
	√		0.712	0.901	0.981	0.327	30.2
		√	0.730	0.870	0.983	0.358	29.1
	√	√	0.721	0.927	0.982	0.342	30.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, F.; Wang, J.; Gordin, S.A.; Ermakov, A.N.; Varnavskiy, K.A. Application of Wavelet Convolution and Scale-Based Dynamic Loss for Multi-Scale Damage Detection of Mining Conveyor Belt. Mining 2026, 6, 8. https://doi.org/10.3390/mining6010008

AMA Style

Xie F, Wang J, Gordin SA, Ermakov AN, Varnavskiy KA. Application of Wavelet Convolution and Scale-Based Dynamic Loss for Multi-Scale Damage Detection of Mining Conveyor Belt. Mining. 2026; 6(1):8. https://doi.org/10.3390/mining6010008

Chicago/Turabian Style

Xie, Fangwei, Jianfei Wang, Sergey Alexandrovich Gordin, Aleksandr Nikolaevich Ermakov, and Kirill Aleksandrovich Varnavskiy. 2026. "Application of Wavelet Convolution and Scale-Based Dynamic Loss for Multi-Scale Damage Detection of Mining Conveyor Belt" Mining 6, no. 1: 8. https://doi.org/10.3390/mining6010008

APA Style

Xie, F., Wang, J., Gordin, S. A., Ermakov, A. N., & Varnavskiy, K. A. (2026). Application of Wavelet Convolution and Scale-Based Dynamic Loss for Multi-Scale Damage Detection of Mining Conveyor Belt. Mining, 6(1), 8. https://doi.org/10.3390/mining6010008

Article Menu

Application of Wavelet Convolution and Scale-Based Dynamic Loss for Multi-Scale Damage Detection of Mining Conveyor Belt

Abstract

1. Introduction

1.1. Background

1.2. Limitations of Existing Methods

1.3. Motivation and Contributions

2. Materials and Methods

2.1. Large Receptive Field Convolutions

2.2. Scale-Sensitive Loss Functions

3. Methodology

3.1. Overview of the Proposed Model

3.2. Wavelet Transform Convolution (WTConv)

3.3. Integration of WTConv into YOLOv11

3.4. Scale-Based Dynamic Loss Function

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Results and Analysis

4.4.1. Effect of Damage Detection

4.4.2. Ablation Study

4.5. Software Development

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI