Mulch-YOLO: Improved YOLOv11 for Real-Time Detection of Mulch in Seed Cotton

Su, Zhiwei; Wei, Wei; Huang, Zhen; Yan, Ronglin

doi:10.3390/app152111604

Open AccessArticle

Mulch-YOLO: Improved YOLOv11 for Real-Time Detection of Mulch in Seed Cotton

School of Electrical and Electronic Engineering, Wuhan Polytechnic University, Wuhan 430048, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(21), 11604; https://doi.org/10.3390/app152111604

Submission received: 24 September 2025 / Revised: 28 October 2025 / Accepted: 28 October 2025 / Published: 30 October 2025

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figures

Versions Notes

Abstract

Machine harvesting of cotton in Xinjiang has significantly improved harvesting efficiency; however, it has also resulted in a considerable increase in residual mulch content within the cotton, which has severely affected the quality and market value of cotton textiles. Existing mulch detection algorithms based on machine vision generally suffer from complex parameterization and insufficient real-time performance. To overcome these limitations, this study proposes a novel mulch detection algorithm, Mulch-YOLO, developed on the YOLOv11 framework. Specifically, an improved CBAM (Convolutional Block Attention Module) is incorporated into the BiFPN (Bidirectional Feature Pyramid Network) to achieve more effective fusion of multi-scale mulch features. To enhance the semantic representation of mulch features, a modified Content-Aware ReAssembly of Features module, CARAFE-Mulch (Content-Aware ReAssembly of Features), is designed to reorganize feature maps, resulting in stronger feature expressiveness compared with the original representations. Furthermore, the MobileOne module is optimized by integrating the DECA Dilated Efficient Channel Attention (Dilated Efficient Channel Attention) module, thereby reducing both the parameter count and computational load while improving detection efficiency in real time. To verify the effectiveness of the proposed approach, experiments were conducted on a real-world dataset containing 20,134 images of low-visual-saliency plastic mulch. The results indicate that Mulch-YOLO achieves a lightweight architecture and high detection accuracy. Compared with YOLOv11n, the proposed method improves mAP@0.5 by 4.7% and mAP@0.5:0.95 by 3.3%, with a 24% reduction in model parameters.

Keywords:

cotton; mulch; YOLOv11; Mulch-YOLO; lightweight

1. Introduction

Xinjiang cotton, known for its soft texture, comfortable feel, and excellent water absorption, is an indispensable raw material for the production of medium- and high-grade yarns [1]. During the machine harvesting process, residual plastic mulch often becomes mixed with the raw cotton. Such contaminants severely affect the quality of yarn products, leading to a decline in overall performance and, consequently, reducing market competitiveness [2,3]. Foreign fibers, including synthetic fibers, plastics, and animal hairs, are non-cotton impurities that can significantly compromise textile quality. Among them, plastic mulch from farmland is considered a highly hazardous type of foreign fiber. It is difficult to remove, exhibits poor dyeing properties, and tends to cause stains or holes, ultimately reducing fabric quality. Therefore, an effective method is urgently needed to remove plastic mulch and other impurities from raw materials, in order to preserve the high-quality characteristics of Xinjiang cotton.

In conventional cotton processing, the detection of plastic mulch still relies heavily on manual sorting in most textile enterprises. This method not only involves high labor intensity but also faces challenges as the sorting time increases. After prolonged sorting, the similarity in color between the plastic mulch and cotton makes it difficult to distinguish the two, as shown in Figure 1. Operators may also experience “snow blindness,” which increases the likelihood of sorting errors and presents potential safety hazards. Hence, it is crucial to find a more efficient and scientific solution.

Machine vision-based foreign fiber detection technology has been widely applied in cotton spinning mills. Companies such as Uster in Sweden, Trützschler in Germany, and domestic manufacturers like Beijing Jingwei, Beijing Daheng, and Wuhan Jiumu Technology have made significant advancements in this field. These machines typically capture cotton flow images using cameras and employ image processing algorithms to detect and separate various impurities mixed with the cotton fibers [4,5]. The successful application of this technology has provided new perspectives for cotton quality control.

In recent years, deep learning-based object detection technologies have experienced rapid development [6,7,8]. Qingxu Li [9] employed an improved YOLOv7 network for foreign fiber detection, achieving an accuracy of 96.14%. Xiangpeng Fan [10] used an enhanced YOLOv5s model for cotton field detection, improving accuracy while reducing the model size. Jinghuan Hu [11] applied a lightweight YOLOv8 version for detecting cotton weeds. Chao Ni [12] utilized short-wave near-infrared hyperspectral imaging and deep learning for seed cotton plastic mulch sorting, achieving a recognition rate of 95.5%. Xue Zhou [13] proposed a semi-supervised foreign fiber detection algorithm, Efficient YOLOv5-cotton, which improved the mAP@0.5 by 1.6% compared to the original Efficient YOLOv5, while reducing the model parameters by 10% compared to the original Efficient YOLOv5 with SPPFCSPC. Zhang Hang [14] introduced a hyperspectral image segmentation method for plastic mulch, with a recognition rate of 91.07%. Chaoran Ma [15] proposed an improved model based on YOLOv8, which demonstrates significantly enhanced performance in detecting corn and weeds. The model achieves an mAP@50 of 0.751. Shuai Y [16] proposed a YOLO-SW model, an improved version of YOLOv8, achieves 92.3% mAP@50. M Zhang [17] proposed a FreqDyn-YOLO model based on YOLOv11, which achieves respective increases of 5.37%, 1.97%, and 2.96% in mAP@50, precision, and recall compared to the baseline method. H Fang [18] proposed the RSE-YOLO-Seg model, experiments on a self-constructed dataset show that RSE-YOLO-Seg improves the object detection average precision (mAP50(B)) by 3% and mask segmentation average precision (mAP50(M)) by 2.7% compared with the baseline. S Zhou [19] proposed the SMA-YOLO algorithm; experimental results on the VisDrone2019 dataset show that SMA-YOLO achieves a 13% improvement in mAP compared to the baseline model, demonstrating exceptional adaptability in small object detection tasks for remote sensing imagery. KJ Giri [20] introduces SO-YOLOv8, an enhanced version of the YOLO model that focuses on small object detection, achieving a precision of 1.0, showing an increase of 6% and an enhanced mean average precision (mAP) score of 0.79, reflecting a 1% increase in mAP compared to YOLOv8.

Discussions on the application of deep learning to plastic mulch detection in cotton remain relatively limited, primarily due to the lack of publicly available standardized mulch datasets for research. Based on existing studies, we constructed a cotton mulch dataset under real-world production conditions, referred to as the Low-Visual-Saliency Mulch Dataset, which contains 20,134 images collected from actual cotton processing lines in Xinjiang. The dataset is categorized into two types based on visual distinguishability: high-saliency mulch and low-visual-saliency mulch. High-saliency mulch refers to fragments that are easily detectable by the human eye due to strong contrast with the cotton background, such as large or dark-colored pieces, while low-visual-saliency mulch consists of fragments embedded within the cotton matrix, exhibiting color and texture highly similar to cotton fibers, making them difficult to identify even for experienced inspectors. Examples of both categories are shown in Figure 2.

Building on the successful application of YOLO series algorithms in cotton and agricultural fields, this study employs high-quality cotton plastic mulch images captured through factory image acquisition devices. This study proposes Mulch-YOLO, an improved YOLOv11-based architecture designed for real-time detection. To enhance multi-scale feature fusion for low-saliency targets, an improved BiFPN module is adopted. Unlike the conventional approach of sequentially applying BiFPN followed by a standalone attention module (e.g., CBAM), our method integrates a modified CBAM directly within the BiFPN’s cross-scale connections. This ‘attention-guided’ design allows the attention mechanism to dynamically recalibrate features at each fusion node, prioritizing information from scales most relevant to the small, fragmented mulch. This is in contrast to post-fusion attention, which operates on already fused features and may dilute scale-specific information. Furthermore, the modified CBAM employs a dilated convolution in its spatial attention branch to increase the receptive field without adding significant parameters, making it more effective for capturing the dispersed patterns of mulch fragments compared to standard CBAM, SE, or ECA modules. The CARAFE-Mulch module is employed in the decoder to mitigate semantic information loss during upsampling. Unlike fixed-kernel interpolation methods (e.g., bilinear) or transposed convolutions with learned but static kernels, CARAFE-Mulch dynamically generates content-aware assembly kernels based on the input feature map. This allows for adaptive reconstruction of object boundaries and internal structures, which is crucial for recovering fragmented mulch occluded by cotton. For the backbone network, we adopt MobileOne-DECA to achieve an optimal balance between model efficiency and detection accuracy. While MobileOne offers excellent inference speed through structural re-parameterization, its built-in attention mechanism is relatively basic. To enhance feature representation without compromising speed, we integrate the DECA (Dilated Efficient Channel Attention) module into the MobileOne architecture, forming MobileOne-DECA. The DECA module employs dilated convolutions within its channel weighting path, allowing it to capture long-range dependencies across channels—a crucial capability for distinguishing subtle texture differences between plastic and cotton fibers—compared to the local receptive fields of standard MobileOne or lightweight alternatives like MobileNetV3 and ShuffleNetV2. Crucially, this enhancement is applied only during training; at inference time, the DECA module is folded into the main convolutional branches via re-parameterization, adding zero extra latency. This results in a model that is significantly more accurate than MobileNetV3 and ShuffleNetV2 for our task, yet remains as fast as the original MobileOne.

2. Materials and Methods

2.1. Materials

The dataset used in the experiments consists of cotton images captured in real-world industrial production environments, specifically from machine-harvested cotton in Xinjiang, China. To withstand the harsh conditions of high temperature, humidity, and dust in the textile industry, this study selected the Hikvision MV-CS016-10UC industrial camera, as shown in Figure 3. This camera is equipped with a global shutter CMOS sensor, which helps to eliminate motion blur caused by high-speed cotton flow, making it ideal for capturing clear foreign fiber edge information. Additionally, the camera supports uncompressed image transmission via a USB interface, enabling end-to-end low-latency performance. It also operates stably in extreme temperature conditions and can withstand common cotton fiber pollution in textile workshops. The specific parameters of the camera are listed in Table 1.

The camera is installed within the pipeline of a specialized foreign fiber separation machine, as shown in Figure 4. This equipment is used to capture images of foreign fibers in cotton during industrial production. The cotton and foreign fibers used in this study primarily come from machine-harvested cotton from Xinjiang, hand-picked cotton from inland China, and state reserve cotton. Before processing, cotton bales mixed with foreign fibers are fed into a cotton opening machine. The mechanical components inside the opening machine loosen the cotton bales. Only after treatment in the opening machine can the cotton be fully loosened, facilitating the subsequent foreign fiber removal process.

The automatic plastic detection system is integrated into the cotton cleaning process at a critical stage following initial loosening. As the loosened seed cotton, intermixed with potential foreign fibers such as plastic mulch, is conveyed through a pipeline by controlled airflow, it passes through an inspection zone equipped with high-speed imaging devices. Here, synchronized cameras capture clear, real-time images of the airborne cotton stream under consistent illumination, ensuring optimal input for visual analysis. The acquired image data are immediately processed by the lightweight Mulch-YOLO model, enabling the rapid and accurate identification of foreign fiber contaminants. Upon detection, a control signal is transmitted to a precisely timed spray valve. This valve releases a burst of compressed air, generating a focused high-speed jet that selectively ejects the contaminated material from the continuous flow without disrupting the surrounding clean cotton. The ejected foreign fibers are then collected by a dedicated suction system driven by a high-power cleaning fan into a designated recovery bag. Subsequently, these collected materials are transported to the spinning preparation workshop, where operators perform detailed classification, inspection, and quality impact assessment. Insights gained from this analysis inform adjustments to upstream processing parameters, closing the loop on quality control and ensuring the final cotton product consistently meets stringent purity standards.

In this study, images were captured under real-world production conditions using a bale opening machine, resulting in a dataset of 20,134 high-resolution images. The dataset was preprocessed with random cropping (384 × 384 patches) and resized to 640 × 640 pixels for model input. Annotation was performed using LabelMe software as shown in Figure 5 (version 3.16.2), where bounding boxes were manually drawn around regions containing plastic mulch residues. The corresponding annotation files were subsequently stored in a directory named “labels”.

The annotated data was first converted into the YOLO format. The dataset was then divided into training, test, and validation sets in an 8:1:1 ratio. Table 2 lists the number of samples for each category in the dataset.

2.2. Methods

2.2.1. YOLOv11

The YOLO11 algorithm integrates advanced feature extraction techniques, enabling it to capture finer details while maintaining a streamlined number of parameters. In this study, the objective is to detect plastic mulch film within cotton fields, a task that demands high detection accuracy. Therefore, YOLO11 was selected as the primary model for this detection task. As illustrated in Figure 6, the YOLO11 architecture comprises three main components: the backbone, the neck, and the head. The backbone employs an enhanced structure incorporating the C3K2 module, which utilizes multi-scale convolutional kernels to expand the receptive field, thereby improving the capture of contextual information in images. The neck integrates the C2PSA module, which combines channel-wise and spatial attention mechanisms to strengthen multi-scale feature representation, enhancing both detection accuracy and robustness. In the head component, a decoupled two-branch structure is adopted, separating the tasks of object classification and bounding box regression. This design facilitates task-specific feature learning, effectively reducing prediction conflicts between different tasks.

2.2.2. Improved YOLOv11

This study designs an efficient and lightweight network for high-precision and rapid detection of cotton plastic mulch in real-world production scenarios. First, the neck network’s FPN + PAN is replaced with BiFPN-CBAM [21,22], which optimizes cross-scale connections and weighted feature fusion, enabling better multi-scale feature integration and enhancing feature representation capabilities. In the neck network, the lightweight content-aware feature reorganization structure CARAFE [23] is employed, allowing the model to better utilize feature information during upsampling, thus enhancing the feature fusion ability of the plastic mulch. To reduce model complexity and computational overhead, facilitating deployment on edge devices, the C3k2 module in the backbone is replaced with MobileOne-DECA [24,25], which reduces the weight of the YOLOv11 backbone, decreases parameters, and improves computational efficiency. The following sections describe the working principles and related technologies of each module. The improved model network structure is shown in Figure 7.

2.2.3. BiFPN-CBAM

The Bidirectional Feature Pyramid Network (BiFPN) is integrated to mitigate low-resolution representation and feature attenuation in plastic mulch detection. As shown in Figure 8a, the standard Feature Pyramid Network (FPN) constructs multi-scale features through a top-down pathway, maintaining uniform channel dimensions while varying spatial resolution [26]. This enables effective recognition of objects across scales. The Path Aggregation Network (PANet), illustrated in Figure 8b, extends FPN with a bottom-up stream that propagates fine-grained details from lower to higher levels via iterative downsampling [27].

Conventional FPN+PAN fuses features by element-wise addition [28], treating all inputs equally regardless of their semantic significance. This unweighted fusion limits sensitivity to small objects. Increased depth also introduces additional parameters and computational cost, affecting model efficiency.

BiFPN addresses these limitations with a bidirectional fusion strategy that supports simultaneous up-sampling and down-sampling. Redundant nodes receiving only one input are removed to reduce duplication. Repeated bidirectional blocks enhance cross-scale integration and information flow across layers.

A learnable weighted fusion mechanism dynamically adjusts the contribution of each input feature:

O = \sum_{i} \frac{ω_{i} I_{i}}{ε + \sum_{j} ω_{j}}

(1)

In the BiFPN module, a learnable weighted fusion mechanism is employed to enable adaptive multi-scale feature integration, addressing the limitation of naive element-wise addition in conventional FPN+PAN structures. Specifically, Equation (1) defines the weighted fusion operation, where

I_{i}

denotes the

i

-th input feature map,

ω_{i}

is a learnable scalar weight that is optimized during training to emphasize more informative features,

ε

is a small constant for numerical stability, and

O

is the final normalized output feature. This operation is applied in both the top-down and bottom-up pathways of BiFPN, allowing the model to dynamically adjust the contribution of each scale during feature fusion.

Where

I_{i}

denotes the input feature,

O

the output, and

ω_{i}

and

ω_{j}

are trainable weights. The small constant

ε

ensures numerical stability.

This adaptive weighting enables selective enhancement of informative features during detection, improving robustness under complex backgrounds and variable lighting conditions.

In this study, P1 to P7 denote the multi-scale feature levels within the model, forming a seven-level feature pyramid structure. Specifically, P1 has the highest spatial resolution, enabling the detection of extremely small plastic mulch fragments. Levels P3–P5 are derived from the standard outputs of the backbone network, while P6 and P7 are generated through additional downsampling layers to enlarge the receptive field for capturing large-scale contextual information. All levels are interconnected via the BiFPN module, which enables efficient bidirectional fusion and cross-scale feature enhancement.

The CBAM attention module is incorporated at the output stage of the enhanced BiFPN to emphasize plastic mulch features and suppress background clutter. By adaptively recalibrating channel and spatial responses, CBAM enhances discriminative feature learning for small defects in complex scenes, reduces noise propagation in high-level semantics, and improves contextual modeling. The modified BiFPN structure with CBAM is illustrated in Figure 9.

Traditional feature extraction approaches face challenges in recognizing plastic mulch due to multi-scale variations and spatial heterogeneity. The Convolutional Block Attention Module (CBAM) improves feature discriminability in complex scenes through dual-axis recalibration. It sequentially computes channel and spatial attention maps, which are applied to intermediate features via element-wise multiplication. This adaptive weighting mechanism enhances informative regions while suppressing irrelevant responses, refining feature representation without additional supervision. The architecture of CBAM is illustrated in Figure 10.

The Convolutional Block Attention Module (CBAM) enhances feature representation through a two-stage attention mechanism that operates sequentially on channel and spatial dimensions. The process starts with the Channel Attention Module (CAM), which aims to refine the importance of different feature channels by adaptively recalibrating their responses.

Given an input feature map

F

, CAM applies global average pooling and global maximum pooling across spatial dimensions to capture comprehensive contextual information. This generates two feature descriptors:

F_{a v g}^{c}

and

F_{m a x}^{c}

, representing the average and maximum activation patterns for each channel, respectively. These pooled features are then passed through a shared two-layer multilayer perceptron (MLP), which consists of fully connected layers with nonlinear activation. The outputs from both pathways are summed elementwise to integrate complementary cues from the two pooling strategies. The resulting values are transformed via a sigmoid function to produce the channel attention weights:

\begin{matrix} M_{c} (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))) \\ = σ (W_{1} (W_{0} (F_{a v g}^{c})) + W_{1} (W_{0} (F_{m a x}^{c}))) \end{matrix}

(2)

where

W_{0}

and

W_{0}

denote the weight matrices of the first and second layers in the MLP, respectively;

σ

is the sigmoid activation function. The original feature map

F

is then modulated by these weights through element-wise multiplication, yielding the channel-refined output:

F^{'} = M_{c} (F) \otimes F

(3)

Here,

\otimes

represents the Hadamard product, and

F^{'}

becomes the input for the subsequent spatial attention stage.

Next, the Spatial Attention Module (SAM) is applied to emphasize informative spatial locations within the feature map. To achieve this, global average pooling and global max pooling are performed along the channel dimension of

F^{'}

, producing two 2D spatial maps—

F_{a v g}^{c}

and

F_{m a x}^{c}

—which encode activation patterns at each spatial position. These two maps are concatenated along the channel axis to form a fused descriptor

{[F}_{a v g}^{c}; F_{m a x}^{c}]

, preserving both global statistical information and salient structural details.

This concatenated feature is then processed by a 7 × 7 convolutional layer, enabling contextual modeling over an extended spatial neighborhood. The output is activated using the sigmoid function to generate the spatial attention weight map:

\begin{matrix} M_{s} (F^{'}) = σ (F^{7 \times 7} ([A v g P o o l (F^{'}); M a x P o o l (F^{'})])) \\ = σ (F^{7 \times 7} ([F_{a v g}^{5}; F_{m a x}^{5}])) \end{matrix}

(4)

Finally, the refined feature map

F^{″}

is obtained by applying the spatial attention weights to

F^{'}

through element-wise multiplication:

F^{″} = M_{s} (F^{'}) \otimes F^{'}

(5)

In this way,

M_{s} (F^{'})

acts as a spatial gating mechanism that enhances relevant regions while suppressing background noise or irrelevant activations.

The integration of CBAM into the enhanced BiFPN architecture is designed to improve the detection accuracy of small and partially occluded plastic mulch defects in complex environments. After multi-scale feature fusion in BiFPN, subtle defect characteristics may still be weakened due to background clutter or information attenuation. By applying CBAM at the output stage, the network gains the ability to selectively emphasize critical features in both channel and spatial dimensions. This dual-axis recalibration strengthens the representation of weak defect signals, thereby enhancing model robustness under challenging conditions such as low contrast, variable illumination, and dense background interference.

2.2.4. CARAFE-Mulch

In YOLOv11, upsampling within the feature pyramid is performed using nearest-neighbor interpolation. While computationally efficient, this method relies only on pixel coordinates to determine the sampling kernel, limiting its ability to leverage semantic content during feature reconstruction. As a result, fine-grained details may be lost, especially for small or texture-weak objects like plastic mulch.

To overcome this limitation, we incorporate the Content-Aware ReAssembly of Features (CARAFE) module into the neck network. Unlike fixed-kernel interpolation, CARAFE dynamically generates position-sensitive upsampling kernels conditioned on the input feature map, enabling content-aware feature expansion. The module aggregates contextual information through a large receptive field, enhancing the richness of upsampled features by adaptively fusing surrounding patches.

CARAFE operates in two steps: first, a lightweight head predicts spatially varying kernels from the input features; second, these kernels are applied to reassemble the feature map at higher resolution. The kernel prediction is instance-aware, allowing adaptive upsampling tailored to local semantics.

Notably, CARAFE introduces negligible computational overhead and additional parameters, making it highly efficient and easily integrable into existing architectures. In this work, we replace the nearest-neighbor interpolation layers in YOLOv11’s neck with CARAFE modules. This substitution enhances plastic mulch feature fusion by preserving contextual coherence and structural details, without increasing model complexity or inference cost.

CARAFE consists of two key components, as shown in Figure 11: the kernel prediction module and the content-aware reassembly module. Given an input feature map

x

of size

C \times H \times W

and an upsampling factor

σ

, CARAFE generates a new feature map

X'

of size

C \times σ H \times σ W

.

Kernel Prediction Module: This module is responsible for generating reassembly kernels in a content-aware manner. For each source position

x

, there are

σ^{2}

corresponding target positions

X^{'}

. Each target position requires a

k_{u p} \times k_{u p}

reassembly kernel. Therefore, the module outputs reassembly kernels of size

C_{u p} \times H \times W

, where

C_{u p} = σ^{2} k_{u p}^{2}

. This module consists of three submodules: the Channel Compressor, Content Encoder, and Kernel Normalizer.

Channel Compressor: A

1 \times 1

convolutional layer is used to compress the input feature channels from

C

to

C_{m}

, reducing the number of channels in the feature map. This helps reduce the parameters and computational costs in subsequent steps, making CARAFE more efficient.

Content Encoder: The content encoder generates reassembly kernels based on the content of the input features using a convolutional layer with a kernel size of

k_{e n c o d e r}

. The parameters of the encoder are

k_{e n c o d e r} \times k_{e n c o d e r} \times C_{m} \times C_{u p}

. Increasing

k_{e n c o d e r}

enlarges the encoder’s receptive field, allowing it to leverage contextual information from a larger area; however, this also increases computational complexity. Through experimentation,

k_{e n c o d e r} = k_{u p} - 2

was found to be a good balance between performance and efficiency.

Kernel Normalizer: Before applying each

k_{u p} \times k_{u p}

reassembly kernel to the input feature map, the kernel values are spatially normalized using a Softmax function, ensuring that the sum of the kernel values equals 1. This acts as a soft selection for local regions.

Content-Aware Reassembly Module: The reassembly process is formulated around a target position

l^{'}

and its corresponding square neighborhood

N (X_{l}, k_{u p})

centered at

l = (i, j)

. For each upsampled location, the output feature is computed as:

X_{l^{'}}^{'} = \sum_{n = - r}^{r} \sum_{m = - r}^{r} W_{l (i + n, m)} \cdot X_{(i + n, j + m)}

(6)

where

r = ⌊ k_{u p} / 2 ⌋

. Unlike conventional interpolation methods that assign weights based on fixed spatial distances, CARAFE employs content-dependent kernels. The weight

W_{l (i + n, m)}

is dynamically predicted from the input feature map and varies across positions and instances.

This mechanism allows the contribution of each neighboring pixel to be adaptively determined by local semantic content rather than geometric proximity. As a result, the reassembled feature map captures richer contextual information and exhibits enhanced semantic coherence compared to the original features.

2.2.5. MobileOne-DECA

To achieve efficient model deployment on resource-constrained mobile devices while maintaining high representational capacity, we integrate the lightweight MobileOne module into our architecture, as illustrated in Figure 12. MobileOne is designed to decouple training-time expressiveness from inference-time efficiency through a reparameterization strategy.

During training, the module comprises two primary components: a depthwise convolution block (top) and a pointwise convolution block (bottom). The depthwise block contains three parallel branches: a 1 × 1 convolution, a set of k over-parameterized 3 × 3 convolutions, and an identity shortcut with batch normalization. The number of groups in the depthwise convolution matches the number of input channels. The pointwise block similarly includes k over-parameterized 1 × 1 convolutions and a batch-normalized skip connection. In this work, k = 3.

At inference time, as shown on the right side of Figure 12, the module is reparameterized into an efficient equivalent structure. For the depthwise component, each branch is transformed into a 3 × 3 convolution kernel and fused into a single composite kernel. In the first branch, the 1 × 1 convolution is converted to a 3 × 3 kernel via zero-padding. This kernel is then merged with its subsequent batch normalization layer using the following transformations:

ω^{'} = \frac{γ \times ω}{\sqrt{σ^{2} + ε}}

(7)

b^{'} = β + γ \times \frac{b - μ}{\sqrt{σ^{2} + ε}}

(8)

where

ω

and

b

denote the convolutional weights and biases, while

γ

,

β

,

μ

, and

σ^{2}

are the scale, shift, mean, and variance from batch normalization, respectively;

ε

is a small constant for numerical stability.

The second branch applies the same fusion process to each of the k 3 × 3 convolutions. Their resulting kernels are summed to form a single aggregated 3 × 3 kernel. The third branch, which contains no convolutional layer, synthesizes a 3 × 3 kernel before batch normalization to ensure dimensional compatibility. This kernel is then fused with batch normalization using Equations (7) and (8).

Finally, all three transformed 3 × 3 kernels are summed to produce the reparameterized depthwise convolution. A similar procedure is applied to the pointwise block, where k 1 × 1 convolutions and the skip connection are merged into a single 1 × 1 convolution. The resulting compact structure enables high-speed inference with minimal computational overhead, making it well-suited for mobile deployment.

MobileOne uses depthwise separable convolutions, splitting channel processing into depthwise convolutions and pointwise convolutions. Depthwise convolutions cannot directly model global dependencies across channels, and while pointwise convolutions can fuse channels, their limited parameterization leads to reduced feature representation capability and a lack of explicit multi-scale feature extraction mechanisms. To address this, a DECA (Dilated Efficient Channel Attention) module is inserted between the depthwise and pointwise convolutions. This module extracts local features, preserves multi-scale information, dynamically enhances key channels, and suppresses redundant features. At the same time, through parameter sharing and an efficient structure, the model maintains its lightweight nature.

The DECA module process is shown in Figure 13. First, given an input image, it is passed through multiple convolutional and pooling operations to extract feature maps.

The multi-scale convolution is defined as:

F_{d i l a t e d}^{(i)} = {C o n v}_{d i l a t e d} (F_{i n}, d_{i}), i = 1,2, \dots, k

(9)

Here,

{C o n v}_{d i l a t e d}

refers to the dilated convolution operation with dilation rate

d i

.

Then, global average pooling (GAP) and global max pooling (GMP) operations are applied to compress the height and width dimensions of the feature map, resulting in channel-level global signals. Next, the encoding module processes the Max and Avg signals to generate channel correlation features at different scales. After concatenating along the second dimension, these features are projected to the same dimension as the Max and Avg signals using parallel fusion methods. Finally, the two branches are merged using element-wise multiplication and passed through a Sigmoid activation function to generate the final signal. This final signal is then multiplied by the original feature map to complete the enhancement.

Here:

G A P (F_{d i l a t e d}) = \frac{1}{H \cdot W} \sum_{h, w} F_{d i l a t e d} (h, w) \in R^{k C}

(10)

G M P (F_{d i l a t e d}) = \underset{h, w}{m a x} F_{d i l a t e d} (h, w) \in R^{k C}

(11)

After concatenating the results of GAP and GMP, a fully connected (FC) layer is used to compress them into channel attention weights:

A = σ (W \cdot [G A P (F_{d i l a t e d}), G M P (F_{d i l a t e d})])

(12)

W \in R^{C \times 2 k C}

is the learnable weight matrix, and

σ

is the Sigmoid activation function, which generates the channel attention weights

A \in R^{C}

.

3. Results

3.1. Environment and Configuration

Experiments were carried out on a system running Ubuntu 20.04, equipped with an Intel^® Core™ i5-12490F processor (3.00 GHz) and an NVIDIA RTX 4070 Ti GPU (12 GB VRAM). The framework was implemented in Python 3.10.14 using PyTorch 2.3.1 with CUDA 12.6, enabling GPU-accelerated training. The model was trained using SGD with momentum, automatically selected via the optimizer: auto setting, with a momentum of 0.937 and a weight decay of 0.0005. The initial learning rate was set to 0.01 and decayed linearly to

1 \times 10^{- 4}

over 150 epochs (controlled by lrf = 0.01 and cos_lr = False). A 3-epoch linear learning rate warmup was applied, starting from a base learning rate of 0.1 and warmup momentum of 0.8. Input images were resized to a fixed resolution of

640 \times 640

pixels. Data augmentation included random horizontal flipping (fliplr = 0.5), HSV color jittering (hue: 0.015, saturation: 0.7, value: 0.4), and Mosaic augmentation (mosaic = 1.0), along with random scaling (scale = 0.5). All other augmentation methods such as MixUp and Copy-Paste were disabled. The complete set of training hyperparameters is detailed in Table 3.

3.2. Performance Evaluation

To comprehensively assess the detection performance and model efficiency, six key metrics are adopted: precision, recall, and mean average precision at IoU threshold 0.5 (mAP50) for evaluating detection accuracy, along with the number of model parameters, FLOPs, and model size to measure computational complexity and deployment feasibility [29]. Precision represents the ratio of correctly predicted positive plastic mulch samples to the total number of predicted samples. The precision is calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

Recall, also known as the true positive rate, reflects the model’s ability to detect all actual mulch instances, calculated as the ratio of TP to the total number of actual positive samples (TP + FN), where FN (false negatives) refers to mulch samples that were missed by the detector:

R e c a l l = \frac{T P}{T P + F N}

(14)

In practice, due to the discrete nature of detection outputs, this integral is computed numerically over a finite set of operating points derived from model predictions. Let

{(r_{i}, p_{i})}_{i = 1}^{k}

denote the sequence of recall and precision values obtained by varying the confidence threshold. The discrete form of AP is given by:

A P = \sum_{i = 1}^{k - 1} (r_{i + 1} - r_{i}) \cdot \underset{j \geq i}{m a x} p_{j}

(15)

where

{m a x}_{j \geq i} p_{j}

represents the interpolated precision at recall level

r_{i}

, ensuring monotonicity.

The mean average precision (mAP) is the meaning of AP values across all object categories, serving as a comprehensive metric for evaluating overall detection performance:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(16)

where

A P_{i}

denotes the average precision for the i-th class, and NN is the total number of classes in the dataset. Two widely used variants are mAP@0.5mAP@0.5, which evaluates mAP at an IoU threshold of 0.5, and mAP@0.5:0.95mAP@0.5:0.95, which reports the average mAP over ten uniformly spaced IoU thresholds from 0.5 to 0.95 (inclusive).

3.3. Ablation Experiments

To validate the enhancement effect of each improved module on plastic mulch detection in cotton images, ablation experiments were conducted on the low-visual-saliency plastic mulch dataset using the same dataset and experimental setup. The results of the ablation study are shown in Table 4.

After introducing the BiFPN-CBAM, the mAP@0.5 increased by 3.1%, and the mAP@0.5:0.95 improved by 0.4%. The parameter count increased by 0.12 M, and the computational load rose by 0.9 G. This demonstrates that the BiFPN-CBAM effectively preserves feature map resolution and minimizes information degradation.

Following the introduction of the CARAFE module, the parameter count increased by 0.3 M, and the computation load increased by 0.2 G. The mAP@0.5 improved by 0.5%, and the mAP@0.5:0.95 increased by 1.4%. This data indicates that the CARAFE module, with its larger receptive field, can better aggregate contextual information, while also leveraging feature information for efficient upsampling based on the input feature map.

After integrating the MobileOne-DECA module, the FLOPs decreased by 2.2 G, the parameter count dropped by 0.84 M, and the mAP@0.5 reached 95.5%, with the mAP@0.5:0.95 achieving 68.6%. Compared to YOLOv11n, the proposed Mulch-YOLO model improved the mAP@0.5 by 4.7% and mAP@0.5:0.95 by 3.3%, while reducing the parameter count by 0.62 M and decreasing the FLOPs by 1.1 G.

Therefore, the proposed method demonstrates high accuracy and lightweight characteristics, effectively meeting the real-time detection requirements for plastic mulch in cotton.

To intuitively and easily reflect the areas of the feature map that the model focuses on, a gradient-weighted class activation map (Grad-CAM) was introduced to generate heatmaps for YOLOv11n. In the heatmaps, the areas with higher model attention are shown in red, while regions with lower attention are depicted in blue. The heatmap results for three sets of images are shown in Figure 14.

For cotton mulch, while the YOLOv11n model can roughly detect the presence of mulch, it struggles to accurately pinpoint their exact locations. Furthermore, in the heatmap, the yellow regions not only indicate the plastic mulch but also include cotton, as YOLOv11n mistakenly identifies cotton as mulch. However, with the introduction of the BiFPN-CBAM, CARAFE module, and MobileOne-DECA module, the yellow regions in the heatmap progressively shrink to the plastic mulch area, turning red. This improvement demonstrates that the proposed method increasingly focuses its attention on the plastic mulch regions with higher precision, effectively reducing background interference.

Mulch-YOLO successfully addresses the background interference issue by focusing attention specifically on the target. The BiFPN-CBAM helps retain more feature map information, while the CARAFE module facilitates better aggregation of contextual information, providing a larger receptive field. MobileOne-DECA reduces the complexity of the network. Therefore, the proposed method can effectively improve the accurate detection and localization of plastic mulch in cotton.

3.4. Comparison Experiment

To further validate the superiority of the proposed algorithm, this paper compares it with current mainstream detection algorithms. We conducted comparison experiments with YOLOv10n [30], YOLOv9t [31], YOLOv8n [32], YOLOv11n [33], YOLOv12n [34], Faster-RCNN [35] and DETR [36]. Similarly, we used multiple metrics for comparison, including mAP@0.5, mAP@0.5:0.95, number of parameters, and computational load. The results of the comparison experiments are shown in Table 5:

On the low-visual-saliency plastic mulch dataset, although Faster-RCNN and DETR demonstrated good performance in terms of mAP@0.5 and mAP@0.5:0.95, they involve excessive model parameters, leading to large storage space requirements. In contrast, YOLO series networks offer the advantages of low FLOPs and smaller model sizes. Compared to the YOLOV10n, YOLOV9t, YOLOV8n, YOLOv11n and YOLOv12n models trained on the entire training set, Mulch-YOLO achieved improvements of 3.3%, 4.3%, 6%, 4.7% and 5.7% in mAP@0.5, respectively, reaching a performance of 95.5%. When compared with YOLOv11n, the proposed method reduced the number of parameters by 24% (to 1.96M) and decreased the computational load by 17% (to 5.2 G). Thus, Mulch-YOLO achieves high-precision detection of hidden mulch in cotton with fewer parameters and reduced computational requirements. Its lightweight structure makes it suitable for deployment on mobile or embedded devices.

While Table 5 presents a quantitative performance comparison, Table 6 provides a qualitative analysis summarizing the key strengths, limitations, and typical application scenarios of each method. This analysis is based on architectural characteristics, training behavior, and practical deployment experience, offering deeper insights for real-world model selection.

To further demonstrate the effectiveness of the proposed algorithm, we conducted per-image detection on the test set. Representative images from complex backgrounds—characterized by significant scale differences and subtle distinctions from the background—were selected for visualization experiments. The mulch detection results are shown in Figure 15. In both single-target and multi-target scenarios, the proposed method not only detects the mulch individually but also accurately identifies mulch of varying sizes hidden within the cotton. These results substantiate the superior detection performance of the model.

3.5. Deployment

The experimental deployment phase takes place in a real-world production environment, where the mulch recognition equipment observes the cotton after it has been processed by the opening machine, identifying the mulch attached to the cotton surface. Figure 16 shows the mulch recognition equipment installed on the actual production line.

The device uses the NVIDIA Jetson Orin Nano as an embedded development platform for deploying mulch detection. NVIDIA Jetson Orin Nano is a high-performance, low-power computing platform designed for embedded applications and AI IoT, meeting the critical requirements for real-time image processing and high-precision AI inference in foreign fiber detection in industrial settings. The physical device is shown in Figure 17. With its compact size (70 mm × 45 mm) and low power consumption (5–10 W), it is ideally suited for applications in the edge intelligence field. The selection of this computing platform for edge hardware deployment is primarily based on its following advantages and features: The Jetson Orin Nano series is equipped with an NVIDIA Ampere architecture GPU, offering up to 20TOPs of computing power, which supports object detection algorithms such as YOLO and large model inference tasks. Its 16 Tensor Cores significantly enhances the parallel computing efficiency of Convolutional Neural Networks (CNNs), meeting the demands of real-time processing of multi-channel HD cameras for foreign fiber feature extraction in industrial scenarios. The design of its carrier board is compatible with industrial peripherals such as M.2 interfaces, USB 3.0, and Gigabit Ethernet, facilitating the integration of industrial cameras, PLC controllers, and other devices. Prior to deployment, the model underwent inference optimization. The trained weight file was first converted into an ONNX model on the server, with careful consideration given to environment configuration and necessary modifications for special network layers. The model was then transformed into a TensorRT-compatible Engine format. Subsequently, the model and its dependencies were packaged and migrated to the Jetson Orin Nano development board, where the required software environment for model execution was configured. After successfully deploying the environment for plastic film detection, the virtual environment was activated and the inference program was executed to verify both the integrity of the installation and the successful deployment of the model. The deployed plastic film detection model runs stably on the Jetson Orin Nano. After being deployed on the Jetson Orin Nano, the optimized model achieved a processing speed of 61 FPS, which is sufficient to match the typical cotton throughput on the production line.

Random sampling tests were conducted on the production line, and the system achieved a 90% detection rate. Figure 18 shows representative detection results, where the model successfully localizes mulch fragments under real-world conditions, including varying illumination and motion blur. This demonstrates the robustness and industrial readiness of our approach.

4. Conclusions

This study addresses the common issues in existing machine vision-based mulch detection algorithms, such as complex parameters and insufficient real-time performance. A lightweight cotton mulch detection method based on the improved YOLOv11 is proposed to effectively balance the trade-off between detection accuracy and model size. This paper introduces an efficient and lightweight feature fusion neck network, incorporating a multi-scale feature fusion network and recombining content-aware features. The improved CBAM is integrated into the BiFPN to better fuse multi-scale features and enhance feature expression capability. The modified CARAFE-Mulch module reorganizes the feature map, strengthening the semantics compared to the original feature map. Furthermore, the MobileOne module is improved by incorporating DECA for parameter optimization, reducing both the model parameters and computational load. Experimental results show that the proposed method outperforms advanced detection techniques in terms of mulch detection performance. The system has been deployed on a cotton processing line, where it achieved a detection rate of 90% under real-world operating conditions.

Although the proposed method achieves strong performance in real-world cotton mulch detection, it has certain limitations. The model’s generalization to diverse environments (e.g., varying lighting or cotton types) needs further validation, and detecting highly transparent or low-contrast plastic fragments remains challenging. Currently, the system relies on RGB images, which may limit discrimination between mulch and similar-looking materials.

For future work, we will explore multi-spectral imaging and 3D-aware architectures to improve robustness in complex industrial scenarios. These enhancements aim to increase detection accuracy for transparent foreign fibers and strengthen adaptability across different production conditions.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app152111604/s1.

Author Contributions

Conceptualization, Z.S.; methodology, W.W.; software, W.W.; validation, Z.S.; formal analysis, Z.H.; investigation, R.Y.; resources, W.W.; data curation, Z.H.; writing—original draft preparation, Z.S.; writing—review and editing, Z.S.; visualization, Z.H.; supervision, Z.S.; project administration, R.Y.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Hubei Provincial Department of Education, grant number B2023051.

Data Availability Statement

The full raw image dataset is owned by China Wuhan Jiumu Technology Co., Ltd. and cannot be publicly shared due to confidentiality agreements related to their production line. A representative subset of anonymized images is provided as Supplementary Materials to support the reproducibility of the findings. Data access requests may be considered upon agreement from the data owner.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yu, M.F. New Frontiers in Textiles: Developments in New Textile Materials and Application Fields; Tsinghua University Press Co., Ltd.: Beijing, China, 2002. (In Chinese) [Google Scholar]
Ma, Z.R.; Liu, Y.S.; Zhang, Q.Q.; Ying, G.G. Current status of agricultural plastic film usage and analysis of environmental pollution. Asian J. Ecotoxicol. 2020, 15, 4. (In Chinese) [Google Scholar]
Chen, Z.; Xing, M.J. Research Progress on Detection Methods of Foreign Fibers in Cotton. Cotton Text. Technol. 2016, 44, 77–81. [Google Scholar]
Xu, W.F.; Hu, J.W.; Zhu, X.J.; Ye, J.J. Application Status of Machine Vision in Intelligent Spinning Production. Cotton Text. Technol. 2022, 50, 607. (In Chinese) [Google Scholar]
Chang, J.Q.; Zhang, R.Y.; Pang, Y.J.; Zhang, M.Y.; Zha, Y. Classification and Detection of Impurities in Machine-Picked Seed Cotton Using Hyperspectral Imaging. Spectrosc. Spectr. Anal. 2021, 41, 3552–3558. (In Chinese) [Google Scholar]
Liu, Q.; Zhang, Y.; Yang, G. Small unopened cotton boll counting by detection with MRF-YOLO in the wild. Comput. Electron. Agric. 2023, 204, 107576. [Google Scholar] [CrossRef]
Lu, Z.; Han, B.; Dong, L.; Zhang, J. COTTON-YOLO: Enhancing Cotton Boll Detection and Counting in Complex Environmental Conditions Using an Advanced YOLO Model. Appl. Sci. 2024, 14, 6650. [Google Scholar] [CrossRef]
Jiang, L.; Chen, W.; Shi, H.; Zhang, H.; Wang, L. Cotton-YOLO-Seg: An enhanced YOLOV8 model for impurity rate detection in machine-picked seed cotton. Agriculture 2024, 14, 1499. [Google Scholar] [CrossRef]
Li, Q.; Ma, W.; Li, H.; Zhang, X.; Zhang, R.; Zhou, W. Cotton-YOLO: Improved YOLOV7 for rapid detection of foreign fibers in seed cotton. Comput. Electron. Agric. 2024, 219, 108752. [Google Scholar] [CrossRef]
Fan, X.; Sun, T.; Chai, X.; Zhou, J. YOLO-WDNet: A lightweight and accurate model for weeds detection in cotton field. Comput. Electron. Agric. 2024, 225, 109317. [Google Scholar] [CrossRef]
Hu, J.; Gong, H.; Li, S.; Mu, Y.; Guo, Y.; Sun, Y.; Hu, T.; Bao, Y. Cotton Weed-YOLO: A Lightweight and Highly Accurate Cotton Weed Identification Model for Precision Agriculture. Agronomy 2024, 14, 2911. [Google Scholar] [CrossRef]
Ni, C.; Li, Z.Y.; Zhang, X.; Zhao, L.; Zhu, T.T.; Jiang, X.S. Sorting Algorithm for Seed Cotton and Plastic Film Residues Based on Short-Wave Near-Infrared Hyperspectral Imaging and Deep Learning. Trans. Chin. Soc. Agric. Mach. 2019, 50, 170–179. (In Chinese) [Google Scholar]
Zhou, X.; Wei, W.; Huang, Z.; Su, Z. Study on the Detection Mechanism of Multi-Class Foreign Fiber under Semi-Supervised Learning. Appl. Sci. 2024, 14, 5246. [Google Scholar] [CrossRef]
Zhang, H.; Qiao, X.; Li, Z.B.; Li, D.L. Hyperspectral Image Segmentation Method for Plastic Film in Ginned Cotton. Trans. Chin. Soc. Agric. Eng. 2016, 32, 13. [Google Scholar]
Ma, C.; Chi, G.; Ju, X.; Zhang, J.; Yan, C. YOLO-CWD: A novel model for crop and weed detection based on improved YOLOv8. Crop. Prot. 2025, 192, 107169. [Google Scholar] [CrossRef]
Shuai, Y.; Shi, J.; Li, Y.; Zhou, S.; Zhang, L.; Mu, J. YOLO-SW: A Real-Time Weed Detection Model for Soybean Fields Using Swin Transformer and RT-DETR. Agronomy 2025, 15, 1712. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, J.; Peng, Y.; Wang, Y. FreqDyn-YOLO: A High-Performance Multi-Scale Feature Fusion Algorithm for Detecting Plastic Film Residues in Farmland. Sensors 2025, 25, 4888. [Google Scholar] [CrossRef]
Fang, H.; Xu, Q.; Chen, X.; Wang, X.; Yan, L.; Zhang, Q. An Instance Segmentation Method for Agricultural Plastic Residual Film on Cotton Fields Based on RSE-YOLO-Seg. Agriculture 2025, 15, 2025. [Google Scholar] [CrossRef]
Zhou, S.; Zhou, H.; Qian, L. A multi-scale small object detection algorithm SMA-YOLO for UAV remote sensing images. Sci. Rep. 2025, 15, 9255. [Google Scholar] [CrossRef]
Giri, K.J. SO-YOLOv8: A novel deep learning-based approach for small object detection with YOLO beyond COCO. Expert Syst. Appl. 2025, 280, 127447. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.W.; Loy, C.C.; Li, D.H. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. Mobileone: An improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7907–7917. [Google Scholar]
Wang, J.; Yu, J.; He, Z. DECA: A novel multi-scale efficient channel attention module for object detection in real-life fire images. Appl. Intell. 2022, 52, 1362–1375. [Google Scholar] [CrossRef]
Gong, Y.; Yu, X.; Ding, Y.; Peng, X.K.; Zhao, J.; Han, Z.J. Effective fusion factor in FPN for tiny object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 6–8 January 2021; pp. 1160–1168. [Google Scholar]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019 ; pp. 9197–9206. [Google Scholar]
Wan, G.; Fang, H.; Wang, D.; Yan, J.; Xie, B. Ceramic tile surface defect detection based on deep learning. Ceram. Int. 2022, 48, 11085–11093. [Google Scholar] [CrossRef]
Padilla, R.; Netto, S.L.; Da Silva, E.A.B. A survey on performance metrics for object-detection algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020; pp. 237–242. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A review on yolov8 and its advancements. In International Conference on Data Intelligence and Cognitive Informatics; Springer: Singapore, 2024; pp. 529–545. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. A Review of YOLOv12: Attention-Based Enhancements vs. Previous Versions. arXiv 2025, arXiv:2504.11995. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]

Figure 1. Plastic mulch hidden within cotton (a–c).

Figure 2. Examples of high-saliency and low-visual-saliency plastic mulch in cotton. (a) High-saliency mulch. (b) Low-visual-saliency mulch.

Figure 3. Hikvision (Hangzhou, China) MV-CS016-10UC Industrial Camera.

Figure 4. Foreign Fiber Splitting Machine.

Figure 5. Data samples.

Figure 6. Architecture of YOLO11.

Figure 7. Improved model architecture of YOLOv11.

Figure 8. Structure of the BiFPN module. (a) FPN; (b) PANet; (c) BiFPN.

Figure 9. Structure of BiFPN-CBAM.

Figure 10. Structure of the CBAM.

Figure 11. Structure of CARAFE.

Figure 12. Structure of the MobileOne module.

Figure 13. Structure of the DECA module.

Figure 14. Heatmap Visualization. (a) Input Image; (b) YOLOv11n; (c) YOLOv11 + BiFPN-CBAM; (d) YOLOv11 + BiFPN-CBAM + CARAFE; (e) Mulch-YOLO.

Figure 15. Detection results of the improved model. (a) YOLO11; (b) Mulch-YOLO.

Figure 16. Mulch detection system deployed on the production line.

Figure 17. Edge computing module.

Figure 18. Mulch detection results.

Table 1. Hikvision MV-CS016-10UC Camera Specifications.

Parameter Name	Parameter Value
Sensor Type	CMOS (Global Shutter)
Sensor Model	IMX273
Resolution	1.6 Megapixels
Image Size	1440 × 1080
Signal-to-Noise Ratio	40 dB (Excellent image noise control)
Dynamic Range	71.1 dB (Suitable for complex lighting environments)

Table 2. Data distribution.

Category	Train	Test	Val
Number	16,107	2013	2014

Table 3. Hyperparameters.

Parameters	Values
size	640
epochs	150
batchsize	16
Initial learning rate	0.01
cos_lr	False
fliplr	0.5
mosaic	1.0
scale	0.5

Table 4. Detection Performance of the Proposed Method with Improvements.

YOLOv11n	+BiFPN-CBAM	+CARAFE	+MobileOne-DECA	mAP@0.5	mAP@0.5:0.95	Parameter/10⁶	FLOPs/G
√				90.8%	65.3%	2.58	6.3
√	√			93.9%	65.7%	2.7	7.2
√	√	√		94.4%	67.1%	2.8	7.4
√	√	√	√	95.5%	68.6%	1.96	5.2

Table 5. Performance Comparison of Different Detection Methods on the Dataset.

Mode	mAP@0.5	mAP@0.5:0.95	Parameter/106	FLOPs/G
YOLOv10n	92.2%	63.9%	2.6	8.2
YOLOv9t	91.2%	64.1%	2.1	8.5
YOLOv8n	89.5%	62.8%	3.1	8.9
YOLOv11n	90.8%	65.3%	2.58	6.3
YOLOv12n	89.8%	60.1	2.5	5.8
Faster RCNN	85.1%	65.2%	125.1	47.9
DETR	89.6%	66.7%	473.95	15.1
Mulch-YOLO	95.5%	68.6%	1.96	5.2

Table 6. Qualitative Comparison: Advantages and Disadvantages of Different Detection Methods in Practical Scenarios.

Detection Method	Advantages	Disadvantages	Recommended Use Cases
YOLOv10n	NMS-free, stable inference	Moderate small-object detection capability	Real-time applications on edge devices
YOLOv9t	Lightweight design	Moderate robustness to low-quality images	Embedded devices with extreme resource constraints
YOLOv8n	Stable training, easy to use	Relatively high computational cost	Rapid prototyping
YOLOv11n	Rapid prototyping	Relatively new, limited community support and tutorials	Industrial applications requiring a balance between accuracy and speed
YOLOv12n	Small parameter count, suitable for model compression	Relatively low mAP@0.5:0.95, may underfit in complex scenes	Ultra-low-power devices
Faster RCNN	High localization accuracy, suitable for large objects	Huge parameter count, difficult to deploy	Tasks requiring extremely accurate localization
DETR	End-to-end design, no need for NMS or anchor boxes	Slow training convergence, requires large datasets	Large-scale, sparse-object scenarios
Mulch-YOLO	Highest accuracy, excellent lightweight design	Design for specific scenarios	High-accuracy applications with moderate hardware resources

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, Z.; Wei, W.; Huang, Z.; Yan, R. Mulch-YOLO: Improved YOLOv11 for Real-Time Detection of Mulch in Seed Cotton. Appl. Sci. 2025, 15, 11604. https://doi.org/10.3390/app152111604

AMA Style

Su Z, Wei W, Huang Z, Yan R. Mulch-YOLO: Improved YOLOv11 for Real-Time Detection of Mulch in Seed Cotton. Applied Sciences. 2025; 15(21):11604. https://doi.org/10.3390/app152111604

Chicago/Turabian Style

Su, Zhiwei, Wei Wei, Zhen Huang, and Ronglin Yan. 2025. "Mulch-YOLO: Improved YOLOv11 for Real-Time Detection of Mulch in Seed Cotton" Applied Sciences 15, no. 21: 11604. https://doi.org/10.3390/app152111604

APA Style

Su, Z., Wei, W., Huang, Z., & Yan, R. (2025). Mulch-YOLO: Improved YOLOv11 for Real-Time Detection of Mulch in Seed Cotton. Applied Sciences, 15(21), 11604. https://doi.org/10.3390/app152111604

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mulch-YOLO: Improved YOLOv11 for Real-Time Detection of Mulch in Seed Cotton

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.2. Methods

2.2.1. YOLOv11

2.2.2. Improved YOLOv11

2.2.3. BiFPN-CBAM

2.2.4. CARAFE-Mulch

2.2.5. MobileOne-DECA

3. Results

3.1. Environment and Configuration

3.2. Performance Evaluation

3.3. Ablation Experiments

3.4. Comparison Experiment

3.5. Deployment

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI