YOLOv11-4ConvNeXtV2: Enhancing Persimmon Ripeness Detection Under Visual Challenges

Zhang, Bohan; Zhang, Zhaoyuan; Zhang, Xiaodong

doi:10.3390/ai6110284

Open AccessArticle

YOLOv11-4ConvNeXtV2: Enhancing Persimmon Ripeness Detection Under Visual Challenges

by

Bohan Zhang

¹

,

Zhaoyuan Zhang

² and

Xiaodong Zhang

^3,*

¹

School of Computer Science and Engineering, University of New South Wales, Sydney 2052, Australia

²

Shaanxi Xiaodong Aid Robot Science and Technology Corporation, Anke Square, iHabour, Xianyang 712046, China

³

School of Mechanical Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

AI 2025, 6(11), 284; https://doi.org/10.3390/ai6110284

Submission received: 23 August 2025 / Revised: 12 October 2025 / Accepted: 21 October 2025 / Published: 1 November 2025

Download

Browse Figures

Versions Notes

Abstract

Reliable and efficient detection of persimmons provides the foundation for precise maturity evaluation. Persimmon ripeness detection remains challenging due to small target sizes, frequent occlusion by foliage, and motion- or focus-induced blur that degrades edge information. This study proposes YOLOv11-4ConvNeXtV2, an enhanced detection framework that integrates a ConvNeXtV2 backbone with Fully Convolutional Masked Auto-Encoder (FCMAE) pretraining, Global Response Normalization (GRN), and Single-Head Self-Attention (SHSA) mechanisms. We present a comprehensive persimmon dataset featuring sub-block segmentation that preserves local structural integrity while expanding dataset diversity. The model was trained on 4921 annotated images (original 703 + 6 × 703 augmented) collected under diverse orchard conditions and optimized for 300 epochs using the Adam optimizer with early stopping. Comprehensive experiments demonstrate that YOLOv11-4ConvNeXtV2 achieves 95.9% precision and 83.7% recall, with mAP@0.5 of 88.4% and mAP@0.5:0.95 of 74.8%, outperforming state-of-the-art YOLO variants (YOLOv5n, YOLOv8n, YOLOv9t, YOLOv10n, YOLOv11n, YOLOv12n) by 3.8–6.3 percentage points in mAP@0.5:0.95. The model demonstrates superior robustness to blur, occlusion, and varying illumination conditions, making it suitable for deployment in challenging maturity detection environments.

Keywords:

persimmonripeness detection; small object detection; occlusion; background blur; YOLOv11; ConvNeXtV2; attention head

1. Introduction

Persimmon ripeness detection is crucial for optimizing harvest timing, yet remains challenging due to small target sizes, frequent occlusion by foliage, and motion- or focus-induced blur that degrades edge information. Recent studies have demonstrated the effectiveness of YOLO-based approaches for various fruit ripeness detection tasks, including pomegranate classification [1], apple ripeness estimation [2], and pineapple ripeness detection [3]. Persimmon ripeness detection presents particular challenges in computer vision due to small target sizes, subtle color transitions during ripening phases, and susceptibility to visual degradation from environmental factors including leaf occlusion and depth-of-field-induced blur. Recent work by Xu et al. [4] improved YOLOv5 through BiFPN neck fusion and attention mechanisms for handling complex field scenes, while Suzuki et al. [5] demonstrated deep learning’s potential for quality prediction from RGB images. In addition, Cao et al. [6] proposed an improved YOLOv5-based method for persimmon recognition and detection in natural environments, highlighting practical advances in field conditions. However, these factors continue to significantly challenge the effectiveness of conventional object detection systems, particularly when occlusion and blur occur simultaneously.

In recent years, YOLO-series detectors have become the dominant approach for real-time fruit detection due to their advantageous balance between inference speed and localization accuracy. Contemporary research has focused on architectural enhancements to YOLO frameworks, with particular emphasis on developing specialized modules for handling occlusion patterns, multi-scale feature fusion, and environmental noise characteristic of orchard settings. Existing YOLO-based frameworks have shown promising improvements in fruit maturity detection but continue to demonstrate limited robustness under the simultaneous presence of small-scale targets, occlusion, and blurred backgrounds.

Several architectural modifications and attention mechanisms have been proposed to strengthen the feature extraction capabilities of YOLO models. Recent studies have demonstrated various improvements for maturity detection applications, including lightweight models for tomato ripeness detection [7,8], transformer-based approaches for apple ripeness identification [9], and multi-task models for tomato maturity and stem detection [10].

Within YOLO-based architectures, various adaptations have emerged to accommodate orchard conditions more effectively. Studies have shown that YOLO variants can achieve high accuracy for different fruit types, including pineapple ripeness detection with 98.26% accuracy [3], oil palm ripeness detection with near 99% accuracy [11], tomato ripeness detection with improved lightweight models [7], and melon ripeness detection with 97.4% mAP [12].

Despite these advances, current methodologies predominantly focus on occlusion or scale imbalance, with limited solutions addressing the compounded issue of small targets embedded in blurred or degraded backgrounds—a common condition in natural persimmon orchards due to shallow depth of field and dense canopy coverage. As highlighted in recent comprehensive reviews [13], unstructured orchard environments characterized by varying illumination conditions and degrees of occlusion present significant challenges for fruit detection systems, necessitating specialized optimization strategies. Traditional fusion and attention mechanisms, though beneficial, often lack the precision to recover fine-grained spatial details critical for identifying blurred, low-contrast objects.

To bridge this critical gap, we propose an advanced YOLOv11-based detection framework that integrates a lightweight multi-compound ConvNeXtV2 backbone, referred to as YOLOv11-4ConvNeXtV2. This backbone significantly optimizes network data transmission, effectively reducing computational complexity while preserving robust feature learning and extraction capabilities essential for accurate object detection. Additionally, we propose a novel dataset structure based on sub-block segmentation, supported by a mathematical data augmentation model. This approach enhances data diversity while retaining local structural integrity, thus providing high-quality and varied datasets for effective network training. Furthermore, we develop an innovative HEAD detection model with a single-head attention mechanism that iteratively learns from multi-scale information in the feature pyramid [14]. This technique significantly reduces computational redundancy and efficiently integrates global and local features, notably enhancing the detection precision and robustness for blurred and partially occluded persimmons, similar to approaches used in other fruit detection studies [10,15].

To address the fundamental limitations, we present the YOLOv11-4ConvNeXtV2 detection framework that tackles the core challenge of representation learning under visual degradation. The key innovation lies in leveraging ConvNeXtV2’s Fully Convolutional Masked Auto-Encoder (FCMAE) pretraining paradigm, which forces the network to reconstruct missing visual information from sparse contextual cues—directly analogous to the fruit-behind-leaf scenario prevalent in orchards. This pretraining mechanism, combined with Global Response Normalization (GRN), establishes competitive dynamics among feature channels, where globally consistent evidence (fruit characteristics) receives amplified representation while locally scattered distractors (leaf textures) are systematically suppressed. Unlike conventional approaches that address occlusion and blur as separate problems, our framework treats them as manifestations of incomplete visual evidence, enabling a unified solution through context completion and channel competition mechanisms inherent in the ConvNeXtV2 architecture [16].

The primary contributions of this study are as follows:

An enhanced YOLOv11 architecture incorporating a ConvNeXtV2 backbone with FCMAE pretraining and Global Response Normalization is proposed. This integration leverages context completion mechanisms to address occlusion challenges, achieving 3.8–6.3-percentage-point improvements in mAP@0.5:0.95 compared to baseline YOLO variants through enhanced channel-wise feature discrimination for persimmon detection under adverse conditions.
A comprehensive persimmon dataset is constructed featuring diverse orchard conditions including occlusion patterns, varying illumination, and blur scenarios. The dataset encompasses 4921 annotated images with systematic sub-block organization that preserves spatial relationships and local structural integrity essential for robust model training.
A single-head attention-based HEAD detection model is constructed to efficiently capture and integrate multi-scale global and local features, significantly improving detection accuracy for blurred and occluded persimmons.

2. Materials and Methods

2.1. Dataset

To enhance dataset diversity and improve the generalization capability of the model under varying orchard conditions, a structured data augmentation pipeline was designed. All augmentation operations were performed after splitting the dataset into training, validation, and test subsets with a fixed ratio of 7:2:1. This approach preserves the intrinsic distribution of the original dataset and prevents sample overlap or label leakage between subsets.

We constructed a self-built persimmon ripeness image dataset specifically designed for maturity detection applications. The dataset comprises 703 high-resolution RGB images (3840 × 2160 pixels) collected from the Aviani Persimmon Orchard in Sydney, Australia, using a Sony

α

7 III digital camera (Sony Corporation, Tokyo, Japan; 24.2 MP). The images were captured with various focal lengths to ensure dataset diversity and better represent real-world acquisition scenarios. The images were captured under diverse natural conditions including direct sunlight, strong backlighting, shaded foliage, and oblique viewing angles to simulate real-world orchard complexity.

Each persimmon in the dataset was annotated using rectangular bounding boxes to maintain consistency with established fruit maturity detection datasets [4]. The maturity classification (mature vs. immature) was conducted under the guidance of experienced farmers who provided expertise on visual indicators of persimmon ripeness at different maturity stages. Data collection was performed across multiple maturity periods to capture the full spectrum of ripening characteristics.

Figure 1 demonstrates typical detection challenges such as motion blur, overlapping fruits, and illumination imbalance. The examples showcase the visual complexity encountered in real orchard environments, where persimmons often appear with color similarity to foliage, frequent partial occlusion, and varying scales under different lighting conditions.

2.1.1. The Algorithm Principle of Sub-Block Segmentation

In this study, a sub-block segmentation-based data augmentation method is proposed. The method divides the original image into several equal-sized, non-overlapping blocks and rearranges them according to a predefined mathematical rule, before reconstructing them into an image of the original size. The mathematical relationship for sub-block segmentation is formulated as

\{\begin{matrix} H = M \times h \\ W = N \times w \\ M, N \in N^{+} \end{matrix}

(1)

where H and W represent the original image height and width respectively; M is the number of sub-block rows, and N is the number of sub-block columns; and h and w denote the height and width of each individual sub-block, respectively. In this experiment, the image size is

5568 \times 4872

. If we take

M = 4

, then

h = 5568 / 4 = 1392

; if we take

N = 3

, then

w = 4872 / 3 = 1624

. That is, twelve sub-blocks are segmented, each with a size of

1392 \times 1624

.

According to the above method, the original image is cut into

4 \times 3

sub-blocks. After cutting, some sub-blocks are randomly flipped or rotated, and finally reassembled into a

5568 \times 4872

image, as shown in Figure 2. This visualization highlights the randomized nature of the sub-block selection process rather than any semantic meaning of color. This method greatly enriches the dataset and provides a foundation for improving the robustness of model efficient target recognition. While this method segments sub-blocks, random flipping and rotation operations also increase the number of small-target-feature images in the dataset. This provides strong data support for helping the model extract more target-feature points.

2.1.2. Dataset Construction

Five augmentation methods were used, each designed to simulate specific visual challenges frequently encountered in field harvesting scenarios.

Contrast enhancement applies a linear pixel transformation to increase contrast between fruits and background regions, particularly under varying lighting conditions. The transformation is defined as

P_{o u t} = A \cdot P_{i n} + B

(2)

where

P_{i n}

and

P_{o u t}

denote the input and output pixel values, respectively. The contrast gain factor A was set to 1.5, and the brightness offset B was fixed at 1. This adjustment simulates scenarios with strong lighting variation, improving the model’s maturity discrimination under different sunlight exposures.

Gaussian noise injection introduces additive Gaussian noise to simulate sensor noise and compression artifacts commonly observed in low-light or fast-capture conditions. This encourages the network to focus on semantic features rather than low-level texture noise, thereby increasing robustness under variable image quality.

Random rotation rotates images randomly within a fixed angular range to emulate changes in viewing direction caused by natural hand-held camera capture or drone instability. This enhances spatial invariance of the model with respect to fruit orientation.

Horizontal flipping applies a standard geometric transformation with 50% probability to mirror image content, diversifying object orientation while preserving semantic consistency. This augmentation is particularly effective in balancing the left–right appearance distribution of targets.

Random erasing involves replacing pixel values of randomly selected rectangular regions with zeros, simulating occlusions by leaves, branches, or other orchard equipment. By learning from partially visible fruit, the model becomes more robust to occlusion-induced degradation in maturity recognition.

Figure 3 displays comprehensive data augmentation strategies employed to enhance model robustness. The figure demonstrates various augmentation techniques including sub-block segmentation for spatial structure variations, brightness enhancement/dimming (±30% adjustment) simulating exposure variations, additive Gaussian noise (

σ = 0.01

) mimicking imaging artifacts, random erasure obscuring 20% of image areas for occlusion simulation, and horizontal flipping with 50% probability for orientation diversity.

Table 1 summarizes the number of augmented images generated by each augmentation method. As shown, each technique generated 703 additional images, resulting in a total dataset of 4921 samples (original 703 +

6 \times 703

augmented images), providing a substantial increase in sample size and diversity. Figure 3 illustrates representative examples of a single image processed through different augmentation techniques, including contrast adjustment, Gaussian noise injection, geometric transformation, and random erasing. These transformations were designed to simulate real-world variations such as illumination changes, sensor noise, occlusion, and viewpoint shifts, thereby improving the model’s robustness to visual perturbations commonly encountered in orchard environments.

2.2. Methods

2.2.1. The Algorithm Principle of YOLOv11

YOLOv11 is a streamlined and high-speed object detection framework optimized for deployment in resource-constrained environments such as embedded systems and edge devices. It preserves the classic YOLO-series structure composed of four main components—input, backbone, neck, and prediction head—and introduces several notable improvements to balance accuracy and computational efficiency, building upon the success of previous YOLO variants [17,18,19].

At the backbone level, YOLOv11 employs a lightweight residual block named C3K2, which compresses parameters through partial convolution while maintaining skip connections [20]. This allows the model to retain semantic depth without redundant feature propagation. Furthermore, the C2f attention module enhances early-stage spatial sensitivity by applying fine-grained attention to shallow layers, which is beneficial for detecting small or low-contrast targets.

In the neck, YOLOv11 employs a classic Feature Pyramid Network (FPN) to fuse features from different stages [21]. This enables multi-scale representation by combining low-level features (with high resolution and fine textures) and high-level features (with rich semantics but lower resolution). However, unlike more recent bi-directional fusion strategies, the FPN in YOLOv11 only provides a top-down flow, which may lead to insufficient information recovery for heavily occluded or blurred small targets.

The detection head outputs predictions at three feature map resolutions—

80 \times 80

,

40 \times 40

, and

20 \times 20

—which correspond to small, medium, and large targets [21]. This design allows the model to make scale-adaptive predictions. The

80 \times 80

branch captures fine structural cues for small targets, while the deeper

20 \times 20

branch focuses on abstract object-level semantics. This combination is effective in general scenarios, but in highly degraded scenes (e.g., blurred fruits, occluded targets), YOLOv11 may still underperform due to its limited ability to retain low-level appearance cues and to differentiate subtle variations across channels.

These limitations become more pronounced in orchard environments, where persimmons often appear in clusters, are partially occluded by foliage, or are affected by defocus-induced blur. In such settings, YOLOv11 may fail to capture critical spatial nuances, leading to misclassification or missed detections. Therefore, enhancing the backbone’s ability to extract and normalize fine-grained features under degraded visual conditions becomes essential.

Figure 4 illustrates the traditional three-stage framework of YOLOv11, consisting of the backbone, neck, and head modules. The backbone employs multiple 3 × 3 convolutions for downsampling, intertwined with C3k2 modules to enhance feature extraction. The neck integrates SPPF, C2PSA, upsampling, and concatenation operations to achieve multi-scale feature fusion, while the head module generates multi-scale detection outputs. The arrows indicate the direction of feature propagation across modules, and the color shading is used only to visually differentiate the backbone, neck, and head regions. This figure demonstrates the standard YOLOv11 process, which involves hierarchical feature extraction, feature fusion, and multi-scale prediction.

2.2.2. Overall Architecture of the YOLOv11-4ConvNeXtV2 Model

Affected by complex environments and lighting, as well as the performance limitations of image acquisition equipment, persimmon targets collected in orchards often have problems with target blur and uneven imaging. This requires further enhancement of the YOLOv11 backbone network’s ability to extract target features. The ConvNeXtV2 model, with its advantages in network architecture design, can effectively improve the model’s feature capture capability and training efficiency for targets in complex scenes.

To address the limitations of YOLOv11 in preserving detailed features and capturing long-range dependencies, we replace its original backbone with ConvNeXtV2-Tiny. This modern convolutional backbone is co-designed with a self-supervised pretraining strategy and introduces two core innovations that directly target the challenges encountered in orchard-based small-object detection [22].

The first is the Fully Convolutional Masked Auto-Encoder (FCMAE) training scheme [23]. During pretraining, random spatial regions of the image are masked, and the network learns to reconstruct them using surrounding visible patches. This forces the backbone to infer missing information from limited context, thereby enhancing its ability to recover object boundaries and textures even under severe occlusion or blur. As a result, the ConvNeXt V2 backbone learns stronger global semantic priors and becomes robust to spatial corruption, which is critical when fruit visibility is compromised.

The second enhancement is the introduction of Global Response Normalization (GRN). Traditional normalization methods like Batch Norm or Layer Norm operate either locally or uniformly, potentially flattening important activation contrasts across channels [24]. In contrast, GRN normalizes each channel based on the global energy distribution of all channels, introducing competitive dynamics among feature maps. This enhances the inter-channel contrast, allowing the model to better separate subtle differences in fruit texture, maturity color, or blurred contours—especially when multiple fruits appear adjacent to one another.

From a design perspective, ConvNeXt V2 maintains a hierarchical structure similar to Resnets [20], but it integrates depthwise separable convolutions, large kernel sizes, and inverted bottlenecks [25,26], enabling it to capture both local edge details and global context efficiently. The shallow layers encode detailed shapes and textures, while deeper layers abstract higher-level semantics—mirroring the complementary strengths of YOLOv11’s multi-scale head. By aligning this rich feature representation with YOLOv11’s detection layers, the overall model achieves spatial precision for small fruits and semantic stability for occluded clusters.

In summary, by embedding ConvNeXt V2 into the YOLOv11 framework, the proposed architecture (YOLOv11-4ConvNeXtV2) inherits the fast inference capability of YOLOv11 while gaining a powerful, blur-tolerant, and detail-preserving backbone. This makes it particularly suited for real-time, fine-grained detection of persimmons in visually complex orchard environments.

Figure 5 shows the overall architecture of the proposed YOLOv11-4ConvNeXtV2 model, illustrating how the ConvNeXt V2 backbone is integrated with YOLOv11’s detection components. The four ConvNeXt V2 Tiny modules (highlighted by the red dashed box) are newly integrated into the YOLOv11 framework to enhance feature extraction. The remaining modules remain unchanged, with the same colors representing the same modules. The framework demonstrates the enhanced feature extraction capabilities while maintaining real-time performance requirements.

2.3. Description of Key Modules

2.3.1. ConvNeXtV2 Backbone with Global Response Normalization

The ConvNeXtV2 backbone provides deeper receptive fields through the use of larger kernel sizes (

7 \times 7

) in early stages and smaller kernels (

3 \times 3

) in later stages. The architecture employs depthwise separable convolutions to reduce computational complexity while maintaining feature extraction capability. The key innovation lies in the integration of Global Response Normalization (GRN), which enhances training stability and feature discriminability.

Figure 6 shows the ConvNeXt V2 block structure, where an expanded MLP layer is followed by a Global Response Normalization (GRN) layer, and the original LayerScale module from ConvNeXt V1 is removed. The GRN mechanism serves as the core innovation, significantly enhancing the model’s discriminative capability for blurred targets by suppressing feature collapse. The arrows indicate the forward propagation of features through the convolutional and normalization layers. The color shading is used solely to visually distinguish different processing stages within the block. The red font "+GRN" highlights the newly added operation in ConvNeXt V2 compared with ConvNeXt V1. Given an input feature tensor

X \in R^{C \times H \times W}

, where C is the number of channels and H and W denote the height and width of the feature map, respectively, the GRN layer operates through the following steps.

Global Response Aggregation:

For spatial position

(i, j)

, compute the L2 norm across all image data channels to generate global response vector

G \in R^{H \times W}

:

g_{(i, j)} = {∥ X_{i j} ∥}_{2} = \sqrt{\sum_{c = 1}^{C} x_{(c, i, j)}^{2}}

(3)

where

g_{(i, j)}

represents the global response intensity at position

(i, j)

, reflecting the comprehensive channel strength at that position;

x_{(c, i, j)}

denotes the feature value at position

(c, i, j)

.

Channel Competition Normalization:

Use global response to adaptively scale the original features:

{\hat{x}}_{(c, i, j)} = \frac{x_{(c, i, j)}^{2}}{\sqrt{\sum_{c = 1}^{C} x_{(k, i, j)}^{2} + ϵ}}

(4)

The normalized feature vector is represented as

{\hat{x}}_{(c, i, j)} g_{(i, j)}

;

x_{(c, i, j)}^{2}

represents the enhanced channel response, and

\sqrt{\sum_{c = 1}^{C} x_{(k, i, j)}^{2} + ϵ}

is the total feature value at position

(i, j)

, where

ϵ

is a small constant to prevent division by zero during computation. This operation enables sparse competition between channels. If a channel’s response is significantly higher than others (such as structural residuals in blurred targets), its output features are enhanced; conversely, weakly responding channels are suppressed.

Feature Calibration:

Finally, use the calculated feature normalization values to calibrate the original input response:

{\tilde{x}}_{(c, i, j)} = γ_{c} \times {\hat{x}}_{(c, i, j)} + β_{c}

(5)

where

{\tilde{x}}_{(c, i, j)}

represents the final calibrated feature response, and

γ

and

β

are two additional learnable parameters, initialized to zero during task training.

This integrated design makes the ConvNeXtV2 backbone particularly effective for persimmon detection under challenging conditions including occlusion and blur, where globally consistent evidence (fruit characteristics) receives amplified representation while locally scattered distractors (leaf textures) are systematically suppressed.

2.3.2. Fully Convolutional Masked Auto-Encoder

As shown in Figure 7, random regions are masked on the input blurred persimmon image, and the model attempts to restore these masked regions through a sparse convolutional encoding–decoding process. The yellow–green grids represent multi-scale feature maps at different encoding stages, where lighter colors correspond to higher-level feature representations. The black solid arrows indicate the forward propagation of feature extraction within the hierarchical encoder, while the gray arrows denote the reconstruction path in the plain decoder. The dotted arrow at the top illustrates the overall data flow from the masked input image through the sparse convolution process to the reconstructed output. This framework employs a fully convolutional structure rather than fully connected layers to generate masks and reconstruct image data, effectively reducing parameters and computational load while preserving spatial information. Moreover, the FCMAE adopts a multi-scale masking strategy instead of fixed-size masks, which greatly enhances the model’s perception capability for occluded and blurred regions.

2.3.3. SHSA Detection Head

The YOLOv11 detection algorithm adopts a decoupled design for the detection head, but only extracts object category and position features through multiple convolutions of different dimensions. However, this design has the following disadvantages: it entails high computational resource consumption, making it difficult to achieve optimal performance on resource-constrained devices; and each detection head processes feature maps independently, lacking information interaction, which affects detection effectiveness. To address these limitations, we integrate the Single-Head Self-Attention (SHSA) mechanism into our detection head design. SHSA is an efficient attention mechanism that demonstrates unique advantages in visual tasks by reducing computational redundancy while maintaining performance.

Figure 8 illustrates a comparison of different single-head self-attention (SHSA) designs. The SHSA module offers several key advantages: reduced computational redundancy by applying single-head attention to only a portion of the input channels, avoiding the overhead of multi-head mechanisms; lower memory access costs through partial channel processing, enabling more efficient operation on both GPU and CPU platforms; and enhanced performance by combining global and local information in parallel, allowing more blocks to be stacked under the same computational budget while improving model accuracy. The arrows indicate the direction of data flow through the contraction, attention, and expansion operations. The color shading is used solely to differentiate the input, intermediate, and output feature representations, where yellow denotes the input channels, green represents the processed or expanded channels, and blue blocks correspond to attention operations.

3. Experimental Setup and Evaluation Metrics

3.1. Experimental Environment

The training process was carried out using PyTorch 2.0 (Meta Platforms, Inc., Menlo Park, CA, USA) on a high-performance local workstation configured with an Intel i9-13900K CPU (Intel Corporation, Santa Clara, CA, USA), 64 GB of RAM, and an NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 24 GB of VRAM. The software environment included CUDA 11.8 (NVIDIA Corporation, Santa Clara, CA, USA) and Python 3.10 (Python Software Foundation, Wilmington, DE, USA). This setup ensured both computational stability and reproducibility across all training stages.

Following the data augmentation procedure detailed earlier, the dataset was expanded from the original 703 high-resolution images to a total of 4921 samples. Each image was annotated using the LabelMe toolkit (version 5.0.1, MIT CSAIL, Cambridge, MA, USA), with all persimmon instances manually labeled as either mature or immature. The complete dataset was randomly split into training, validation, and test sets in a 7:2:1 ratio to facilitate consistent model evaluation and avoid distributional bias.

Prior to training, all images were resized to 640 × 640 pixels to match standard input requirements of YOLO-based detectors while maintaining a balance between feature resolution and computational load. The training process spanned 300 epochs in total, allowing the model to reach stable convergence under full supervision.

The Adam optimizer was used throughout training, initialized with a learning rate of 0.001 [27]. Its adaptive gradient adjustment mechanism is particularly effective in addressing the sparse and localized gradient patterns that typically arise in small-object detection tasks. A cosine annealing schedule was adopted to gradually reduce the learning rate as training progressed, which promoted finer convergence during the later optimization phases [28].

A mini-batch size of 16 was selected to optimize GPU memory utilization while maintaining gradient stability. Early stopping was applied with patience of 50 epochs to prevent overfitting and ensure optimal model convergence across all experimental groups.

For the loss functions, Generalized Intersection over Union (GIoU) loss was used for bounding box regression [29]. GIoU provides a more robust optimization signal than traditional IoU-based losses, especially in scenes with occluded or closely packed targets. For the classification branch, binary cross-entropy loss was used to supervise the two-class maturity prediction task. This combination yielded stable training dynamics and strong discriminative performance, particularly under real-world orchard constraints involving blur, partial visibility, and complex lighting.

3.2. Evaluation Metrics

We evaluated model performance using standard object detection metrics with corresponding mathematical formulations [21]. These metrics have been widely used in maturity detection applications, including fruit ripeness detection studies [2,3]. Precision measures the accuracy of positive predictions:

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

where

T P

represents true positives and

F P

represents false positives. Recall measures the ability to find all positive instances:

R e c a l l = \frac{T P}{T P + F N}

(7)

where

F N

represents false negatives. Recall measures the proportion of actual positive instances that were correctly identified, reflecting the model’s ability to find all positive targets. Recall is used to evaluate the detection coverage of all targets to be detected. For all positive instances (

T P + F N

) in the dataset, it represents the proportion of positive instances correctly identified by the model (

T P

) among all positive instances in the dataset.

F N

represents data that was incorrectly classified as negative by the model but is actually positive. Recall is also called sensitivity. Taking object detection as an example, we often regard objects in images as positive instances. At this time, high recall means that the model can find more objects in the image.

The mean average precision (mAP) at an IoU threshold of 0.5 is calculated as

m A P @ 0.5 = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(8)

where N is the number of classes and

A P_{i}

is the average precision for class i. The mAP@0.5:0.95 represents the mean AP over IoU thresholds from 0.5 to 0.95 in steps of 0.05:

m A P @ 0.5 : 0.95 = \frac{1}{10} \sum_{k = 0}^{9} m A P @ (0.5 + 0.05 k)

(9)

The Intersection over Union (IoU) threshold of 0.5 is particularly relevant for maturity detection applications where precise localization is critical for accurate assessment. Additionally, we report inference speed in Frames Per Second (FPS), parameter count, and computational complexity in FLOPs to assess deployment feasibility.

For small-object detection specifically, we focus on AP for small objects (objects with area

< 32^{2}

pixels) as defined in the COCO evaluation protocol, which is particularly relevant for persimmon detection where fruits often occupy less than 5% of the total image area. Detailed small-object detection experiments under various illumination conditions are presented in Section 5.1.2.

4. Results and Analysis

4.1. Backbone Network Ablation Study

To evaluate the optimal architectural configuration for persimmon detection under challenging conditions, this experiment investigates the impact of varying numbers of ConvNeXtV2 modules integrated into the YOLOv11 backbone. Different module configurations were systematically tested to determine the optimal balance between feature representation capability and detection performance.

The comparison of detection results is shown in Table 2.

The experimental results demonstrate that the YOLOv11 model with four ConvNeXtV2 modules achieves optimal performance across all metrics. Compared with the baseline YOLOv11n (ConvNeXtV2 = 0), the four-module configuration shows substantial improvements: precision increases from 0.895 to 0.959 (+6.4%), recall from 0.655 to 0.837 (+18.2%), mAP@0.5 from 0.864 to 0.884 (+2.0%), and mAP@0.5:0.95 from 0.685 to 0.748 (+6.3%). Among all ConvNeXtV2 configurations tested, the four-module model consistently outperforms alternatives, demonstrating that this specific architecture provides the optimal balance for persimmon detection tasks.

On the core comprehensive metric mAP@0.5, the four-module model (0.884) leads the seven-module model (0.866) by 1.8 percentage points, and significantly leads the one-module (0.794), three-module (0.789), and five-module (0.782) models by about 9–10.2 percentage points. Moreover, on the mAP@0.5:0.95 metric, the four-module model (0.748) has extremely significant leading advantages. Its value is 3.4 percentage points higher than the second-best model, the seven-module model (0.714) (relative advantage of about 4.8%), and significantly exceeds the one-module (0.587), three-module (0.576), and five-module (0.561) models by 16.1–18.7 percentage points. This series of quantitative data, especially the 3.4 percentage point advantage over the second-best model (YOLOv11-7ConvNeXtV2) on mAP@0.5:0.95, demonstrates that compared to the YOLOv11-5ConvNeXtV2 model, the YOLOv11 model with four ConvNeXtV2 modules achieves an 18.7% absolute advantage. This ablation experiment proves that four ConvNeXtV2 modules is the optimal choice in this series of configurations, achieving the best balance in recognition accuracy, target recall capability, and comprehensive localization accuracy.

4.2. Comparative Experiments with Different Detection Models

To compare our model’s detection performance with current mainstream object detection models on this dataset, we conducted six comparison experiments with YOLOv5n, YOLOv8n [19], YOLOv9t [30], YOLOv10n, YOLOv11n, and YOLOv12n, and these models were evaluated based on the same dataset. The results are shown in Table 3.

Overall, in the comprehensive evaluation of blurred target detection tasks, our proposed YOLOv11-4ConvNeXtV2 model (with four ConvNeXtV2 modules) shows significant advantages. The precision metric (mAP@0.5:0.95) leads all comparison models with 0.748, significantly improving by 3.8 percentage points (relative improvement of 5.4%) over the second-best model, YOLOv10n (0.710). It also improves by 6.3 and 5.1 percentage points (relative improvements of 9.2% and 7.3%) over baseline models YOLOv5n (0.685) and YOLOv8n (0.697), respectively.

Meanwhile, mAP@0.5 reaches 0.884, improving by 1.4 percentage points over the best model, YOLOv10n (0.870), verifying precision–recall balance. This cross-IoU-threshold stability proves that our model’s bounding box regression capability for blurred targets, small targets, and occluded scenes has reached a high level. The performance improvements demonstrate the effectiveness of the ConvNeXtV2 backbone integration, making it especially suitable for maturity detection scenarios that have strict requirements for persimmon detection robustness under challenging conditions.

Furthermore, we conducted comparative analysis of network training convergence and stability. The comparison networks were the same as those in Table 3, and we measured network training convergence and stability using the train/box loss metric. Box loss, short for bounding box loss, represents the loss between the predicted bounding box and the standard detection region after the neural network model is trained and learned. The results are shown in Figure 9.

As shown in the figure, our model demonstrates more robust recognition and detection of blurred targets during network training compared to other traditional models, with visibly smoother and more reliable loss curves. Moreover, the convergence values for YOLOv5, YOLOv8, YOLOv9, YOLOv10, YOLOv11, and YOLOv12 are 0.715, 0.644, 0.691, 1.635, 0.675, and 0.6629, respectively, with none of the models breaking through 0.6, while our model can ultimately converge at 0.488. This confirms that within the same 300-epoch training batch, our model achieves at least a 15% improvement in the ability to extract blurred persimmon maturity features during network training compared to the six traditional models. Particularly compared to the YOLOv12 model, our model achieves a 17.4% performance improvement.

Although YOLOv10n achieves slightly better results among the standard YOLO series, YOLOv11n was selected in this study as the baseline due to its framework extensibility and architectural compatibility with further integration of ConvNeXtV2 modules. The modular design of YOLOv11n allows for seamless incorporation of ConvNeXtV2 blocks without requiring substantial architectural modifications, thereby facilitating the investigation of backbone replacement strategies.

Figure 9 illustrates the training and validation loss curves of the YOLOv11-4ConvNeXtV2 model. The solid blue lines represent the actual recorded results during each training epoch, while the dotted orange lines denote the smoothed trend of the corresponding metrics for clearer visualization. As shown in the figure, YOLOv11-4ConvNeXtV2 demonstrates a smooth and steady convergence trajectory, with the total loss approaching 0.5 during training. Notably, the validation distributional focal loss (val/dfl_loss) remains stable throughout the training process, suggesting that the model is robust to class imbalance and capable of maintaining discriminative learning even under blurred fruit boundaries and uneven sample distributions. This stability helps mitigate common failure cases such as missed detections and maturity misclassifications, particularly in occluded or low-contrast regions.

5. Discussion

Accurate maturity detection is crucial for persimmon crop harvesting. This study introduces YOLOv11-4ConvNeXtV2, an enhanced detection framework for persimmon ripeness identification. The model adopts ConvNeXtV2 as the backbone network, integrating the Fully Convolutional Masked Auto-Encoder (FCMAE) module for feature fusion and extraction [23]. This integration enhances blurred-target-feature extraction through context completion mechanisms. The Global Response Normalization (GRN) module suppresses feature extraction collapse while improving the model’s detection and discrimination capability for blurred targets. To address dataset limitations, a comprehensive persimmon dataset with sub-block segmentation organization is constructed, which enhances data diversity while preserving local structural information. Additionally, a Single-Head Self-Attention (SHSA) detection head mechanism is introduced, improving detection capability for small targets through enhanced feature representation. YOLOv11-4ConvNeXtV2 achieves 95.9% average recognition precision and an 83.7% recall rate, which compares favorably with other fruit ripeness detection approaches [3,11,12,15]. Compared to the YOLOv5n, YOLOv8n, YOLOv9t, YOLOv10n, YOLOv11n, and YOLOv12 models, it leads in detection precision, recall rate, and overall performance, making it suitable for challenging agricultural environments.

5.1. Quantitative Results and Robustness Analysis

5.1.1. Qualitative Comparison Under Challenging Orchard Scenarios

To more intuitively compare our model’s detection performance with current mainstream object detection models on this dataset, we conducted three image detection experiments with YOLOv8, YOLOv9, and YOLOv11.

Figure 10 presents qualitative comparisons under three representative test scenarios. The ground truth uses red bounding boxes, while light blue boxes represent model predictions for immature persimmons and dark blue boxes represent model predictions for mature persimmons. The sub-figures (d) through (h), (i) through (m), and (n) through (r) show enlarged views of the ground truth and each model’s predictions for the corresponding sections of scenarios (a), (b), and (c), respectively. In scenarios (a) and (b), it should be noted that images captured under a leaf canopy exhibit reddish coloration due to filtered lighting conditions, which may visually resemble mature persimmons. However, these fruits are actually immature, highlighting the challenge of maturity classification under non-uniform illumination. In Figure 10, scenario (a), the superior performance stems from FCMAE’s context completion mechanism operating at the feature level—when depth-of-field blur attenuates high-frequency spatial information, the masked auto-encoder pretraining enables ConvNeXt blocks to reconstruct missing edge details through learned spatial correlations between visible fruit patches and their surrounding context. Simultaneously, GRN’s global channel competition suppresses noise-dominated channels while amplifying those encoding consistent fruit signatures, ensuring that semantic fruit representations persist despite edge degradation.

In Figure 10, scenario (b), the enhanced discrimination capability results from GRN’s divisive normalization mechanism that establishes competitive channel dynamics—channels encoding coherent fruit characteristics (rounded contours, color consistency) receive higher global energy normalization weights, while those responding to fragmented leaf textures are systematically attenuated. This channel-level competition operates through L2-norm aggregation across spatial dimensions, enabling the model to distinguish globally consistent fruit evidence from locally scattered foliage distractors, fundamentally addressing the camouflage challenge through learned feature hierarchy rather than superficial pattern matching.

In Figure 10, scenario (c), the robustness under varying illumination conditions derives from ConvNeXt V2’s depthwise-MLP architecture combined with GRN’s global feature calibration—the depthwise convolutions preserve spatial locality, while the MLP expansion creates cross-channel interactions that learn illumination-invariant feature relationships. GRN then stabilizes these relationships by computing channel importance relative to the entire feature map’s energy distribution, preventing illumination-dependent channels from dominating the representation and ensuring that fruit detection remains consistent across lighting variations through normalized competitive feature dynamics.

These results reaffirm the practical impact of the architectural advantages detailed in Section 3: the enhanced feature diversity and representational depth of YOLOv11-4ConvNeXtV2 directly translate into improved detection reliability, even under non-ideal input conditions commonly encountered in real orchard environments.

5.1.2. Comparison of Small-Object Detection Results

Building upon the qualitative comparisons above, we next isolate and analyze the small-object case to ensure a clear assessment of performance under this specific challenge. To further evaluate the model’s performance on small objects under varying illumination conditions, we conducted comparative experiments with mainstream YOLO models. As shown in Figure 11, test images were captured under three representative lighting scenarios: (a) midday direct sunlight, (b) afternoon non-uniform illumination, and (c) morning uniform illumination. All targets in these scenarios are classified as small objects, with each object’s pixel area being less than 3% of the total image area.

In scenario (a), direct sunlight causes severe overexposure on fruit surfaces and backgrounds, weakening edge feature information of small persimmon targets. Traditional detection models (YOLOv8, YOLOv9, and YOLOv11) frequently exhibit missed detections in such scenarios, particularly for small targets in edge regions. In local region comparisons (e) through (h), YOLOv11-4ConvNeXtV2 accurately identifies these small targets with damaged edge features, demonstrating significantly superior detection performance compared to other models.

In scenario (b), although captured under non-uniform illumination conditions, the overall illumination intensity remains moderate, maintaining relatively high contrast between fruits and the background. All models achieve relatively good recognition performance under these conditions. However, in local regions (i) and (j), YOLOv11-4ConvNeXtV2 exhibits more accurate detection boxes with higher boundary fit. The confidence scores for the two small targets in this region are 0.89 and 0.83, respectively. In contrast, traditional models (YOLOv8, YOLOv9, YOLOv11), despite achieving high confidence on one target, show significantly lower confidence on the other target (0.91/0.29, 0.88/0.72, 0.88/0.83, respectively), revealing instability in their detection results.

In scenario (c), test images captured under uniform illumination conditions exhibit balanced overall brightness. However, the high similarity in color and texture between fruits and the background significantly increases detection difficulty. Traditional models are prone to multiple detection issues under these conditions. For instance, in local regions (p), (q), and (r), YOLOv8/YOLOv9/YOLOv11 all fail to identify small targets in the upper-left edge region. Additionally, region (q) shows misclassification of leaves as immature fruits. In contrast, YOLOv11-4ConvNeXtV2 (shown in (o)) effectively distinguishes fruits from background textures, producing more complete and accurate detection results.

From a quantitative result perspective, YOLOv11-4ConvNeXtV2 achieves detection counts closest to Ground Truth across all three illumination conditions, significantly outperforming other baseline models. These results demonstrate that YOLOv11-4ConvNeXtV2 exhibits stronger small-object recognition capability and superior generalization performance across three typical illumination conditions.

5.2. Real-Time Video Evaluation

To further assess the robustness of the proposed detection framework in real-world scenarios, a short video sequence of persimmons was captured under natural orchard conditions and directly fed into the trained model for inference (see Video S1). The results demonstrate that the model successfully detected persimmons partially occluded by foliage or other fruits, highlighting its capability to handle challenging visual conditions. Although the current work does not involve hardware deployment, this experiment indicates that the proposed approach can be readily extended to dynamic, real-time applications when integrated with an appropriate acquisition system.

As shown in Figure 12, the video-based experiment provides qualitative evidence of robust detection in dynamic orchard scenes. The green bounding boxes indicate immature persimmons detected by the model, with the associated numerical values representing their confidence scores. The red arrows point to occluded or partially visible fruits, and the red font annotations mark the corresponding visual conditions such as "Obscured by leaves" and "Hidden by fruit." These indicators highlight the model’s ability to correctly identify targets even when they are partially covered or located in visually complex regions.

6. Conclusions and Future Work

This study addresses the fundamental challenge of persimmon ripeness detection under adverse orchard conditions through the development of YOLOv11-4ConvNeXtV2, an enhanced detection framework that leverages context completion and channel competition mechanisms. Through a self-built persimmon target dataset, the performance of this network was trained and validated. The proposed YOLOv11-4ConvNeXtV2 model demonstrates significant improvements in detection accuracy, achieving 95.9% precision and 83.7% recall, while maintaining robust performance under challenging conditions such as motion blur, occlusion, and varying illumination. The integration of the ConvNeXtV2 backbone with FCMAE pretraining, Global Response Normalization, and Single-Head Self-Attention mechanisms provides a comprehensive solution for maturity detection applications.

Despite these promising results, several limitations remain. The relatively small dataset size limits the generalizability of the conclusions that can be drawn, as it may not fully demonstrate the generalization ability of the proposed model. Moreover, due to experimental constraints, additional data under diverse weather conditions were not collected, and no attempts were made to employ different color channels or optical filters, which could have further enriched the dataset. Although the proposed YOLOv11-4ConvNeXtV2 framework demonstrates effectiveness in orchard scenarios, a limitation of this study is that the evaluation was conducted only on the task-specific dataset. The absence of experiments on large-scale public benchmarks such as COCO restricts the extent to which the advantages of the proposed structure can be generalized. The integration of four ConvNeXt V2 modules into the YOLOv11 framework enhances feature extraction but inevitably adds computational overhead and increases the total parameter count, thereby raising model complexity without a dedicated parameter efficiency analysis.

Future research will focus on extending the framework to multi-modal sensing integration and developing domain adaptation strategies for broader maturity detection applications across different crop varieties and environmental conditions. Future work includes extending evaluation to large-scale public benchmarks such as COCO, to provide a broader and more rigorous comparison. We plan to explore more lightweight solutions to reduce computational requirements while maintaining detection performance, and investigate YOLOv10n integration as an alternative baseline for parameter efficiency optimization.

Supplementary Materials

The following are available online at https://www.mdpi.com/article/10.3390/ai6110284/s1, Video S1: PersimmonRipenessDetection.

Author Contributions

Conceptualization, B.Z. and X.Z.; methodology, B.Z. and X.Z.; software, B.Z.; validation, B.Z., Z.Z., and X.Z.; formal analysis, B.Z. and X.Z.; investigation, B.Z.; resources, X.Z.; data curation, B.Z.; writing—original draft preparation, B.Z.; writing—review and editing, B.Z. and X.Z.; visualization, B.Z.; supervision, X.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors would like to thank the agricultural experts who contributed to the dataset annotation process and provided valuable insights into persimmon cultivation practices.

Conflicts of Interest

Zhaoyuan Zhang is the CEO of Shaanxi Xiaodong Aid Robot Science and Technology Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

AI	Artificial Intelligence
BCE	Binary Cross-Entropy
CIoU	Complete Intersection over Union
CNN	Convolutional Neural Network
ConvNeXt V2	Convolutional Next Version 2
FCMAE	Fully Convolutional Masked Auto-Encoder
FLOPs	Floating Point Operations
FPN	Feature Pyramid Network
FPS	Frames Per Second
GIoU	Generalized Intersection over Union
GRN	Global Response Normalization
IoU	Intersection over Union
mAP	Mean Average Precision
RGB	Red Green Blue
SHSA	Single-Head Self-Attention
YOLO	You Only Look Once

References

Cuong, N.H.H. The Model for the Classification of the Ripeness Stage of Pomegranate Fruits in Orchards Using. Preprints 2021. [Google Scholar] [CrossRef]
Hamza, R.; Chtourou, M. Comparative study on deep learning methods for apple ripeness estimation on tree. In Proceedings of the International Conference on Intelligent Systems Design and Applications, Online, 13–15 December 2021; Springer International Publishing: Cham, Switzerland, 2021; pp. 1325–1340. [Google Scholar]
Cuong, N.H.H.; Trinh, T.H.; Meesad, P.; Nguyen, T.T. Improved YOLO object detection algorithm to detect ripe pineapple phase. Comput. Electron. Agric. 2023, 43, 1365–1381. [Google Scholar] [CrossRef]
Xu, X.; Zhou, B.; Xu, Y.; Li, W. A target detection method for persimmon based on an improved fifth version of the you only look once algorithm. Eng. Appl. Artif. Intell. 2024, 137, 109139. [Google Scholar] [CrossRef]
Suzuki, M.; Masuda, K.; Asakuma, H.; Takeshita, K.; Baba, K.; Kubo, Y.; Ushijima, K.; Uchida, S.; Akagi, T. Deep Learning Predicts Rapid Over-softening and Shelf Life in Persimmon Fruits. Hortic. J. 2022, 91, 408–415. [Google Scholar] [CrossRef]
Cao, Z.; Mei, F.; Zhang, D.; Liu, B.; Wang, Y.; Hou, W. Recognition and Detection of Persimmon in a Natural Environment Based on an Improved YOLOv5 Model. Electronics 2023, 12, 785. [Google Scholar] [CrossRef]
Hao, F.; Zhang, Z.; Ma, D.; Kong, H.; Li, Y.; Wang, J.; Chen, X.; Liu, S.; Wang, M.; Zhang, L.; et al. GSBF-YOLO: A lightweight model for tomato ripeness detection in natural environments. J.-Real-Time Image Process. 2025, 22, 47. [Google Scholar] [CrossRef]
Dong, Y.; Qiao, J.; Liu, N.; He, Y.; Li, S.; Hu, X.; Zhang, C. GPC-YOLO: An Improved Lightweight YOLOv8n Network for the Detection of Tomato Maturity in Unstructured Natural Environments. Sensors 2025, 25, 1502. [Google Scholar] [CrossRef] [PubMed]
Xiao, B.; Nguyen, M.; Yan, W.Q. Apple ripeness identification from digital images using transformers. Multimed. Tools Appl. 2024, 83, 7811–7825. [Google Scholar] [CrossRef]
Wu, M.; Lin, H.; Shi, X.; Zhu, S.; Zheng, B. MTS-YOLO: A multi-task lightweight and efficient model for tomato fruit bunch maturity and stem detection. Horticulturae 2024, 10, 1006. [Google Scholar] [CrossRef]
Tandion, P.; Chen, L.; Wang, M.; Liu, S.; Zhang, H.; Li, Y.; Chen, X.; Wang, J.; Liu, Y.; Chen, Z.; et al. Comparison of the Latest Version of Deep Learning YOLO Model in Automatically Detecting the Ripeness Level of Oil Palm Fruit. Agriculture 2024, 14, 1234. [Google Scholar]
Wang, J.; Chen, L.; Liu, S.; Zhang, H.; Li, Y.; Chen, X.; Wang, M.; Liu, Y.; Chen, Z.; Wang, K.; et al. Melon ripeness detection by an improved object detection algorithm. Comput. Electron. Agric. 2023, 201, 107234. [Google Scholar]
Tang, Y.; Qiu, J.; Zhang, Y.; Wu, D.; Cao, Y.; Zhao, K.; Zhu, L. Optimization strategies of fruit detection to overcome the challenge of unstructured background in field orchard environment: A review. Precis. Agric. 2023, 24, 1183–1219. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Mutha, S.A.; Shah, A.M.; Ahmed, M.Z. Maturity detection of tomatoes using deep learning. SN Comput. Sci. 2021, 2, 441. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14376–14386. [Google Scholar]
Bochkovskiy, A.; Wang, C.; Liao, H. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Wang, C.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7464–7475. [Google Scholar]
Wang, C.; Bochkovskiy, A.; Liao, H. YOLOv8: A state-of-the-art YOLO model. arXiv 2023, arXiv:2301.12546. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. Int. Conf. Mach. Learn. 2019, 97, 6105–6114. [Google Scholar]
Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Rezatofighi, S.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Wang, C.; Liao, H.; Wu, Y.; Chen, P.; Hsieh, J.; Yeh, I. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]

Figure 1. Data examples demonstrating typical detection challenges in orchard scenes.

Figure 2. Sub-block segmentation augmentation process with mathematical description.

Figure 3. Comprehensive data augmentation techniques for model robustness enhancement.

Figure 4. Original traditional YOLOv11 model framework.

Figure 5. YOLOv11-4ConvNeXtV2 detection model framework.

Figure 6. ConvNeXt V2 Block structure diagram.

Figure 7. Architecture of the FCMAE model.

Figure 8. Comparison of single-head attention designs. (a) Replaces convolution with single-head attention in ResNet’s bottleneck block. The contraction ratio is equal to the partial ratio in (c). (b) Uses full channels for single-head attention modules. All models are configured to have similar speeds. Our partial channel approach has the best speed–accuracy tradeoff.

Figure 9. Training and validation curves of YOLOv11 + ConvNeXt model.

Figure 10. Qualitative comparison of detection performance under challenging orchard scenarios. (a) Blur tolerance assessment with motion blur and depth-of-field blur. (b) Camouflage and false positive mitigation in high-foliage-density regions. (c) Multi-target performance under dynamic lighting conditions.

Figure 11. Comparison chart of small-object detection results. Detection performance comparison under three typical illumination conditions: (a) midday direct sunlight, (b) afternoon non-uniform illumination, and (c) morning uniform illumination.

Figure 12. Detection results in video frames.

Table 1. Image quantities generated by different augmentation strategies.

Augmentation Method	Description	Images Generated
Contrast Enhancement	Linear pixels transform with $A = 1.5$ , $B = 1$ ; simulates lighting variation	703
Gaussian Noise Injection	Additive noise to mimic sensor/compression artifacts	703
Random Rotation	Random angle rotation to simulate camera view angle changes	703
Horizontal Flipping	Mirroring for object orientation diversity	703
Random Erasing	Occlusion simulation by zeroing out random patches	703
Sub-Block Segmentation Method	Original image sub-block segmentation and recombination	703
Total	Original dataset + 6 augmentation methods	4921

Table 2. Ablation study.

The Number of ConvNextV2 Modules	Precision	Recall	mAP@0.5	mAP@0.5:0.95
0	0.895	0.655	0.864	0.685
1	0.896	0.677	0.794	0.587
3	0.848	0.710	0.789	0.576
4	0.959	0.837	0.884	0.748
5	0.871	0.698	0.782	0.561
7	0.957	0.806	0.866	0.714

Table 3. Comparison of our approach with mainstream models.

Methods	mAP@0.5	mAP@0.5:0.95
YOLOv5n	0.864	0.685
YOLOv8n	0.857	0.697
YOLOv9t	0.851	0.693
YOLOv10n	0.870	0.710
YOLOv11n	0.855	0.689
YOLOv12n	0.862	0.698
YOLOv11-4ConvNeXtV2 (ours)	0.884	0.748

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, B.; Zhang, Z.; Zhang, X. YOLOv11-4ConvNeXtV2: Enhancing Persimmon Ripeness Detection Under Visual Challenges. AI 2025, 6, 284. https://doi.org/10.3390/ai6110284

AMA Style

Zhang B, Zhang Z, Zhang X. YOLOv11-4ConvNeXtV2: Enhancing Persimmon Ripeness Detection Under Visual Challenges. AI. 2025; 6(11):284. https://doi.org/10.3390/ai6110284

Chicago/Turabian Style

Zhang, Bohan, Zhaoyuan Zhang, and Xiaodong Zhang. 2025. "YOLOv11-4ConvNeXtV2: Enhancing Persimmon Ripeness Detection Under Visual Challenges" AI 6, no. 11: 284. https://doi.org/10.3390/ai6110284

APA Style

Zhang, B., Zhang, Z., & Zhang, X. (2025). YOLOv11-4ConvNeXtV2: Enhancing Persimmon Ripeness Detection Under Visual Challenges. AI, 6(11), 284. https://doi.org/10.3390/ai6110284

Article Menu

YOLOv11-4ConvNeXtV2: Enhancing Persimmon Ripeness Detection Under Visual Challenges

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. The Algorithm Principle of Sub-Block Segmentation

2.1.2. Dataset Construction

2.2. Methods

2.2.1. The Algorithm Principle of YOLOv11

2.2.2. Overall Architecture of the YOLOv11-4ConvNeXtV2 Model

2.3. Description of Key Modules

2.3.1. ConvNeXtV2 Backbone with Global Response Normalization

2.3.2. Fully Convolutional Masked Auto-Encoder

2.3.3. SHSA Detection Head

3. Experimental Setup and Evaluation Metrics

3.1. Experimental Environment

3.2. Evaluation Metrics

4. Results and Analysis

4.1. Backbone Network Ablation Study

4.2. Comparative Experiments with Different Detection Models

5. Discussion

5.1. Quantitative Results and Robustness Analysis

5.1.1. Qualitative Comparison Under Challenging Orchard Scenarios

5.1.2. Comparison of Small-Object Detection Results

5.2. Real-Time Video Evaluation

6. Conclusions and Future Work

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI