1. Introduction
Jixin fruit, also known as Jinxiu Begonia, Dian thorn olive, Manban tree, and Stenosepalous Pittosporum, with the Latin name
Xantolis stenosepala, is a type of pocket apple and an updated variety of Hanfu apple. It is named after the conical shape of its fruit, resembling a chicken heart, combining both ornamental and edible characteristics. The
Jixin fruit variety exhibits excellent yield characteristics, with yields reaching 2000–2500 kg per mu (approximately 30,000–37,500 kg/ha) during peak production periods. At wholesale market prices of approximately 10–15 CNY/kg, the per-mu economic benefit can exceed 10,000 CNY, making it an important income-generating industry for rural communities in northern China. It is one of the regional specialty fruit crops in Northeast and Northwest China. Currently, the harvesting of
Jixin fruit mainly relies on traditional manual methods. This production mode, which is highly dependent on manpower, has become a common bottleneck restricting the development of the specialty fruit industry [
1]. With the continuous growth of market demand,
Jixin fruit production faces multiple challenges, such as uneven fruit maturity, inefficient picking, and rising labor costs, leading to increased post-harvest losses, escalating labor costs, and unstable product quality, which seriously restrict the economic benefits and sustainable development of the industry. In recent years, machine vision technology has been widely applied in the field of fruit and vegetable harvesting, and various fruit-picking robots have been introduced to the market [
2]. However, existing picking robot systems generally suffer from high equipment costs, high system complexity, and poor adaptability to specific orchard environments, making it difficult to promote their application in small and medium-scale orchards. Therefore, this study aims to develop an efficient, accurate, and lightweight
Jixin fruit maturity recognition model to achieve automatic grading of
Jixin fruit maturity through deep learning methods, verify the recognition performance of the lightweight model in actual orchard scenarios, and provide theoretical support and technical reference for algorithm optimization and mobile application of intelligent picking systems in the future.
In recent years, domestic and foreign scholars have conducted extensive research on fruit segmentation and maturity assessment, including traditional methods and currently popular deep learning segmentation methods. In the field of fruit segmentation, traditional methods [
3,
4] primarily rely on image processing and feature extraction algorithms. Feng et al. [
5] pointed out that although traditional machine learning-based technologies have seen improvements in speed, accuracy, and robustness, they remain sensitive to abnormal data inputs, require pre-set parameters, and the final classification performance is closely tied to parameter configurations. Moreover, current mainstream image segmentation and classifier solutions based on traditional machine learning are often tailored to specific scenarios, lacking universality and performing poorly in multi-class classification tasks. Payne et al. [
6] conducted detailed research on the assessment of mango maturity and found that the color space threshold segmentation technology is extremely sensitive to ambient light changes. They specifically demonstrated that the method based on color space threshold would experience a sharp decline in performance under outdoor non-uniform lighting conditions, which confirmed the fragility of traditional methods when dealing with “color gradients” and “lighting variations.” Tian et al. [
7], in their review of apple recognition, concluded that although algorithms relying on contour analysis and template matching can identify individual apples, they fundamentally lack the ability to distinguish the complex boundaries of densely clustered apples. Traditional computer vision cannot correctly separate touching fruits, which is considered a major obstacle to deploying robust automation systems in dense orchards. These studies highlight the inherent fragility of traditional image processing methods, necessitating more adaptive and robust solutions for reliable fruit segmentation.
With the rapid development of deep learning technology, semantic segmentation methods based on Convolutional Neural Networks (CNNs) [
8,
9,
10] have achieved remarkable progress in the field of image processing, and fruit maturity detection is no longer limited to traditional methods. Xie et al. [
11] proposed ECD-DeepLabv3+, a lightweight semantic segmentation model for postharvest maturity detection of sugar apples. The model employs MobileNetV2 as the backbone network and integrates ECA and CA attention modules with a Dense ASPP structure, achieving an mIoU of 89.95%. However, this method was validated only on a self-constructed dataset with three maturity levels, and its generalization capability requires further verification. Nuanmeesri et al. [
12] proposed a Hybrid Attention Convolutional Neural Network (HACNN) for avocado maturity classification, combining spatial, channel, and self-attention modules. The model achieved a test accuracy of 91.25% with a memory footprint of 59.81 MB and an inference time of 280.67 ms. However, this method was validated on a single fruit species only, exhibiting limited cross-variety generalization capability and presenting trade-offs between high accuracy and lightweight design. To improve the model’s lightweight and real-time performance, Chen et al. [
13] proposed a lightweight semantic segmentation model based on an improved DeepLabv3+ for young plum stem recognition and picking point localization. The model adopts MobileNetV2 as the backbone network and integrates CBAM attention modules with a DenseASPP structure, achieving an MIoU of 86.13% and a picking point localization success rate of 88.8%. However, this method was validated on a specific fruit only, and its performance under extreme occlusion conditions requires improvement. Hou et al. [
14] proposed LM-DeepLabV3+ on this basis, which combines multi-scale feature interaction modules with lightweight attention mechanisms, achieving a significant reduction in model complexity while maintaining high accuracy. Liu et al. [
15] proposed MFA-DeepLabV3+, which introduces SE modules in the decoder structure. This method improves the feature expression capability to some extent, but its spatial attention capability is relatively limited. In terms of attention mechanisms, Cao et al. [
16] replaced the YOLOv5n backbone network with FasterNet and integrated MobileViT, CBAM attention mechanisms, and the SPPELAN module, achieving a detection precision of 98.94% and an mAP of 99.43%. However, the detection frame rate was only 16.61 FPS, indicating poor real-time performance, and the model size increased to 53.22 MB, limiting the lightweight effect. Wang et al. [
17] proposed the ECA-Net module, which achieves channel feature weighting at minimal parameter cost, providing an efficient alternative for lightweight models. In terms of feature fusion, Chen et al. [
18] proposed FAFNet, which adopts cross-layer feature fusion concepts and achieves efficient integration of multi-modal data, validating the advantages of feature hierarchical fusion in complex semantic segmentation.
In summary, existing research in fruit maturity segmentation still exhibits several limitations. Although current lightweight models reduce computational requirements, their segmentation accuracy degrades in complex scenarios, making it difficult to balance real-time performance with accuracy. Furthermore, traditional ASPP modules employ dilated convolutions with fixed dilation rates, which cannot adaptively capture semantic information across different fruit scales, particularly limiting performance in scenarios with significant fruit size variations. Additionally, existing decoder architectures suffer from information loss during high-level and low-level feature fusion, resulting in difficulties in precisely segmenting ambiguous boundaries between semi-ripe and ripe fruits caused by color gradients. To address these issues, the main objectives of this study are as follows: (1) to develop a lightweight fruit maturity segmentation model with fewer than 10 M parameters capable of real-time operation on resource-constrained embedded devices; (2) to design a multi-scale feature perception module to enhance the model’s recognition capability for fruits of different scales and maturity stages; (3) to optimize high-level and low-level feature fusion strategies to improve segmentation accuracy in color gradient regions and ambiguous boundary scenarios.
Based on this, this study proposed the MA-DeepLabV3+ model, with specific improvements including replacing the backbone network of the DeepLabV3+ model with MobileNetV2 [
19] to achieve lightweight, replacing Atrous Spatial Pyramid Pooling (ASPP) [
20] with the Multiscale Self-Attention Module (MSAM) [
21] to achieve cross-scale semantic interaction and target perception, while introducing the Attention and Convolution Fusion Module (ACFM) [
22] in the decoding stage, enhancing the connection between high and low-level features through attention-guided cross-layer feature fusion, thus improving boundary details and small target recognition capabilities while maintaining lightweight. This research provides technical support for the intelligent harvesting of
Jixin fruit. The lightweight design of the proposed MA-DeepLabV3+ model greatly reduces the hardware cost and energy consumption of intelligent harvesting systems, providing a feasible solution for promoting the digital transformation of niche specialty fruit industries. Through sufficient verification on the self-built
Jixin fruit dataset, the model achieves an mIoU of 86.13% while reducing the number of parameters to 5.58 M and computational cost to 74.64 GFLOPs, achieving an optimal balance between accuracy and efficiency. This achievement lays an algorithmic foundation for the future development of
Jixin fruit intelligent picking robot systems and can provide a reference for maturity recognition tasks of other niche fruits.
3. Results
3.1. Performance Analysis of the Improved Model
Figure 8 and
Figure 9 show the training and validation performance curves of the MA-DeepLabV3+ model. As can be seen from
Figure 8, as the number of iterations increases, the loss values of both the training set and validation set gradually decrease and tend to stabilize, with the fastest decline rate in the range of 0–25 iterations. When the number of iterations is in the range of 150–200, the loss value changes little and basically tends to stabilize, reaching a convergent state.
At the same time, as shown in
Figure 9, the mIoU on the validation set steadily increases. Although the segmentation accuracy in the early stage is relatively low, with continuous training, the model’s feature extraction and target recognition capabilities have been improved, and the segmentation performance has also been optimized, indicating that the MA-DeepLabV3+ model has good generalization.
3.2. Comparison Experiments of Different Backbone Networks
When performing
Jixin fruit maturity segmentation tasks, the backbone network’s ability to extract multi-scale features such as texture, color, and morphology of fruits is key to model performance and efficiency. Since this study has high requirements for lightweight and real-time performance, robustness under complex lighting, occlusion, and background interference must also be taken into consideration. Therefore, it is necessary to conduct backbone network comparison experiments based on the DeepLabV3+ framework. To verify the effectiveness of the backbone network, this study compared the performance of different networks in terms of segmentation accuracy and lightweight metrics. The selected networks included ResNet50, VGG [
27], Xception, ShuffleNetV2 [
28], and MobileNetV2, and experimental control groups were constructed to compare their actual performance in
Jixin fruit maturity segmentation. Comparison metrics include mIoU, mPA, GFLOPs, number of parameters, and F1-Score. The experimental results are shown in
Table 3.
As can be seen from
Table 3, although VGG16 achieved the highest mIoU of 86.19% and the highest mPA of 91.67%, its computational cost of 152.71 G far exceeds other networks, with memory usage as high as 128.32 MB, inference time of 32.68 ms, and FPS of only 30.6, which makes VGG16 face serious efficiency problems in actual
Jixin fruit maturity segmentation and deployment, unsuitable for resource-constrained scenarios such as mobile devices and drones. ResNet50, as a deep residual network, had a parameter count of 39.64 M, memory usage of 151.19 MB, inference time of 18.52 ms, and FPS of 54.0, and achieved an mIoU of 85.87% with relatively excellent performance, but its large model size limited practical application. Xception, as the backbone network of the original DeepLabV3+, used depthwise separable convolution and achieved an mIoU of 85.45% with 22.86 M parameters, memory usage of 87.20 MB, inference time of 24.31 ms, and FPS of 41.1, striking a good balance between accuracy and efficiency. ShuffleNetV2 was the lightest network with only 5.37 M parameters, memory usage of only 20.49 MB, inference time of 9.76 ms, and FPS as high as 102.4, demonstrating excellent real-time processing capability, but accuracy dropped to 82.68% with mPA of only 87.92%, which may not meet accuracy requirements in practical applications. In contrast, the model with MobileNetV2 as backbone maintains relatively high segmentation accuracy while achieving model lightweight. Its parameter count was 5.81 M. While maintaining lightweight, through inverted residual structure and linear bottleneck layer design, memory usage of 22.17 MB, inference time of 10.58 ms, FPS reached 94.5, and mIoU reached 85.32%, achieving a good balance between lightweight and accuracy, with inference speed improved by 56.5% compared to Xception, making it the best choice for lightweight backbone networks.
3.3. Comparison Experiments of Different Semantic Segmentation Models
To further verify the superior performance of MA-DeepLabV3+, this study compared it with the U-Net [
29], PSPNet [
30], SegFormer [
31], and DeepLabV3+. Evaluation metrics included mIoU, mPA, F1-Score, number of parameters, and GFLOPs. Experimental results are shown in
Table 4.
From
Table 4, different semantic segmentation models show obvious differences in performance on the
Jixin fruit maturity segmentation task. U-Net, as a classic encoder-decoder structure, had an mIoU of 88.09%, mPA of 93.55%, and F1-Score of 88.10%, demonstrating strong performance. However, its parameter count and GFLOPs reached 24.89 M and 451.74 G, respectively, with memory usage of 94.97 MB, inference time as high as 38.24 ms, and FPS of only 26.1. The huge computational overhead makes it difficult to deploy on resource-constrained devices. Although the classic semantic segmentation model PSPNet is representative, it had an excessively large parameter count of 46.71 M, memory usage of 178.24 MB, inference time of 31.47 ms, and FPS of 31.7, and the accuracy of 84.1% mIoU was also at a relatively low level. Overall performance and lightweight design cannot meet the requirements of this study. The SegFormer model adopted a lightweight design with only 3.71 M parameters and 13.55 G GFLOPs, memory usage of only 14.15 MB, inference time of 7.82 ms, and FPS as high as 127.8, possessing excellent real-time processing capability, but its segmentation accuracy was seriously insufficient, with mIoU reaching only 75.97%, making it difficult to meet actual requirements. The original DeepLabV3+ model achieved an mIoU of 85.45% and mPA of 90.60%, demonstrating good segmentation performance, but its parameter count of 54.71 M and computational cost of 166.86 G, as well as memory usage of 208.76 MB, inference time of 24.31 ms, and FPS of 41.1, limited its deployment on mobile devices.
In contrast, the MA-DeepLabV3+ model proposed in this study achieved an mIoU of 86.13%, second only to the U-Net’s 88.09%. At the same time, the mPA and F1-Score of MA-DeepLabV3+ were 91.29% and 90.05%, respectively, performing best among all comparison models. While maintaining high accuracy, its parameter count was reduced to 5.58 M, and GFLOPs were 74.64 G, memory usage of only 21.29 MB, inference time of 12.36 ms, and FPS reached 80.9, with inference speed improved by 49.2% compared to the original DeepLabV3+, meeting lightweight requirements. The above data demonstrate the effectiveness of MA-DeepLabV3+, which can achieve high-precision segmentation of Jixin fruit maturity while maintaining lightweight, possessing the capability for real-time deployment on mobile devices and embedded platforms, achieving a balance between accuracy, speed, and model size, and has good engineering application value.
3.4. Ablation Study
To verify the feasibility of MA-DeepLabV3+, this study conducted ablation experiments to evaluate the effect of each individual component. Taking DeepLabV3+ as the baseline model, improvements were introduced step by step. The experimental settings are as follows: the first group used the original model DeepLabV3+; the second group replaced the original model’s backbone with MobileNetV2, with the rest unchanged; the third group replaced the ASPP part with the MASM module, and the backbone remained MobileNetV2, and the following groups all used this backbone; the fourth group only introduced the ACFM module in the feature fusion stage; the fifth group introduced both modules. Evaluation metrics were mIoU, mPA, F1-Score, number of parameters, and GFLOPs. The experimental results are summarized in
Table 5.
From
Table 5, after replacing the backbone with MobileNetV2 in the second group of experiments, the parameter count of the DeepLabV3+ model decreased from 54.709 M to 5.814 M, and GFLOPs decreased from 166.858 G to 52.884 G. However, the performance loss of the model is almost negligible, with mIoU, mPA, and F1-Score of 84.95%, 90.40%, and 89.02%, respectively, proving that MobileNetV2 is a reasonable backbone network choice.
Compared with the DeepLabV3+ model after replacing the backbone with MobileNetV2, the third group of experiments with MSAM-Only further reduced the parameter count to 5.112 M, but GFLOPs slightly increased (59.025 G). It is worth noting that mIoU dropped to 82.95%, a decrease of 2.0 percentage points compared to the basic MobileNetV2 model. This phenomenon indicates that the MSAM module, through multi-branch and multi-scale feature extraction and a self-attention mechanism, although it can enhance the multi-scale expression capability of features, may lead to slight performance degradation when used alone due to a lack of an effective feature fusion mechanism. Moreover, the lightweight design of MSAM proves its advantage in maintaining model efficiency, but it needs to be combined with appropriate feature fusion strategies to fully play its role. The fourth group of experiments evaluated the independent contribution of the ACFM module. This configuration added the ACFM module for feature fusion enhancement on the basis of MobileNetV2, with a parameter count of 6.28 M and GFLOPs of 68.485 G. Importantly, mIoU increased to 85.70%, an improvement of 0.75 percentage points compared to the second group of experiments, even exceeding the original baseline model’s 85.45%. At the same time, mPA reached 90.94%, and F1-Score reached 89.66%. These results prove that the ACFM module can effectively enhance feature fusion between the encoder and decoder through an adaptive channel attention mechanism, improve the model’s ability to distinguish Jixin fruits of different maturities, and significantly improve segmentation accuracy with limited parameter increase, demonstrating good cost-effectiveness.
The fifth group of experiments was the complete MA-DeepLabV3+ model proposed in this paper, which integrated all three improved components: MobileNetV2, MSAM, and ACFM. MA-DeepLabV3+ achieved 5.581 M parameters and 74.635 GFLOPs, representing reductions of 89.8% and 55.3%, respectively, compared to the baseline. The mIoU reached 86.13%, the highest value among all configurations, with mPA and F1-Score of 91.29% and 90.05% respectively, indicating that the model achieves a good balance between precision and recall. The above results show that there is an obvious synergistic effect between the MSAM and ACFM modules. MSAM enhances the feature expression capability of the encoder through multi-scale feature extraction, while ACFM optimizes the feature fusion process of the decoder through adaptive channel attention. When used together, they not only compensate for the performance degradation when MSAM is used alone but also achieve better results than when ACFM is used alone, while maintaining low model complexity.
The ablation experiments systematically verified the effectiveness of each improvement measure proposed in this paper. The replacement of the backbone network significantly reduces model parameters and computational cost, meeting the lightweight requirements. At the same time, the multi-scale feature extraction of the MSAM module and the adaptive feature fusion mechanism of the ACFM module effectively compensate for the accuracy loss caused by lightweight, indicating that the improved architecture design in this paper can effectively integrate the advantages of multiple modules. The final MA-DeepLabV3+ shows significant advantages in the Jixin fruit maturity segmentation task.
3.5. Segmentation Performance Analysis of Different Maturity Categories
To deeply evaluate the recognition ability of each model on fruits of different maturities, this section separately analyzed the IoU metric for three categories: unripe, semi-ripe, and ripe, and revealed the confusion patterns between categories through confusion matrices.
3.5.1. IoU Comparison Analysis of Each Category
To comprehensively evaluate the segmentation performance of each model at different fruit maturity stages, this study conducted an IoU [
32] metric analysis for immature, semi-mature, and mature categories separately.
Table 6 shows the segmentation performance of five models across different maturity categories.
Unripe fruits usually have characteristics similar to background color, but their edge contours are relatively clear, with relatively regular shape features. In the segmentation task of unripe fruits, all models showed high recognition accuracy. Experimental results show that the traditional U-Net model reached 91%, demonstrating good baseline performance. It is worth noting that PSPNet obtained 88% IoU, while the Transformer-based SegFormer only reached 75%, performing relatively weak in this category. MA-DeepLabV3+ achieved the best 93% IoU in this category, an improvement of 1 percentage point compared to the original model. The excellent performance in this category benefits from its enhanced feature extraction capability and multi-scale information fusion mechanism, which can better capture the subtle texture features and edge information of unripe fruits.
Fruits in the semi-ripe stage show color gradation characteristics, with blurred boundaries and reduced contrast with the background, which brings challenges to the segmentation task. The performance of all models decreased to varying degrees in this category. The IoU of PSPNet was 76%, with a relatively large drop. SegFormer performed worst in this category, obtaining only 63% IoU, a decrease of 12 percentage points compared to the unripe category. MA-DeepLabV3+ maintains a leading advantage. Through introducing the attention mechanism and optimizing the feature fusion strategy, it obtained 81% IoU, a decrease of 12 percentage points compared to the unripe category. DeepLabV3+ and U-Net reached 79%, with stable performance. It can be seen that MA-DeepLabV3+ can more effectively handle such complex visual features, maintaining high segmentation accuracy in blurred boundary regions.
Although ripe fruits have bright colors and high contrast with the background, they are often accompanied by problems such as irregular shapes, surface reflections, and occlusion overlaps. These factors also lead to performance degradation of all models in this category. In this type of task, DeepLabV3+ and U-Net both obtained 77% IoU, with a gap of 2 percentage points from MA-DeepLabV3+. The performance of PSPNet further declined to 75%, performing worst among the three categories. SegFormer showed some recovery in the ripe category, reached 71%, but it is still significantly lower than other models. MA-DeepLabV3+ consistently maintained optimal performance with an IoU of 79%, indicating that MA-DeepLabV3+ can better handle segmentation tasks in complex scenarios through improving decoder structure and multi-level feature fusion.
3.5.2. Confusion Matrix Analysis
Figure 10 presents the normalized confusion matrices of the original DeepLabV3+ and our MA-DeepLabV3+ on the test set, providing a detailed view of the inter-class confusions. The diagonal and off-diagonal elements respectively denote the per-class pixel accuracy and the misclassification rates between categories.
It can be seen that the original model, DeepLabV3+, shows good classification ability overall, with high diagonal values, indicating that most samples are correctly classified. However, there is a high confusion rate between semi-ripe and unripe fruits, which may be due to the similarity in color and texture features between the two. In addition, there is still a small amount of misclassification between the background and each category, indicating that there is still room for optimization in feature extraction and boundary judgment of the model. In contrast, MA-DeepLabV3+ achieves significant performance improvement in the most challenging semi-ripe category by introducing multi-scale attention mechanism and adaptive context fusion module, increasing the accuracy of the semi-ripe category from 93.6% to 96.5%, an improvement of 2.9 percentage points, and reducing the confusion rate between semi-ripe and unripe categories from 4.3% to 1.7%, a decrease of 2.6 percentage points. Although there is a slight decline in unripe and ripe categories, the overall model performance is more balanced, especially showing excellent performance in solving inter-category confusion. The confusion matrix of MA-DeepLabV3+ shows a clearer diagonal structure, proving the effectiveness of the new modules.
3.5.3. Segmentation Result Visualization Analysis
To more intuitively evaluate the actual performance of different models on the
Jixin fruit maturity segmentation task, this section conducts a visual comparison analysis of the segmentation results of typical samples. Representative images containing
Jixin fruits of different maturity categories were selected to show the segmentation effects of each model. Through visual comparison, the differences between different models in boundary accuracy, small target detection, and complex scene processing can be clearly observed.
Figure 9 shows the comparison of the segmentation results of each model.
Figure 11 visualizes the segmentation results, highlighting the differences between models. DeepLabV3+ maintains high segmentation accuracy in mixed scenarios and can accurately distinguish different maturity categories, but confusion still occurs in distinguishing semi-ripe and ripe fruits in Image 1, and blurred edge segmentation also appears in Image 3. U-Net can handle multi-category mixed scenarios well, but there are deficiencies in fine boundary segmentation, and like DeepLabV3+, there is confusion between different maturity categories, and the segmentation continuity of densely distributed fruits is not ideal. PSPNet’s performance declines in mixed scenarios, especially in Images 3 and 4, where misclassification occurs due to leaf occlusion or uneven maturity distribution. SegFormer performs poorly in complex mixed scenarios, with serious category confusion and obvious fragmentation of segmentation results, performing poorly in the segmentation results of all images. In contrast, MA-DeepLabV3+ demonstrates excellent comprehensive performance in complex mixed scenarios. It can accurately identify and segment fruits of three maturity levels with clear category distinction and precise boundaries. Even when fruits are densely packed and mutually occluded, they can maintain high segmentation quality. Although some fragmentation phenomenon of some segmentation results occurs in Image 1, its lightweight design and high inference speed can balance some deficiencies in performance, enabling it to meet real-time processing requirements, fully verifying the superiority of this method in practical applications.