1. Introduction
One important cash crop that produces oil on a global scale is palm fruit [
1,
2]. Global palm fruit output has consistently risen recently, growing from 410.697 million tons in 2019 to 424.587 million tons in 2022, according to the statistical yearbook issued by the Food and Agriculture Organization of the United Nations (FAO) in 2024 [
3]. Indonesia and Malaysia are the principal global producers of palm fruit. In 2022, Indonesia produced 257 million tons, representing 60.49% of the global total. Owing to relatively lower degrees of industrialization, the primary production districts of Malaysia and Indonesia predominantly rely on labor-intensive methods, and the adoption of advanced technology has not yet greatly risen in these locales. The quality of oil and harvesting efficiency strongly influence the precise identification of palm fruit ripeness [
4]. The evaluation of ripeness has traditionally relied on manual eye inspection, which is inefficient and highly subjective [
5]. Despite advancements in computer vision technology for fruit detection [
6,
7], the intricate scene characteristics of palm fruits on trees continue to present significant recognition issues. Agricultural scenarios necessitate lightweight models to accommodate the computational limitations of edge devices.
The precise determination of palm fruit ripeness has gained significance as smart farming techniques in palm plantations have evolved rapidly in response to the rising demand for palm oil [
8,
9,
10,
11]. Certain researchers have determined the ripeness of palm fruit via image processing techniques grounded in conventional machine learning. Taparugssanagorn et al. [
12] proposed an image processing method based on relative entropy for non-destructive prediction of palm fruit ripeness. By applying the Kullback–Leibler distance to oil palm classification, their experimental results demonstrated that the proposed algorithm achieves high accuracy and rapid computation speed. Septiarini et al. [
13] employed a machine vision approach to classify oil palm fruits into unripe, ripe, and partially ripe categories. Color and texture features were first extracted from the palm fruit images. Subsequently, an ANN was applied to the classification process to obtain predicted ripeness classes. The experimental results demonstrated that this method achieved an accuracy of 98.3%. Alfatni et al. [
14] obtained images under regulated LED light utilizing a CCD camera. Utilizing a texture-based model that combines BGLAM and ANN, they attained 93% accuracy in identifying palm fruit ripeness inside the ROI2/ROI3 regions, with a processing duration of only 0.4 s. Certain researchers employ deep learning techniques to evaluate the freshness of palm fruits. Chang et al. [
15] suggested a hybrid approach to color correction. This method creates ground-truth images using concurrently recorded spectral data and photographed palm fruit. A color constancy model is then trained by mapping the palm fruit photos onto various ambient spectra. Finally, a YOLOv8 model for identifying the ripeness of palm fruit is trained using the rectified photos. Elwirehardja et al. [
16] developed a lightweight CNN model capable of classifying the ripeness of palm fruit on Android-based applications. By employing three unfrozen convolutional blocks and a specialized data augmentation technique termed “Nine-Corner Cropping”, they enhanced the network’s ripeness classification accuracy. To advance research and application in deep learning-based palm fruit ripeness recognition, Suharjito et al. [
17] released an open-source dataset specifically curated for this task. Collected at a palm oil mill in South Kalimantan, Indonesia, the dataset comprises six ripeness classes: unripe, underripe, ripe, overripe, empty bunches, and abnormal fruits. At the same time, the dataset was verified by using deep learning methods. Salim et al. [
18] employed a genetic algorithm to automatically search for the optimal learning rate, thereby optimizing YOLOv4-tiny and enhancing its accuracy in palm fruit ripeness detection. There are also researchers who use specific physical sensing techniques to explore the correlation between the intrinsic physical properties of palm fruit and ripeness. Ali et al. [
19] proposed an approach for detecting the ripeness of palm fruit by integrating computer vision with laser backscattering imaging. Their study significantly established a strong correlation between palm fruit color values and oil content. Zolfagharnassab et al. [
20] achieved high-precision classification of palm fruits into unripe, ripe, and overripe categories by capturing surface thermal variations via thermal imaging technology and integrating this data with machine learning algorithms. Goh et al. [
21] collected reflectance spectral data from different regions of palm fruits using an optical spectrometer. They observed superior ripeness classification performance for the anterior equatorial region. However, the study was limited by a small sample size and did not account for the effects of field variables such as lighting conditions and humidity on the spectral data.
Traditional machine learning often necessitates researchers to manually design and select discriminating color and texture features based on domain knowledge or experience to distinguish between various stages of palm fruit ripeness. While the integration of specific physical sensing technologies for recognizing the ripeness of palm fruit represents a noteworthy approach, its adoption is constrained by prohibitive costs, portability limitations, and pronounced susceptibility to environmental interference [
22,
23]. Deep learning-based approaches have emerged as the predominant choice in smart agriculture [
24,
25,
26,
27], owing to their operational flexibility, scalability, and cost-effectiveness [
28,
29,
30]. Deep learning demonstrates marked superiority over alternatives for tasks that require managing significant visual variability, such as identifying the ripeness of palm fruit. Nonetheless, attaining both robustness and lightweight recognition in complex environments, such as palm fruits on trees, remains challenging.
Palm fruit, cultivated on trees and enveloped by branches and foliage, frequently exists in complex environments characterized by obstructions and uneven illumination. Static recognition is ineffective, and the swift assessment of palm fruit ripeness leads to image motion blur due to camera movement. The “complex environment” described in this paper mainly involves high-frequency interference under the palm tree canopy, including occlusion by branches and leaves, image motion blur, and uneven lighting. Moreover, constrained computational resources in agricultural hardware require a more streamlined recognition network. This research presents an efficient network for recognizing palm fruit ripeness, enhanced principally through the integration of StarNet, to overcome the challenge of balancing model lightweightness with robust recognition. The main contributions of this paper are summarized as follows:
This study proposed a novel StarNet embedding-based efficient network for recognizing the ripeness of palm fruit in complex environments. By adopting the StarNet backbone architecture and introducing an optimized C2F-Star module in the neck structure, the network’s feature representation capability is effectively improved, significantly reducing model complexity.
This research employs a combinatorial optimization approach to enhance the network for identifying the ripeness of palm fruit. Initially, StarNet and the LSCD detecting head are employed to accomplish lightweighting. The C2F-Star module and DIoU loss are employed to enhance and stabilize the network, delivering both lightweightness and robustness.
Evaluation under intricate environmental conditions, such as inconsistent lighting, motion blur, and occlusion, confirmed the model’s robustness. The model for recognizing the ripeness of on-tree palm fruit attained 4.5 GFLOPs, 1.37 M parameters, and a size of 2.85 MB.
The following chapters are arranged:
Section 2 introduces the dataset for palm fruit ripeness and the designed on-tree palm fruit ripeness identification networks.
Section 3 compares and analyzes the result. In
Section 4, we discuss the limitations of the research and possible directions for future work. Finally,
Section 5 summarizes the article.
3. Result
The training computing equipment we used is an NVIDIA GeForce RTX 3060 GPU and a 12th Gen Intel(R) Core(TM) i5-12400F CPU. The training is performed on a Windows 10 system using the torch version 2.1.2+cu118. The computing equipment and environmental configurations used in this study are summarized in
Table 1. In this study, all models are trained in the same environment and with the same equipment. Each model is trained on 300 epochs, with a batch size of 8 and an image size of 640. Using the SGD optimizer, the initial learning rate, momentum, and weight decay are set to 0.01, 0.937, and 0.0005, respectively. In this study, the rest of the experiments use the same hyperparameters, except for the comparison experiments of different models that utilize their default hyperparameters. It is important to note that in the results section, all photos utilized for validation differ from those in the training and validation sets employed for model training.
3.1. Evaluation of Model
This study employs five metrics to evaluate the model’s performance. The first metric is the F1-score, the harmonic mean of precision and recall. Precision and recall are calculated using Equations (
12) and (
13).
where
(True Positive) denotes the number of correctly predicted labeled palm fruits and
(False Positive) is the number of incorrectly predicted images without palm labels.
(False Negative) denotes that the number of palm fruit targets is undetected. And the calculation formula of the F1-score is
The second metric is GFLOPs, which stands for billions of floating-point operations per second, used to measure the computational efficiency and speed of the model. GFLOPs is the sum of the computational effort of all layers of the network, and a general expression can express its calculation principle:
where the number 2 is the multiplication and addition operation factor,
is the number of input channels,
represents the convolution kernel size, and
is the number of output channels.
represents the output feature map size. The summation symbol is used to traverse the L convolutional layers in the network.
The third metric is parameters, which determine the model’s storage and memory footprint size. The parameter is the sum of all the parameters of the network layers, and its general expression is
where
represents the number of convolution kernel weights,
is the number of bias terms, and
is the BN layer parameter. The L learnable layers in the network are traversed through the summation symbol.
The fourth evaluation metric is mAP@0.5, which represents the average of the
values across all categories, providing a more comprehensive reflection of the effectiveness of the object detection algorithm. mAP@0.5 can be calculated using the following formula:
where the
represents the area under the precision–recall curve.
N represents the quantity of the detection category.
The fifth metric is FPS, which stands for frames per second and is used to measure the real-time capability of the model. The formula for calculating FPS is
where the
represents the time it takes for the model to infer a frame of imagery, measured in milliseconds. These metrics provide a more comprehensive assessment of the model’s performance.
3.2. Ablation Experiments
In this study, we enhance the YOLOv8 architecture through four key modifications: (1) replacement of the detection head with LSCD, (2) substitution of the original backbone network with StarNet, (3) integration of C2F-Star modules, and (4) adoption of DIoU loss. These refinements collectively establish a network structure optimized for palm fruit ripeness recognition tasks. To systematically assess the individual contributions of each architectural improvement, we conduct ablation studies under identical computational hardware and hyperparameter settings.
As presented in
Table 2, replacing the detection head with LSCD reduces model complexity to 80.2% of the original GFLOPs, 78.4% of the original parameters, and 79.0% of the original model size compared to the baseline, while maintaining equivalent recognition accuracy (mAP@0.5). Substituting the original backbone with StarNet further compresses computational requirements to 60.5% of GFLOPs, 52.2% of parameters, and 54.0% of baseline model size. Although this modification incurs a marginal precision reduction, StarNet achieves substantial model streamlining. Subsequently, the C2F-Star module incorporating Star Blocks is deployed to replace the original C2F modules. Relative to the baseline network that features the LSCD detection head and the StarNet backbone, this refinement yielded additional model compression, with reductions in GFLOPs, parameters, and model size, while simultaneously increasing detection precision by 1.4 percentage points. Following these cumulative modifications, the model achieves substantial streamlining with GFLOPs, parameters, and size reduced to 56.0%, 46.0%, and 48.0% of the baseline, respectively. However, this configuration exhibits a 1.2 percentage point reduction in mAP@0.5 compared to the original model. To mitigate this precision degradation, we implement DIoU as a replacement for the default CIoU loss function, elevating the final mAP@0.5 to 76.0%, surpassing the baseline model’s performance. When taken as a whole, these improvements improved recognition accuracy while reducing the model’s complexity by around half. This dual optimization demonstrates the efficacy of our methodological improvements.
3.3. Recognition Performance at Different Ripeness Stages
This section illustrates the efficacy of our method for recognizing palm fruit across five maturation stages: unripe, underripe, ripe, flower, and abnormal.
Table 3 illustrates that mature fruit possesses notable attributes, attaining the best detection accuracy (mAP@0.5 = 88.5% and F1-score = 82.0%). Unripe fruits exhibited strong performance (mAP@0.5 = 82.8%, F1-score = 78.1%). The flowers are generally smaller and sometimes concealed by foliage, leading to moderate performance throughout the flowering stage (mAP@0.5 = 72.1%, F1-score = 71.4%). It is worth noting that the underripe stage is a transitional state between ripe and unripe, and the F1-score of 69.9% indicates that there is a phenomenon of accurate positioning but incorrect classification, which is caused by the instability of the transition state characteristics. The model exhibits limited efficacy in detecting anomalous stages. The aberrant stage encompasses various scenarios, including sickness, deformity, and rodent damage, and the features are highly discrete. The model’s recognition data for this phase is below the average. Overall, the results indicate that the model can prioritize the reliability of identifying ripe fruits that can be picked, which meets the engineering requirements of orchard operations.
3.4. Comparison of Experimental Results for Different IoU Losses
To validate the effectiveness of our DIoU implementation, this study employs the original CIoU in YOLOv8 as the baseline for comparative experiments with alternative loss functions. The loss functions compared include CIoU, EIoU, PIoU, and SIoU. CIoU improves localization accuracy by constraining aspect ratios and is suitable for general object detection. CIoU can achieve better mAP in high-resource tasks, but its high computational overhead makes it unsuitable for resource-constrained scenarios. The benefit of EIoU lies in its ability to decompose the aspect ratio loss, attributing greater significance to low-quality predictions, hence expediting the convergence of challenging samples and rectifying the disparity between easy and difficult samples. It is more appropriate for scenarios including densely clustered, diminutive objects and highly distorted forms. The primary complexity of this test is environmental interference, such as occlusion, blur, and uneven lighting, rather than the inherent difficulty of the object itself. The advantage of PIoU lies in its non-monotonic focusing mechanism, which improves learning efficiency for difficult samples and simplifies computation, making it a clear advantage in tasks with imbalanced samples. The core challenge of this task is environmental interference, not target density. SIoU attains advancements in traditional detection tasks via directional perception and is appropriate for contexts exhibiting directional consistency. Nonetheless, the stringent limitations imposed by its numerous punishment terms are at odds with the variety of deformations occurring during occlusion, resulting in an excessive penalization of justifiable deformations.
Palm fruit grows on trees, often surrounded by branches and leaves, which means it is often obscured. DIoU does not enforce aspect ratio matching, making it more tolerant to object width and height variations under occlusion. DIoU also maintains positioning capability through center point distance calculation, enhancing positional robustness. Therefore, DIoU is particularly suitable for recognizing the ripeness of palm fruits on trees. Comparative results of different loss functions are detailed in
Table 4. Implementations of PIoU, SIoU, and DIoU all yielded positive improvements, whereas EIoU reduced recognition accuracy by 1.2 percentage points. Among the three beneficial loss functions, our adopted DIoU demonstrated superior efficacy with a 1.7% increase in recognition accuracy.
3.5. Performance Comparison of Different Algorithms
To further demonstrate the superior performance of our model, we conduct comparative experiments against several state-of-the-art detection algorithms, including YOLOv5n, YOLOv8n, YOLOv8s, RT-DETR, and YOLOv10n. As presented in
Table 5, our palm fruit ripeness recognition model achieves superior efficiency with the lowest computational footprint among all methods, establishing the most compact architecture. Compared to YOLOv10n (the most compact model among other comparison methods), our solution achieves further reductions to 69.2% of its GFLOPs, 60.4% of parameters, and 52.7% of model size while maintaining competitive accuracy. Comparative analysis reveals YOLOv8n and YOLOv8s achieve the highest F1-score (73.7%), marginally exceeding our method by 0.5 percentage points. Meanwhile, our approach attains 76.0% mAP@0.5, slightly below YOLOv10n yet superior to other algorithms. In inference speed, our proposed model demonstrates competitive FPS performance, surpassed only by YOLOv5n and YOLOv8n while outperforming all other benchmarked models. Overall, our method can considerably cut down on GFLOPs and parameters while ensuring satisfactory accuracy and compliance of the F1-score and FPS metrics with acceptable limits. Our model’s performance is generally superior.
3.6. Ripeness Recognition Performance in Complex Environments
The dataset employed in this study primarily consists of proximity shots captured at very close range to palm fruits. While effective for detailed ripeness assessment, exclusive reliance on such near-field imagery imposes operational inefficiencies for large-scale ripeness screening. To further validate the generalizability of our proposed model, we conducted additional testing using newly captured palm fruit images acquired from operationally efficient vantage points. We utilized mobile phones to walk around palm trees in the PT. PALMINA UTAMA palm plantation in Kalimantan, Indonesia, to capture images of palm fruits on location, employing the ripeness identification model developed in this paper to analyze the acquired images. We employed regular smartphones instead of professional equipment for filming. Any smartphone possessing basic photographic functionalities is appropriate for use. As shown in
Figure 4, our ripeness recognition model effectively identifies the ripeness of palm fruit under optimal operational conditions. The size of the palm fruit ranges from one-fifth to one-third of the image height. The scaling in
Figure 4 demonstrates that the ripeness recognition model is proficient for palm fruit targets as small as 50 × 50 pixels.
To demonstrate the practical efficacy of our proposed palm fruit ripeness recognition model, we conducted ripeness assessment tests on palm fruits under diverse complex agricultural environments. The inherent growth characteristics of palm fruits—being enveloped by dense foliage canopies on trees—frequently subject them to challenging environments involving occlusion and uneven illumination. Concurrently, static recognition approaches prove inefficient for rapid ripeness assessment, while accelerated data acquisition inevitably introduces motion blur artifacts during camera movement.
Figure 5,
Figure 6, and
Figure 7, respectively, show the recognition effects of palm fruits under uneven lighting, motion blur, and occlusion conditions. The on-tree palm fruit ripeness recognition model proposed in this paper shows excellent recognition performance.
3.7. Model Visualization Analysis
While deep learning proves highly effective for object detection tasks, its opaque internal processes often lack interpretability. To address this, our research employs heatmap visualization techniques that transform the "black-box" decision-making into human-comprehensible visual signals. These heatmaps utilize gradient color mapping to represent the contribution intensity of each pixel location to the model’s predictions, with warmer hues indicating higher contributions and cooler tones signifying lower relevance, thereby visually revealing regions of visual saliency prioritized by the model. The Grad-CAMPlusPlus method is used to generate heatmaps in this section. This study employs comparative visualization of recognition heatmaps between the baseline YOLOv8 model and our proposed palm fruit ripeness recognition model. As depicted in
Figure 8, specular reflection occurs at the apex region of palm fruit granules under direct illumination, a phenomenon prominently visible in the original imagery due to the fruit’s smooth pericarp surface. Heatmap analysis reveals that the baseline YOLOv8 model exhibits over-attention to granules’ apex features, rendering it vulnerable to robustness degradation under specular reflection interference. In contrast, our proposed palm ripeness recognition model demonstrates holistic feature extraction across the entire fruit morphology, as evidenced by its heatmap distribution, yielding superior robustness through comprehensive feature extraction capability.
4. Discussion
This lightweight technique attains an average mAP@0.5 accuracy of 76.0% in intricate tree landscapes characterized by uneven lighting, motion blur, and occlusion. It can substantially reduce the network size while preserving recognition reliability in intricate situations. This paper enhances the YOLOv8 network by substituting the LSCD detection head, replacing the original backbone with StarNet, incorporating the C2F-Star module, and implementing the progressive combination optimization method of DIoU, thereby rendering it more adept for the task of palm fruit ripeness recognition.
Following the introduction of StarNet, GFLOPs, parameters, and model size are diminished by 24.7%, 33.5%, and 31.9%, respectively, whereas mAP@0.5 had a decline of 2.6%. StarNet employs an exponential feature reuse strategy, utilizing weight sharing via a recursive structure that does not completely align with the feature scale of the initial convolutional layer. This leads to the attenuation of shallow texture features during the recursive process, diminishing the capacity to share weights for optimizing goals across various scales. StarNet attains lightweight characteristics via weight reuse, albeit at the expense of diminished representational flexibility. The adverse impact is mitigated by the C2F-Star module and DIoU loss. The C2F-Star module employs cross-layer skip connections to reintegrate superficial details, counteracting the smoothing impact of StarNet. Following the implementation of the C2F-Star module, mAP@0.5 experienced an increase of 1.4%. The DIoU loss enhances the central alignment of the bounding box to mitigate positioning drift resulting from feature ambiguity and employs gradient-stabilized positioning loss for rectification. Following the implementation of the DIoU loss, mAP@0.5 experiences an additional rise of 1.7%. StarNet itself is not a defective module, but its recursive nature requires other parts of the network to specifically compensate for feature diversity. C2F-Star provides detailed supplements, and DIoU corrects positioning deviations. Through this combination, the final accuracy exceeds that of the baseline model.
The proposed method for assessing palm fruit ripeness considers both lightweight characteristics and recognition robustness; nonetheless, it possesses certain limitations. The present study does not address interferences such as precipitation, fog, and varying times of day (late at night, intense direct sunlight). The model demonstrates resilience under moderate weather conditions. Conditions such as precipitation, fog, and nighttime necessitate supplementary data and meticulous adjustment. Currently, the model fully learns ripe palm fruits, ensuring they meet harvesting priorities. However, there is room for improvement in identifying underripe and abnormal samples. In the future, the small object detection layer can be upgraded to adapt it to flower and lesion recognition. Multispectral imaging can also be deployed to resolve color confusion through multimodal fusion. The current research, with its ultra-small model (2.85 MB) and high frame rate (253 FPS), lays the foundation for mobile deployment. In the future, the program can be deployed on mobile devices, allowing workers on-site to access ripeness identification results by taking photos with their phones. Furthermore, integrating SLAM technology with the current ripeness identification model to create ripeness maps for palm plantations is a valuable development direction.
5. Conclusions
This paper presents an efficient network for recognizing palm fruit ripeness, integrated within StarNet, to reconcile a lightweight design with robustness in complicated circumstances. This network, built upon YOLOv8, incorporates enhancements to the backbone network, neck, detection head, and loss function. The LSCD detection head and StarNet were implemented to diminish the model’s computational complexity and parameter quantity. The original C2F module is subsequently replaced with C2F-Star, thereby enhancing the network’s detection accuracy. The more efficient DIoU has supplanted the original CIoU, enhancing model convergence and considerably augmenting accuracy. Following the introduction of StarNet, C2F-Star furnished comprehensive information, while DIoU rectified positional inaccuracies. The integration of these modules resulted in a network that is both lightweight and highly effective in recognition. The mAP50 accuracy of the palm fruit ripeness recognition model developed in this research is enhanced to 76.0% by the advancement of the network via progressive combinatorial optimization. The model’s GFLOPs, parameter, and model size are 4.5 GB, 1.37 MB, and 2.85 MB, respectively, constituting 56.0%, 46.0%, and 48.0% of the original model. Robustness evaluation under complex environments, including uneven illumination, motion blur, and occlusion, demonstrates the resilience of our model for recognizing palm ripeness. Complementing these robustness tests, we further validate generalization capabilities using newly captured palm fruit images from operationally efficient vantage points distinct from the original dataset. Recognition performance on this independent test set confirms the model’s robust generalization capabilities. We employ a heatmap visualization to demystify the model’s internal decision-making through intuitive, pixel-level color mapping, which validates enhanced feature saliency and demonstrates superior focal precision in our proposed methodology. This study also discussed the limitations of the method used and pointed out that the “plantation-level ripeness map” and “real-time viewing of fruit ripeness on mobile phones” functions are extremely valuable development directions.