A Lightweight Deep Learning Semantic Segmentation Model for Optical-Image-Based Post-Harvest Fruit Ripeness Analysis of Sugar Apples ( Annona squamosa )

: The sugar apple ( Annona squamosa ) is valued for its taste, nutritional richness, and versatility, making it suitable for fresh consumption and medicinal use with significant commercial potential. Widely found in the tropical Americas and Asia’s tropical or subtropical regions, it faces challenges in post-harvest ripeness assessment, predominantly reliant on manual inspection, leading to inefficiency and high labor costs. This paper explores the application of computer vision techniques in detecting ripeness levels of harvested sugar apples and proposes an improved deep learning model (ECD-DeepLabv3+) specifically designed for ripeness detection tasks. Firstly, the proposed model adopts a lightweight backbone (MobileNetV2), reducing complexity while maintaining performance through MobileNetV2 ′ s unique design. Secondly, it incorporates the efficient channel attention (ECA) module to enhance focus on the input image and capture crucial feature information. Additionally, a Dense ASPP module is introduced, which enhances the model’s perceptual ability and expands the receptive field by stacking feature maps processed with different dilation rates. Lastly, the proposed model emphasizes the spatial information of sugar apples at different ripeness levels by the coordinate attention (CA) module. Model performance is validated using a self-made dataset of harvested optical images categorized into three ripeness levels. The proposed model (ECD-DeepLabv3+) achieves values of 89.95% for MIoU , 94.58% for MPA , 96.60% for PA , and 94.61% for MF1 , respectively. Compared to the original DeepLabv3+, it greatly reduces the number of model parameters ( Params ) and floating-point operations ( Flops ) by 89.20% and 69.09%, respectively. Moreover, the proposed method could be directly applied to optical images obtained from the surface of the sugar apple, which provides a potential solution for the detection of post-harvest fruit ripeness.


Introduction
Fruits are crucial for human health; multiple studies indicate that consuming fresh fruits can promote human physical well-being [1][2][3].However, eating stale or spoiled fruits can trigger outbreaks of foodborne diseases, potentially leading to serious public health issues [4].Among numerous fruits, the sugar apple (Annona squamosa) [5], which belongs to the genus Annona of the family Annonaceae, is often cultivated in tropical and subtropical regions and is also known as the custard apple and sweetsop.Due to its sweet and distinctive taste, it has gained favor among a vast number of consumers [6].Furthermore, it can be also used for anti-cancer, anti-obesity, and lipid-lowering processes and as an insecticidal agent [7][8][9].However, fruits like sugar apples are highly prone to decay, making them challenging to consume fresh and resulting in significant waste [6,10].Therefore, the determination process of fruits is widely considered crucial for both producers and processors [11].In modern agriculture, ensuring the quality of fruits is of paramount importance for enhancing the entire agricultural sector.However, previous studies indicate that determining the ripeness of fruits like watermelon solely based on surface characteristics, such as size or external color, through manual observation is quite challenging.Comprehensive considerations of various factors are necessary unless assisted by experienced individuals [12].Therefore, proposing methods to detect the ripeness of easily perishable fruits (such as sugar apples) can reduce post-harvest losses and lower costs.
In recent decades, in order to reduce the cost of manual differentiation of post-harvest fruit ripeness, scientists have proposed a lot of methods for the detection of fruit ripeness.Elektrik [13] employed near-infrared spectroscopy to assess the ripeness of a watermelon by calculating the reflectance on the surface of the watermelon.The gathered data underwent statistical analysis for the purpose of grading and assessing the ripeness of watermelons.Hasanuddin et al. [14] designed a 0.5 µm thick zinc oxide sensitive layer on a LiNbO 3 piezoelectric substrate specifically for sensing ethylene (C 2 H 4 ) gas, aiming to discern the ripeness of fruits.ArrÁZola et al. [15] evaluated five maturity levels of Tainong papaya fruits through the examination of mechanical resistance and the application of finite element analysis (FEA); an in-depth analysis was conducted.Phoophuangpairoj [16] created an acoustic model of a knocking sound based on the Hidden Markov model of syllables and proposed a new approach for recognizing durian ripe and raw impact signals, with an average ripening recognition rate of more than 90.0%.González-Araiza et al. [17] designed a non-destructive device based on electrical bioimpedance measurements to obtain the impedance spectrum of the whole fruit and, thus, analyze the ripeness of strawberry fruits.
With the development of artificial intelligence, the deep learning approach has received a lot of attention from scientists in the fields of post-harvest fruit ripeness detection, quality inspection, cultivation, and production process [18][19][20][21].Xiao et al. [22] employed a hybrid approach involving the Transformer model within the domain of natural language processing and the deep learning model to classify apples of different levels of ripeness, which makes it easier to combine multimodal data and provides greater flexibility in modeling.Appe et al. [23] enhanced the YOLOV5 by incorporating the Convolutional Block Attention module for the automatic classification of tomato multi-classes with an average accuracy of 88.1%.Kim et al. [24] proposed a dual-path model through semantic segmentation, which achieves 90.33% accuracy and 71.15% recognition of strawberry ripeness and fruit stalk coordinates regarding the task of strawberry ripeness and fruit stalk coordinate detection.Zhao et al. [25] proposed a new single-stage instance segmentation model.In total, 72.12% average precision (AP) was achieved on a home-made peach ripeness classification dataset.However, there is currently limited research on using computer vision techniques to detect post-harvest sugar apple ripeness and the performance is still to be further improved.Sanchez et al. [26] used the YOLO model to classify sugar apple ripeness and achieved 86.84% in terms of average accuracy performance.On the other hand, the object detection algorithm primarily identifies and locates objects using rectangular bounding boxes.When applied to fruit ripeness detection, simple bounding boxes face challenges in accuracy and detail due to complex color and texture variations [27].Furthermore, object detection lacks detailed semantic information at the pixel level, which may make it challenging to accurately assess fruit ripeness.In addition, existing models often have large model parameters and floating-point operations, demanding high hardware requirements and lacking practicality on embedded devices.
Compared to object detection, semantic segmentation offers detailed pixel-level annotations and assigns semantic labels to each pixel in an image.This enables a more precise segmentation of sugar apples, providing accurate and comprehensive information for ripeness detection [27].This advantage, especially in analyzing color and texture features, makes it a promising direction for advancing sugar apple ripeness detection.Therefore, this paper introduces an improved semantic segmentation method (ECD-DeepLabv3+) for post-harvest sugar apple ripeness segmentation.In order to enhance the model's performance and streamline its complexity, we substituted the initial backbone network with MobileNetV2 in DeepLabv3+ and integrated the efficient channel attention (ECA) module to establish connectivity between the encoding and decoding regions.The ASPP module in DeepLabv3+ was replaced with a dense connection approach (Dense ASPP) to reduce the loss of the sugar apple's image feature information.Finally, the Dense ASPP module incorporated a coordinate attention (CA) module to enhance the model's understanding of the sugar apple's coordinate information.To evaluate the performance of the ECD-DeepLabv3+ model, a self-made dataset was built up, which includes 1600 optical images focusing on the ripening of sugar apples after harvest.Experimental results show that the proposed model achieved better results in terms of MIoU, MPA, PA, MF1, Model Params, and Flops.Using the improved segmentation algorithm, we conducted ripeness detection on harvested sugar apples, aiming to provide a possible method for automatically screening the ripeness of sugar apples and other fruits.The primary contribution of this paper can be summarized as follows: 1.
This paper explored, for the first time, the feasibility of applying semantic segmentation techniques to the detection of sugar apple (Annona squamosa) ripeness; 2.
This paper proposed an improved semantic segmentation model (ECD-DeepLabv3+) which, while significantly reducing model complexity (Model Params and Flops), achieves enhancements in performance metrics, such as MIoU, MPA, MF1, and PA; 3.
This paper created a semantic segmentation dataset for post-harvest sugar apple optical images to evaluate the performance of the ECD-DeepLabv3+.

Dataset
The aim of this paper is to explore the utilization of semantic segmentation techniques to detect ripeness in post-harvest sugar apples and to achieve automatic classification of sugar apples at different ripeness levels using artificial intelligence.However, to our knowledge, there is currently no publicly available semantic segmentation dataset specifically designed for assessing sugar apple ripeness.Therefore, in this paper, a self-made semantic segmentation dataset comprising 30 sugar apples, 4 kiwis, and 3 pineapples for sugar apple ripeness assessment has been created to validate the performance of the proposed model (all of these fruits are obtained from Guangzhou, China).A total of 1000 images were collected, each containing sugar apples at different levels of ripeness (unripe, ripe, and bad).To ensure the robustness of the model, among these 1000 images, there were 935 images, including kiwis and pineapples, with features similar to those of sugar apples (each image containing multiple categories, with kiwis and pineapples labeled as "other").Additionally, background interference caused by the peeling of sugar apple skin during ripening or impact processes was introduced (the 'background' category is labeled in each image).We used Labelme for image annotation and, after careful inspection by the experienced biologists, the dataset was randomly and evenly divided into three groups in a ratio of 6:2:2, namely, the training set containing 600 images, the validation set containing 200 images, and the test set containing 200 images.The dataset was labeled into five categories: unripe, ripe, bad, other, and background.After the dataset division, this paper increased the number of optical images within the training set through horizontal flipping, vertical flipping, and random changes in brightness.The ultimate training set comprises 1200 images, supplemented by 200 images each for both the validation and test sets, culminating in a total dataset size of 1600 images.Compared to the scale of the semantic segmentation datasets used in references [28][29][30][31], our dataset, consisting of 1600 images, is reasonable and reliable.

Image Preprocessing
As these images were captured by different smartphones (iPhone 11, HUAWEI Mate20, and HUAWEI nova9), they had varying pixels (3024 × 3024, 2976 × 2976, and 3072 × 3072).It is worth mentioning that all the equipment used to collect the dataset was manufactured in China.To standardize the dataset, we used OpenCV (version 4.7.0) to resize them to a uniform resolution of 512 × 512 pixels.Some images from the self-made dataset are shown in Figure 1.
the scale of the semantic segmentation datasets used in references [28][29][30][31], our datas consisting of 1600 images, is reasonable and reliable.

Image Preprocessing
As these images were captured by different smartphones (iPhone 11, HUAW Mate20, and HUAWEI nova9), they had varying pixels (3024 × 3024, 2976 × 2976, and 30 × 3072).It is worth mentioning that all the equipment used to collect the dataset was ma ufactured in China.To standardize the dataset, we used OpenCV (version 4.7.0) to res them to a uniform resolution of 512 × 512 pixels.Some images from the self-made data are shown in Figure 1.

Data Augmentation
The technique of data augmentation [32] is a method that enhances the performan of deep learning models by introducing diversity through transformations and expa sions on the original data.By incorporating various changes, such as rotation, flippin and brightness adjustments, data augmentation creates a more diverse set of samples, panding the training dataset and enabling the model to comprehensively learn featur The strength of this approach lies in its ability to improve the generalization of the p posed model, reduce the risk of overfitting, and provide solutions when faced with li ited data or a lack of diversity.By simulating variations present in real-world scenari data augmentation helps the model adapt to different environments and conditions, th improving its robustness.This paper primarily employs horizontal flipping and verti flipping to enhance the recognition accuracy of flipped targets, as well as utilizing brig ness adjustments to enhance the model's robustness to different lighting environmen Specifically, after considering the trade-off between the training time cost and model p formance, the training set (containing 600 images) in the original dataset is randomly au mented using the three forms of image enhancement mentioned above, expanding t training set to 1200 images.The images for each category in the dataset are detailed Table 1, with multiple category annotations per image.Following the completion of d augmentation, the dataset comprises 1600 images, including 1217 images containing "U ripe" fruit, 1581 images containing "Ripe" fruit, 1372 images containing "Bad" fruit, a 1516 images containing "Other" fruit (among which 1297 contain kiwifruit, 627 conta pineapple, and 408 contain both kiwifruit and pineapple).Each image includes anno tions for the background.Figure 2 illustrates the effects of horizontal flipping, verti flipping, and brightness adjustments.

Data Augmentation
The technique of data augmentation [32] is a method that enhances the performance of deep learning models by introducing diversity through transformations and expansions on the original data.By incorporating various changes, such as rotation, flipping, and brightness adjustments, data augmentation creates a more diverse set of samples, expanding the training dataset and enabling the model to comprehensively learn features.The strength of this approach lies in its ability to improve the generalization of the proposed model, reduce the risk of overfitting, and provide solutions when faced with limited data or a lack of diversity.By simulating variations present in real-world scenarios, data augmentation helps the model adapt to different environments and conditions, thus improving its robustness.This paper primarily employs horizontal flipping and vertical flipping to enhance the recognition accuracy of flipped targets, as well as utilizing brightness adjustments to enhance the model's robustness to different lighting environments.Specifically, after considering the trade-off between the training time cost and model performance, the training set (containing 600 images) in the original dataset is randomly augmented using the three forms of image enhancement mentioned above, expanding the training set to 1200 images.The images for each category in the dataset are detailed in Table 1, with multiple category annotations per image.Following the completion of data augmentation, the dataset comprises 1600 images, including 1217 images containing "Unripe" fruit, 1581 images containing "Ripe" fruit, 1372 images containing "Bad" fruit, and 1516 images containing "Other" fruit (among which 1297 contain kiwifruit, 627 contain pineapple, and 408 contain both kiwifruit and pineapple).Each image includes annotations for the background.Figure 2 illustrates the effects of horizontal flipping, vertical flipping, and brightness adjustments.augmentation, the dataset comprises 1600 images, including 1217 images containing "Un ripe" fruit, 1581 images containing "Ripe" fruit, 1372 images containing "Bad" fruit, and 1516 images containing "Other" fruit (among which 1297 contain kiwifruit, 627 contain pineapple, and 408 contain both kiwifruit and pineapple).Each image includes annota tions for the background.Figure 2 illustrates the effects of horizontal flipping, vertica flipping, and brightness adjustments.

Four Well-Known Models Being Compared
As artificial intelligence (AI) advances, computer vision technology finds extensive applications in diverse fields, including agriculture, engineering, medicine, and beyond.Within the realm of computer vision, three main tasks prevail: object detection, image classification, and semantic segmentation.Compared to the former two tasks, semantic segmentation has attracted considerable attention from scholars due to its finer target localization, comprehensive understanding of global scenes, and advantages in handling multiple objects.Four well-known semantic segmentation models (HR-Net, U-Net, PSPNet, and DeepLabv3+) are illustrated in Figure 3.

Four Well-Known Models Being Compared
As artificial intelligence (AI) advances, computer vision technology finds extensive applications in diverse fields, including agriculture, engineering, medicine, and beyond.Within the realm of computer vision, three main tasks prevail: object detection, image classification, and semantic segmentation.Compared to the former two tasks, semantic segmentation has attracted considerable attention from scholars due to its finer target localization, comprehensive understanding of global scenes, and advantages in handling multiple objects.Four well-known semantic segmentation models (HR-Net, U-Net, PSPNet, and DeepLabv3+) are illustrated in Figure 3.

U-Net
U-Net is a fully convolutional neural network proposed by Ronneberger et al. [33] distinguished by the integration of feature maps across the channel dimension at the same level between the encoder and decoder through skip connections (as shown in Figure 3a).This design facilitates the fusion of contextual information from deep network features with shallow network images, promoting multiscale feature integration and mitigating the loss of image information.

U-Net
U-Net is a fully convolutional neural network proposed by Ronneberger et al. [33] distinguished by the integration of feature maps across the channel dimension at the same level between the encoder and decoder through skip connections (as shown in Figure 3a).This design facilitates the fusion of contextual information from deep network features with shallow network images, promoting multiscale feature integration and mitigating the loss of image information.

HRNet
HRNet, or High-Resolution Network, is a deep learning architecture proposed by Wang et al. [34].This network places a significant emphasis on high-resolution images, utilizing a high-resolution feature pyramid network structure to effectively capture fine features in images (as shown in Figure 3b).HRNet achieves heightened sensitivity to details by preserving the flow of high-resolution information, avoiding resolution loss.Unlike traditional networks, HRNet is capable of simultaneously handling feature maps of different resolutions, enabling comprehensive feature learning while maintaining high resolution.This design has led to outstanding performances for HRNet in image processing tasks.

PSPNet
PSPNet, short for Pyramid Scene Parsing Network, is a deep learning network proposed by Zhao et al. [35].This network adopts a structure known as pyramid spatial pooling, enhancing image segmentation by incorporating multiscale global information (as shown in Figure 3c).In PSPNet, different-sized pooling kernels are introduced to capture contextual information at various scales.This design contributes to an improved understanding of scenes, enabling the network to better adapt to objects and structures of varying scales.Ultimately, PSPNet achieves enhanced accuracy and effectiveness in image segmentation through the fusion of multiscale contextual information.

DeepLabv3+
DeepLabv3+ is proposed by Chen et al. [36].The network adopts an encoder-decoder structure, utilizing Xception as the backbone (as shown in Figure 3d).The design includes an atrous spatial pyramid pooling (ASPP) module, where atrous convolutions with different atrous rates are employed to extract features at various resolutions, enhancing the richness of contextual information.After up-sampling the deep feature maps, they are fused once again with low-level layer features, resulting in higher segmentation accuracy.

Architecture of the Proposed ECD-DeepLabv3+
This paper proposes an enhanced semantic segmentation model based on DeepLabv3+ (as shown in Figure 4), enhancing the detection performance of ripeness in post-harvest sugar apples while reducing the model's complexity.The improvement made to the model architecture encompasses three key aspects: i.
Replacing the backbone with MobileNetV2 and introducing an efficient channel attention (ECA) module in the junction between the encoding and decoding regions, which can substantially decrease the complexity of the model, including parameters (Params) and floating-point operations (Flops), while simultaneously boosting its capabilities; ii.Following the feature maps output by ASPP in DeepLabv3+, adding the coordinate attention (CA) module to improve attention towards the positional and long-range dependency information of post-harvest sugar apple images; iii.Merging the densely connected atrous spatial pyramid pooling (Dense ASPP), which can minimize overlooked pixel features, preserving the completeness of feature information and achieving an enlarged receptive field.
Agriculture 2024, 14, x FOR PEER REVIEW 7 of 23 MobileNetV2, proposed by Sandler et al. [37], is a neural network known for its lightweight design, designed to address the demand for efficient image recognition in resource-constrained environments.Emphasizing lightweight feel and efficiency, the model incorporates innovative designs, such as depth-wise separable convolutions, residual connections, and inverted residual structures, to enable real-time image processing on small

MobileNetv2
MobileNetV2, proposed by Sandler et al. [37], is a neural network known for its lightweight design, designed to address the demand for efficient image recognition in resource-constrained environments.Emphasizing lightweight feel and efficiency, the model incorporates innovative designs, such as depth-wise separable convolutions, residual connections, and inverted residual structures, to enable real-time image processing on small devices.Figure 5 illustrates the inverted residual block of MobileNetV2.Compared to Xception, MobileNetV2 focuses more on the model's light weight and efficiency, achieving satisfactory performance with relatively small parameters and computational complexity.This makes MobileNetV2 an ideal choice for practical deployment in scenarios with limited resources, particularly on mobile devices.

MobileNetv2
MobileNetV2, proposed by Sandler et al. [37], is a neural network known for its lightweight design, designed to address the demand for efficient image recognition in resource-constrained environments.Emphasizing lightweight feel and efficiency, the model incorporates innovative designs, such as depth-wise separable convolutions, residual connections, and inverted residual structures, to enable real-time image processing on small devices.Figure 5 illustrates the inverted residual block of MobileNetV2.Compared to Xception, MobileNetV2 focuses more on the model's light weight and efficiency, achieving satisfactory performance with relatively small parameters and computational complexity.This makes MobileNetV2 an ideal choice for practical deployment in scenarios with limited resources, particularly on mobile devices.
The creators of MobileNetV2 took into consideration the practical application requirements, leading to widespread adoption in edge computing and mobile devices.In comparison to DeepLabv3+ with Xception as the backbone, DeepLabv3+ with Mo-bileNetV2 as the backbone demonstrates better performance in the segmentation task of ripeness detection in post-harvest sugar apples.Moreover, it significantly reduces the number of parameters and floating-point operations.Hence, in this paper, we adopt Mo-bileNetV2 as the backbone for our proposed model (ECD-DeepLabv3+).

Efficient Channel Attention
Efficient channel attention (ECA) is a lightweight attention mechanism designed for image processing and computer vision tasks [38].The approach aims to enhance the The creators of MobileNetV2 took into consideration the practical application requirements, leading to widespread adoption in edge computing and mobile devices.In comparison to DeepLabv3+ with Xception as the backbone, DeepLabv3+ with MobileNetV2 as the backbone demonstrates better performance in the segmentation task of ripeness detection in post-harvest sugar apples.Moreover, it significantly reduces the number of parameters and floating-point operations.Hence, in this paper, we adopt MobileNetV2 as the backbone for our proposed model (ECD-DeepLabv3+).

Efficient Channel Attention
Efficient channel attention (ECA) is a lightweight attention mechanism designed for image processing and computer vision tasks [38].The approach aims to enhance the model's focus on crucial features in the channel dimension while reducing computational and parameter complexity.ECA achieves this by introducing an adaptive and lightweight attention weight for each channel, enabling the model to capture key information in images more accurately without introducing significant computational overhead.Figure 6 illustrates the specific structure of the ECA.
Agriculture 2024, 14, x FOR PEER REVIEW model's focus on crucial features in the channel dimension while reducing compu and parameter complexity.ECA achieves this by introducing an adaptive and ligh attention weight for each channel, enabling the model to capture key informatio ages more accurately without introducing significant computational overhead.F illustrates the specific structure of the ECA.As depicted in the illustration, the ideal span for the exchange of channel d noted as k, corresponds to the dimension of the one-dimensional convolution kern is determined using the following formula: As depicted in the illustration, the ideal span for the exchange of channel data, denoted as k, corresponds to the dimension of the one-dimensional convolution kernel.This is determined using the following formula: where C signifies the count of feature channels and b and γ are typically assigned the values of 1 and 2, respectively.Additionally, w denotes the ultimate channel attention and it is computed according to the subsequent equation: where F denotes the incoming feature, C1D k symbolizes the one-dimensional convolution, employing a convolution kernel of size k, and σ represents the Sigmoid function.
In comparison to traditional attention mechanisms, ECA effectively reduces the demand for computational resources while maintaining model performance, making it an ideal choice for image processing in resource-constrained environments.The design philosophy of ECA aligns with the objectives of this paper, which focuses on designing a lightweight semantic segmentation network for assessing the ripeness of post-harvest sugar apples.Consequently, ECA is incorporated at the connections linking the encoder and decoder in this paper to enhance the model's ability to capture critical information in images with greater precision.

Coordinate Attention
Traditional attention mechanisms, constrained by convolutional computations, primarily focus on capturing local relationships and exhibit limitations in modeling distant dependencies.To overcome this challenge, the coordinate attention mechanism is proposed by Hou et al. [39].This mechanism conducts feature-aware operations across spatial coordinates and encompasses two pivotal stages: coordinate attention embedding and coordinate attention generation.The architecture is illustrated in Figure 7.During the coordinate attention embedding process, the attention mechanism ap plied to individual channels within the input feature map is divided into two one-dimen sional processes for feature encoding.These dual feature encodings execute feature con solidation along both the x and y directions, encompassing horizontal and vertical orien tations.Subsequently, encoding is carried out for each channel along the horizontal an vertical coordinates using pooling operations.
During the generation of coordinate attention, the feature map tensors obtained fro two distinct directions, horizontal and vertical, undergo convolutional operations adapt their channel dimensions, aligning with the channel count in the input.Ultimatel an activation function, which can modify the output of the attention module, is employe The formula for the coordinate attention process is outlined as follows: x hi W ( where h stands for the height parameter of the pooling kernels, signifies the outpu During the coordinate attention embedding process, the attention mechanism applied to individual channels within the input feature map is divided into two one-dimensional processes for feature encoding.These dual feature encodings execute feature consolidation along both the x and y directions, encompassing horizontal and vertical orientations.Subsequently, encoding is carried out for each channel along the horizontal and vertical coordinates using pooling operations.
During the generation of coordinate attention, the feature map tensors obtained from two distinct directions, horizontal and vertical, undergo convolutional operations to adapt their channel dimensions, aligning with the channel count in the input.Ultimately, an activation function, which can modify the output of the attention module, is employed.The formula for the coordinate attention process is outlined as follows: where h stands for the height parameter of the pooling kernels, z c h signifies the output of the c-th channel at the height h, and the input of the c-th channel is represented by x c .
where w is the width parameter of the pooling kernels and z c w denotes the output of the c-th channel at width w.
where [•,•] denotes the connectivity operation from the spatial dimension, δ stands for the non-linear activation function, and u is the mid-level feature mapping that is achieved by merging the feature both vertically and horizontally.Conv is the convolutional operation.
where g w and g h are tensors with the same channel number as the input, obtained by transforming u w and u h , respectively.The sigmoid function is represented by σ while u h and u w are the two tensors obtained by decomposing u along the spatial dimension.
where the output of the c-th channel in the equation is represented by y c .In order to improve the model's localization and feature extraction capabilities in the sugar apple ripeness segmentation task, we introduced the coordinate attention mechanism into the feature map after the fusion of the ASPP module.This allows the model to adaptively learn feature information and details from post-harvest sugar apple images of different ripeness levels, accurately capturing the precise location of the target object and long-range dependency information.

Dense ASPP
In the ASPP module of the DeepLabv3+ model (Figure 8a), the parallel-connected atrous convolutions exhibit discreteness in the image space, leading to the loss of feature information and a reduction in the accuracy of image segmentation.We introduced the Dense ASPP [40] in DeepLabv3+, as illustrated in Figure 8b.This improvement allowed DeepLabv3+ to sample pixels more densely, obtaining a broader receptive field and enhancing the segmentation performance of the model on post-harvest sugar apple images.
In the context of dense pixel sampling, the pixel sampling rate associated with atrous convolution appears notably sparse.As illustrated in Figure 9a, within one-dimensional atrous convolution featuring an atrous rate of 6, merely 3 pixels are engaged in the computation for such a substantial convolution kernel at any given moment.Despite achieving a broader receptive field, this approach results in the omission of considerable information.Following the integration of the densely linked ASPP module, pixel sampling undergoes intensification, securing the integrity of feature information.As depicted in Figure 9b, the atrous rate incrementally escalates layer by layer, involving seven pixels in the convolutional computation within the obtained convolutional result.This configuration is richer in pixel information compared to the atrous convolution depicted in Figure 9a.As demonstrated in Figure 9c, within two-dimensional atrous convolution, the count of the pixels engaged in feature extraction reaches 49 when each densely connected atrous convolution layer is employed.In contrast, a solitary atrous convolution layer encompasses merely 9 pixels in feature extraction.Convolution layers featuring a higher atrous rate tend to neglect adjacent pixel features.Consequently, by amalgamating convolution layers with varying atrous rates, the preservation of feature information integrity can be assured and the receptive field can be expanded.
mechanism into the feature map after the fusion of the ASPP module.This allows the model to adaptively learn feature information and details from post-harvest sugar apple images of different ripeness levels, accurately capturing the precise location of the target object and long-range dependency information.

Dense ASPP
In the ASPP module of the DeepLabv3+ model (Figure 8a), the parallel-connected atrous convolutions exhibit discreteness in the image space, leading to the loss of feature information and a reduction in the accuracy of image segmentation.We introduced the Dense ASPP [40] in DeepLabv3+, as illustrated in Figure 8b.This improvement allowed DeepLabv3+ to sample pixels more densely, obtaining a broader receptive field and enhancing the segmentation performance of the model on post-harvest sugar apple images.In the context of dense pixel sampling, the pixel sampling rate associated with atrous convolution appears notably sparse.As illustrated in Figure 9a, within one-dimensional atrous convolution featuring an atrous rate of 6, merely 3 pixels are engaged in the computation for such a substantial convolution kernel at any given moment.Despite achieving a broader receptive field, this approach results in the omission of considerable information.Following the integration of the densely linked ASPP module, pixel sampling undergoes intensification, securing the integrity of feature information.As depicted in Figure 9b, the atrous rate incrementally escalates layer by layer, involving seven pixels in the convolutional computation within the obtained convolutional result.This configuration is richer in pixel information compared to the atrous convolution depicted in Figure 9a.As demonstrated in Figure 9c, within two-dimensional atrous convolution, the count of the pixels engaged in feature extraction reaches 49 when each densely connected atrous convolution layer is employed.In contrast, a solitary atrous convolution layer encompasses merely 9 pixels in feature extraction.Convolution layers featuring a higher atrous rate tend to neglect adjacent pixel features.Consequently, by amalgamating convolution layers with varying atrous rates, the preservation of feature information integrity can be assured and the receptive field can be expanded.The size of the receptive field can be calculated using the following formula: where r represents the atrous rate of the atrous convolution, k represents the kernel size of the atrous convolution, and R represents the size of the receptive field.
If two atrous convolutional layers are stacked together, a larger receptive field can be obtained.The formula for calculating the size of the stacked receptive field is as follows: (10) where R1 and R2 represent the receptive field sizes provided by the adjacent two atrous convolutional layers and R represents the stacked receptive field.
Table 2 illustrates the receptive fields obtained through both the original ASPP configuration and the utilization of a densely linked stacked atrous convolution.In the initial DeepLabv3+ model, the ASPP module operates independently, with no information exchange between its branches.After passing through each atrous convolutional layer, the feature maps are directly concatenated.As detailed in Table 2, the receptive fields are 13, 25, and 37 for atrous rates of 6, 12, and 18, respectively.The size of the receptive field can be calculated using the following formula: where r represents the atrous rate of the atrous convolution, k represents the kernel size of the atrous convolution, and R represents the size of the receptive field.
If two atrous convolutional layers are stacked together, a larger receptive field can be obtained.The formula for calculating the size of the stacked receptive field is as follows: where R 1 and R 2 represent the receptive field sizes provided by the adjacent two atrous convolutional layers and R represents the stacked receptive field.Table 2 illustrates the receptive fields obtained through both the original ASPP configuration and the utilization of a densely linked stacked atrous convolution.In the initial DeepLabv3+ model, the ASPP module operates independently, with no information exchange between its branches.After passing through each atrous convolutional layer, the feature maps are directly concatenated.As detailed in Table 2, the receptive fields are 13, 25, and 37 for atrous rates of 6, 12, and 18, respectively.In contrast, the densely linked ASPP structure enhances the reutilization of feature information across diverse layers, consequently augmenting the receptive field.Notably, the receptive fields are 37 when connecting branches with atrous rates of 6 and 12 and 61 when connecting branches with atrous rates of 12 and 18.For the scenario where all three branches with atrous rates of 6, 12, and 18 are connected, the receptive field expands to 73.

Evaluation Indicators
In this paper, we comprehensively evaluate the performance of the mentioned model in the sugar apple ripeness classification task from two aspects: model performance and model complexity.For the validation of model performance, we employ four evaluation metrics: Mean Intersection over Union (MIoU), Mean Pixel Accuracy (MPA), Pixel Accuracy (PA), and Mean F1 Score (MF1).MIoU primarily assesses the segmentation performance of models, measuring the accuracy of pixel segmentation for each class.MPA is used to gauge the classification accuracy at the pixel level, calculating its average.It focuses on the model's accuracy in pixel-level classification.PA is utilized to evaluate the overall accuracy of the model at the pixel level, i.e., the proportion of correctly classified pixels to the total number of pixels.MF1 serves as a comprehensive metric, offering a balanced evaluation of model performance.It assesses the accuracy of predicted positives among the true positives and evaluates the model's capability to correctly identify true positives.In doing so, the MF1 achieves equilibrium between different aspects of model performance, resulting in a thorough and nuanced assessment.Four basic metrics can be obtained by comparing the predictions of the model with the dataset labeling, such as False Negative (FN), True Negative (TN), True Positive (TP), and False Positive (FP).These basic metrics allow further calculation of the previously mentioned metrics (MIoU, MPA, PA, and MF1).The formulas for these metrics are as follows: On the other hand, for the validation of model complexity, we primarily focus on the size of model parameters and floating-point operations per second (Flops).Understanding the size of the model's parameters aids in assessing its demands on storage and computational resources while considering the model's Flops helps evaluate its computational efficiency during actual runtime.Flops are commonly used to measure the speed of a system in handling tasks involving a large number of numerical calculations, particularly in scenarios, such as scientific computing and deep learning, that require extensive floating-point operations.The formulas for these metrics can be listed as follows: where C 0 stands for the output channel number, (•) stands for the parameters of a convolution kernel, k w represents the convolution kernel width, k h represents the convolution kernel height, C i represents the input channel number, and k w × k h × C i represents the number of weights in a convolution kernel.
If the convolution kernel is square, i.e., k w = k h = k, then the above formula becomes: Additionally, since batch normalization (BN) was employed in the model design, the +1 term in the calculation is removed.Furthermore, considering the use of square convolution kernels, the final formula for the parameters is listed as follows: And the Flops can be calculated as follows: where [•] represents the computational cost, including both multiplication and addition, required to compute a single point in the feature map through a convolution operation.The term C i × k w × k h accounts for the multiplication cost within a single convolution operation, C i × k w × k h − 1 represents the addition cost in a single convolution operation, and the +1 term corresponds to the bias.H and W stand for the width and length of the feature map, respectively, and C 0 × W × H denotes the total number of elements in the feature map.
If the convolutional kernel is square, meaning k w = k h = k, and batch normalization (BN) is applied, the calculation formula is modified to:

Experiment Details
The details of the experimental platform's hardware and software parameters used for the experiment can be seen in Table 3.In addition, the loss curves and accuracy curves in the training process are also presented in Figure 10.As depicted in Figure 10a, the training loss and validation loss converge simultaneously, both are at low values, and the validation loss is slightly higher than the training loss.Moreover, as illustrated in Figure 10b, the training and validation accuracy converge simultaneously, both at high values over 90%.The training accuracy is slightly higher than the validation accuracy, suggesting that the model is striving to achieve performance akin to that of the training set, rather than deviating significantly.Therefore, it can be concluded that there is no overfitting observed in the models presented in this paper.

Quantitative Analysis
Within this section, an evaluation is conducted on existing semantic segmenta methods, encompassing DeepLabv3+, PSPNet, HRNet, and U-Net models utilizing ous backbones (Xception, MobileNetV2), and the newly introduced model (E DeepLabv3+).In the experiments, a consistent dataset (self-made post-harvest sugar a dataset) is employed for both training and testing to maintain comparability and sistency in the independent variable.The quantitative comparison results of these mo are presented in Table 5.It can be seen that DeepLabv3+ has the best comprehensive formance among the four well-known semantic segmentation models, with the best formance on MIoU, PA, and MF1; however, its model complexity is high so this p makes a series of improvements to DeepLabv3+ based on maintaining its performan As can be seen in Table 5, after replacing the backbone with the original DeepLa (MobileNetV2-DeepLabv3+), the model exhibits superior performance in the compari of Params and Flops with values of 5.81 M and 52.89 G, respectively.Notably, when c pared with Xception-DeepLabv3+ (baseline), MobileNetV2-DeepLabv3+ shows sig cant improvements of 89.33% and 68.30% in Params and Flops, respectively, and a s increase in other metrics (MIoU, MPA, PA, MF1).
After accomplishing the task of reducing model complexity, further exploration conducted to enhance the performance of the model in the post-harvest sugar apple ness detection task while maintaining low model complexity.As evident from Tab compared to DeepLabv3+ (Xception), the proposed model (ECD-DeepLabv3+) exhibit ductions of 89.20% and 68.09% in Params and Flops, respectively.Additionally, there improvements of 2.13%, 1.77%, 0.62%, and 1.27% in MIoU, MPA, PA, and MF1, res tively.These results show that the proposed method improves the detection of diffe ripeness levels of post-harvest sugar apples by better extracting the sugar apple im features and enhancing the focus on the information of the coordinates where the f greek features are located.Figure 11 plots the results of the ECD-DeepLabv3+ mod each of the evaluation metrics in the different categories, providing a clearer and e
In the experiments, a consistent dataset (self-made post-harvest sugar apple dataset) is employed for both training and testing to maintain comparability and consistency in the independent variable.The quantitative comparison results of these models are presented in Table 5.It can be seen that DeepLabv3+ has the best comprehensive performance among the four well-known semantic segmentation models, with the best performance on MIoU, PA, and MF1; however, its model complexity is high so this paper makes a series of improvements to DeepLabv3+ based on maintaining its performance.As can be seen in Table 5, after replacing the backbone with the original DeepLabv3+ (MobileNetV2-DeepLabv3+), the model exhibits superior performance in the comparisons of Params and Flops with values of 5.81 M and 52.89 G, respectively.Notably, when compared with Xception-DeepLabv3+ (baseline), MobileNetV2-DeepLabv3+ shows significant improvements of 89.33% and 68.30% in Params and Flops, respectively, and a slight increase in other metrics (MIoU, MPA, PA, MF1).
After accomplishing the task of reducing model complexity, further exploration was conducted to enhance the performance of the model in the post-harvest sugar apple ripeness detection task while maintaining low model complexity.As evident from Table 5, compared to DeepLabv3+ (Xception), the proposed model (ECD-DeepLabv3+) exhibits reductions of 89.20% and 68.09% in Params and Flops, respectively.Additionally, there are improvements of 2.13%, 1.77%, 0.62%, and 1.27% in MIoU, MPA, PA, and MF1, respectively.These results show that the proposed method improves the detection of different ripeness levels of post-harvest sugar apples by better extracting the sugar apple image features and enhancing the focus on the information of the coordinates where the fenugreek features are located.Figure 11 plots the results of the ECD-DeepLabv3+ model in each of the evaluation metrics in the different categories, providing a clearer and easier comparison between the categorization results and the actual predictions, demonstrating the performance of our model.Combining Figure 11 and Table 5, it can be seen that the proposed model exhibits excellent performance both in terms of the overall average performance (MPA, MIoU, and MF1), the overall performance (PA, Params, and Flops), and the performance specific to the individual categories in the dataset.

Qualitative Analysis
In order to better verify the performance of the proposed model, we conducted a qualitative analysis of six models (U-Net, PSP-Net, HR-Net, DeepLabv3+ with two different backbones, and ECD-DeepLabv3+).On our self-made dataset, which includes clean backgrounds and backgrounds with interference (such as ripe sugar apple or skin peeling caused by impact), the experimental results show that our proposed ECD-DeepLabv3+ model demonstrates better performance in terms of segmentation accuracy, segmentation detail, and robustness under two different background situations.It is worth mentioning that, to ensure the rigor of the experiment, all images used to evaluate the model performance are not included in the model training.
Figure 12 shows the segmentation results of each model for the different the ripeness of the sugar apple under a clean background (non-interference situation).As can be seen in Figure 12, compared with the five models, the segmentation result of our proposed model (ECD-DeepLabv3+) performs the best.Figure 12a,b indicate that for ripe and bad sugar apples, the five compared models all have serious segmentation errors while the proposed model (ECD-DeepLabv3+) still maintains high accuracy, overall performing the best.Figure 12c shows that the U-Net, PSP-Net, and HR-Net models all have segmentation errors or under-segmentation for the ripe sugar apple or other fruits (pineapple) while DeepLabv3+ (Xception) and DeepLabv3+ (MobileNetV2) perform better but still have some segmentation errors.In contrast, our proposed model accurately segments each type.Figure 12d shows that the five compared models all have different degrees of segmentation errors when classifying the ripe sugar apple and bad sugar apple.But our proposed model has relatively high prediction accuracy.Figure 12e presents that the four models, U-Net, HR-Net, DeepLabv3+ (Xception), and DeepLabv3+ (MobileNetV2), all

Qualitative Analysis
In order to better verify the performance of the proposed model, we conducted a qualitative analysis of six models (U-Net, PSP-Net, HR-Net, DeepLabv3+ with two different backbones, and ECD-DeepLabv3+).On our self-made dataset, which includes clean backgrounds and backgrounds with interference (such as ripe sugar apple or skin peeling caused by impact), the experimental results show that our proposed ECD-DeepLabv3+ model demonstrates better performance in terms of segmentation accuracy, segmentation detail, and robustness under two different background situations.It is worth mentioning that, to ensure the rigor of the experiment, all images used to evaluate the model performance are not included in the model training.
Figure 12 shows the segmentation results of each model for the different the ripeness of the sugar apple under a clean background (non-interference situation).As can be seen in Figure 12, compared with the five models, the segmentation result of our proposed model (ECD-DeepLabv3+) performs the best.Figure 12a,b indicate that for ripe and bad sugar apples, the five compared models all have serious segmentation errors while the proposed model (ECD-DeepLabv3+) still maintains high accuracy, overall performing the best.Figure 12c shows that the U-Net, PSP-Net, and HR-Net models all have segmentation errors or under-segmentation for the ripe sugar apple or other fruits (pineapple) while DeepLabv3+ (Xception) and DeepLabv3+ (MobileNetV2) perform better but still have some segmentation errors.In contrast, our proposed model accurately segments each type.Figure 12d shows that the five compared models all have different degrees of segmentation errors when classifying the ripe sugar apple and bad sugar apple.But our proposed model has relatively high prediction accuracy.Figure 12e presents that the four models, U-Net, HR-Net, DeepLabv3+ (Xception), and DeepLabv3+ (MobileNetV2), all have serious segmentation errors when identifying the unripe sugar apple and ripe sugar apple.PSP-Net has slight segmentation errors; in contrast, our proposed model still maintains stable segmentation accuracy.Figure 13 highlights the ripeness detection results of various models when the background has noise interference caused by sugar apple skin peeling.As shown in Figure 13a, except for U-Net and PSP-Net, which have slight segmentation errors, other models all have considerable accuracy; however, in some details, the segmentation results of the other three models (HR-Net, DeepLabv3+ (MobileNetV2), DeepLabv3+ (Xception)) are affected by the proximity of the fruit, which causes some adhesion.Compared to the above models, our proposed model performs the best in detail.In Figure 13b, the U-Net and PSP-Net model results have segmentation errors for the ripe sugar apple and unripe sugar apple while HR-Net, DeepLabv3+ (Xception), and DeepLabv3+ (MobileNetV2) all have different degrees of under-segmentation and our proposed model performs relatively well.Figure 13c shows that U-Net and PSP-Net are disturbed by the background and the peeled outer skin is wrongly segmented; HR-Net, DeepLabv3+ (Xception), and DeepLabv3+ (MobileNetV2) all have serious segmentation errors and our proposed model performs better but there is still a slight under-segmentation phenomenon.In Figure 13d, all five compared models have under-segmentation and segmentation errors for the bad sugar apple target with severe skin peeling; in contrast, our proposed model performs better.Figure 13e fully reflects the situation where the background noise caused by sugar apple skin peeling affects the segmentation results to a certain extent; in addition to the five compared models all having different degrees of segmentation errors for the bad sugar apple, U-Net and HR-Net are affected by the peeled skin.Both wrongly segment the background skin while PSP-Net, DeepLabv3+ (Xception), and DeepLabv3+ (Mo-bileNetV2) are not affected by the background; however, all have different degrees of seg- Figure 13 highlights the ripeness detection results of various models when the background has noise interference caused by sugar apple skin peeling.As shown in Figure 13a, except for U-Net and PSP-Net, which have slight segmentation errors, other models all have considerable accuracy; however, in some details, the segmentation results of the other three models (HR-Net, DeepLabv3+ (MobileNetV2), DeepLabv3+ (Xception)) are affected by the proximity of the fruit, which causes some adhesion.Compared to the above models, our proposed model performs the best in detail.In Figure 13b, the U-Net and PSP-Net model results have segmentation errors for the ripe sugar apple and unripe sugar apple while HR-Net, DeepLabv3+ (Xception), and DeepLabv3+ (MobileNetV2) all have different degrees of under-segmentation and our proposed model performs relatively well.Figure 13c shows that U-Net and PSP-Net are disturbed by the background and the peeled outer skin is wrongly segmented; HR-Net, DeepLabv3+ (Xception), and DeepLabv3+ (Mo-bileNetV2) all have serious segmentation errors and our proposed model performs better but there is still a slight under-segmentation phenomenon.In Figure 13d, all five compared models have under-segmentation and segmentation errors for the bad sugar apple target with severe skin peeling; in contrast, our proposed model performs better.Figure 13e fully reflects the situation where the background noise caused by sugar apple skin peeling affects the segmentation results to a certain extent; in addition to the five compared models all having different degrees of segmentation errors for the bad sugar apple, U-Net and HR-Net are affected by the peeled skin.Both wrongly segment the background skin while PSP-Net, DeepLabv3+ (Xception), and DeepLabv3+ (MobileNetV2) are not affected by the background; however, all have different degrees of segmentation errors for the bad sugar apple target and our proposed model shows better accuracy and completeness.

Ablaton Analysis
Ablation experiments with module removal are performed to assess the influence of each component on the efficacy of the proposed model.Table 6 presents the outcomes of semantic segmentation applied to post-harvest fenugreek images, considering various combinations of methodologies.In contrast to (1) and ( 2), notable reductions of 89.38% in Params and 68.30% in Flops are evident when transitioning from Xception to MobileNetV2 as the backbone.This highlights the substantial simplification achieved by MobileNetV2 in the model's complexity.Moreover, there is a positive impact on the four evaluation indicators: MIoU, MPA, PA, and MF1.According to the results (1) and ( 3), it can be observed that by incorporating ECA between the encoding and decoding regions to adjust channel weights, the overall performance of the model has slightly improved.The values of MIoU, MPA, PA, and MF1 have increased by 0.87%, 0.91%, 0.22%, and 0.54%, respectively.Based on the results comparison of ( 1) and ( 4), it can be seen that after the introduction of coordinate attention branching in the feature map after ASPP output, the performance of the model upgraded by 1.16%, 0.98%, 0.27%, and 0.7% on MIoU, MPA, PA, and MF1, respectively, which shows that this method effectively improves the attention of the target objectives.Compared to the results comparison of ( 1) and ( 5), after substituting the initial ASPP module with the Dense ASPP module, improvements in model performance were observed, resulting in an increase of 1.23%, 1.03%, 0.29%, and 0.76% in MIoU, MPA, PA, and MF1 values, respectively.According to the results of Method (6), it is evident that incorporating all the improvements simultaneously significantly reduces the model's complexity and enhances overall performance.The values of MIoU, MPA, PA, and MF1 are 89.95%,94.58%, 96.60%, and 94.61%, respectively.The Params and Flops are 5.91 M, and 53.24 G, respectively.Compared to the original model (Method (1)), the proposed model achieves improvements of 2.13%, 1.77%, 0.62%, and 1.27% in MIoU,

Ablaton Analysis
Ablation experiments with module removal are performed to assess the influence of each component on the efficacy of the proposed model.Table 6 presents the outcomes of semantic segmentation applied to post-harvest fenugreek images, considering various combinations of methodologies.In contrast to (1) and ( 2), notable reductions of 89.38% in Params and 68.30% in Flops are evident when transitioning from Xception to MobileNetV2 as the backbone.This highlights the substantial simplification achieved by MobileNetV2 in the model's complexity.Moreover, there is a positive impact on the four evaluation indicators: MIoU, MPA, PA, and MF1.According to the results (1) and (3), it can be observed that by incorporating ECA between the encoding and decoding regions to adjust channel weights, the overall performance of the model has slightly improved.The values of MIoU, MPA, PA, and MF1 have increased by 0.87%, 0.91%, 0.22%, and 0.54%, respectively.Based on the results comparison of ( 1) and ( 4), it can be seen that after the introduction of coordinate attention branching in the feature map after ASPP output, the performance of the model upgraded by 1.16%, 0.98%, 0.27%, and 0.7% on MIoU, MPA, PA, and MF1, respectively, which shows that this method effectively improves the attention of the target objectives.Compared to the results comparison of ( 1) and ( 5), after substituting the initial ASPP module with the Dense ASPP module, improvements in model performance were observed, resulting in an increase of 1.23%, 1.03%, 0.29%, and 0.76% in MIoU, MPA, PA, and MF1 values, respectively.According to the results of Method (6), it is evident that incorporating all the improvements simultaneously significantly reduces the model's complexity and enhances overall performance.The values of MIoU, MPA, PA, and MF1 are 89.95%,94.58%, 96.60%, and 94.61%, respectively.The Params and Flops are 5.91 M, and 53.24 G, respectively.Compared to the original model (Method (1)), the proposed model achieves improvements of 2.13%, 1.77%, 0.62%, and 1.27% in MIoU, MPA, PA, and MF1, respectively, while reducing Params and Flops by 89.2% and 69.09%.

Generalizability Analysis
To investigate the generalizability of the proposed model in this paper to other tasks, an open-source dataset focusing on apples was selected after considering its relevance to the topic of fruits and the number of images available.This dataset, obtained from Baidu's Paddle Deep Learning platform, comprises a semantic segmentation dataset with five categories, including three types of apples, pears, and peaches, totaling 758 images.This dataset can be obtained from https://aistudio.baidu.com/datasetdetail/114414(accessed on 16 March 2024).A portion of the images is illustrated in Figure 14.

Generalizability Analysis
To investigate the generalizability of the proposed model in this paper to other tasks, an open-source dataset focusing on apples was selected after considering its relevance to the topic of fruits and the number of images available.This dataset, obtained from Baidu's Paddle Deep Learning platform, comprises a semantic segmentation dataset with five categories, including three types of apples, pears, and peaches, totaling 758 images.This dataset can be obtained from https://aistudio.baidu.com/datasetdetail/114414(accessed on 16 March 2024).A portion of the images is illustrated in Figure 14.In the same experimental settings, the DeepLabv3+ and ECD-DeepLabv3+ models were further compared.Upon analysis of Table 7 and Figure 14, it is evident that due to the smaller background noise and the simpler task of this open-source dataset, both DeepLabv3+ and ECD-DeepLabv3+ demonstrate high performance.And the performance of the proposed ECD-DeepLabv3+ in this task still slightly surpasses that of DeepLabv3+.Furthermore, the training processes of the two models were explored, with Figure 15 illustrating the training curves of Pixel Accuracy, MPA, MIoU, and MF1 on the validation set for both models.It is apparent from the figure that ECD-DeepLabv3+ not only exhibits a faster improvement speed across all evaluation metrics but also converges earlier, further demonstrating its superiority in optimization and generalization capabilities.Additionally, it is worth mentioning that the proposed model has significantly reduced complexity (Params and Flops), facilitating easier deployment to embedded systems.In the same experimental settings, the DeepLabv3+ and ECD-DeepLabv3+ models were further compared.Upon analysis of Table 7 and Figure 14, it is evident that due to the smaller background noise and the simpler task of this open-source dataset, both DeepLabv3+ and ECD-DeepLabv3+ demonstrate high performance.And the performance of the proposed ECD-DeepLabv3+ in this task still slightly surpasses that of DeepLabv3+.Furthermore, the training processes of the two models were explored, with Figure 15 illustrating the training curves of Pixel Accuracy, MPA, MIoU, and MF1 on the validation set for both models.It is apparent from the figure that ECD-DeepLabv3+ not only exhibits a faster improvement speed across all evaluation metrics but also converges earlier, further demonstrating its superiority in optimization and generalization capabilities.Additionally, it is worth mentioning that the proposed model has significantly reduced complexity (Params and Flops), facilitating easier deployment to embedded systems.

Discussion
The task of accurately and automatically classifying the ripeness of fruits is a meaningful one because the current sorting of fruit ripeness is still predominantly manual.This reliance on manual labor may lead to sorting errors due to insufficient human experience and result in low efficiency [41,42].Currently, to address the inefficiency caused by manual labor in fruit ripeness classification, numerous computer-vision-based methods for fruit ripeness detection have been proposed.Although these methods have demonstrated good performance (with average accuracy exceeding 85% for sugar apple [26] and strawberry [24] ripeness tasks using YOLO object detection and semantic segmentation algorithms, respectively), there is still significant room for improvement in performance.Moreover, object detection methods have limitations in expressing information while semantic segmentation methods can provide more detailed and specific information [27].It is worth mentioning that the proposed approach in this paper not only improves performance but also reduces model complexity, avoiding limitations associated with deploying models to embedded devices due to large model parameters and floating-point computational requirements [43,44].
Nowadays, semantic segmentation techniques have been applied in many fields of agriculture, for example, plant leaf disease segmentation [45], plant flower segmentation [46], specific whole plants [47], and so on [48,49].Inspired by the aforementioned research, this paper explores the feasibility of applying semantic segmentation techniques to the task of sugar apple ripeness detection and proposes an improved semantic segmentation model (ECD-DeepLabv3+).To validate the performance of the proposed model, we created a dataset with 1600 optical images of sugar apples at different ripeness levels on various backgrounds.Detailed quantitative and qualitative analysis experiments were conducted on each model.The experimental results indicate that, under consistent experimental hardware parameters, model training hyperparameters, and datasets, the proposed model exhibits better performance and lower model complexity.The values for

Discussion
The task of accurately and automatically classifying the ripeness of fruits is a meaningful one because the current sorting of fruit ripeness is still predominantly manual.This reliance on manual labor may lead to sorting errors due to insufficient human experience and result in low efficiency [41,42].Currently, to address the inefficiency caused by manual labor in fruit ripeness classification, numerous computer-vision-based methods for fruit ripeness detection have been proposed.Although these methods have demonstrated good performance (with average accuracy exceeding 85% for sugar apple [26] and strawberry [24] ripeness tasks using YOLO object detection and semantic segmentation algorithms, respectively), there is still significant room for improvement in performance.Moreover, object detection methods have limitations in expressing information while semantic segmentation methods can provide more detailed and specific information [27].It is worth mentioning that the proposed approach in this paper not only improves performance but also reduces model complexity, avoiding limitations associated with deploying models to embedded devices due to large model parameters and floating-point computational requirements [43,44].
Nowadays, semantic segmentation techniques have been applied in many fields of agriculture, for example, plant leaf disease segmentation [45], plant flower segmentation [46], specific whole plants [47], and so on [48,49].Inspired by the aforementioned research, this paper explores the feasibility of applying semantic segmentation techniques to the task of sugar apple ripeness detection and proposes an improved semantic segmentation model (ECD-DeepLabv3+).To validate the performance of the proposed model, we created a dataset with 1600 optical images of sugar apples at different ripeness levels on various backgrounds.Detailed quantitative and qualitative analysis experiments were conducted on each model.The experimental results indicate that, under consistent experimental hardware parameters, model training hyperparameters, and datasets, the proposed model exhibits better performance and lower model complexity.The values for MIoU, MPA, PA, and MF1 for ECD-DeepLabv3+ are 89.95%,94.58%, 96.60%, and 94.61%, respectively.
The model Params and Flops are 5.91 M and 53.24 G, respectively.Compared to the original DeepLabv3+ model, this approach achieves improvements of 2.13%, 1.77%, 0.62%, and 1.27% in MIoU, MPA, PA, and MF1 and achieves reductions of 89.20% and 68.09% in Params and Flops, respectively.The above results indicate that by utilizing the semantic segmentation model proposed in this paper, it may be more convenient to deploy deep learning models in embedded devices.Additionally, through computer vision technology, it offers a potential method for better assessing the ripeness of fruits, such as sugar apples.This could contribute to the automated classification of fruits like sugar apples.
Furthermore, compared to other imaging methods, optical images not only possess real-time characteristics but are also more easily integrated with embedded devices, allowing the model to directly output the ripeness of sugar apples based on their visual appearance.In addition, we employed transfer learning techniques to initialize the model training weights [50], adopting a training strategy that combines freeze training and nonfreeze training (30% freeze, 70% non-freeze), which maximizes the utilization of generic features learned on a large-scale dataset (vocdevkit2007), preserving these features by freezing lower-level weights.Subsequently, fine-tuning is performed on the target task to adapt to specific domain requirements.This training strategy not only expedites model convergence and mitigates overfitting risks but also enhances the model's performance in real applications [51,52].
Compared to existing computer vision methods for detecting the ripeness of sugar apples, we have, for the first time, utilized semantic segmentation technology in our research; the proposed enhanced model demonstrates superior performance.On the other hand, by assigning semantic labels to each pixel in the image, we achieve pixel-level segmentation, providing more detailed information.However, it is worth noting that creating a dataset for semantic segmentation is more challenging and time consuming compared to datasets used in object detection.
In the future, we will explore how to reduce the model's reliance on data scale through few-shot learning, aiming to decrease the workload associated with data collection and annotation.Additionally, we plan to deploy the proposed model in embedded devices to validate its performance in such environments.On the other hand, we also intend to investigate the performance of the proposed model in various fruit ripeness detection tasks, such as bananas, apples, and others.

Conclusions
This paper delves into the feasibility of using deep-learning-based computer vision methods for detecting different ripeness levels of harvested sugar apples.To meet the lightweight deployment requirements of embedded models and further enhance performance, a lightweight deep learning model named ECD-DeepLabv3+ is proposed, which is based on the best-performing DeepLabv3+ after comparing four well-known semantic segmentation models (U-Net, PSPNet, HRNet, and DeepLabv3+).By replacing the backbone with MobilNetV2, the complexity of the model has been greatly reduced.In addition, the combination of ECA, CA, and Dense ASPP modules has improved the performance of the model.The MIoU, MPA, PA, and MF1 values of the proposed model on the customized dataset are 89.95%,94.58%, 96.60%, and 94.61% and the Params and Flops are 5.91 M and 53.24 G, respectively.In comparison to the original DeepLabv3+ model, this approach achieves reductions of 89.20% and 68.09% in Params and Flops, respectively.Simultaneously, on a custom dataset, it demonstrates improvements of 2.13%, 1.77%, 0.62%, and 1.27% in MIoU, MPA, PA, and MF1, respectively.Moreover, the ablation experiments show that each module is effective in the method proposed in this paper.Finally, through further experimentation on other publicly available datasets with DeepLabv3+ and ECD-DeepLabv3+, the superiority of our proposed model in terms of performance and complexity is further affirmed, making it better suited for embedded devices and offering a potential solution for digital agriculture.

Figure 1 .
Figure 1.Some images from the self-made dataset.

Figure 1 .
Figure 1.Some images from the self-made dataset.

Figure 11 .
Figure 11.Results of various evaluation indicators in each category of ECD-Deeplabv3+.

Figure 11 .
Figure 11.Results of various evaluation indicators in each category of ECD-Deeplabv3+.

Figure 12 .
Figure 12.Segmentation results on the clean background.

Figure 13 .
Figure 13.Segmentation results on the complex background.

Figure 14 .
Figure 14.Images for each category in the open-source dataset.

Figure 14 .
Figure 14.Images for each category in the open-source dataset.

Table 1 .
Number of images for each category in the dataset (except background).

Table 1 .
Number of images for each category in the dataset (except background).
The training process is divided into two phases: freeze training and unfreeze training, aimed at accelerating the model training speed.Further details of the experiment can be found in Table4.

Table 5 .
Performance of the six models.

Table 6 .
Results of the ablation experiments.

Table 6 .
Results of the ablation experiments.