A Study of Kale Recognition Based on Semantic Segmentation

: The kale crop is an important bulk vegetable, and automatic segmentation to recognize kale is fundamental for effective field management. However, complex backgrounds and texture-rich edge details make fine segmentation of kale difficult. To this end, we constructed a kale dataset in a real field scenario and proposed an UperNet semantic segmentation model with a Swin transformer as the backbone network and improved the model according to the growth characteristics of kale. Firstly, a channel attention module (CAM) is introduced into the Swin transformer module to improve the representation ability of the network and enhance the extraction of kale outer leaf and leaf bulb information; secondly, the extraction accuracy of kale target edges is improved in the decoding part by designing an attention refinement module (ARM); lastly, the uneven distribution of classes is solved by modifying the optimizer and loss function to solve the class distribution problem. The experimental results show that the improved model in this paper has excellent performance in feature extraction, and the average intersection and merger ratio (mIOU) of the improved kale segmentation can be up to 91.2%, and the average pixel accuracy (mPA) can be up to 95.2%, which is 2.1 percentage points and 4.7 percentage points higher than the original UperNet model, respectively, and it effectively improves the segmentation recognition of kale.


Introduction
Kale is an important bulk vegetable species in China and is well liked by consumers because it is rich in various nutrients required by the human body, and it is very beneficial to human health.The main growth characteristics of kale are its outer leaves and leaf bulb, whose growth traits affect the field water and fertilizer management measures, and ultimately affect the overall yield of kale.Therefore, it is important to keep the growth status of the outer leaves and leaf bulbs of kale abreast to reduce the risk of damage to kale.
In recent years, with the rapid development of deep learning, semantic segmentation [1], target detection [2], and image classification [3] have also made significant progress.Among them, in computer vision, semantic segmentation is a very important direction, and the main method used is to judge the category that this image belongs to by the pixels that have been labelled in the image.Field kale images have color, texture, and spatial structure information.Traditional image processing algorithms, such as pixel-level cluster-based segmentation, pixel-level threshold-based segmentation, and pixel-level decision tree-based classification, usually use the underlying features of the image for segmentation.These traditional algorithms may face the problem of poor accuracy in image segmentation tasks because they mainly focus on the underlying features and ignore higher-level semantic information.With the rise of deep learning techniques, modern image segmentation methods are increasingly favoring the use of deep learning models, such as convolutional neural networks, which are capable of learning higher level abstract features to achieve better performance in image segmentation tasks.Deep learning methods are able to improve segmentation accuracy by automatically learning feature representations in images through end-to-end learning.Since the introduction of fully convolutional neural networks, many classical semantic segmentation networks have emerged, including FCN [4], Unet [5], PSPnet [6], DeepLab [7], DeepLabv3+ [8], and so on.These classical networks have had a profound impact on later semantic segmentation research, and their applications on the agricultural field have gradually increased.Junrui Xue et al. [9] proposed an image segmentation method based on improved FCN-8s and obtained a date class intersection and merger ratio of 93.50% and a segmentation speed of 16.20 frames/s.Song et al. [10] used DeepLabV3 + based on Resnet-101 to enable the best segmentation of fruit calyx, branches, and filaments in kiwifruit canopy images.Hongjie et al. [11] proposed a method to segment crops in UAV remote sensing images by replacing the ordinary convolution in the ASPP module with a depth-separable convolution and by adding a double attention mechanism to improve the DeepLabv3+ model; the results show that the pixel accuracy of this method reached 93.9%, and the average intersection and merger ratio was 83.3%.Compared with convolutional neural networks, deep neural networks based on the transformer architecture [12] with self-attention mechanism have achieved excellent results in natural language processing [13] problems facing time-series data in recent years.The self-attention mechanism is able to increase the feature weights in the input linear data so that it can better extract the feature information; thus, scholars have begun to use it in the field of image processing.However, the Swin transformer [14] is a deep neural network based on the self-attention mechanism.Swin transformer [14] architecture is thus proposed; the window self-attention mechanism of a Swin transformer enables the transformer architecture to be well applied to the task of processing two-dimensional and above images and videos with large input data.Liu et al. [14] proposed the Swin transformer model by restricting windows to control the interaction between different slices, and the method also greatly reduced the computational effort of the network.
Semantic segmentation of vegetable crops, such as kale, is a critical task for efficient field management and yield optimization.However, current methodologies face significant challenges when segmenting the leaf bulb and outer leaves of kale against complex backgrounds, often resulting in low accuracy and compromised real-time performance.Traditional machine learning techniques have struggled to capture the intricate details and variations present in kale images, leading to suboptimal segmentation results.
In this context, the following limitations of existing research are summarized by this study: (1) insufficient datasets: the lack of comprehensive datasets capable of capturing kale diversity under real-world conditions hinders the training and validation of robust segmentation models [15]; (2) limited feature representation: existing methods often cannot effectively represent and extract the unique features of kale bulbs and outer leaves, especially in scenes with complex backgrounds [11]; (3) edge detection accuracy: the accuracy of edge detection of kale leaves often falls short of the requirements, which are crucial for accurate segmentation [10]; (4) class imbalance: the class imbalance problem during training has not been fully resolved, which may lead to biased model performance [14].
To overcome these limitations, the main contributions of this work are summarized as follows [16]:

•
Building upon the UperNet architecture, a tailored semantic segmentation framework is developed that considers the specific growth characteristics of kale leaves.

•
By adopting the Swin transformer as our backbone network, the channel attention module (CAM) is introduced to significantly improve the network's ability to represent and extract information about the kale's outer leaves and leaf bulb.

•
An ARM within the decoding part of the framework is designed to refine the edge detection accuracy of kale's target areas.

•
The issue of class imbalance is addressed by modifying the optimizer and loss function during the network training phase, ensuring a more balanced class distribution and improved learning dynamics.
The remainder of this article is organized as follows.Section 2 briefly summarizes related research.Section 3 provides technical details of model training.Section 4 provides a comprehensive discussion of the experiments.Finally, Section 5 outlines some concluding remarks and suggestions for future research.

Experimental Data
The image data in this study were obtained from the Precision Agriculture Experimental Base of the National Agricultural Informatization Engineering and Technology Research Centre in Xiaotangshan, Beijing, and the selected kale variety was Zhonggan-21, which was annotated according to the Pascal Voc dataset format using the open-source image annotation tool Labelme (v4.5.6), which was then preprocessed synchronously with the annotated images and their original images.In order to further expand the data samples, the randomly cropped images were subjected to data enhancement [17] operations, such as horizontal flipping, vertical flipping, brightness adjustment, and adding Gaussian noise.Finally, a dataset with a size of 5000 images was obtained.The images were randomly divided into training and test sets in an 8:2 ratio.The combination of these image enhancement methods simulates the changes in the shooting angle and light intensity during image acquisition, increases the diversity of training samples, and improves the robustness and generalization ability of the model.An example of using the LabelMe annotation is shown in Figure 1, where the black part is the background, the red part is the outer leaf, and the green part is the leaf bulb.


The issue of class imbalance is addressed by modifying the optimizer and loss function during the network training phase, ensuring a more balanced class distribution and improved learning dynamics.
The remainder of this article is organized as follows.Section 2 briefly summarizes related research.Section 3 provides technical details of model training.Section 4 provides a comprehensive discussion of the experiments.Finally, Section 5 outlines some concluding remarks and suggestions for future research.

Experimental Data
The image data in this study were obtained from the Precision Agriculture Experimental Base of the National Agricultural Informatization Engineering and Technology Research Centre in Xiaotangshan, Beijing, and the selected kale variety was Zhonggan-21, which was annotated according to the Pascal Voc dataset format using the open-source image annotation tool Labelme (v4.5.6), which was then preprocessed synchronously with the annotated images and their original images.In order to further expand the data samples, the randomly cropped images were subjected to data enhancement [17] operations, such as horizontal flipping, vertical flipping, brightness adjustment, and adding Gaussian noise.Finally, a dataset with a size of 5000 images was obtained.The images were randomly divided into training and test sets in an 8:2 ratio.The combination of these image enhancement methods simulates the changes in the shooting angle and light intensity during image acquisition, increases the diversity of training samples, and improves the robustness and generalization ability of the model.An example of using the LabelMe annotation is shown in Figure 1, where the black part is the background, the red part is the outer leaf, and the green part is the leaf bulb.

An Improved Semantic Segmentation Model
In this paper, UperNet is used as an implementation framework for semantic segmentation.The feature extractor of this semantic segmentation architecture is set as a Feature Pyramid Network (FPN) based on the Swin transformer backbone.It utilizes the multi-level feature representation obtained by the Swin transformer to represent the corresponding pyramid hierarchies, using a top-down FPN architecture with horizontal connectivity and downsampling ratios that are consistent with those of the Swin transformer.The Pyramid Pooling Module [18] (PPM) is located before the top-down branches of the FPN and is connected to stage 4 of the Swin transformer network, and the PPM is capable of delivering effective global a priori feature representations that are highly compatible with the FPN architecture.This form of architecture can effectively cooperate with the hierarchical feature expression obtained by the Swin transformer to achieve a better semantic segmentation effect based on the fusion of high-and low-level semantic infor-

An Improved Semantic Segmentation Model
In this paper, UperNet is used as an implementation framework for semantic segmentation.The feature extractor of this semantic segmentation architecture is set as a Feature Pyramid Network (FPN) based on the Swin transformer backbone.It utilizes the multi-level feature representation obtained by the Swin transformer to represent the corresponding pyramid hierarchies, using a top-down FPN architecture with horizontal connectivity and downsampling ratios that are consistent with those of the Swin transformer.The Pyramid Pooling Module [18] (PPM) is located before the top-down branches of the FPN and is connected to stage 4 of the Swin transformer network, and the PPM is capable of delivering effective global a priori feature representations that are highly compatible with the FPN architecture.This form of architecture can effectively cooperate with the hierarchical feature expression obtained by the Swin transformer to achieve a better semantic segmentation effect based on the fusion of high-and low-level semantic information.The feature extraction capability of the network is further improved by incorporating EAM on the Swin transformer module.In the feature fusion module, all the hierarchical features output from the FPN are adjusted to the same size by bilinear interpolation, and then a convolutional layer is applied to fuse the features from different levels.The target segmentation header is appended to the fused feature map with a separate convolutional layer in front of each classifier.All additional non-classifier convolutional layers have a batch normalization of 512 channel outputs [19], and ReLU [20] activation functions are applied.STM is added to the encoder part to avoid omission of the extracted information.The model output is a category mask generated from the pixel classification prediction labels, which in turn yields a segmentation map.This results in early recognition of kale while obtaining a fine segmentation of its target region.The improved UperNet semantic segmentation framework is shown in Figure 2.
mation.The feature extraction capability of the network is further improved by incorporating EAM on the Swin transformer module.In the feature fusion module, all the hierarchical features output from the FPN are adjusted to the same size by bilinear interpolation, and then a convolutional layer is applied to fuse the features from different levels.The target segmentation header is appended to the fused feature map with a separate convolutional layer in front of each classifier.All additional non-classifier convolutional layers have a batch normalization of 512 channel outputs [19], and ReLU [20] activation functions are applied.STM is added to the encoder part to avoid omission of the extracted information.The model output is a category mask generated from the pixel classification prediction labels, which in turn yields a segmentation map.This results in early recognition of kale while obtaining a fine segmentation of its target region.The improved UperNet semantic segmentation framework is shown in Figure 2.

Swin Transformer
The Swin transformer serves as the backbone network for our visual task, leveraging a fully self-attentive mechanism.It surpasses traditional convolutional neural network architectures in semantic segmentation tasks.Unlike standard transformer architectures, the Swin transformer excels in constructing hierarchical feature representations essential for pixel-dense prediction.Its modeling capability is notably enhanced by the shift-windowbased self-attention mechanism.
Key features of a Swin transformer: (1) hierarchical feature representation: a Swin transformer constructs hierarchical feature representations crucial for precise pixel-level prediction, contributing to superior segmentation outcomes.(2) Shift-window-based selfattention: this mechanism computes self-attention locally within non-overlapping windows of the segmented image, enabling efficient cross-window connectivity and expediting model inference.The structure of the Swin transformer block is shown in Figure 3.The first Swin transformer block maintains a constant number of input and output tokens at [H/4, W/4], which is designated as Stage 1 alongside the linear embedding layer.As data flows into the Swin transformer block, multi-head self-attention computation for the window commences.

Swin Transformer
The Swin transformer serves as the backbone network for our visual task, leveraging a fully self-attentive mechanism.It surpasses traditional convolutional neural network architectures in semantic segmentation tasks.Unlike standard transformer architectures, the Swin transformer excels in constructing hierarchical feature representations essential for pixel-dense prediction.Its modeling capability is notably enhanced by the shift-windowbased self-attention mechanism.
Key features of a Swin transformer: (1) hierarchical feature representation: a Swin transformer constructs hierarchical feature representations crucial for precise pixel-level prediction, contributing to superior segmentation outcomes.(2) Shift-window-based selfattention: this mechanism computes self-attention locally within non-overlapping windows of the segmented image, enabling efficient cross-window connectivity and expediting model inference.The structure of the Swin transformer block is shown in Figure 3.The first Swin transformer block maintains a constant number of input and output tokens at [H/4, W/4], which is designated as Stage 1 alongside the linear embedding layer.As data flows into the Swin transformer block, multi-head self-attention computation for the window commences.Attentional mechanisms play a crucial role in semantic segmentation tasks by enhancing the focus on important features while suppressing distracting ones.However, in the case of kale, which grows in complex outdoor environments influenced by various factors such as soil, light changes, weeds, and leaf texture similarity, accurately capturing Attentional mechanisms play a crucial role in semantic segmentation tasks by enhancing the focus on important features while suppressing distracting ones.However, in the case of kale, which grows in complex outdoor environments influenced by various factors such as soil, light changes, weeds, and leaf texture similarity, accurately capturing the intricate relationship between kale and its surroundings is challenging.While a Swin transformer excels in integrating spatial context information, it may not accurately capture the complex interactions and dependencies between channels.
To address this limitation, we propose the channel attention module (CAM), which is inserted after the W-MSA and SW-MSA stages of the Swin transformer.CAM is used to assign different weights to each channel of the input feature map based on their importance in representing relevant information about the kale's morphology.This mechanism allows the network to emphasize key features associated with the outer leaves and leaf bulb, such as texture, shape, and color variations, while attenuating distractions from the background or other non-essential elements in the image.By adaptively adjusting the weights of different channels, the CAM module effectively enhances the discriminative power of the network, enabling it to capture and highlight crucial details about the kale's structure and appearance, ultimately improving segmentation performance, particularly in delineating the boundaries and contours of the outer leaves and leaf bulb with greater accuracy and precision.The structure of the CAM module is illustrated in Figure 4. Initially, CAM performs a squeeze operation on the feature map obtained after self-attention to extract global features at the channel level.Subsequently, an excitation operation is performed on these global features to learn the relationship between each channel and obtain the weights of different channels.Finally, the original feature map is multiplied by these weights to generate the final features.This module preserves the channel dimension while compressing the spatial dimension, thereby enhancing the interaction of information between input channels.By incorporating CAM into both W-MSA and SW-MSA stages, the correlation between channels can be better explored and utilized at different levels of the model.Moreover, since W-MSA mainly focuses on global spatial relations and SW-MSA on local spatial relations, adding channel attention to both stages allow for better integration of global and local spatial information with channel correlation.The Swin transformer block enhanced with CAM is depicted in Figure 5.By incorporating CAM into both W-MSA and SW-MSA stages, the correlation between channels can be better explored and utilized at different levels of the model.Moreover, since W-MSA mainly focuses on global spatial relations and SW-MSA on local spatial relations, adding channel attention to both stages allow for better integration of global and local spatial information with channel correlation.The Swin transformer block enhanced with CAM is depicted in Figure 5.
By incorporating CAM into both W-MSA and SW-MSA stages, the correlation between channels can be better explored and utilized at different levels of the model.Moreover, since W-MSA mainly focuses on global spatial relations and SW-MSA on local spatial relations, adding channel attention to both stages allow for better integration of global and local spatial information with channel correlation.The Swin transformer block enhanced with CAM is depicted in Figure 5.

Attention Refinement Module (ARM)
To extract detailed information effectively from kale images, we introduce an attention refinement module (ARM) [21,22], as depicted in Figure 6.This module plays a crucial role in improving the model's ability to capture intricate details and semantic context relevant to kale leaf segmentation.The ARM takes high-order features extracted from earlier layers of the network and refines them using channel attention mechanisms.This step allows the model to focus on important features specific to kale leaves, enhancing their representation in the feature space.In parallel, the ARM processes low-order features, which typically capture fine details and edges in the image.Through spatial attention mechanisms, these features are recalibrated to better align with the semantic information

Attention Refinement Module (ARM)
To extract detailed information effectively from kale images, we introduce an attention refinement module (ARM) [21,22], as depicted in Figure 6.This module plays a crucial role in improving the model's ability to capture intricate details and semantic context relevant to kale leaf segmentation.The ARM takes high-order features extracted from earlier layers of the network and refines them using channel attention mechanisms.This step allows the model to focus on important features specific to kale leaves, enhancing their representation in the feature space.In parallel, the ARM processes low-order features, which typically capture fine details and edges in the image.Through spatial attention mechanisms, these features are recalibrated to better align with the semantic information of high-order features.This refinement process ensures that important details related to kale leaves are preserved and accurately represented in the final feature maps.The refined high-order and low-order features are then fused together to create a comprehensive representation of the kale leaves.This fusion process leverages both the semantic context captured by high-order features and the fine details captured by low-order features, resulting in a more robust and accurate representation of kale leaves in the dataset.By refining both high-order and low-order features, the ARM enhances the model's ability to segment kale leaves accurately.It ensures that important details and semantic context are effectively captured, leading to improved segmentation performance, especially in challenging scenarios with complex backgrounds or occlusions.
Agronomy 2024, 14, x FOR PEER REVIEW 7 o of high-order features.This refinement process ensures that important details related kale leaves are preserved and accurately represented in the final feature maps.The refin high-order and low-order features are then fused together to create a comprehensive r resentation of the kale leaves.This fusion process leverages both the semantic context c tured by high-order features and the fine details captured by low-order features, result in a more robust and accurate representation of kale leaves in the dataset.By refining b high-order and low-order features, the ARM enhances the model's ability to segment k leaves accurately.It ensures that important details and semantic context are effectiv captured, leading to improved segmentation performance, especially in challenging s narios with complex backgrounds or occlusions.Firstly, the lower-order features are averaged and pooled across the entire chan axis to compute the average value of each pixel along the channel axis.Subsequently, × 7 convolutional operation followed by a sigmoid function is applied to capture lo spatial dependencies.On the other hand, the high-order features are maximally poo along a single channel to generate a feature vector, ensuring that each channel of the hi order features contributes its representative response to the feature vector.The outp from the spatial and channel attention branches are then fused, and the number of ch nels is adjusted using a 3 × 3 convolution operation to obtain the final output of the m ule.
This multi-step process enables the ARM to refine feature representations effectiv capturing both spatial and channel-wise dependencies to recover fine details while p serving segmentation accuracy.

Te st Platform Configuration
The experimental environment of this study is based on a computer with a Windo 10 operating system, an AVX2 central processing unit (CPU), and an NVIDIA GeFo RTX 2080 Ti graphics processing unit (GPU).The programming language is Python the general-purpose parallel computing architecture is CUDA 10.2, the deep neural n work (GPU) acceleration library is cuDNN 7.6.5, the computer vision library is Open 4.7.0, and the Pytorch 1.7.1 deep learning framework is used to build and adjust the rameters of the segmentation model in this paper.All the training and testing proces of the models were completed on the test platform built in this study to ensure the c sistency of the comparison conditions.Firstly, the lower-order features are averaged and pooled across the entire channel axis to compute the average value of each pixel along the channel axis.Subsequently, a 7 × 7 convolutional operation followed by a sigmoid function is applied to capture local spatial dependencies.On the other hand, the high-order features are maximally pooled along a single channel to generate a feature vector, ensuring that each channel of the high-order features contributes its representative response to the feature vector.The outputs from the spatial and attention branches are then fused, and the number of channels is adjusted using a 3 × 3 convolution operation to obtain the final output of the module.
This multi-step process enables the ARM to refine feature representations effectively, capturing both spatial and channel-wise dependencies to recover fine details while preserving segmentation accuracy.

Test Platform Configuration
The experimental environment of this study is based on a computer with a Windows 10 operating system, an AVX2 central processing unit (CPU), and an NVIDIA GeForce RTX 2080 Ti graphics processing unit (GPU).The programming language is Python 3.8, the general-purpose parallel computing architecture is CUDA 10.2, the deep neural network (GPU) acceleration library is cuDNN 7.6.5, the computer vision library is OpenCV 4.7.0, and the Pytorch 1.7.1 deep learning framework is used to build and adjust the parameters of the segmentation model in this paper.All the training and testing processes of the models were completed on the test platform built in this study to ensure the consistency of the comparison conditions.

Model Training Strategies
The model training is conducted end-to-end, with the inputs being the original images and the outputs being the corresponding recognition segmentation maps, with no human intervention in the process.The Swin transformer backbone network and the semantic segmentation framework are combined into a holistic model through a decoder-encoder structure and are trained at the same time.The backbone network acts as a decoder responsible for feature transformation and extraction, while the semantic segmentation framework acts as an encoder to reconstruct and fuse the output features of the backbone network and generate classification predictions based on them.During training, the optimizer is used to update and compute the network parameters that affect the model's training and model's output, so that they approach or reach the optimal values, thus minimizing the loss function.In the improved UperNet model, the AdamWarmup, Adam and Adadelta optimizers are used to fit the data, as shown in Figure 7, which shows a comparison of the change curve of loss during the training process of the model using different optimizers; it can be seen that the Adam Adadelta optimizers have a general effect, the fluctuation of the value of the loss function is relatively large, and it is obvious that the AdamWarmup optimizer is more effective.It can be seen that the Adam and Adadelta optimizers are generally effective, and the value of the loss function fluctuates during the training process.
Agronomy 2024, 14, x FOR PEER REVIEW 8 of 12 backbone network and generate classification predictions based on them.During training, the optimizer is used to update and compute the network parameters that affect the model's training and model's output, so that they approach or reach the optimal values, thus minimizing the loss function.In the improved UperNet model, the AdamWarmup, Adam and Adadelta optimizers are used to fit the data, as shown in Figure 7, which shows a comparison of the change curve of loss during the training process of the model using different optimizers; it can be seen that the Adam and Adadelta optimizers have a general effect, the fluctuation of the value of the loss function is relatively large, and it is obvious that the AdamWarmup optimizer is more effective.It can be seen that the Adam and Adadelta optimizers are generally effective, and the value of the loss function fluctuates during the training process.In this paper, we choose the exponential decay learning rate strategy with Warmup [23] using AdamWarmup as the model optimizer, with an initial learning rate of 1 × 10 −3 and the minimum value of the learning rate set to 1 × 10 −5 .During the training process, this learning rate strategy can automatically adjust the learning rate of each Epoch; during In this paper, we choose the exponential learning rate strategy with Warmup [23] using AdamWarmup as the model optimizer, with an initial learning rate of 1 × 10 −3 and the minimum value of the learning rate set to 1 × 10 −5 .During the training process, this learning rate strategy can automatically adjust the learning rate of each Epoch; during the first training Epoch, the model can quickly correct the data distribution; after the first training Epoch, the learning rate is set to a smaller value than the initial learning rate to ensure the model has good convergence.After the first Epoch, the learning rate is set to a smaller value than the initial learning rate to ensure good convergence of the model.After the model is relatively stable, the learning rate of the model is gradually increased to the preset initial learning rate to accelerate the convergence speed of the model, so that the model training effect is better in the late stage of model training, when the model can learn less new knowledge, and a larger learning rate will destroy the stability of the existing one.Therefore, in the subsequent training Epoch, the learning rate of the model is gradually reduced to approach the minimum value.The cross-entropy loss function is used to measure the distance between the predicted probability distribution of pixel categories and the probability distribution of real label categories during the training process, which is calculated as in Equation ( 1): where M is the number of pixels; N is the number of categories; i is the current pixel; c is the current category; b i is the true labelling category of pixel I; h is the 0-1 probability distribution function, and it is 1 if b i = c and 0 otherwise; pic is the predicted probability of pixel i belonging to the class c, which is obtained from the computation of the predicted category score by the Sigmoid function.The training performance of the model is measured by the calculation of the loss function during the iteration process, and the weights are adjusted by back propagation so that the error distance represented by the loss values is gradually reduced to achieve the training goal.

Indicators for Model Evaluation
The performance evaluation metrics of the model in this paper mainly use the mean pixel accuracy (mPA), mean intersection, and merger ratio (mIOU) to assess image segmentation model performance.Among them, mPA measures the average accuracy of the model in correctly predicting the pixels of each category-the higher the mPA value indicates, the better the pixel prediction accuracy of the model-which is calculated as shown in Equation (2).mIOU measures the segmentation accuracy of the model by calculating the ratio of the intersection and concatenation of the predicted segmentation results to the real segmentation results, with higher metrics indicating that the predicted results overlap with the real results and the model's segmentation effect is better.Its definition is shown in Equation where N is the number of target categories segmented (in the case of no background), n ii denotes the number of correctly categorized pixels, denotes the number of pixels in target category i, n ji denotes the number of pixels in target category i predicted to be category j, and n ji denotes the number of pixels in target category j predicted to be category i.

Ablation Experiment
In order to verify the effectiveness of the improved semantic segmentation model for the kale segmentation recognition algorithm, this paper designs ablation experiments for the UperNet model based on the original model UperNet.Table 1 shows a data comparison of kale segmentation recognition performance.In order to do so, the following steps were followed: (1) use the original model UperNet-Resnet network structure; (2) replace the backbone network of the UperNet model with the Swin transformer; (3) add the CAM on the basis of UperNet-Swin transformer; (4) add ARM to the UperNet-Swin transformer; (5) add CAM and ARM to the UperNet-Swin transformer.The comparison of the data in Table 1 illustrates that the network model in this paper will pay more attention to the detail information after the improvement of the network model, and mPA and mIOU are both improved, which proves that the modules are effective for network improvement.

Comparison of Segmentation Recognition Effect of Different Models
In order to further validate the effectiveness of the model in this paper, a comparative analysis of the segmentation performance of the model in this paper with FCN, Unet, PSPNet, Deeplabv3+, UperNet (Resnet), and UperNet (Swin transformer) was performed on a homemade kale dataset, and the results of the experiments are shown in Table 2.The improved UperNet-Swin transformer model achieves 91.2% mIOU for kale, which is higher than the FCN, Unet, PSPNet, Deeplabv3+, UperNet (Resnet), and UperNet (Swin transformer) models in the following order of mIOU for kale: 11%, 6.9%, 12.1%, 4.7%, 5.3%, and 2.1%, respectively.In this paper, mPA reached 95.2%, as above, and improved by 13.6%, 9.3%, 12.9%, 6.8%, 7.4%, and 4.7%, respectively.It can be concluded that the model segmentation effect of this paper has shown a significant improvement, and the main reason for the better segmentation effect of the model in this paper is that this model introduces the attention module and displays improvement in the encoder site, which strengthens the judgement of the target features at each stage and obtains the effective global contextual information.

Visualization and Analysis
To verify the effectiveness and interpretability of this paper's method in the field kale image segmentation task, this paper combines Table 2 to select the three models of UNet, DeepLabv3+, and UperNet (Swin transformer)-which have better segmentation and recognition effects-for visual comparative analysis (shown in Figure 8).

Visualization and Analysis
To the effectiveness and interpretability of this paper's method in the field kale image segmentation task, this paper combines Table 2 to select the three models of UNet, DeepLabv3+, and UperNet (Swin transformer)-which have better segmentation and recognition effects-for visual comparative analysis (shown in Figure 8).According to Figure 8, the following conclusions can be drawn.Segmentation details: (1) The proposed model demonstrates superior segmentation details compared to the other models.Specifically, UNet exhibits a noticeable pixel area missing in the outer leaf region, indicating some information loss.DeepLabv3+ misidentifies the leaf sphere, resulting in information misjudgment.Although UperNet (Swin transformer) also displays misidentification in the leaf sphere area with slight information loss, our model achieves segmentation results that closely resemble the ground truth labels.
(2) Precision in leaf sphere identification: While the UperNet (Swin transformer) model may not be precise enough in identifying the leaf sphere area, our model's segmentation effect is remarkably similar to the ground truth when observed by the human eye.We note that mis-segmentation typically occurs near the edges of the target, where only a few pixels are misclassified.However, these small discrepancies have minimal impact on the overall segmentation quality and are challenging to detect without close inspection.
(3) Generalization ability: Combining the visual analysis with the quantitative results presented in Table 2, the proposed method exhibits stronger practical generalization ability in terms of pixel recognition accuracy across different growth states of kale.The model's ability to accurately segment kale leaves under varying conditions highlights its robustness and effectiveness for real-world applications.
In summary, the detailed visual analysis of the model confirms the efficacy of the proposed method for kale image segmentation.The close resemblance of the segmentation results to the ground truth labels underscores the model's accuracy and reliability.Additionally, the model demonstrates strong generalization ability across diverse kale growth states, further validating its practical utility.

Conclusions
This paper presents an enhanced semantic segmentation model aimed at accurately segmenting kale leaves in complex environments.Leveraging the UperNet semantic segmentation framework, our approach integrates a Swin transformer as the backbone network and introduces the channel attention module (CAM) to enhance feature extraction from kale leaves.Additionally, an attention refinement module (ARM) is designed to refine target features and improve edge extraction.Finally, optimization techniques are employed to mitigate the issue of an uneven class distribution.Through extensive comparative experiments, the proposed method outperforms other semantic segmentation models, achieving mIOU and MPA scores of 91.2% and 95.2%, respectively.The successful implementation of our method enables accurate segmentation recognition of kale in diverse field conditions,

Figure 1 .
Figure 1.Comparison of kale images before and after labelling.

Figure 1 .
Figure 1.Comparison of kale images before and after labelling.

Figure 4 .
Figure 4. Structure of the channel attention module.

Figure 4 .
Figure 4. Structure of the channel attention module.

Figure 5 .
Figure 5. Structure of the improved Swin transformer block.

Figure 5 .
Figure 5. Structure of the improved Swin transformer block.

Figure 6 .
Figure 6.Structure of the attention refinement module.

Figure 6 .
Figure 6.Structure of the attention refinement module.

Figure 7 .
Figure 7. Loss function during training of different optimizers.

Figure 7 .
Figure 7. Loss function during training of different optimizers.

Figure 8 .
Figure 8. Visual analysis of the model.According to Figure8, the following conclusions can be drawn.Segmentation details:

Figure 8 .
Figure 8. Visual analysis of the model.

Table 2 .
Comparative experiments with different models.