Optimized Watermelon Scion Leaf Segmentation Model Based on Hungarian Algorithm and Information Theory

Zhu, Yi; Yu, Qingcang; Xu, Zihao

doi:10.3390/electronics14030620

Open AccessArticle

Optimized Watermelon Scion Leaf Segmentation Model Based on Hungarian Algorithm and Information Theory

by

Yi Zhu

,

Qingcang Yu

^* and

Zihao Xu

Department of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(3), 620; https://doi.org/10.3390/electronics14030620

Submission received: 11 January 2025 / Revised: 3 February 2025 / Accepted: 4 February 2025 / Published: 5 February 2025

Download

Browse Figures

Versions Notes

Abstract

In the fully automated grafting process of watermelon seedlings, it is crucial to ensure that the scion’s cotyledons maintain a perpendicular orientation with the rootstock cotyledons. To achieve precise segmentation of watermelon scion cotyledons and accurately extract parameters, such as cotyledon orientation angles, this study introduces enhancements to the Mask2Former network, aiming to improve segmentation accuracy for watermelon scion cotyledons. Specifically, two innovative modules are designed. Taking Swin-Former as the backbone, an Optimal Feature Re-ranking (OFR) module based on the Hungarian Algorithm is devised to re-rank the feature maps obtained from the feature extraction process. Grounded in information theory, the amount of information in semantic segmentation tasks is quantified as Shannon entropy, enabling the model to perceive the information distribution of the feature maps and dynamically adjust the output features. Experimental results demonstrate that the improved model achieves mIoU, mDice, mPrecision, and mRecall scores of 97.44%, 98.70%, 98.20%, and 99.21%, respectively, greatly outperforming Mask2Former, FCNN, and DeepLabv3. Furthermore, the enhanced network exhibits superior accuracy in low signal-to-noise ratio environments, highlighting its robustness in complex scenarios. This study provides a high-precision solution for agricultural automation in the watermelon industry, contributing to the development of fully automated grafting machines.

Keywords:

Mask2Former; leaf segmentation; optimal feature reordering; dynamic information modulation; deep learning

1. Introduction

Watermelon is an important fruit that is widely cultivated in both tropical and temperate regions worldwide, particularly in China, where its production and consumption are among the highest globally. Renowned for its rich nutritional content, watermelon is beloved by people around the world [1]. Despite its broad appeal, watermelon cultivation faces significant challenges from both biotic and abiotic stresses, which can severely impact yield and fruit quality.

Grafting plays a crucial role in enhancing crop disease resistance, growth rate, yield per unit, and reducing the need for fertilizers and pesticides, making it a key process in the large-scale production of various vegetables and fruit. Therefore, grafting watermelon seedlings as scions onto cucurbitaceous rootstocks, such as pumpkin, squash, and zucchini, can improve watermelon’s resistance to diseases, abiotic stresses, and environmental adaptability, ultimately leading to increased yield and improved fruit quality [2,3,4,5].

With the continuous improvement of living standards worldwide, the demand for fruits and vegetables has shown a sustained growth trend. Against this backdrop, grafting technology has undergone rapid development, transitioning from traditional manual operations to semi-automation and even full automation, thereby providing critical technological support for the efficient production and quality enhancement of the fruit and vegetable industry. However, the technology for fully automated grafting is still in an immature stage, with grafting work in the fruit and vegetable industry primarily relying on manual labor, supplemented by semi-automatic grafting machines [6,7,8]. Faced with the growing demand for grafting, fruit and vegetable production enterprises are under significant production pressure, making the research and optimization of fully automatic grafting machines a focal point in the industry. At present, fully automatic grafting machines have primarily been used for demonstrating advanced technology and have not yet been fully applied in practical fruit and vegetable grafting production [9,10,11,12]. The main reason for this is the insufficient accuracy during the scion placement process, which still requires manual adjustment of the leaf position. Enhancing the visual detection accuracy of the grafting machine and equipping it with a control system that can automatically adjust the scion leaf position based on visual feedback can greatly reduce labor costs. Therefore, the development of a visual algorithm capable of quickly and accurately segmenting a target leaf’s posture in complex environments is of great significance for achieving fully automated grafting.

Semantic segmentation methods can generally be divided into two categories: traditional methods and deep learning methods. Among the traditional semantic segmentation methods, there are many outstanding studies, such as the following: Valliammal et al. [13], who proposed a plant leaf segmentation method using a nonlinear K-means algorithm combined with Sobel edge detection, enhancing processing efficiency in complex backgrounds through precise morphological feature extraction. D Li et al. [14] developed a point-cloud-based leaf segmentation algorithm integrating 3D filtering with facet region growing, effectively resolving overlapping leaf issues. Xia et al. [15] achieved 3D segmentation of occluded leaves in natural scenes using RGB-D depth data and active contour models, reporting an overall segmentation accuracy of 87.97%.

Many researchers have also used deep learning methods for leaf segmentation, including the following: Yang et al. [16] utilized Mask R-CNN for leaf segmentation (1.15% misclassification rate) and VGG16 for classifying 15 species (91.5% accuracy) in complex backgrounds. R Guo et al. [17] designed LeafMask, a single-stage instance segmentation model incorporating multi-scale attention and mask refinement modules, achieving a 90.1% Dice score on the CVPPA LSC dataset. S Bhagat et al. [18] employed an EfficientNet-B4 encoder with redesigned skip connections to balance the computational efficiency and feature fusion performance.

The aforementioned semantic segmentation studies are of great significance, but there is still substantial room for improvement in segmentation accuracy. This paper improves the Mask2Former network model based on the requirements of practical application scenarios, further optimizing the model’s performance. First, a feature reordering module is introduced, utilizing the Hungarian Matching Algorithm to minimize the matching cost of feature reordering operations, aiming to find the globally optimal feature arrangement and ensure that each pixel aligns with its most relevant feature. Then, a dynamic information modulation module is incorporated, where the entropy value of the feature map is quantified using divergence metrics. This allows the model to adaptively regulate the feature fusion process, dynamically highlighting key features while suppressing redundant information. These improvements greatly enhance the segmentation accuracy of watermelon scion seedlings in the fully automated grafting process, which is of great importance for the development of fully automated grafting machines.

2. Materials and Methods

2.1. Image Dataset

Given the lack of publicly available datasets for watermelon seedlings, this study presents and constructs a novel dataset, referred to as the Watermelon Seedling Cotyledon Dataset, to support the training and performance evaluation of the proposed algorithm model.

The dataset was created by simulating a greenhouse environment to cultivate watermelon seedlings in plug trays (as shown in Figure 1).

Image acquisition was performed using an NVIDIA Jetson Orin Nano development board as the embedded computing device, paired with a HikVision industrial camera featuring a resolution of 1.3 megapixels (as shown in Figure 2).

To ensure data diversity and enhance the robustness of the model, images were captured under varying lighting conditions and against diverse background environments. Ultimately, a dataset comprising 1000 images was constructed, capturing various poses and states of scion seedlings during the grafting process. All input images were resized to 512 × 512 pixels during preprocessing. The dataset was then divided into training and testing sets in a ratio of 8:2 for model training and validation.

During the data preprocessing stage, the open source annotation tool Labelme (operating in a Python 3.11 environment) was employed to manually annotate the scion seedling images using polygons. The annotations were stored in JSON format and subsequently converted into corresponding visualized PNG files (as shown in Figure 3). This dataset provides a critical foundation for subsequent research on grafting-related target detection and segmentation tasks.

2.2. Experimental Setup and Environment

This experiment was conducted on a Windows 11 operating system with a computing device equipped with an NVIDIA GeForce RTX 4090D GPU featuring 24 GB of VRAM. The experimental environment included CUDA 11.8, Python 3.11.0, PyTorch 2.3.0, and Torchvision 0.15.1. The model’s initial learning rate was set to 0.001 and dynamically adjusted during training using a cosine annealing strategy to effectively mitigate the risk of the model converging to a local optimum. The training process spanned 200 epochs, and the AdamW optimizer, known for its adaptive capabilities, was employed to enhance optimization efficiency. The pre-trained model utilized was a Swin-Transformer trained on the ImageNet dataset.

2.3. Mask2Former Model

In the field of computer vision, particularly in image segmentation tasks, Transformer-based models [19] have emerged as a major research focus in recent years, especially following the introduction of Vision Transformer (ViT) [20]. Mask2Former [21], a versatile image segmentation model based on the Transformer architecture, has garnered significant attention for its unified approach to instance segmentation, panoptic segmentation, and semantic segmentation tasks, achieving substantial performance improvements across these domains. The network architecture of Mask2Former is illustrated in Figure 4.

Mask2Former redefines segmentation tasks as mask prediction problems within a unified Transformer framework, showcasing superior performance across various segmentation tasks compared to traditional methods. Its workflow primarily consists of three stages: feature extraction, mask prediction, and segmentation result generation. Initially, input images are processed by a feature extraction network, where a Swin-Transformer backbone is employed to extract multi-scale feature maps. These feature maps are then fed into a Transformer-based mask prediction module, which leverages self-attention and cross-attention mechanisms to integrate information across different scales, enabling the generation of latent masks and corresponding class labels. Finally, the generated masks are applied to the input image to produce high-precision segmentation results.

Compared to its predecessor—MaskFormer [22]—Mask2Former incorporates several optimizations and enhancements. Firstly, it introduces a query grouping strategy, greatly improving the efficiency of query utilization and the quality of mask generation. Secondly, the model employs a multi-scale mask prediction strategy, allowing it to capture more fine-grained object features at different scales. Additionally, during training, Mask2Former adopts an improved optimization method to further enhance its segmentation performance.

This study selects Mask2Former primarily for its excellent scalability and memory-efficient training strategies, which strike a balance between detection speed and accuracy. These advantages make Mask2Former a robust choice for advancing segmentation tasks in diverse application scenarios.

2.4. Model Improvement

To improve the segmentation accuracy of Mask2Former for watermelon grafted seedling leaves, this paper proposes an Enhanced-Mask2Former network model that integrates Optimal Feature Ranking (OFR) and Dynamic Information Modulation (DIM). The network structure is illustrated in Figure 5.

2.4.1. Optimal Feature Re-Ranking (OFR)

In semantic segmentation tasks, the primary objective of the model is to accurately assign the corresponding semantic label to each pixel. The Feature Re-ranking method enhances the correlation between features, thereby optimizing the model’s ability to focus on critical features, which in turn improves segmentation accuracy. Specifically, this paper proposes a Feature Re-ranking method based on optimal transport to achieve reasonable adjustment of features and to enhance essential semantic information. The following provides a detailed description of the core process and mathematical modeling:

First, for the input feature map

F \in R^{N \times C \times H \times W}

, it is flattened to obtain

F_{flat} \in R^{N \times C \times (H \cdot W)}

. Then, normalization is applied, with the following normalization formula:

F_{i, j} = \frac{F_{i, j}}{∥ F_{i} ∥}, ∥ F_{i} ∥ = \sqrt{\frac{c}{\sum_{j = 1} F_{i, j}^{2}}}

(1)

In the formula,

F_{i, j}

represents a normalized feature, where i is the sample index, j is the channel index, and the features are organized in a matrix of dimensions

N \times C

.

Based on the normalized features, a similarity matrix S is constructed to quantify the correlations between different features. The similarity matrix is computed using a dot product formulation, with its specific calculation expressed as follows:

S_{i, j} = F_{i, j} \cdot F_{i, j}^{T}

(2)

Each element

S_{i, j}

in the similarity matrix reflects the cosine similarity between feature i and feature j. Higher similarity values indicate greater semantic consistency between features, while lower values suggest more significant differences between them. To further quantify the differences between features, a cost matrix C is constructed based on the similarity matrix S. The calculation is expressed with the following formula:

C_{i, j} = 1 - S_{i, j}

(3)

The cost matrix C represents the “matching cost” of feature alignment, where smaller values indicate higher feature similarity and lower reordering cost. By minimizing the cost matrix, an optimal arrangement of features can be achieved, thereby preserving more significant semantic information. To minimize the “matching cost”, this paper introduces the Hungarian Matching Algorithm [23] to solve the assignment problem. The objective function is formulated as follows:

min \sum_{i} \sum_{j} C_{i, j} X_{i, j}

(4)

Here,

X_{i j}

is a binary variable indicating whether feature i is reassigned to position j. The constraints are as follows:

\sum_{j} X_{i, j} = 1, \forall i, \sum_{i} X_{i, j} = 1, \forall j

(5)

The general process of the Hungarian Algorithm is as follows:

Row-wise minimum subtraction: Subtract the minimum value of each row from all elements in that row. The formula for this is as follows:

$C_{i, j}^{'} = C_{i, j} - min_{j} (C_{i, j})$

(6)
Column-wise minimum subtraction: Subtract the minimum value of each column from all elements in that column. The formula for this is as follows:

$C_{i, j}^{″} = C_{i, j}^{'} - min_{j} (C_{i, j}^{'})$

(7)
Find and mark zero elements: Identify the zero elements and mark them. The formula for this is as follows:

$Z_{i, j} = \{\begin{matrix} 1, if C_{i, j}^{″} = 0 \\ 0, otherwise \end{matrix}$

(8)
Cover zero elements: Draw the minimum number of lines to cover all zero elements in the matrix. If the number of covering lines L is less than N, adjust the matrix as follows:
- Compute the minimum value m among the uncovered elements. The calculation formula for this is as follows:
  
  $m = min \{C_{i, j}^{″} ∣ Z_{i, j} = 0\}$
  
  (9)
- Then, adjust the matrix $C_{i, j}^{″}$ using the following strategy:
  
  $C_{i, j}^{‴} = \{\begin{matrix} C_{i, j}^{″} - m, if Z_{i, j} = 0; \\ C_{i, j}^{″}, Otherwise; \end{matrix}$
  
  (10)

The optimal matching results obtained through the above process are used to rearrange the features. The final updated feature representation is as follows:

F^{'} = F_{i} [{col}_{i n d}]

(11)

Here,

{c o l}_{i n d}

represents reordered channel indices. Finally, rearranged features are restored to their original shape as follows:

F_{i}^{'} \in R^{N \times C \times H \times W}

(12)

In semantic segmentation tasks, the significance of the Hungarian Algorithm lies in achieving optimal feature reordering by aligning the pixels in the current feature map with those in the preceding feature map. This process is based on feature similarity and utilizes the Hungarian Algorithm to identify the globally optimal arrangement, ensuring that each pixel aligns with its most relevant feature. Such reordering reduces the interference of redundant information during segmentation, enhancing the regional representation capability of features. This process improves the model’s segmentation accuracy and robustness, making it particularly effective for semantic segmentation tasks in complex scenarios.

2.4.2. Dynamic Information Modulation (DIM)

In information theory, entropy [24] is the average amount of information contained in each received message. In semantic segmentation tasks, the quantification of information can be defined using Shannon entropy [25]. Given a probability distribution

P (X)

for a random variable X, its entropy

H (X)

is defined as follows:

H (X) = - \sum_{i} P (x_{i}) log P (x_{i})

(13)

In the semantic segmentation task, this paper treats each pixel as a random variable. The distributions of the output feature maps are denoted as

P_{current}

and

P_{prev}

, where

P_{current}

represents the feature map distribution of the current layer and

P_{prev}

represents the feature map distribution of the previous layer. By analyzing the information discrepancy between the two, the model can dynamically adjust the output features, thereby highlighting important information while suppressing redundant content. This provides a quantifiable basis for feature selection and output modulation.

To compare the information difference between

P_{current}

and

P_{prev}

, both need to be adjusted to the same dimensions. Let the dimensions of

P_{current}

be

(N, C, H_{1}, W_{1})

and those of

P_{prev}

be

(N, C, H_{2}, W_{2})

. In this paper, bilinear interpolation is used to upsample

P_{prev}

to the same dimensions as

P_{current}

. The calculation formula for this is as follows:

P_{output} (x, y) = \sum_{i = 0}^{H_{1} - 1} \sum_{j = 0}^{W_{1} - 1} w_{i} w_{j} \cdot P_{prev} (x_{i} x_{j})

(14)

Here,

w_{i}

and

w_{j}

represent the interpolation weights.

The use of bilinear interpolation for upsampling in this paper offers the following two advantages:

Dimensional Consistency: The interpolated feature map has the same dimensions as $P_{current}$ , ensuring consistency in size and facilitating the subsequent information discrepancy calculation. This alignment simplifies the comparison of information between the two feature maps and enables seamless integration in the model’s processing pipeline.
Information Retention and Detail Reconstruction: Compared to other interpolation methods, bilinear interpolation better preserves surrounding information during the upsampling process, minimizing information loss. For semantic segmentation, maintaining the smoothness of spatial information aids in more accurately segmenting edges and fine details. When upsampling feature maps, bilinear interpolation can reconstruct details without distortion, which is crucial for segmentation tasks, as they require precise delineation of object boundaries and subtle features.

After obtaining

P_{prev}

with the same dimensions as

P_{current}

, the entropy for each feature map is calculated using the following formulas:

H_{current} = - \sum_{i} P_{current} (x_{i}) log P_{current} (x_{i})

(15)

H_{prev} = - \sum_{j} P_{prev} (y_{j}) log P_{prev} (y_{j})

(16)

The magnitude of entropy reflects the randomness and uncertainty of the distribution. A higher entropy value for the feature map indicates that the model has a greater perception of the image information, meaning there is more uncertainty and variability in the distribution of features. This suggests that the model is capturing more detailed and diverse information from the input, which is crucial for accurately understanding and segmenting complex images.

Building upon this, Kullback–Leibler (KL) divergence is further used to quantify the information difference between the current feature map and the previous feature map. The definition of KL divergence is as follows:

D_{KL} (P_{current} ∥ P_{prev}) = \sum_{i} P_{current} (x_{i}) log (\frac{P_{current} (x_{i})}{P_{prev} (x_{i})})

(17)

Kullback–Leibler (KL) [26] divergence provides an intuitive way to characterize the information difference between two probability distributions. Furthermore, a scaling factor can be computed based on KL divergence, which is defined as follows:

δ (D_{K L}) = \frac{1}{1 + e^{- D_{K L}}}

(18)

By calculating the scaling factor, the information difference between feature maps can be quantified into a weight that reflects the importance of the current feature map. Specifically, when the difference is large, it indicates that the current feature map contains more new information. In this case, the output of the Sigmoid function approaches 1, signaling that more attention should be given to the current feature map during feature fusion. On the other hand, when the difference is small, it suggests that the current feature map contributes less to the overall semantics. Here, the output of the Sigmoid function approaches 0, indicating that attention to this feature map should be reduced. Conversely, if the output of the Sigmoid function approaches 0, it suggests insufficient information in the current feature map; thus, attention to this feature should be minimized. Through this mechanism, the model can perform adaptive control during feature fusion, dynamically highlighting critical features and suppressing redundant information, thereby improving the overall performance of the semantic segmentation task.

3. Experiments and Results

3.1. Evaluation Metrics

This study employs four key evaluation metrics to rigorously assess the segmentation performance of the proposed model: mean Intersection over Union (mIoU), mean Precision (mPrecision), mean Dice Similarity Coefficient (mDICE), and mean Recall (mRecall). The mIoU serves as an indicator of the overlap and alignment between the predicted segmentation and the ground truth, capturing the degree of matching. mPrecision quantifies the accuracy of the segmentation process, specifically focusing on the proportion of correctly identified positive pixels among the total predicted positives. mDICE measures the similarity between the predicted segmentation and the ground truth, with particular sensitivity to performance in scenarios involving imbalanced datasets. Lastly, mRecall evaluates the model’s capability to detect and capture the pixels belonging to the target class comprehensively [27,28,29].

The mathematical formulations of mIoU, mPrecision, mDICE, and mRecall are outlined below:

m I o U = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(19)

m P r e c i s i o n = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F P_{i}}

(20)

m D I C E = \frac{1}{N} \sum_{i = 1}^{N} \frac{2 \times T P_{i}}{2 \times T P_{i} + F P_{i} + F N_{i}}

(21)

m R e c a l l = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F N_{i}}

(22)

3.2. Ablation Experiments

To validate the segmentation performance of the proposed improved algorithm on the self-constructed cotyledon dataset of watermelon seedlings, a series of ablation experiments were conducted. Based on the original Mask2Former model, three variants (Mask2Former + OFR, Mask2Former + DIM, and Mask2Former + OFR + DIM) were developed by integrating the proposed improvements. These variants were designed to quantify the contribution of the two proposed enhancement strategies to the overall model performance. The experimental results are presented in Table 1.

The experimental results demonstrate that the original Mask2Former model achieves mIoU, mDice, mPrecision, and mRecall values of 92.88%, 96.25%, 94.25%, and 98.58%, respectively. By integrating the Optimal Feature Ranking (OFR) module, which utilizes optimal transport methods for feature adjustment, the model’s performance on these metrics improved by 3.34%, 1.81%, 2.93%, and 0.4%, respectively. When the Dynamic Information Modulation (DIM) module was added by itself, the model’s performance increased by 2.27%, 1.24%, 2.03%, and 0.23%.

When both modules were incorporated simultaneously, the model’s mIoU, mDice, mPrecision, and mRecall improved by 4.56%, 2.45%, 3.95%, and 0.63%, respectively, compared to the original Mask2Former. This combined enhancement outperforms the individual application of each module.

These findings suggest that, starting with Mask2Former as the baseline network, adding either of the two modules leads to an improvement in all performance metrics. Moreover, the concurrent integration of both the Optimal Feature Ranking and Dynamic Information Modulation modules results in a more significant overall performance boost. This incremental enhancement illustrates the positive and complementary effects of the two modules, providing compelling evidence for the effectiveness of the proposed Enhanced-Mask2Former model.

3.3. Comparison Experiments

To objectively evaluate the performance of the proposed method, a comparative experiment was conducted under the same experimental environment and dataset. The proposed Enhanced-Mask2Former model was compared with three classic semantic segmentation algorithms—DeepLabV3, FCNN, [30,31] and Mask2Former—on a self-constructed Watermelon Seedling Cotyledon Dataset. The segmentation results of each network are visually presented in Figure 6.

From the experimental results displayed in the figure, the differences in segmentation performance of the four models on the same dataset of grafted seedling leaves are evident. For the DeepLabV3 model, the segmentation performance is suboptimal, with significant background noise, especially in the regions around the leaf. The leaf contours are unclear, with considerable mis-segmentation, and DeepLabV3 struggles with details and edge handling, particularly when dealing with complex backgrounds, leading to frequent mis-segmentation. For the FCNN model, the segmentation performance is relatively average. There are noticeable errors along the edges, and the leaf contours are somewhat incomplete. Some background areas are misclassified as a leaf, and FCNN exhibits poor resistance to interference when handling complex backgrounds, making it challenging to accurately segment the leaf region. The Mask2Former model provides a generally good segmentation result with clear leaf contours. A small amount of background is misclassified as a leaf (such as small noise at the top of some images). Compared to Enhanced-Mask2Former, Mask2Former’s segmentation is slightly inferior, especially at finer edges, where there are minor flaws. In contrast, the Enhanced-Mask2Former model achieves more accurate segmentation, with clear edge details and better preservation of leaf region integrity. There is almost no background noise interference. Compared to other models, Enhanced-Mask2Former performs better in retaining the contours of the leaf, demonstrating superior segmentation accuracy and edge handling capability.

Based on the comparison of evaluation metrics (mDICE, mIoU, mPrecision, and mRecall) for different models across each training epoch, as shown in Figure 7, similar learning curve trends can be observed for all four evaluation metrics. All models show rapid improvement during the early stages of training, stabilizing around the 75–100 epoch mark. Among them, the proposed model outperforms all others on every metric, with final values exceeding 95% and the fastest convergence rate. The Mask2Former model follows, with metrics stabilizing around 95%. The FCNN model shows moderate performance, with metric values ranging between 85% and 90%. The DeepLabv3 model performs the worst, with metric values around 70–75%. After convergence, all models demonstrate good stability, particularly the proposed model and Mask2Former, which exhibit minimal fluctuations. These results convincingly demonstrate the significant advantages of the proposed model across all evaluation metrics, highlighting its exceptional performance on this task.

As shown in Table 2, the experimental results demonstrate that Enhanced-Mask2Former outperforms all other models on four segmentation evaluation metrics. The values for mIoU, mDICE, mPrecision, and mRecall reach 97.44%, 98.70%, 98.20%, and 99.21%, respectively, representing improvements of approximately 4.9%, 2.5%, 4.2%, and 0.6% compared to the Mask2Former model. Additionally, Mask2Former outperforms DeepLabv3 and FCNN on all metrics, exhibiting superior segmentation performance. FCNN shows moderate performance with mIoU, mDICE, mPrecision, and mRecall values of 86.14%, 88.73%, 87.32%, and 88.46%, respectively, performing better than DeepLabv3. DeepLabv3 achieves the lowest scores across all metrics, with an mIoU of 76.52% and an mDICE of 69.61%, indicating limited segmentation accuracy. Overall, Enhanced-Mask2Former not only greatly improves the segmentation accuracy but also ensures a high recall rate while maintaining a low false detection rate. Its performance surpasses that of other models, making it highly suitable for applications where precision is critical.

4. Discussion

This study presents an improved segmentation model, Enhanced-Mask2Former, specifically designed for the segmentation of watermelon scion cotyledons. Compared to traditional methods (e.g., Valliammal et al. [13], where they used nonlinear K-means and Sobel edge detection with limited robustness in complex backgrounds), our model achieves a significant leap in accuracy (mIoU of 97.44% vs. a 87.97% mIoU reported by Xia et al. [15] for occluded leaf segmentation). This improvement stems from two key innovations:

Global Feature Optimization: By integrating the Hungarian Algorithm into the feature reordering module, we minimize matching costs and ensure globally optimal pixel feature alignment. This addresses the limitations of region-growing strategies (e.g., D Li et al. [14] struggled with dense overlapping leaves due to local optimization).

Dynamic Information Modulation: Introducing entropy-based adaptive feature fusion allows the model to suppress noise (common in natural environments) while enhancing critical details, outperforming traditional morphological feature extraction [13] and single-scale attention mechanisms [17].

The proposed model also surpasses other mentioned deep learning approaches. For instance, it achieves a 98.70% mDICE score, which is significantly higher than LeafMask’s 90.1% [17], owing to its ability to resolve edge ambiguity through entropy-guided feature modulation.

Our future work will extend this framework to other crops (e.g., grapes and tomatoes), leveraging lessons from EfficientNet-B4’s efficient feature fusion [18] to pursue lightweight deployment without sacrificing accuracy.

5. Conclusions

This paper presents an improved model, Enhanced-Mask2Former, specifically designed to address the task of cotyledon contour segmentation during the grafting process of watermelon scions. By refining the feature reordering module within the Mask2Former framework and introducing a dynamic information modulation module, the proposed model greatly enhances segmentation accuracy.

Specifically, the model leverages the Hungarian Matching Algorithm to minimize the matching cost in feature reordering operations, enabling the discovery of a globally optimal feature arrangement that ensures each pixel is aligned with its most relevant feature. Additionally, the entropy of feature maps is quantified via divergence, allowing the model to achieve adaptive regulation during feature fusion, dynamically emphasizing key features while suppressing redundant information.

Experimental results demonstrate that the proposed model achieves mIoU, mDICE, mPrecision, and mRecall scores of 97.44%, 98.70%, 98.20%, and 99.21%, respectively. Compared to existing models, such as Mask2Former, FCNN, and DeepLabV3, the Enhanced-Mask2Former exhibits superior performance in handling edge details, effectively preventing mis-segmentation and greatly improving the segmentation precision required for fully automated grafting vision tasks.

Our future research will focus on the lightweight optimization of the model to reduce its parameter count and computational complexity. Additionally, visual algorithms tailored to various popular fruits and vegetables will be developed, further advancing the capabilities of fully automated grafting machines and promoting their widespread application.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z. and Q.Y.; software, Y.Z.; validation, Z.X. and Q.Y.; formal analysis, Y.Z.; investigation, Y.Z.; writing—original draft preparation, Y.Z. and Z.X.; writing—review and editing, Y.Z., Q.Y. and Z.X.; project administration, Q.Y.; funding acquisition, Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 51375460.

Data Availability Statement

The datasets generated during this study can be made available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank all contributors and reviewers for their insightful comments and suggestions that helped to improve this manuscript.

Conflicts of Interest

All authors declare no conflicts of interest.

References

Huang, H.; Fu, D.G.; Li, W.X.; Rong, W.; Li, C.Y.; Feng, Z.Y.; Min, D.X. A Comparative Analysis of Rootstock, Seedling Age, and Anatomical and Physiological Studies for Watermelon Grafting. Preprints 2024, 2024120885. [Google Scholar] [CrossRef]
Sahoo, B.; Lenka, S.; Meher, S.; Satpathy, B.; Munshi, R.; Mishra, N.; Sahoo, K. The art of grafting: Elevating vegetable production to new height. e-planet 2024, 22, 24–39. [Google Scholar]
Yadav, S.K.; Singh, A. Vegetable Grafting: A New Approach to Increase Yield and Quality in Vegetables. Pharma Innov. J. 2023, 12, 407–411. [Google Scholar]
Li, J. Effect of different types of rootstock grafting on watermelon fruit quality. Agric. Eng. 2024, 14, 50–54. [Google Scholar]
Ilakiya, T.; Parameswari, E.; Davamani, V.; Yazhini, G.; Singh, S. Grafting Mechanism in Vegetable Crops. Res. J. Chem. Environ. Sci. 2021, 9, 1–9. [Google Scholar]
Raza, M. Grafting in Vegetables: Transforming Crop Production with Cutting-Edge Techniques. Kashmir J. Sci. 2024, 3, 1–9. [Google Scholar]
Liang, H.; Zhu, J.; Ge, M.; Wang, D.; Liu, K.; Zhou, M.; Sun, Y.; Zhang, Q.; Jiang, K.; Shi, X. A Comparative Analysis of the Grafting Efficiency of Watermelon with a Grafting Machine. Horticulturae 2023, 9, 600. [Google Scholar] [CrossRef]
Abbasi, R.; Martinez, P.; Ahmad, R. The digitization of agricultural industry–a systematic literature review on agriculture 4.0. Smart Agric. Technol. 2022, 2, 100042. [Google Scholar] [CrossRef]
Zhang, K.L.; Chu, J.; Zhang, T.Z.; Yin, Q.; Kong, Y.S.; Liu, Z. Development Status and Analysis of Automatic Grafting Technology for Vegetables. Nongye Jixie Xuebao/Transactions Chin. Soc. Agric. Mach. 2017, 48, 1–13. [Google Scholar]
Yu, Q.; Zhang, J.; Xia, C. Design and Experiment of Automatic Grafting Device for Grafting Machine Based on Vision Driven. In Proceedings of the 2017 International Conference on Computer Technology, Electronics and Communication (ICCTEC), Dalian, China, 19–21 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1169–1172. [Google Scholar]
Olivar-Jiménez, C.V.; Aguilar-Orduña, M.A.; Sira-Ramírez, H.J. Semi-automatic grafting machine prototype for tomato seedlings. In Proceedings of the 2023 20th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), Mexico City, Mexico, 25–27 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Yan, G.; Feng, M.; Lin, W.; Huang, Y.; Tong, R.; Cheng, Y. Review and prospect for vegetable grafting robot and relevant key technologies. Agriculture 2022, 12, 1578. [Google Scholar] [CrossRef]
Valliammal, N.; Geethalakshmi, S. Plant leaf segmentation using non linear K means clustering. Int. J. Comput. Sci. Issues (IJCSI) 2012, 9, 212. [Google Scholar]
Li, D.; Cao, Y.; Shi, G.; Cai, X.; Chen, Y.; Wang, S.; Yan, S. An overlapping-free leaf segmentation method for plant point clouds. IEEE Access 2019, 7, 129054–129070. [Google Scholar] [CrossRef]
Xia, C.; Wang, L.; Chung, B.K.; Lee, J.M. In situ 3D segmentation of individual plant leaves using a RGB-D camera for agricultural automation. Sensors 2015, 15, 20463–20479. [Google Scholar] [CrossRef] [PubMed]
Yang, K.; Zhong, W.; Li, F. Leaf segmentation and classification with a complicated background using deep learning. Agronomy 2020, 10, 1721. [Google Scholar] [CrossRef]
Guo, R.; Qu, L.; Niu, D.; Li, Z.; Yue, J. LeafMask: Towards greater accuracy on leaf segmentation. In Proceedings of the CVF International Conference on Computer Vision Workshops (ICCVW), Virtual, 11–17 October 2021; pp. 1249–1258. [Google Scholar]
Bhagat, S.; Kokare, M.; Haswani, V.; Hambarde, P.; Kamble, R. Eff-UNet++: A novel architecture for plant leaf segmentation and counting. Ecol. Inform. 2022, 68, 101583. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. arXiv 2017, arXiv:1706.03762v7. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
Mills-Tettey, G.A.; Stentz, A.; Dias, M.B. The Dynamic Hungarian Algorithm for the Assignment Problem with Changing Costs; Technical Report CMU-RI-TR-07-27; Robotics Institute: Pittsburgh, PA, USA, 2007. [Google Scholar]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; Volume 4, pp. 547–562. [Google Scholar]
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A review on deep learning techniques applied to semantic segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar]
Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar]
Bertels, J.; Eelbode, T.; Berman, M.; Vandermeulen, D.; Maes, F.; Bisschops, R.; Blaschko, M.B. Optimizing the dice score and jaccard index for medical image segmentation: Theory and practice. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019; Proceedings, Part II 22. Springer: Berlin/Heidelberg, Germany, 2019; pp. 92–100. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]

Figure 1. Grow watermelon seedlings in simulated greenhouse.

Figure 2. Image acquisition equipment.

Figure 3. Example of Grafted Seedling Cotyledon Dataset.

Figure 4. Network structure of Mask2Former [21].

Figure 5. Network structure of Enhanced-Mask2Former.

Figure 6. Comparison of segmentation performance across different models.

Figure 7. Segmentation performance comparison curve across training epochs.

Table 1. Comparison of segmentation performance in ablation experiments.

Model	mIoU	mDICE	mPrecision	mRecall
Mask2Former	92.88	96.25	94.25	98.58
Mask2Former + OFR	96.22	98.06	97.18	98.98
Mask2Former + DIM	96.22	98.06	97.18	98.98
Enhanced-Mask2Former	97.44	98.70	98.20	99.21

Table 2. Comparison of segmentation performance data across different models.

Model	mIoU	mDICE	mPrecision	mRecall
Deeplabv3	76.52	69.61	74.55	78.93
FCNN	86.14	88.73	87.32	88.46
Mask2Former	92.88	96.25	94.25	98.58
Enhanced-Mask2Former	97.44	98.70	98.20	99.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, Y.; Yu, Q.; Xu, Z. Optimized Watermelon Scion Leaf Segmentation Model Based on Hungarian Algorithm and Information Theory. Electronics 2025, 14, 620. https://doi.org/10.3390/electronics14030620

AMA Style

Zhu Y, Yu Q, Xu Z. Optimized Watermelon Scion Leaf Segmentation Model Based on Hungarian Algorithm and Information Theory. Electronics. 2025; 14(3):620. https://doi.org/10.3390/electronics14030620

Chicago/Turabian Style

Zhu, Yi, Qingcang Yu, and Zihao Xu. 2025. "Optimized Watermelon Scion Leaf Segmentation Model Based on Hungarian Algorithm and Information Theory" Electronics 14, no. 3: 620. https://doi.org/10.3390/electronics14030620

APA Style

Zhu, Y., Yu, Q., & Xu, Z. (2025). Optimized Watermelon Scion Leaf Segmentation Model Based on Hungarian Algorithm and Information Theory. Electronics, 14(3), 620. https://doi.org/10.3390/electronics14030620

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimized Watermelon Scion Leaf Segmentation Model Based on Hungarian Algorithm and Information Theory

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Dataset

2.2. Experimental Setup and Environment

2.3. Mask2Former Model

2.4. Model Improvement

2.4.1. Optimal Feature Re-Ranking (OFR)

2.4.2. Dynamic Information Modulation (DIM)

3. Experiments and Results

3.1. Evaluation Metrics

3.2. Ablation Experiments

3.3. Comparison Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI