MSFUnet: A Semantic Segmentation Network for Crop Leaf Growth Status Monitoring

Cheng, Zhihan; Yan, He

doi:10.3390/agriengineering7070238

Open AccessArticle

MSFUnet: A Semantic Segmentation Network for Crop Leaf Growth Status Monitoring

by

Zhihan Cheng

and

He Yan

^*

College of Information Science and Technology & Artificial Intelligence, Nanjing Forestry University, Nanjing 210037, China

^*

Author to whom correspondence should be addressed.

AgriEngineering 2025, 7(7), 238; https://doi.org/10.3390/agriengineering7070238

Submission received: 22 April 2025 / Revised: 3 July 2025 / Accepted: 7 July 2025 / Published: 15 July 2025

Download

Browse Figures

Versions Notes

Abstract

Monitoring the growth status of crop leaves is an integral part of agricultural management and involves important tasks such as leaf shape analysis and area calculation. To achieve this goal, accurate leaf segmentation is a critical step. However, this task presents a challenge, as crop leaf images often feature substantial overlap, obstructing the precise differentiation of individual leaf edges. Moreover, existing segmentation methods fail to preserve fine edge details, a deficiency that compromises precise morphological analysis. To overcome these challenges, we introduce MSFUnet, an innovative network for semantic segmentation. MSFUnet integrates a multi-path feature fusion (MFF) mechanism and an edge-detail focus (EDF) module. The MFF module integrates multi-scale features to improve the model’s capacity for distinguishing overlapping leaf areas, while the EDF module employs extended convolution to accurately capture fine edge details. Collectively, these modules enable MSFUnet to achieve high-precision individual leaf segmentation. In addition, standard image augmentations (e.g., contrast/brightness adjustments) were applied to mitigate the impact of variable lighting conditions on leaf appearance in the input images, thereby improving model robustness. Experimental results indicate that MSFUnet attains an MIoU of 93.35%, outperforming conventional segmentation methods and highlighting its effectiveness in crop leaf growth monitoring.

Keywords:

crop leaf image segmentation; leaf growth monitoring; multi-path feature fusion; semantic segmentation

1. Introduction

Crop growth status monitoring is crucial for modern agricultural management, especially under the development trend of precision agriculture. Understanding crop health, nutrient requirements, and growth dynamics has become a key factor in improving yields and optimizing resource utilization [1]. Leaves are a direct reflection of crop growth, and by analyzing leaf shape, area, and other characteristics, it can help farmers accurately assess crop growth [2]. Therefore, how to accurately segment and recognize crop leaves has become an important issue in agricultural management [3].

The traditional manual segmentation process requires a large amount of labor, which, however, makes it difficult for such methods to meet the demand for efficient leaf growth monitoring in agriculture, especially in complex application scenarios [4,5]. With the rapid development of deep learning, semantic segmentation techniques have made significant progress in the fields of medicine [6,7], bionics [8], food science [9] and information technology [10]. They also improved the accuracy and efficiency of crop leaf segmentation in agricultural research [11]. Semantic segmentation is able to automatically learn and extract high-dimensional features from images through deep learning models [12], free from the dependence on manual feature selection [13]. Compared with traditional methods, semantic segmentation greatly reduces the time and labor cost of manual segmentation [14]. The Fully Convolutional Network (FCN) [15,16] is a foundational model in this field, replacing fully connected layers with convolutional ones to enable end-to-end pixel-level prediction [17]. This breakthrough established a robust baseline for subsequent segmentation models. Building on this foundation, U-Net [18] introduced an encoder–decoder architecture with skip connections that effectively fuses high-level semantic information with low-level spatial details [19]. This design has greatly enhanced the extraction of fine features and has been widely adopted for both medical image segmentation and crop leaf segmentation tasks. To further refine segmentation performance—particularly in accurately delineating leaf boundaries—the DeepLab model [20] extends the receptive field through atrous (dilated) convolutions and incorporates Conditional Random Fields to enhance edge detail refinement. In a complementary approach, PSPNet [21] employs a pyramid pooling module to aggregate multi-scale contextual information, improving segmentation performance across leaves of varying sizes [22]. Mask R-CNN [23] integrates a segmentation branch into the Faster R-CNN framework to perform instance-level segmentation [24]. This model delivers precise segmentation masks for overlapping instances, effectively addressing the need for a detailed delineation of individual leaf instances. Collectively, these models illustrate a clear progression from early pixel-level prediction methods to sophisticated architectures that integrate multi-scale, contextual, and instance-level information. Recent work [25] further demonstrates that disentangling parameters into latent sub-representations can enhance feature diversity while maintaining inference efficiency, providing a promising direction for balancing model capacity and computational cost. In parallel, multimodal fusion techniques have shown significant potential in addressing data scarcity [26]. For example, large language models have been leveraged to expand textual semantics for few-shot action recognition, demonstrating that cross-modal alignment can effectively compensate for limited visual samples [27,28,29]. This approach improves segmentation accuracy while providing a robust framework for addressing the diverse challenges inherent in crop leaf segmentation.

Building upon these foundational architectures, several U-Net variants and hybrid models have been proposed to tackle the specific challenges of crop leaf segmentation and monitoring. For example, UNet++ [30] is an improved U-Net model that enhances multi-scale feature handling through a nested skip connection structure and an optimized feature extraction mechanism. Multiple feature fusion and dense connectivity enable U-Net++ to better handle complex background problems and effectively reduce false segmentation of background impurities and improve the robustness of segmentation. A network named MFBP-UNet was developed for the segmentation and disease detection of pear tree leaves in a natural agricultural environment, which integrates a multi-scale feature extraction module and a tokenized multilayer perceptron module with dynamic sparse attention, improving the superior performance and generalization ability of the model [31]. AWUNet [32] is a network that incorporates an attention gating mechanism to reduce feature map size and a wavelet pooling technique to emphasize semantic content within the feature maps, specifically designed for crop monitoring in agricultural settings [33]. Eff-UNet++ [34] combines EfficientNet and U-Net++ to enhance plant leaf segmentation accuracy and improve leaf count estimation through efficient feature extraction and fine-grained segmentation capabilities. AC-UNet [35] is an enhanced U-Net architecture capable of more accurately capturing and localizing stem and leaf regions by incorporating self-attention and coordinate attention mechanisms for boxwood tree segmentation.

Residual paths, as used in ResNet [36], allow input features to bypass intermediate layers and pass directly to deeper layers via skip connections, alleviating gradient vanishing and improving feature transfer efficiency. In this way, the residual path forms a “shortcut” in the information flow, which alleviates the problem of gradient vanishing in deep networks, and at the same time improves the efficiency of feature transfer [37] so that the network can learn richer detail information. Residual paths are particularly suitable for image segmentation tasks with complex structures, as they can effectively fuse shallow information with deep features to improve the segmentation accuracy and robustness of the model [38]. Similar to ResNet, ConvNeXt V2 [39] employs residual connectivity in its network structure, bypassing the intermediate convolutional layers and adding inputs directly to the outputs, combining residual paths with novel convolutional operations (e.g., deep convolution and pointwise convolution) as well as modern architectural optimization strategies, such as LayerNorm, in order to enhance the performance of visual tasks. The residual path serves as an important structural component that helps preserve low-level information and improves the flexibility of feature representation and the generalization ability of the model [40].

Residual paths are widely used in the field of crop leaf growth status monitoring [41]. Researchers have found that the complex shape and edge details of leaves require rich multi-scale features for accurate segmentation [42]. Segmentation models that introduce residual paths perform well in capturing leaf edges and textures, and are able to better distinguish the leaf from the background. MU-Net [43] is an enhanced U-Net architecture designed for plant disease detection, incorporating fast and residual paths to address gradient vanishing and explosion issues and to enhance the network’s expressive power. A multi-directional attention mechanism named MDAM-DRNet [44] combined with a two-channel residual network was developed for the detection of strawberry leaf diseases. The model realizes feature enhancement through two-channel residual paths, which can effectively extract and integrate the information of different feature layers, and at the same time improves the accuracy of detection by focusing on complex lesions with the help of multi-directional attention mechanism. The design of residual path enhances the feature transfer capability of the model; therefore, EDF-Block was designed in MSFUnet by borrowing the idea of residual path.

Multi-path feature fusion is a commonly employed technique in semantic segmentation tasks. The void space pyramid pooling (ASPP) module [45], originally designed for general semantic segmentation, offers valuable insights for crop leaf segmentation tasks by effectively fusing multi-path features to capture multi-scale contextual information. With the multi-path feature fusion strategy, DeepLabV3+ is able to better capture the edge details and texture information of leaves at different scales to improve the segmentation accuracy [46]. The multi-path fusion of high-resolution and low-resolution features [47] is maintained to progressively refine segmentation edges. It performs multi-scale feature fusion and propagation at each layer, which helps in the accurate segmentation of plant leaves and is especially suitable for leaf images with complex structures. DenseASPP [48] is a densely connected null-space pyramid pooling network that utilizes multi-scale null convolution to capture feature information across different scales [49]. The model can perform multi-path feature fusion to adapt to the size and shape variations of plant leaves in an image [50], thus enhancing leaf edge detection and detail retention. U-Net was enhanced by incorporating multi-scale paths [51] to improve segmentation through feature fusion at different resolutions. Its multi-path feature fusion mechanism enables it to perform well in the leaf segmentation task, especially when dealing with different sizes and overlapping leaves, to better separate the target region. Although the above network models enhance the segmentation accuracy, these multi-path feature fusion methods tend to focus on the combination of features under a fixed structure and lack the ability to weigh the importance of different features. Therefore, we used EDF-Block to achieve the fine tuning of features to extract and utilize key information more effectively.

Although the aforementioned methods have yielded notable advances in crop leaf segmentation, they continue to encounter significant challenges, particularly in addressing two critical gaps: leaf overlapping [52,53,54] and the loss of fine leaf edge details [55]. Overlapping of leaves often lead to inaccurate segmentation boundaries, while the inadequate preservation of edge details compromises the precision and consistency of the segmentation results. These limitations [56] ultimately hinder their effectiveness in crop growth monitoring. In response, we propose that explicitly reinforcing boundary information through a dedicated edge detail module, combined with a learnable, multi-path fusion mechanism, will enable more precise and robust segmentation. To test this, we introduce MSFUnet, an advanced semantic segmentation model that integrates the edge-detail focus (EDF) module and the multi-path feature fusion (MFF) mechanism, and evaluate its performance on both apple and grape leaf datasets.

2. Materials and Methods

2.1. Experimental Setting

The experiments were conducted on an NVIDIA RTX 4090 GPU paired with an Intel Core i9-13900K CPU, providing high-throughput computational synergy for training workloads. Experiments were performed on Windows 11 with an Intel Core i9-13900K CPU, 64 GB of RAM, and an NVIDIA RTX 4090 GPU (24 GB VRAM); the software environment used Python 3.8, PyTorch 1.8.0, Torchvision 0.9.0, and CUDA 11.1 (Table 1). Training spanned 100 epochs—a configuration refined through preliminary tests to balance efficiency and model performance. The learning rate followed a dynamic schedule, starting at 0.0001 and decaying to 0.000001 via linear warmup and cosine annealing. The Adam optimizer (momentum factor = 0.9) was used to stabilize training dynamics, while a batch size of 8 optimized GPU memory utilization (Table 2).

To ensure statistical reliability, all experiments were conducted with three independent runs using distinct random seeds (seeds = 2024, 2025, 2026). Performance metrics (MIoU, MPA, MPrecision, Dice) are reported as

μ \pm σ

, where

μ

represents the mean and

σ

the standard deviation across runs. Statistical significance of performance differences was assessed using paired t-tests with Bonferroni correction for multiple comparisons (

α

= 0.01).

2.2. Data Acquisition and Expansion

To achieve accurate crop leaf segmentation, selecting an appropriate image acquisition method and ensuring an adequate number of images are essential for optimizing the segmentation model’s performance. Most of our experiments use the ATTLDD public dataset [57]: this dataset consists of 2970 apple leaf images MSFUnet (https://github.com/KingslayerCris/MSFUnet (accessed on 15 June 2025). Apple leaf images were collected from the Apple Experimental Demonstration Station of Northwest Agriculture and Forestry University (NWAFSU) located in Baishui County, Weinan City; Ganyang County, Baoji City; Luochuan County, Yan’an City; Ganyang County, Xianyang City; and Qingcheng County, Qingyang City, Gansu Province. ABM-500GE/BB-500GE color digital camera was used to capture the leaf images of apples in different growth periods and growth segments in the morning [58], afternoon, evening, when it is cloudy or sunny, and before and after rain, respectively, with an image resolution of 2448 × 3264.

Due to the limited number of leaf images collected, a small sample dataset is prone to overfitting problems. By expanding the dataset, the variety of training samples can be enhanced, which helps mitigate overfitting, strengthens the model’s generalization capability, and maintains a balanced distribution of images across categories. Commonly used data expansion methods include flipping, contrast enhancement, rotation and luminance enhancement to increase the number of training samples. In the following experiments, each apple leaf image is enhanced into four images, and a data enhancement set of 11,880 preprocessed images is constructed (Figure 1).

To verify the effectiveness of the data enhancement strategy, independent experiments were conducted on both the original and the augmented datasets. MIoU values for MSFUnet trained on the original ATTLDD apple leaf dataset (2970 images) versus its augmented version (11,880 images; flips, rotations, contrast/brightness adjustments) were compared (Figure 2). Training on the augmented data increased MIoU from 86.71% to 93.35%, demonstrating that sample diversification substantially improves segmentation accuracy.

In addition, to further evaluate MSFUnet’s generalizability beyond apple leaves, we also tested on the Grape400 dataset MSFUnet (https://github.com/KingslayerCris/MSFUnet (accessed on 15 June 2025), which comprises 518 high-resolution grape leaf images (Figure 3). These images were taken from different vineyards—different leaf orientations and background textures—making Grape400 a challenging benchmark for cross-crop segmentation. All grape leaf images were resized to 512 × 512 and annotated using the same two-class (leaf and background) scheme.

2.3. MSFUnet

The frequent occurrence of overlapping leaves in crop images often leads to inaccurate segmentation boundaries, while the inadequate preservation of edge details compromises the precision and consistency of the segmentation results (Figure 4). Accurately segmenting the boundary of each individual leaf is crucial because it preserves the leaf’s structural integrity—enabling precise measurements of shape, area, and edge health that are essential for the reliable monitoring of growth status and the early detection of physiological changes. The proposed MSFUnet model addresses these challenges of leaf overlap, and edge detail segmentation in crop leaf growth monitoring. The network follows an encoder–decoder structure, where the encoder gradually decreases the spatial resolution of feature maps while extracting hierarchical features, and the decoder reconstructs spatial details by applying upsampling and integrating features (Figure 5A). The key innovations of MSFUnet include the multi-path feature fusion (MFF) module, which enhances segmentation in overlapping regions by fusing multi-scale features, and the edge-detail focus (EDF) module, which improves edge localization and feature robustness by capturing both local details and global contexts. These modules work together to improve segmentation performance, particularly in scenarios with overlapping leaves and edge detail processing. The MSFUnet architecture consists of four encoder–decoder stages. Encoder layers progressively halve the spatial resolution while doubling channels; decoder layers mirror this with upsampling and feature fusion (Table 3).

2.4. Multi-Path Feature Fusion

The growth of crop leaves is often accompanied by overlapping, which presents a significant challenge in segmentation. To tackle this problem, we propose the multi-path feature fusion (MFF) module, which is specifically designed to extract and integrate multi-scale features, thereby improving the segmentation of overlapping regions. Leaf overlap complicates the segmentation process by causing ambiguities in boundary delineation and incomplete separation of individual leaves, making it one of the most difficult aspects of crop leaf segmentation. The MFF module improves the model’s capacity to interpret intricate leaf structures by effectively capturing contextual information across multiple scales. Specifically, the MFF module addresses the challenge of overlapping leaves by explicitly disentangling and recombining feature information at three complementary scales. After upsampling the deeper feature map (Feat 5) to match the resolution of the shallower map (Feat 4) and concatenating them, this fused tensor is passed through three parallel branches: one branch applies a 3 × 3 convolution to preserve fine spatial details, a second branch applies a separate 3 × 3 convolution to capture mid-level semantic cues such as leaf shape and orientation, and a third branch upsamples the original deep features by a factor of four before convolving, thereby injecting broad contextual information about the relative positions of the leaves. During training, learned weights adaptively balance these branches so that, in regions of occlusion, the model can lean on global contexts to decide which leaf lies in front, use mid-level cues to enforce coherent leaf contours, and still preserve local edge information for precise boundary alignment. Its structure (Figure 5B) implements a multi-path fusion strategy with the following specific operations:

First, an upsampling operation with a scale of 2 was performed on the underlying output feature map (e.g., Feat 5 in Figure 5B) to make its spatial resolution consistent with that of the previous-layer encoder’s feature map (e.g., Feat 4 in Figure 5B). The upsampled feature maps were spliced with Feat 4 to form a preliminary fused feature map:

F_{concat} = Concat ({Upsample}_{2} (F_{low}), F_{high})

(1)

where

{Upsample}_{2} (\cdot)

was the interpolation upsampling function with scale 2,

Concat (\cdot)

denotes the splicing operation in the channel dimension, and

F_{low}

and

F_{high}

denote the low-level and high-level feature maps, respectively. The spliced feature maps were further extracted by 3 × 3 convolutional blocks, and the multi-path structure was created to process the features at various scales, thereby improving communication of the high-level semantic information with the low-level spatial information. The multi-path processing is specifically as follows:

F_{path 1} = σ (W_{1} * F_{concat})

(2)

F_{path 2} = σ (W_{2} * F_{concat})

(3)

F_{path 3} = σ (W_{3} * ({Upsample}_{4} (F_{low})))

(4)

where

W_{1}

,

W_{2}

, and

W_{3}

are the convolution kernels in the multi-path,

{Upsample}_{4} (\cdot)

denotes the upsampling operation with a scale of 4, and

σ (\cdot)

is the ReLU activation function. The feature extraction results of each path were finally fused by weighted summation to generate the multi-scale feature map

F_{fused}

:

F_{fused} = \sum_{i = 1}^{3} α_{i} F_{path i}

(5)

where

α_{i}

denotes the weight of each path. The fused feature map

F_{fused}

further refines the features by a 3 × 3 convolution operation and gradually passes to the next decoder layer for stitching with higher-resolution feature maps (e.g., Feat 3 in Figure 5B), forming a recursive multi-path fusion mechanism, which is expressed as

F_{decoder} = Concat (F_{fused}, F_{high - next})

(6)

where

F_{high - next}

denotes the next layer of high-resolution feature maps. With this design, the MFF module establishes a closer connection between different-resolution features, which substantially enhances the model’s capacity to segment overlapping regions of leaves.

2.5. Edge-Detail Focus

Precise edge localization is essential but often lost when contours blur. The EDF module overcomes this by processing features along two complementary paths. In EDF-1, a 2× upsampling restores high spatial resolution, then a 5 × 5 depthwise convolution enlarges the local receptive field to capture fine edge patterns. Pointwise convolutions and global response normalization sharpen channel activations around edge regions, while residual connections preserve the original feature context—together, this path amplifies subtle contour cues without overwhelming them with noise. In EDF-2, a 4× upsampling followed by pointwise convolution and normalization aggregates broader contextual information, allowing the network to distinguish true edges from background artifacts. By adding EDF-1’s detail-rich output to the decoder and concatenating EDF-2’s context-aware features downstream, MSFUnet maintains both crisp boundaries and stable feature representations. The EDF module processes input features through dual-path operations (Figure 5C) with distinct receptive fields.

In the EDF-1 path, the feature map undergoes an upsampling operation with a scale of 2, increasing its spatial resolution to match the high-resolution features in the decoder. After upsampling, the features are refined by the following steps:

\begin{matrix} F_{edf 1} = & GRN (σ (W_{4} * (σ (W_{3} * (Norm (W_{2} * \\ (DepthConv (W_{1} * F_{in}))))))), \end{matrix}

(7)

where the convolution kernels operate on distinct feature channels as annotated in Figure 5C:

$W_{1}$ : A 1 × 1 pointwise convolution (Input Channel × 4× Expansion);
DepthConv(·): A 5 × 5 depthwise convolution (Preserves Channel Dimension);
$W_{2}$ : A 1 × 1 pointwise convolution (4× Expanded Channel × Compression);
$W_{3}$ / $W_{4}$ : Sequential 1 × 1 convolutions (Channel Preservation).

BatchNorm (Norm) stabilizes channel distributions, while LeakyReLU (

σ

) introduces nonlinearity. Global response normalization (GRN) recalibrates channel-wise responses before residual fusion. The DropPath operation stochastically masks residual connections during training to prevent overfitting.

The EDF-2 path further expands the receptive field of the feature map through an upsampling operation of scale 4 to capture a larger range of contextual information. Subsequently, the upsampled feature maps are processed by a 1 × 1 convolution kernel and batch normalization operation with the following expression:

F_{edf 2} = Norm (σ (W_{6} * ({Upsample}_{4} (F_{in}))))

(8)

where

{Upsample}_{4} (\cdot)

denotes the upsampling function with a scale of 4,

W_{6}

is a 1 × 1 convolutional kernel, and

σ (\cdot)

and

Norm (\cdot)

function similarly to those in EDF-1. The output features from the EDF-2 paths are integrated with the feature maps in the decoder, further improving the segmentation model’s sensitivity to the target edges.

After undergoing processing by the EDF module, the input feature map (e.g., Feat 5) is partitioned into two parts:

F_{edf 1}

and

F_{edf 2}

, where

F_{edf 1}

is mainly used to enhance the local details and

F_{edf 2}

is responsible for capturing the global context information. These two parts of features are added or spliced with the feature maps in the decoder, respectively, and passed to the subsequent decoder layer in turn, finally generating the segmented feature maps after refinement. Through the above design, the EDF module can effectively enhance the model’s ability to detect edges and details, while also ensuring the stability and robustness of feature extraction, thereby significantly improving the overall performance of the segmentation model.

2.6. Network Architecture

The proposed MSFUnet network structure is composed of two parts: a contraction network (left side) and an expansion network (right side) for extracting shallow and deep features, respectively. The contraction network gradually abstracts and compresses the input information by extracting the feature maps layer by layer, while the expansion network gradually recovers the spatial details through upsampling and feature fusion. Both realize the fusion of shallow and deep features through four jump connections, thus integrating more semantic information and spatial details.

The contraction backbone was initialized with ImageNet’s pre-trained weights. During training, deep semantic layers were fine-tuned to adapt to leaf morphology, while earlier layers remained frozen to preserve generic feature representations. The entire expansion network with its specialized modules was randomly initialized.

In the contraction network, each encoder module comprises two 3 × 3 convolutional layers followed by a 2 × 2 max-pooling operation with a stride of 2. This design progressively abstracts and compresses the input image by extracting hierarchical feature maps at successive layers. In the decoder, a multi-path feature fusion process is implemented in conjunction with an integrated edge-detail focus (EDF) module. In the initial fusion stage, the feature map from the bottom layer (Feat 5) is upsampled by a factor of 2 to align with the resolution of the corresponding encoder feature map (Feat 4). The upsampled feature map is concatenated with Feat 4 and subsequently passed through a 3 × 3 convolutional block (with padding set to 1) followed by ReLU activation. An additional 3 × 3 convolutional layer with ReLU activation further refined the fused features, thereby efficiently combining high-resolution and low-resolution information, gradually recovering spatial details for subsequent layers.

Parallel to the multi-scale feature fusion pathway, the bottom encoder features (Feat 5, 32 × 32 × 1024) are processed by the edge-detail focus (EDF) module (Figure 5C). This dual-stream design enables simultaneous refinement of contextual and edge information.

The EDF-1 module begins by upsampling the feature maps by a factor of 2, thereby doubling their spatial dimensions. It then applies a 1 × 1 convolution, expanding channels fourfold (1024 × 4096), followed by batch normalization to mitigate the risk of gradient vanishing. A 5 × 5 convolutional layer paired with layer normalization is used to enhance the stability of feature representations and accelerate convergence. The expanded channels are then compressed back to 512 dimensions via a 1 × 1 convolution, and nonlinearity is introduced using the Leaky ReLU (LReLU) activation function, which enables the network to capture intricate feature relationships. Global response normalization (GRN) is then applied to balance activation values across channels and further harmonize the channel responses. Finally, a concluding 1 × 1 convolution adjusts channels to 512, and residual concatenation—combined with DropPath-based random residual concatenation—is employed to suppress overfitting, thereby enhancing the stability and generalization of the module.

The EDF-2 module employs upsampling by a factor of 4, followed by a 1 × 1 convolution, batch normalization, and LReLU activation to further expand the feature sensing field. After processing through the EDF module, Feat 5 is partitioned into two components, designated as Feat 5-1 and Feat 5-2. Summing Feat 5-1 with the corresponding decoder feature maps enhances the network’s sensitivity to boundaries and fine details. These enhanced feature maps are then concatenated with higher-level decoder feature maps (e.g., Feat 3 in Figure 5B) in subsequent fusion stages, where further convolution and ReLU activation operations refine the decoder outputs. This recursive process continues until the final feature map (Feat 0 in Figure 5B) is generated, which is then processed by a concluding 1 × 1 convolutional layer and SoftMax activation to produce the segmentation probability map and final segmentation result. By coordinating the EDF and MFF modules at each decoder stage, the overall architecture improves segmentation accuracy in overlapping leaf regions and preserves very fine leaf edge details.

3. Results

3.1. Dataset Processing

We applied data augmentation to 2970 original apple leaf images from the ATTLDD public dataset using techniques such as flipping, contrast enhancement, rotation, and brightness adjustment. These operations increased the diversity of training samples and reduced the likelihood of model overfitting, resulting in a final dataset of 11,880 augmented images with a balanced distribution of samples across different leaf shapes, lighting conditions, and growth stages.

We employed cross-validation, a statistical technique for partitioning the dataset into multiple subsets, to train and evaluate the segmentation model. The dataset was divided into training, validation, and test sets in an 8:1:1 ratio, resulting in 9504 images for training, 1188 for validation, and 1188 for testing. To maintain consistency in evaluation, images in the validation set were selected from the original dataset. Additionally, the image resolution was adjusted to 512 × 512 pixels prior to labeling, in order to save computational resources and reduce manual processing time. Subsequently, labels were created for the two semantic categories in the dataset: foreground (target leaves) and background (soil, weeds, other leaves). All images were manually annotated by two independent expert annotators using the LabelMe tool, following a standardized pixel-level protocol for delineating leaf boundaries. To assess annotation consistency, 300 randomly selected images were labeled by both annotators: we computed Cohen’s kappa coefficient on the binary pixel labels (

κ = 0.93

) and the average pixel-level IoU (0.91 ± 0.02) between their masks. Any image with a disagreement greater than 5% IoU was jointly reviewed and adjudicated by a third expert before inclusion. These measures ensured that our ground-truth labels were both accurate and highly consistent, providing a reliable basis for training, validation and testing.

3.2. Evaluation Indicators

To quantitatively assess the performance of the proposed method and other comparative methods, this experiment focused on the mean intersection over union (MIoU), the mean pixel accuracy (MPA), the mean precision (MPrecision) and the dice similarity coefficient (Dice) for a thorough assessment of the model performance. Among these metrics, MIoU is a crucial measure of image segmentation effectiveness, representing the average ratio of intersection to union between the predicted results and the ground-truth labels. MPA represents the average pixel accuracy across all classes, where the accuracy for each class is defined as the ratio of correctly classified pixels of that class to the total number of pixels in the class. MPrecision indicates how many of all the positive-class (i.e., foreground) pixels predicted by the model are true positive class pixels, which can help reduce the number of false positives (i.e., backgrounds misclassified as foregrounds). Dice, which quantifies the extent of overlap between the predicted region and the true region, is a metric designed specifically for evaluating segmentation quality. It was calculated as shown in Equations (9)–(12):

MIoU = \frac{1}{n + 1} \sum_{i = 0}^{n} \frac{p_{i i}}{\sum_{j = 0}^{n} p_{i j} + \sum_{j = 0}^{n} p_{j i} - p_{i i}}

(9)

MPA = \frac{1}{n + 1} \sum_{i = 0}^{n} \frac{p_{i i}}{\sum_{j = 0}^{n} p_{i j}}

(10)

MPrecision = \frac{\sum_{1}^{n} \frac{T P}{T P + F P}}{n}

(11)

Dice = \frac{2 * T P}{2 * T P + F P + F N}

(12)

where k denotes the number of classes, i denotes the true value, j denotes the predicted value,

T P

represents the number of pixels correctly classified as foreground by the model,

F P

denotes the number of background pixels mistakenly predicted as foreground, and

F N

refers to the number of foreground pixels incorrectly identified as background.

p_{i j}

indicates that pixels of category i are recognized as j, which is

F P

.

p_{j i}

indicates that pixels of category j are recognized as category i, which is

F N

.

p_{i i}

indicates that the pixel representing category i is recognized as category i, which is

T P

.

3.3. Comparative Analysis with Other Methods

In this study, MSFUnet was evaluated against several widely used semantic segmentation models. Classical U-Net variants (U-Net and U²Net) were included as baselines due to their proven efficacy in leaf segmentation tasks. Transformer-based models (TransUNet and SwinUNet) were included as representatives of prominent contemporary architectures, validating MSFUnet’s superiority over both conventional and transformer-driven approaches. Additionally, two recent agricultural segmentation networks (CUDU-Net and CS-Net) were evaluated to benchmark against specialized designs. Over 100 training epochs and for seven segmentation models, MSFUnet’s loss decreased most rapidly and settled at the lowest value, indicating faster convergence and better fit (Figure 6). Moreover, the training and validation losses for MSFUnet remained closely aligned throughout, with no significant gap emerging, which confirmed that the model converges stably without overfitting. To assess variability, each model was trained five times with different random seeds under identical settings (Table 4). MSFUnet achieved higher MIoU, MPA, MPrecision, and Dice scores than all the other models. These findings indicated that MSFUnet’s combination of MFF and EDF modules was particularly well suited to low-data settings, as it fused multi-scale and edge-focused features without relying on the large training volumes required by transformer-only models. To assess model complexity alongside segmentation accuracy, we summarized the number of parameters, FLOPs, and average inference time per 512 × 512 image for each compared model (Table 5). MSFUnet maintained a moderate parameter count and FLOPs while delivering superior accuracy. The average inference time per image on the RTX 4090 was 37 ms. This level of throughput suggested that, on similarly capable hardware, MSFUnet could operate near real-time in typical field scenarios.

Concretely, MSFUnet achieved the highest MIoU of 93.35%, exceeding the best comparators U²Net (91.04%) and CUDU-Net (90.78%) (Table 4). This 2.31–2.57% improvement in overlap accuracy primarily stemmed from the EDF-Block’s ability to preserve fine leaf margins through edge-differential convolution, and the multi-path feature fusion mechanism that hierarchically integrates global context with local details.

Furthermore, MSFUnet demonstrated superior performance across all metrics:

MPA: A total of 95.35% (↑1.18% vs. U²Net’s 94.17%);
MPrecision: A total of 94.93% (↑1.42% vs. CUDU-Net’s 93.51%);
Dice: A total of 0.937 (↑0.016 vs. CUDU-Net’s 0.921).

These results confirmed MSFUnet’s exceptional capability in pixel-level classification (MPA), positive class differentiation (MPrecision), and boundary alignment (Dice), validating its design for complex leaf segmentation tasks.

In contrast, models like SwinUNet and TransUNet performed relatively poorly on these metrics (Table 4). For example, TransUNet achieved an MIoU of 80.23% and a Dice of 0.818, values that were substantially lower than those of MSFUnet. This discrepancy may be attributed to TransUNet’s architectural design, which did not sufficiently incorporate multi-path feature fusion, and its larger parameter count, which imposed higher computational demands for leaf segmentation.

Over 100 epochs, the MIoU convergence curves visualize the training dynamics of these models (Figure 7). MSFUnet exhibited not only superior final accuracy but also accelerated convergence compared to the other architectures. Two key observations emerged:

Rapid Early Convergence: MSFUnet achieved >80% MIoU by epoch 30, whereas U-Net and CS-net required 50 epochs to reach comparable performance. This indicated more efficient feature learning in early training stages.

Training Stability: MSFUnet maintained a smooth, monotonically increasing curve with minimal oscillations, contrasting sharply with the volatile trajectories of CUDU-Net. This stability suggested better optimization behavior and reduced susceptibility to local minima.

This convergence behavior directly validated the efficacy of our architectural innovations: The MFF module’s multi-scale feature fusion accelerated context learning in early stages, while the EDF module’s edge refinement contributed to late-stage stability by mitigating boundary ambiguity. The consistent performance gap after epoch 35 correlated with the emergence of complex leaf boundary features that conventional architectures struggled to capture.

3.4. Ablation Experiment

To systematically evaluate the contributions of individual components, we conducted a controlled ablation study measuring MIoU improvement across four configurations under identical training protocols (Figure 8):

(1) Baseline U-Net;
(2) U-Net + MFF (multi-path feature fusion only);
(3) U-Net + EDF (edge-detail focus only);
(4) Full MSFUnet (MFF + EDF).

Three critical findings emerged:

MFF Dominated Global Accuracy: The MFF module contributed a 5.65% MIoU gain (90.26% vs. baseline 84.61%), which is substantially higher than EDF’s 4.17% gain (88.78%). This demonstrated MFF’s superior capacity for resolving semantic ambiguities in overlapping regions through multi-scale context integration.
EDF Specialized in Boundary Refinement: While EDF showed lower overall MIoU (88.78%), a detailed analysis of segmentation boundaries revealed it significantly reduced edge segmentation errors compared to MFF. This specialization stemmed from its hierarchical processing: depthwise convolution captured high-frequency details, while GRN normalization preserved structural coherence.
Synergistic Interaction: The combined MSFUnet achieved 93.35% MIoU—exceeding the best single-module (MFF) by 3.09% and the theoretical additive gain (5.65% + 4.17% = 9.82%) by 1.92%. This nonlinear enhancement originated from complementary function allocation:
(1) MFF’s global context-guided EDF’s local refinement.
(2) EDF’s edge-aware features refined MFF’s fusion boundaries.

The performance hierarchy (MSFUnet > MFF-only > EDF-only > Baseline) confirms that both modules were essential for optimal segmentation. Specifically, the 8.74% net improvement over baseline demonstrated our architecture’s efficacy in addressing the core challenges of leaf overlap and edge ambiguity identified in the Introduction section.

3.5. Visualization Experiment and Analysis

To rigorously assess the operational efficacy of MSFUnet in real-world agricultural scenarios, we performed multi-dimensional visual analysis on the ATTLDD dataset under three challenging conditions: irregular leaf margins, overlapping leaves, and folded leaf structures. Comparative evaluations included six leading contemporary segmentation architectures alongside our proposed model. The visualization outcome revealed critical performance disparities across different models (Figure 9).

Under irregular leaf boundary conditions (Figure 9A), conventional encoder–decoder architectures (e.g., U-Net, U²Net) exhibited edge fragmentation in their segmentation masks, particularly at serrated leaf margins. Transformer-based models (TransUNet, SwinUNet) showed boundary displacement errors due to their limited spatial resolution preservation in high-frequency regions. MSFUnet achieved sub-pixel alignment fidelity, a direct consequence of its EDF module’s hierarchical feature refinement. The EDF-1 submodule’s 5 × 5 convolutional layers combined with global response normalization (GRN) amplify high-frequency edge signatures, while EDF-2’s quadruple upsampling and residual concatenation enhance boundary continuity. This dual-stage processing preserved marginal trichomes and serrations critical for identification.

In overlapping leaf scenarios, conventional models erroneously merged overlapping instances due to inadequate feature disambiguation (Figure 9B). MSFUnet correctly segmented overlapping leaves through a synergistic operation of its contraction–expansion architecture. The contraction network’s deep feature abstraction provided semantic context about leaf count and orientation, while the expansion network’s MFF module resolves spatial ambiguities. This multi-path feature fusion strategy reduced overlap errors compared to U-Net’s single-path reconstruction.

Folded leaf analysis revealed MSFUnet’s superiority in maintaining venation patterns (Figure 9C). MSFUnet preserved midrib continuity through its EDF-2 module’s 4× upsampling and DropPath-regularized residuals. The cross-layer connections of the MFF module enabled an accurate segmentation of folded regions by fusing shallow features with deep semantic context.

Meanwhile, MSFUnet accurately outlined the contours of grape leaves in images from the Grape400 dataset (Figure 10). This performance demonstrated that the model’s architectural advantages—EDF’s hierarchical edge refinement and MFF’s multi-scale feature fusion—effectively translated to a different crop species, confirming MSFUnet’s broad applicability across diverse agricultural crops.

These visual results (Figure 9 and Figure 10) validated the architectural superiority of MSFUnet, especially the edge refinement of the EDF module and the contextual feature fusion of the MFF, and established a new performance benchmark for the real-world monitoring of crop growth status in the field under complex situations.

3.6. Per-Class Performance and Error Analysis

The pixel-level confusion matrix for MSFUnet on the ATTLDD test set further quantified misclassification patterns (Table 6). Among the 311,296,512 test pixels, MSFUnet correctly classifies 183,042,349 background pixels and 118,915,268 leaf pixels. It produced 3,735,558 background-to-leaf false positives (1.20%) and 5,603,337 leaf-to-background false negatives (1.80%). These results indicated that most errors occur near leaf–background boundaries, with class-wise IoUs of 93.95% for background and 92.75% for leaf.

To assess specificity in the complete absence of target objects, MSFUnet was evaluated under leaf-free conditions. Only 1.20% were incorrectly labeled as “leaf,” confirming that background misclassification remains minimal even when no target objects are present.

In class-wise segmentation, MSFUnet demonstrated superior performance over all comparative models, based on ATTLDD test set results (Table 7). MSFUnet achieves the highest leaf IoU (0.92), background IoU (0.94), leaf precision (0.94), background precision (0.96), leaf recall (0.94) and background recall (0.95). In contrast, U-Net recorded the lowest leaf IoU (0.81) and lower background IoU (0.88), while transformer-based methods (TransUNet, SwinUNet) similarly underperform (leaf IoUs around 0.79–0.80). U²Net and CUDU-Net narrowed this gap (leaf IoUs of 0.90 and 0.89, respectively) but still trail MSFUnet by 2–3 points.

These quantitative gains (Table 7) directly reflect our architectural contributions. The EDF module’s dual-path design, with depthwise convolutions and global response normalization, sharpens edge localization, driving MSFUnet’s superior precision and recall on serrated margins. Meanwhile, the MFF module’s multi-scale fusion paths disambiguate occlusions in overlapping regions, lifting leaf IoU by 2-11% compared to other high-performing models. Collectively, these results confirmed that MSFUnet not only excels in overall segmentation accuracy but also maintains robust class-specific performance under real-world conditions.

4. Discussion

This study introduced MSFUnet as an advanced semantic segmentation framework for crop leaf monitoring. The model achieved a MIoU of 93.35% on the apple leaf dateset and demonstrated robust generalization on the grape leaf dateset. The following discussion rigorously compares MSFUnet’s methodology and outcomes against established and emerging alternatives, and candidly evaluates its limitations.

4.1. Comparative Analysis of Methodological Approaches and Outcomes

The persistent challenges in crop leaf segmentation—severe occlusion leading to boundary ambiguity and insufficient preservation of morphologically critical edge details—were consistently identified as major constraints in prior research [59]. MSFUnet addressed these challenges more effectively than contemporary models, primarily through its integrated MFF and EDF modules. To contextualize these improvements within the broader evolution of semantic segmentation techniques, a comparative analysis is presented below:

Versus Foundational and Enhanced Segmentation Architectures: Semantic segmentation’s progression, driven by architectures like FCN [15,16] for pixel-level prediction, U-Net [18] for multi-scale fusion via skip connections, and DeepLab [20] leveraging atrous convolutions, established critical capabilities for agricultural applications [11]. Subsequent innovations focused on enhancing feature representation and efficiency—such as disentangling parameters into latent sub-representations [25] and leveraging multimodal fusion to combat data scarcity [26]. U-Net variants like UNet++ [30] and MFBP-UNet [31] improved multi-scale handling and disease detection, while attention mechanisms (e.g., AWUNet [32], AC-UNet [35]) boosted localization. Despite these advances, challenges in boundary ambiguity under occlusion [52,53,54] and fine detail preservation [55] persisted. MSFUnet directly targets these gaps through its integrated EDF and MFF modules, positioning itself as an evolution beyond these foundational approaches.

Versus Transformer-Based Architectures (TransUNet, SwinUNet): Transformer models were recognized for capturing long-range dependencies effectively. However, their global attention mechanisms frequently resulted in spatially inconsistent predictions, particularly in regions demanding high edge fidelity [60]. This limitation manifested in lower MIoU scores (TransUNet: 80.23%; SwinUNet: 79.87%) and poorer boundary alignment (Dice: 0.818–0.834) compared to MSFUnet, aligning with observations that global attention could blur fine structures in botanical imaging. MSFUnet mitigated this by prioritizing local detail preservation alongside contextual integration, offering a solution to the spatial imprecision noted in transformer-based segmentation.

Versus Specialized Agricultural Networks (CUDU-Net, CS-Net): Models tailored for agriculture, such as CUDU-Net [61] and CS-Net [62], represented significant efforts to address domain-specific challenges. While CUDU-Net achieved strong performance (90.78% MIoU), it remained outperformed by MSFUnet. Analysis suggested that sequential refinement architectures, common in such models, risked homogenizing fine biological features during processing, potentially explaining their difficulty in preserving intricate margins under complex overlap. To overcome this, MSFUnet’s EDF module adapts the proven principle of residual paths—successfully used for gradient mitigation and feature transfer in ResNet [36], ConvNeXt V2 [39], and crop monitoring models like MDAM-DRNet [37,38,41,44]—via parallel processing: EDF-1 captured localized edge gradients early, preventing detail dilution, while EDF-2 provided complementary context. This design, inspired by the efficacy of residual connectivity for complex structures [42], resonated with calls for dedicated edge pathways in plant phenotyping.

Versus Multi-Scale Fusion Frameworks (U²Net): Foundational techniques like ASPP [45] and pyramid pooling [21] established the value of multi-scale features. U²Net effectively leveraged this concept, achieving high accuracy (91.04% MIoU). However, studies indicated that fixed-scale fusion could lack adaptability in complex scenes, struggling to balance local detail and global context optimally [63]. Building upon these concepts but addressing their rigidity, MSFUnet’s MFF implemented learnable weighting of its multiple distinct paths—extending beyond dense connections (DenseASPP [48]) and multi-scale path integrations [51] used for leaf variation adaptation [49,50]. This yielded a more nuanced and contextually adaptive feature representation, contributing significantly to the observed gains over U²Net and superior handling of occlusions, thus aligning with the trend towards dynamic feature fusion emphasized in the recent literature [64].

Collectively, MSFUnet represented an integrated approach tackling three interconnected challenges critical for practical crop monitoring [65]: (1) enhanced edge localization under morphological variation, (2) preservation of structural integrity for fine features, and (3) contextually adaptive fusion for occluded objects. The performance advantages observed over other leading models validated this integrated design philosophy.

4.2. Practical Implications, Acknowledged Constraints and Future Trajectories

The high accuracy and practical inference speed of MSFUnet indicated significant potential for enhancing precision agriculture workflows. Reliable pixel-level segmentation facilitated the extraction of key growth metrics such as individual leaf area, shape description, and precise boundary health assessment—core objectives established in foundational agricultural vision research [2,3]. Integration into field scouting platforms or phenotyping systems could substantially improve monitoring objectivity and resolution compared to manual methods [4,5] or less accurate automated segmentation [14], ultimately supporting resource optimization [1]. The model’s efficacy across different broad-leaf crops suggested wider applicability potential.

However, two primary limitations warrant consideration:

Taxonomic Specificity: While MSFUnet generalized well between certain crop types, performance on plants with significantly divergent leaf morphologies (e.g., grasses or highly dissected leaves) remained untested. Preliminary indications suggested potential performance variations outside its primary training domain, reflecting the broader challenge of model generalization in monitoring crop growth status [66].

Computational Demand: MSFUnet’s parameter count and computational cost (58.6M Params, 38.4 GFLOPs), while favorable compared to large transformers, exceeded lightweight alternatives like CS-Net (29.2M Params, 14.8 GFLOPs). This overhead, noted as a key constraint for agricultural robotics and UAVs [67], could impact deployment on resource-constrained edge devices [13].

These constraints delineated focused future research avenues:

Model Compression: Exploring knowledge distillation or efficient architecture design (e.g., inspired by [25,34]) to develop a lightweight variant targeting under 15 GFLOPs, while maintaining high accuracy, extending successes in agricultural model optimization [68].

Cross-Domain Robustness: Investigating synthetic-to-real transfer learning or unsupervised domain adaptation to enhance generalization across diverse crop species and field conditions, mitigating the data scarcity challenge. Leveraging generative models for synthetic data creation represented a promising approach [69].

Embedded Deployment Validation: Rigorous testing and optimization of MSFUnet or its derivatives on actual agricultural edge hardware platforms to assess real-world viability under operational constraints [70].

5. Conclusions

In this study, our experimental results—spanning ablation studies, quantitative comparisons, and cross-crop validation—confirm the following hypothesis: MSFUnet’s EDF block markedly improves edge localization, and its MFF strategy resolves complex overlap ambiguities. Together, these models resulted in a 93.35% MIoU on apple leaves and showed similarly high performance on grape leaves. These findings validated our architectural choices and demonstrate MSFUnet’s potential for accurate and reliable crop monitoring in diverse agricultural settings. Future work will explore lightweight implementations and domain adaptation strategies to further extend its applicability.

Author Contributions

Writing—original draft, Methodology, Supervision and Investigation, Z.C.; Visualization, Writing—review and editing and Validation, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding authors upon request.

Acknowledgments

We would like to thank those who contributed to this research and manuscript, providing valuable feedback and technical support throughout the study.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

References

Huang, J.; Gómez-Dans, J.L.; Huang, H.; Ma, H.; Wu, Q.; Lewis, P.E.; Liang, S.; Chen, Z.; Xue, J.H.; Wu, Y.; et al. Assimilation of remote sensing into crop growth models: Current status and perspectives. Agric. For. Meteorol. 2019, 276, 107609. [Google Scholar] [CrossRef]
Kataoka, T.; Kaneko, T.; Okamoto, H.; Hata, S. Crop growth estimation system using machine vision. In Proceedings of the Proceedings 2003 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM 2003), Kobe, Japan, 20–24 July 2003; Volume 2, pp. b1079–b1083. [Google Scholar]
Ashok Kumar, D.; Prema, P. A review on crop and weed segmentation based on digital images. In Multimedia Processing, Communication and Computing Applications, Proceedings of the First International Conference, ICMCCA, Kochi, Kerala, India, 13–15 December 2012; Springer: New Delhi, India, 2013; pp. 279–291. [Google Scholar]
Ramesh, S.; Hebbar, R.; Niveditha, M.; Pooja, R.; Shashank, N.; Vinod, P. Plant disease detection using machine learning. In Proceedings of the 2018 International Conference on Design Innovations for 3Cs Compute Communicate Control (ICDI3C), Bangalore, India, 25–28 April 2018; pp. 41–45. [Google Scholar]
Yang, X.; LiShen, J.; Zhang, L.; Fan, X.; Ye, Q.; Fu, L. Non-rigid object detection via fast one-class model. Pattern Recognit. 2025, 168, 111821. [Google Scholar] [CrossRef]
Varoquaux, G.; Cheplygina, V. Machine learning for medical imaging: Methodological failures and recommendations for the future. npj Digit. Med. 2022, 5, 48. [Google Scholar] [CrossRef] [PubMed]
Wang, P.; Li, Y.; Tan, B.F.; Zhou, Y.C.; Li, Y.; Wei, X.S. Multibranch co-training to mine venomous feature representation: A solution to snakeclef2024. In Working Notes of CLEF, Proceedings of the CLEF 2024: Conference and Labs of the Evaluation Forum, Grenoble, France, 9–12 September 2024; CEUR-WS.org: Aachen, Germany, 2024. [Google Scholar]
Zheng, Y.; Song, Q.; Liu, J.; Song, Q.; Yue, Q. Research on motion pattern recognition of exoskeleton robot based on multimodal machine learning model. Neural Comput. Appl. 2020, 32, 1869–1877. [Google Scholar] [CrossRef]
Saha, D.; Manickavasagan, A. Machine learning techniques for analysis of hyperspectral images to determine quality of food products: A review. Curr. Res. Food Sci. 2021, 4, 28–44. [Google Scholar] [CrossRef]
Wang, H.; Ding, J.; He, S.; Feng, C.; Zhang, C.; Fan, G.; Wu, Y.; Zhang, Y. MFBP-UNet: A network for pear leaf disease segmentation in natural agricultural environments. Plants 2023, 12, 3209. [Google Scholar] [CrossRef]
Rehman, T.U.; Mahmud, M.S.; Chang, Y.K.; Jin, J.; Shin, J. Current and future applications of statistical machine learning algorithms for agricultural machine vision systems. Comput. Electron. Agric. 2019, 156, 585–605. [Google Scholar] [CrossRef]
Bisht, B. Yield Prediction Using Spatial and Temporal Deep Learning Algorithms and Data Fusion. Ph.D. Thesis, Université d’Ottawa/University of Ottawa, Ottawa, ON, Canada, 2023. [Google Scholar]
Asaf, M.Z.; Rasul, H.; Akram, M.U.; Hina, T.; Rashid, T.; Shaukat, A. A modified deep semantic segmentation model for analysis of whole slide skin images. Sci. Rep. 2024, 14, 23489. [Google Scholar] [CrossRef]
Lee, J.; Ilyas, T.; Jin, H.; Lee, J.; Won, O.; Kim, H.; Lee, S.J. A pixel-level coarse-to-fine image segmentation labelling algorithm. Sci. Rep. 2022, 12, 8672. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Tan, J.; Huang, L.; Chen, Z.; Qu, R.; Li, C. DarkSegNet: Low-light semantic segmentation network based on image pyramid. Signal Process. Image Commun. 2025, 135, 117265. [Google Scholar] [CrossRef]
Shi, H.; Li, H.; Meng, F.; Wu, Q.; Xu, L.; Ngan, K.N. Hierarchical parsing net: Semantic scene parsing from global scene to objects. IEEE Trans. Multimed. 2018, 20, 2670–2682. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Biswas, R. Multi-Task Multi-Scale Contrastive Knowledge Distillation for Efficient Medical Image Segmentation. arXiv 2024, arXiv:2406.03173. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhang, Y.; Gao, D.; Du, Y.; Li, B.; Qin, L. Efficient multi-scale network for semantic segmentation of fine-resolution remotely sensed images. Meas. Sci. Technol. 2024, 35, 096005. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Gakhar, I.; Guha, A.; Gupta, A.; Agarwal, A.; Toshniwal, D.; Verma, U. TLDR: Traffic Light Detection using Fourier Domain Adaptation in Hostile WeatheR. arXiv 2024, arXiv:2411.07901. [Google Scholar]
Wang, J.; Guo, J.; Wang, R.; Zhang, Z.; Fu, L.; Ye, Q. Parameter Disentanglement for Diverse Representations. Big Data Min. Anal. 2025, 8, 606–623. [Google Scholar] [CrossRef]
Shen, Y.; Xiao, L.; Chen, J.; Du, Q.; Ye, Q. Learning Cross-Task Features With Mamba for Remote Sensing Image Multitask Prediction. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5612116. [Google Scholar] [CrossRef]
Wei, R.; Yan, R.; Qu, H.; Li, X.; Ye, Q.; Fu, L. SVMFN-FSAR: Semantic-Guided Video Multimodal Fusion Network for Few-Shot Action Recognition. Big Data Min. Anal. 2025, 8, 534–550. [Google Scholar] [CrossRef]
Sun, L.; Zhou, J.; Ye, Q.; Wu, Z.; Chen, Q.; Xu, Z.; Fu, L. MDC-FusFormer: Multiscale Deep Cross-Fusion Transformer Network for Hyperspectral and Multispectral Image Fusion. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5528914. [Google Scholar] [CrossRef]
Jin, X.; Wu, X.; Weng, L.; Ye, Q. Lightweight binary convolutional-transformers fusion network for facial expression recognition. Eng. Appl. Artif. Intell. 2025, 158, 111315. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Proceedings of the 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018, Proceedings 4; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Fu, J.; Li, X.x.; Chen, F.h.; Wu, G. Pear leaf disease segmentation method based on improved DeepLabv3+. Cogent Food Agric. 2024, 10, 2310805. [Google Scholar] [CrossRef]
Banu, A.S.; Deivalakshmi, S. AWUNet: Leaf area segmentation based on attention gate and wavelet pooling mechanism. Signal, Image Video Process. 2023, 17, 1915–1924. [Google Scholar] [CrossRef]
Duhan, S.; Gulia, P.; Gill, N.S.; Shukla, P.K.; Khan, S.B.; Almusharraf, A.; Alkhaldi, N. Investigating attention mechanisms for plant disease identification in challenging environments. Heliyon 2024, 10, e29802. [Google Scholar] [CrossRef] [PubMed]
Bhagat, S.; Kokare, M.; Haswani, V.; Hambarde, P.; Kamble, R. Eff-UNet++: A novel architecture for plant leaf segmentation and counting. Ecol. Inform. 2022, 68, 101583. [Google Scholar] [CrossRef]
Yi, X.; Wang, J.; Wu, P.; Wang, G.; Mo, L.; Lou, X.; Liang, H.; Huang, H.; Lin, E.; Maponde, B.T.; et al. AC-UNet: An improved UNet-based method for stem and leaf segmentation in Betula luminifera. Front. Plant Sci. 2023, 14, 1268098. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Fang, X.; Yan, P. Multi-organ segmentation over partially labeled datasets with multi-scale feature abstraction. IEEE Trans. Med Imaging 2020, 39, 3619–3629. [Google Scholar] [CrossRef]
Li, L.; Wang, B.; Li, Y.; Yang, H. Diagnosis and mobile application of apple leaf disease degree based on a small-sample dataset. Plants 2023, 12, 786. [Google Scholar] [CrossRef]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 16133–16142. [Google Scholar]
Deng, J.; Wang, Y.; Zhang, C. FSCformernet: A Fourier-Transformer UNet for Efficient Semantic Segmentation of Plant Leaf. In Advanced Intelligent Computing Technology and Applications, Proceedings of the 20th International Conference, ICIC 2024, Tianjin, China, 5–8 August 2024, Proceedings, Part XI; Springer: Singapore, 2024; pp. 266–277. [Google Scholar]
Omia, E.; Bae, H.; Park, E.; Kim, M.S.; Baek, I.; Kabenge, I.; Cho, B.K. Remote sensing in field crop monitoring: A comprehensive review of sensor systems, data analyses and recent advances. Remote Sens. 2023, 15, 354. [Google Scholar] [CrossRef]
Lagergren, J.; Pavicic, M.; Chhetri, H.B.; York, L.M.; Hyatt, D.; Kainer, D.; Rutter, E.M.; Flores, K.; Bailey-Bale, J.; Klein, M.; et al. Few-Shot Learning Enables Population-Scale Analysis of Leaf Traits in Populus trichocarpa. Plant Phenomics 2023, 5, 0072. [Google Scholar] [CrossRef]
Zhang, S.; Zhang, C. Modified U-Net for plant diseased leaf image segmentation. Comput. Electron. Agric. 2023, 204, 107511. [Google Scholar] [CrossRef]
Liao, T.; Yang, R.; Zhao, P.; Zhou, W.; He, M.; Li, L. MDAM-DRNet: Dual channel residual network with multi-directional attention mechanism in strawberry leaf diseases detection. Front. Plant Sci. 2022, 13, 869524. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Yi, X.; Zhou, Y.; Wu, P.; Wang, G.; Mo, L.; Chola, M.; Fu, X.; Qian, P. U-Net with Coordinate Attention and VGGNet: A Grape Image Segmentation Algorithm Based on Fusion Pyramid Pooling and the Dual-Attention Mechanism. Agronomy 2024, 14, 925. [Google Scholar] [CrossRef]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3684–3692. [Google Scholar]
Luo, X.; Wei, Y.; Chen, Y.; Chen, Z.; Fang, Y. Automatic reading method for pointer meters based on improved Deeplabv3+. In Proceedings of the 2023 IEEE 7th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 15–17 September 2023; Volume 7, pp. 1958–1963. [Google Scholar]
Wang, J.; Han, L.; Ran, D. Architectures and applications of U-net in medical image segmentation: A review. In Proceedings of the 2023 9th International Symposium on System Security, Safety, and Reliability (ISSSR), Hangzhou, China, 10–11 June 2023; pp. 84–94. [Google Scholar]
Su, R.; Zhang, D.; Liu, J.; Cheng, C. Msu-net: Multi-scale u-net for 2d medical image segmentation. Front. Genet. 2021, 12, 639930. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Liu, J.; Liu, G. Diseases detection of occlusion and overlapping tomato leaves based on deep learning. Front. Plant Sci. 2021, 12, 792244. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, Z. Segment Any Leaf 3D: A Zero-Shot 3D Leaf Instance Segmentation Method Based on Multi-View Images. Sensors 2025, 25, 526. [Google Scholar] [CrossRef]
Rana, S.; Gerbino, S.; Carillo, P. Study of spectral overlap and heterogeneity in agriculture based on soft classification techniques. MethodsX 2025, 14, 103114. [Google Scholar] [CrossRef]
Chen, Y.; Baireddy, S.; Cai, E.; Yang, C.; Delp, E.J. Leaf segmentation by functional modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Xu, Y.; Ma, Y.; Ye, Q.; Fu, L.; Yang, X. Interpretable deep one-class model for forest fire detection. Expert Syst. Appl. 2025, 281, 127657. [Google Scholar]
Chao, X.; Hu, X.; Feng, J.; Zhang, Z.; Wang, M.; He, D. Construction of apple leaf diseases identification networks based on xception fused by SE module. Appl. Sci. 2021, 11, 4614. [Google Scholar] [CrossRef]
Saha, D.; Mangukia, M.P.; Manickavasagan, A. Real-time deployment of MobileNetV3 model in edge computing devices using RGB color images for varietal classification of chickpea. Appl. Sci. 2023, 13, 7804. [Google Scholar] [CrossRef]
Yang, K.; Zhong, W.; Li, F. Leaf segmentation and classification with a complicated background using deep learning. Agronomy 2020, 10, 1721. [Google Scholar] [CrossRef]
Ma, Y.; Lan, X. Semantic segmentation using cross-stage feature reweighting and efficient self-attention. Image Vis. Comput. 2024, 145, 104996. [Google Scholar] [CrossRef]
Cai, W.; Wang, B.; Zeng, F. CUDU-Net: Collaborative up-sampling decoder U-Net for leaf vein segmentation. Digit. Signal Process. 2024, 144, 104287. [Google Scholar] [CrossRef]
Liu, L.; Li, G.; Du, Y.; Li, X.; Wu, X.; Qiao, Z.; Wang, T. CS-net: Conv-simpleformer network for agricultural image segmentation. Pattern Recognit. 2024, 147, 110140. [Google Scholar] [CrossRef]
Huang, S.; Huang, H. AMFFNet: Adaptive Multi-Scale Feature Fusion Network for Urban Image Semantic Segmentation. Electronics 2025, 14, 2344. [Google Scholar] [CrossRef]
Hu, Y.; Chen, Y.; Li, X.; Feng, J. Dynamic feature fusion for semantic edge detection. arXiv 2019, arXiv:1902.09104. [Google Scholar]
Donapati, R.R.; Cheruku, R.; Kodali, P. Real-time seed detection and germination analysis in precision agriculture: A fusion model with u-net and cnn on jetson nano. IEEE Trans. AgriFood Electron. 2023, 1, 145–155. [Google Scholar] [CrossRef]
Ilyas, T.; Lee, J.; Won, O.; Jeong, Y.; Kim, H. Overcoming field variability: Unsupervised domain adaptation for enhanced crop-weed recognition in diverse farmlands. Front. Plant Sci. 2023, 14, 1234616. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, F.; Lane, N.D.; Shu, Y.; Zeng, X.; Fang, B.; Yan, S.; Xu, H. Deep learning in the era of edge computing: Challenges and opportunities. In Fog Computing: Theory and Practice; John Wiley & Sons: Hoboken, NJ, USA, 2020; pp. 67–78. [Google Scholar]
Alba, A.; Villaverde, J.; Lacara, A.; Domingo, J.; Aguirre, D. Optimized FLOPs-Aware Knowledge Distillation for TinyML Applications in Agriculture. In Proceedings of the 2025 International Conference on Advancement in Data Science, E-Learning and Information System (ICADEIS), Bandung, Indonesia, 3–4 February 2025; pp. 1–6. [Google Scholar]
Cieslak, M.; Govindarajan, U.; Garcia, A.; Chandrashekar, A.; Hadrich, T.; Mendoza-Drosik, A.; Michels, D.L.; Pirk, S.; Fu, C.C.; Palubicki, W. Generating diverse agricultural data for vision-based farming applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5422–5431. [Google Scholar]
Posso, J.; Kieffer, H.; Menga, N.; Hlimi, O.; Tarris, S.; Guerard, H.; Bois, G.; Couderc, M.; Jenn, E. Real-Time Semantic Segmentation of Aerial Images Using an Embedded U-Net: A Comparison of CPU, GPU, and FPGA Workflows. arXiv 2025, arXiv:2503.08700. [Google Scholar]

Figure 1. Apple leaf data images. (a) Multiple flat leaves with clear venation under natural daylight. (b) Irregular leaves showing folded shape and serrated edges. (c) Complex clusters of overlapping leaves. (d) Leaves photographed at night with artificial illumination. (e) Example of horizontal and vertical flipping applied to diversify leaf morphology. (f) Contrast and brightness adjustments performed to simulate lighting variations and enhance edge visibility.

Figure 2. Impact of data augmentation on MSFUnet segmentation performance.

Figure 3. Display of grape leaf images from Grape400.

Figure 4. Crop leaf images.

Figure 5. MSFUnet structure diagram. (A) The overall structure of MSFUnet. (B) The multi-path feature fusion module. (C) The edge-detail focus module.

Figure 6. Loss curves of the MSFUnet and other methods.

Figure 7. MIoU curves across 100 training epochs for MSFUnet and six segmentation models.

Figure 8. Baseline U-Net achieved 84.61% IoU. Adding only the MFF module raised the IoU to 90.26%, while using only the EDF module increased the IoU to 88.78%. Combining both modules in MSFUnet further improved the IoU to 93.35%, indicating a synergistic gain beyond the sum of individual contributions.

Figure 9. Leaf segmentation visualization results. (A) Irregular margins: MSFUnet achieves precise contouring, whereas comparison models produce fragmented or displaced edges. (B) Overlapping leaves: MSFUnet maintains distinct leaf instances, whereas conventional models erroneously merge overlapping regions. (C) Folded structures: MSFUnet preserves venation continuity, while other architectures exhibit midrib fragmentation.

Figure 10. MSFUnet segmentation results on grape leaf images.

Table 1. Software and hardware information.

Hardware	Software
CPU: Intel Core i9-13900K	OS: Windows 11 x64
RAM: 64 GB	CUDA Toolkit: 11.1
GPU: RTX 4090	Python: 3.8
Video Memopry: 24 GB	Pytorch: 1.8.0
	Torchvision: 0.9.0

Table 2. Experimental settings.

Parameters	Value
Size of input images	512 × 512 pixels
Batch size	8
Maximum learning rate	0.0001
Minimum learning rate	0.000001
Scheduler	Cosine annealing
Optimizer	Adam
Loss function	Dice + BCE (weighted)
Class weights	Leaf: 1.5, Background: 0.7
Momentum	0.9
Number of iterations	100 epochs

Table 3. MSFUnet layer configuration and output dimensions.

Module	Contracting	Size, Stride	Output Size	Module	Decoding	Size, Stride	Output Size
Encoder 1	Conv 1	3 × 3, 1	512 × 512 × 64	Decoder 4	Upsample, Concat	2 × 2, 2	128 × 128 × 256
	Conv 2	3 × 3, 1	512 × 512 × 64		EDF-1		128 × 128 × 256
	Maxpooling	2 × 2, 2	256 × 256 × 64		EDF-2		128 × 128 × 128
Encoder 2	Conv 3	3 × 3, 1	256 × 256 × 128	Decoder 3	Upsample, Concat	2 × 2, 2	256 × 256 × 128
	Conv 4	3 × 3, 1	256 × 256 × 128		EDF-1		256 × 256 × 128
	Maxpooling	2 × 2, 2	128 × 128 × 128		EDF-2		256 × 256 × 64
Encoder 3	Conv 5	3 × 3, 1	128 × 128 × 256	Decoder 2	Upsample, Concat	2 × 2, 2	512 × 512 × 64
	Conv 6	3 × 3, 1	128 × 128 × 256		EDF-1		512 × 512 × 64
	Maxpooling	2 × 2, 2	64 × 64 × 256
Encoder 4	Conv 7	3 × 3, 1	64 × 64 × 512	Decoder 1	Concat, Addition		512 × 512 × 64
	Conv 8	3 × 3, 1	64 × 64 × 512
	Maxpooling	2 × 2, 2	32 × 32 × 512
Bottom	Conv 9	3 × 3, 1	32 × 32 × 1024	Final	Conv 11	1 × 1, 1	512 × 512 × 3
	Conv 10	3 × 3, 1	32 × 32 × 1024		SoftMax	2	512 × 512 × 3
	Upsample, Concat	2 × 2, 2	64 × 64 × 512
	EDF-1		64 × 64 × 512
	EDF-2		64 × 64 × 256

Table 4. Segmentation performance comparison on the ATTLDD apple leaf dataset.

Model	MIoU (%)	MPA (%)	MPrecision (%)	Dice
U-Net	$84.61 \pm 0.32$	$91.39 \pm 0.28$	$89.55 \pm 0.41$	$0.912 \pm 0.004$
U²Net	$91.04 \pm 0.19$	$94.17 \pm 0.15$	$93.96 \pm 0.22$	$0.916 \pm 0.003$
TransUNet	$80.23 \pm 0.47$	$85.78 \pm 0.36$	$88.81 \pm 0.52$	$0.818 \pm 0.006$
SwinUNet	$79.87 \pm 0.51$	$86.12 \pm 0.42$	$84.35 \pm 0.49$	$0.834 \pm 0.007$
CUDU-Net	$90.78 \pm 0.21$	$94.07 \pm 0.18$	$93.51 \pm 0.25$	$0.921 \pm 0.003$
CS-net	$87.50 \pm 0.35$	$93.77 \pm 0.29$	$92.16 \pm 0.38$	$0.905 \pm 0.005$
MSFUnet	$93.35 \pm 0.12$	$95.35 \pm 0.10$	$94.93 \pm 0.14$	$0.937 \pm 0.002$
p-value	<0.001	<0.001	<0.001	<0.001

Table 5. Comparison of model complexity and inference efficiency.

Model	Params (M)	FLOPs (G)	Inference Time (ms/Image)
U-Net	31.0 M	15.2 G	12.4 ± 0.5
U²Net	174.7 M	127.8 G	60.3 ± 1.2
TransUNet	105.7 M	90.5 G	51.6 ± 2.0
SwinUNet	88.2 M	81.3 G	45.1 ± 2.3
CUDU-Net	40.5 M	22.7 G	18.7 ± 0.6
CS-net	29.2 M	14.8 G	13.0 ± 0.5
MSFUnet	58.6 M	38.4 G	37.0 ± 0.5

Table 6. Pixel-level confusion matrix for MSFUnet.

	Predicted		Total (Actual)
Actual	Background	Leaf	Pixels (%)
Background	183,042,349 (58.78%)	3,735,558 (1.20%)	186,777,907 (60.0%)
Leaf	5,603,337 (1.80%)	118,915,268 (38.22%)	124,518,605 (40.0%)
Total Predicted	188,645,686 (60.58%)	122,650,826 (39.42%)	311,296,512 (100%)

Table 7. Per-class segmentation metrics on the ATTLDD test set.

Model	Leaf IoU	Bg IoU	Leaf Precision	Bg Precision	Leaf Recall	Bg Recall
U-Net	0.81	0.88	0.84	0.95	0.93	0.90
U²Net	0.90	0.93	0.91	0.96	0.95	0.94
TransUNet	0.79	0.82	0.88	0.89	0.88	0.89
SwinUNet	0.80	0.83	0.83	0.86	0.85	0.85
CUDU-Net	0.89	0.93	0.93	0.94	0.90	0.95
CS-net	0.86	0.89	0.91	0.95	0.94	0.93
MSFUnet	0.92	0.94	0.94	0.96	0.94	0.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, Z.; Yan, H. MSFUnet: A Semantic Segmentation Network for Crop Leaf Growth Status Monitoring. AgriEngineering 2025, 7, 238. https://doi.org/10.3390/agriengineering7070238

AMA Style

Cheng Z, Yan H. MSFUnet: A Semantic Segmentation Network for Crop Leaf Growth Status Monitoring. AgriEngineering. 2025; 7(7):238. https://doi.org/10.3390/agriengineering7070238

Chicago/Turabian Style

Cheng, Zhihan, and He Yan. 2025. "MSFUnet: A Semantic Segmentation Network for Crop Leaf Growth Status Monitoring" AgriEngineering 7, no. 7: 238. https://doi.org/10.3390/agriengineering7070238

APA Style

Cheng, Z., & Yan, H. (2025). MSFUnet: A Semantic Segmentation Network for Crop Leaf Growth Status Monitoring. AgriEngineering, 7(7), 238. https://doi.org/10.3390/agriengineering7070238

Article Menu

MSFUnet: A Semantic Segmentation Network for Crop Leaf Growth Status Monitoring

Abstract

1. Introduction

2. Materials and Methods

2.1. Experimental Setting

2.2. Data Acquisition and Expansion

2.3. MSFUnet

2.4. Multi-Path Feature Fusion

2.5. Edge-Detail Focus

2.6. Network Architecture

3. Results

3.1. Dataset Processing

3.2. Evaluation Indicators

3.3. Comparative Analysis with Other Methods

3.4. Ablation Experiment

3.5. Visualization Experiment and Analysis

3.6. Per-Class Performance and Error Analysis

4. Discussion

4.1. Comparative Analysis of Methodological Approaches and Outcomes

4.2. Practical Implications, Acknowledged Constraints and Future Trajectories

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI