Automatic Scribble Annotations Based Semantic Segmentation Model for Seedling-Stage Maize Images

Li, Zhaoyang; Liu, Xin; Deng, Hanbing; Zhou, Yuncheng; Miao, Teng

doi:10.3390/agronomy15081972

Open AccessArticle

Automatic Scribble Annotations Based Semantic Segmentation Model for Seedling-Stage Maize Images

by

Zhaoyang Li

,

Xin Liu

,

Hanbing Deng

^*

,

Yuncheng Zhou

and

Teng Miao

College of Information and Electrical Engineering, Shenyang Agricultural University, Shenyang 110866, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(8), 1972; https://doi.org/10.3390/agronomy15081972

Submission received: 26 June 2025 / Revised: 9 August 2025 / Accepted: 13 August 2025 / Published: 15 August 2025

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Canopy coverage is a key indicator for judging maize growth and production prediction during the seedling stage. Researchers usually use deep learning methods to estimate canopy coverage from maize images, but fully supervised models usually need pixel-level annotations, which requires lots of manual labor. To overcome this problem, we propose ASLNet (Automatic Scribble Labeling-based Semantic Segmentation Network), a weakly supervised model for image semantic segmentation. We designed a module which could self-generate scribble labels for maize plants in an image. Accordingly, ASLNet was constructed using a collaborative mechanism composed of scribble label generation, pseudo-label guided training, and double-loss joint optimization. The cross-scale contrastive regularization can realize semantic segmentation without manual labels. We evaluated the model for label quality and segmentation accuracy. The results showed that ASLNet generated high-quality scribble labels with stable segmentation performance across different scribble densities. Compared to Scribble4All, ASLNet improved mIoU by 3.15% and outperformed fully and weakly supervised models by 6.6% and 15.28% in segmentation accuracy, respectively. Our works proved that ASLNet could be trained by pseudo-labels and offered a cost-effective approach for canopy coverage estimation at maize’s seedling stage. This research enables the early acquisition of corn growth conditions and the prediction of corn yield.

Keywords:

semantic segmentation; weakly supervised learning; maize seedling stage; plant phenotype; scribble annotations

1. Introduction

With the integration of information technology and crop breeding, computer vision has become increasingly vital in plant phenotyping. High-throughput phenotyping facilitates efficient acquisition of crop traits, enabling timely growth monitoring and early yield prediction. The seedling stage is particularly important, as maize canopy coverage at this phase indicates plant vigor, health, and potential yield [1]. However, traditional manual measurements remain inefficient and susceptible to operator bias.

Classical computer vision techniques have been used to extract canopy traits. Methods like Otsu’s thresholding [2], Lab color space segmentation [3], and K-means clustering [4] show robustness under controlled settings but degrade in complex field conditions. While Naive Bayes-based segmentation methods [5] and Multi-threshold segmentation using the multi-stage Cauchy gray wolf optimizer [6] improve accuracy, they remain computationally demanding. Despite reducing subjectivity, these traditional methods struggle with image variability and lack consistency across diverse scenarios.

Recent advances in deep learning have introduced powerful tools for semantic segmentation in crop phenotyping by enabling automatic extraction of high-level features through end-to-end training. Since Hariharan et al. [7] proposed a unified framework integrating object detection and segmentation, numerous studies have adopted deep learning in plant analysis. Zhang et al. [8] incorporated morphological priors to segment weeds in maize fields. Hong et al. [9] used lightweight SSDLite-MobileDet for fast and accurate canopy detection. Zenkl et al. [10] applied DeepLab V3+ to winter wheat segmentation under natural conditions, and Turgut et al. [11] proposed an attention-based architecture for hierarchical organ segmentation. Fan et al. [12] introduced a multi-scale two-stream model for leaf detection and counting, while DeepShoot [13] demonstrated over 90% segmentation accuracy across various plant types and growth stages. These approaches have collectively advanced the scalability, robustness, and precision of trait analysis in plant phenotyping.

Despite their effectiveness, most deep learning models rely heavily on pixel-level labels, which are costly and time-consuming to obtain, especially under complex field conditions. Public datasets often fail to accommodate specific crop types and imaging scenarios, forcing researchers to construct custom datasets. To mitigate labeling costs, weakly supervised approaches using coarse annotations such as image-level labels [14,15,16], bounding boxes [17,18,19], and scribbles [20,21,22] have been proposed. Zhao et al. [23], Xia et al. [24], and Chaturvedi et al. [25] developed methods leveraging pseudo-labels, autoencoders, and GANs for maize segmentation. While these methods reduce annotation burdens, bounding box annotations may introduce background noise, and unsupervised methods often lack accuracy due to the absence of clearly defined target regions.

To address these issues, we propose a novel weakly supervised semantic segmentation method, termed ASLNet (Automatic Scribble Labeling Network), designed to achieve accurate segmentation of maize seedlings without human-annotated labels. ASLNet integrates a self-generating scribble label module and uses the resulting pseudo-labels to guide training. A dual-loss optimization strategy enhances learning across the backbone and auxiliary branches, while the dual-path feature interaction and resolution-aware contrastive regularization modules help extract structured mid-level features and improve generalization. Experimental results show that ASLNet outperforms both classical fully supervised and state-of-the-art weakly supervised methods. It reduces annotation costs by 80% while maintaining competitive segmentation accuracy through lightweight processing.

2. Materials and Methods

2.1. Data Acquisition

The maize variety selected for this study is “Xianyu 335”(Tieling Pioneer Seed Research Co., Ltd., Tieling, China), known for its vigorous growth during the seedling stage. This hybrid exhibits superior leaf morphology and plant architecture compared to other maize cultivars, ensuring robust health and physiological stability throughout the growth cycle. These characteristics make it an ideal plant model for phenotypic data acquisition in field experiments. At the seedling stage, individual maize plants typically reach a height of 20–30 cm and develop 3 to 4 fully expanded leaves. Aerial data collection was conducted using an unmanned aerial vehicle (UAV) platform. The UAV surveys were conducted over a one-week period, with the main data acquisition occurring on 7 September 2023 to guarantee phenological consistency. A DJI Phantom 4-RTK drone (SZ DJI Technology Co., Ltd., Shenzhen, China) was employed to capture high-resolution top-view images using an overhead imaging protocol, as illustrated in Figure 1. To ensure representative sampling and enhance the diversity of the dataset, the flight altitude was set between 2 and 5 m, with a constant cruising speed of 50 km/h, enabling full coverage of the maize seedling population within the experimental plots. To guarantee image quality and minimize the effects of environmental variability, aerial surveys were conducted daily between 9:00 a.m. and 11:00 a.m. under windless, clear weather conditions, with ambient temperatures ranging from 15 °C to 25 °C. The captured RGB images have a native resolution of 2720 × 1350 pixels, providing sufficient spatial detail for downstream image analysis and phenotypic feature extraction.

2.2. Datasets Construction

The original images captured by the drones were manually screened, and finally 1500 top view images of corn seedlings were selected. Each image contains 7 to 9 columns of corn seedlings to ensure the diversity and representativeness of the data. In terms of image data preprocessing, in order to avoid overfitting and improve the model’s generalization ability, the original dataset was enhanced, as shown in Figure 2. The enhancement methods included operations such as random cropping, image flipping, adding Gaussian noise, and brightness adjustments. Two methods were randomly selected for each image to enhance it. The dataset was ultimately expanded to 2000 images. All of the augmented images retained the diversity of the original dataset while providing more samples to improve the model’s generalization ability.

To ensure the rationality of the data distribution and the effectiveness of the model training, the images in the dataset were divided into training, validation, and test sets. The division ratio of the training set and the validation set was 4:1, and the remaining part is used as the test dataset. The above image data were all the basic constituent elements of this model, ensuring the authenticity and representativeness of the data and providing reliable input for the subsequent training of the weakly supervised learning model.

2.3. Semantic Segmentation Model Based on Self-Generated Labels

2.3.1. Research Objectives and Model Framework

This study was oriented toward the semantic segmentation task of corn seedling images in the field environment, focusing on solving problems such as the ambiguity of leaf overlap caused by plant morphological diversity, complex background interference from drone aerial perspectives, and the high cost of fully supervised labeling. To overcome these limitations, we propose a weakly supervised deep learning framework based on self-generated scribble annotations, which aims to eliminate dependence on large-scale fine-grained annotated data, significantly improve the labeling efficiency on the premise of ensuring segmentation accuracy, and promote the development of agricultural phenotypic parameter acquisition toward low cost and high automation. The proposed framework enables semantic segmentation in complex farmland scenes with sparse supervision by constructing a multi-stage pseudo-label generation and optimization mechanism. As illustrated in Figure 3, the overall architecture is composed of three key modules:

Graffiti label initialization module: In the HSV color space, K-means clustering is employed to quantify the dominant color distribution of each image. To suppress fine-grained noise caused by minor branches and preserve the primary structural features of the maize plants, morphological skeleton extraction is combined with Harris corner detection. Subsequently, contour tracking is applied to achieve region coding, enabling the generation of a sparse weak annotation set by randomly selecting foreground pixels. To improve the model’s generalization under complex chromatic backgrounds, stratified sampling is integrated with K-fold cross-validation (n = 5) during this phase.

Self-generated pseudo-supervised optimization module: Following the initialization of graffiti labels, a pseudo-supervised optimization strategy based on cross-scale contrastive learning is introduced to refine the semantic reliability of the generated pseudo-labels. This module extracts latent feature representations from the intermediate layers of the decoder and utilizes the pseudo-labels to embed semantic supervision. Feature consistency-based contrastive learning is then performed to strengthen category boundary discrimination and representation, thus effectively optimizing the training of unlabeled regions within the pseudo-supervised paradigm.

Semantic consistency-driven segmentation module: To further enhance the model’s capacity to interpret complex field images under weak supervision, a semantic alignment constraint mechanism is proposed. This mechanism leverages the pseudo-labels to construct a semantic consistency mapping between labeled and unlabeled regions, guiding the network toward a better understanding of cross-regional semantic structures. It effectively mitigates semantic drift arising from sparse annotations and significantly improves the model’s regional awareness and boundary consistency. Refer to Figure 3 for detailed architecture.

2.3.2. Semantic Segmentation Network

In terms of model architecture, an improved fully convolutional neural network structure was adopted as the basic framework. Considering the relatively limited scale of the experimental dataset—approximately 70% smaller than standard datasets such as COCO—and the presence of a single category, ResNet50 was selected as the core feature extractor to achieve an optimal trade-off between model complexity and representational capacity. The encoder–decoder architecture of U-Net [26] has demonstrated strong performance in semantic segmentation tasks. However, its default use of two stacked

3 \times 3

convolutional layers for feature extraction may not be sufficient to capture comprehensive image features. To address this limitation, the standard convolutional layers were replaced with residual blocks to enhance feature extraction capabilities. The residual learning framework proposed by He et al. [27] effectively alleviates the gradient vanishing/explosion problem and enriches hierarchical feature representation as the network depth increases. Inspired by this, Zhang et al. [28] proposed a residual U-Net architecture incorporating a single residual connection before each downsampling operation for road extraction tasks. In comparison, the segmentation of seedling maize images requires higher accuracy and more precise delineation of boundary regions. To meet this demand, this study adopts the BasicBlock structure from [27], enabling multiple residual connections before each downsampling layer to facilitate the extraction of discriminative features across scales. By preserving skip connections, features at different scales can be integrated while maintaining low-scale feature information. As the depth of the network increases, the computational burden increases. In computer vision, the attention mechanism is regarded as a dynamic selection process that reduces computational complexity and maintains performance by adaptively weighting the input according to feature importance. The ECA module proposed by Wang et al. [29] overcomes the contradiction between performance and complexity by involving only a small number of parameters, which significantly improves the network performance.

Consequently, this study proposes an ECA–U-Net architecture, which integrates residual blocks in the encoder and incorporates the ECA module for channel-wise attention, thereby forming a redesigned, lightweight yet effective encoder. The overall network comprises an improved encoder, high-fidelity upsampling in the decoder via transposed convolution, and a skip connection mechanism for fusing multi-level features. This design enhances segmentation performance in terms of boundary accuracy and structural completeness. The detailed ECA–U-Net architecture is illustrated in Figure 4, where the numbers in the green boxes indicate n repeated combinations of BasicBlocks and ECA modules at different encoding stages. These are connected through residual links, as visualized in Figure 5. The convolution operations employ 1D convolutional kernels of varying sizes across layers, denoted as k1, k2, k3, and k4, respectively.

2.3.3. Automatic Scribble Labeling Module

To streamline the annotation process and reduce manual labeling costs, this study first utilized the LabelMe software tool (version 5.4.1) for annotating the constructed Maize–real dataset. When users become proficient in operating the LabelMe tool, the time required for generating scribble annotations is approximately one-third of that required for fully supervised pixel-wise labeling. While graffiti-style (scribble) labeling significantly reduces annotation time compared to full supervision, the cumulative time cost remains substantial as the dataset size increases, especially in large-scale phenotyping studies. To address this limitation, we propose an automatic label generation strategy that combines human annotation with a pseudo-label self-generation mechanism.

This hybrid approach enables the model to autonomously generate high-quality weak labels based on sparse scribble annotations, thereby further reducing manual labor and enhancing annotation efficiency. By combining scribble annotation with the pseudo-label self-generation mechanism, scribble labels (weak labels) can be generated, effectively reducing the time cost of manual annotation. The overall framework is shown in Figure 6: it is divided into three major stages: scribble label generation, pseudo-label guidance training, and double-loss joint optimization.

Scribble Label Generation

The proposed method utilizes a fully pixel-wise labeled mask as input and simulates manually drawn foreground scribbles by refining the target region’s skeleton structure through morphological structural extraction techniques. To generate a supervisory signal for the background class, the inverse mask of the target region is similarly processed to extract its skeletal contour, thereby constructing sparse background annotations. Given the structural complexity and overlapping boundaries commonly present in real-world canopy images, such pseudo-labels may exhibit semantic ambiguity. To mitigate this, a corner detection mechanism is integrated to identify abrupt changes in the skeleton structure. Combined with local neighborhood analysis, this step filters out high-curvature points that are prone to introducing noise or inconsistencies during model learning. This refinement enhances the geometric coherence and semantic reliability of the resulting sparse labels, ensuring that the supervisory signals remain both representative and informative.

Furthermore, to control the sparsity level of the generated annotations, a sampling scale parameter is introduced during the annotation generation process. This parameter randomly selects a proportion of connected skeleton segments from the target region. Lower sampling ratios yield more sparse annotations by retaining only a small set of key structural pixels, thereby emulating the minimal supervision typical of manually drawn scribble labels. For instance, when the sampling ratio is set to 10%, the resulting sparse annotation comprises approximately 10% of the total skeleton pixels in the original mask. This effectively mirrors the annotation sparsity encountered in weakly supervised learning scenarios and significantly reduces labeling overhead.

Pseudo-Label Guided Training

Guided by the initial graffiti annotations, the model learns to localize the primary structural regions of maize seedlings. To further expand the supervisory signal beyond sparse labels, this study introduces a structure-consistent pseudo-label generation mechanism. Specifically, color clustering is first applied to the original image to segment it into multiple homogeneous regions, thereby simplifying the local texture space. The initial pseudo-labels are constructed by identifying overlapping regions between the graffiti annotations and the clustered segments. During iterative training, the model’s predictions are continuously compared with the clustered region assignments. Regions exhibiting low semantic confidence are progressively filtered out, while label propagation is achieved by leveraging the structural consistency and the model’s self-supervised learning capability. This results in a dynamic pseudo-label updating process that progressively refines label quality throughout training.

In terms of training data organization, a K-fold cross-validation strategy is employed to partition the dataset into training and validation subsets for each image type, thereby enhancing the model’s generalization ability under diverse conditions. To simulate weakly supervised environments of varying intensities, multiple levels of graffiti label sparsity were designed for each image, including sparsity ratios of 10%, 30%, 50%, and 100%. These variants are stored in separate directories, enabling staged access during different training phases.

This data organization strategy not only facilitates initial pre-training using sparsely annotated data (e.g., 10%) to establish fundamental perception capabilities with minimal supervision, it also supports the progressive incorporation of denser pseudo-labels in later stages, which incrementally enhance the strength of the training signal. The integration of dynamic pseudo-label generation and phased data scheduling ensures structured and efficient convergence of the model under weak supervision, contributing to both robustness and accuracy in field-scale semantic segmentation.

Double-Loss Joint Optimization

In order to fully integrate the weakly labeled information with the pseudo-labeled extension information, this paper adopts a dual-supervised signal training mechanism.

1.: Initialization phase: sparse annotation supervision and prediction integration

In the initial phase of training, we only rely on a small number of user-provided graffiti annotations as supervised signals. Due to the limited coverage of this annotation, only the pixels within the graffiti region are computed as loss during the training process, and the optimization objective is the standard cross-entropy loss, which is defined as follows:

L_{s} (x, s) = - \sum_{j \in Ω_{s}} s_{i} \log ({\hat{y}}_{i})

(1)

where

x

denotes the input image,

s

is the user-supplied graffiti annotation, and

Ω_{s}

is the set of graffiti pixels.

{\hat{y}}_{i}

is the predicted output of the model at the

i

-th iteration. This phase lasts until the end of the preset initial cycle to ensure that the model is primed with basic semantic information about the foreground and background.

To alleviate the lack of supervision caused by scribble labels, a Smoothed Prediction Update (SPE, Smoothed Prediction Update) strategy was introduced during the training process, which generates a stable prediction mean via an exponential moving average of the model’s historical outputs with the following formula:

y_{t} = α y_{t - 1} + (1 - α) \hat{y_{t}}

(2)

where

α

is the smoothing coefficient and

y_{t}

is the mean value of smoothed prediction in round

t

. To improve consistency, this strategy uses the unenhanced original image input in prediction and updates it every

γ

rounds to reduce the computational burden and improve the quality of the pseudo-labels.

2.: Pseudo-label-driven reinforcement learning phase

At the end of the initialization phase, the model can generate stable pseudo-labels based on the cumulative predicted mean.

To ensure its reliability, only pixels with a prediction confidence level higher than the threshold

τ

are selected, and one-hot encoding is adopted for labeling to supervise the unlabeled regions. The corresponding loss function is as follows:

L_{u} (x, y_{n}) = - \sum_{j \in Ω_{g}} 1 (p_{i} > τ) \cdot {\tilde{y}}_{i} \log ({\hat{y}}_{i})

(3)

Among them,

Ω_{g} = \{g | g \in (m a x (y_{n}, 1 - y_{n}) > τ), g \notin Ω_{s}\}

represents the set of pixels generated by the pseudo-label,

τ

is the confidence threshold, and

1 (\cdot)

denotes the indicator function, which is used to filter the pixel points with high confidence. This strategy combines Progressive Learning (PL) and the entropy minimization mechanism to guide the model to optimize the unlabeled regions while the quality of pseudo-labels gradually improves. The final training target consists of a loss-weighted combination of two components:

L_{t 1} (x, s, \hat{y}) = L_{s} (x, s, \hat{y}) + λ L_{u} (x, y_{n})

(4)

where

λ

is an adjustable hyperparameter to control the influence weight of pseudo-labeling supervision at different stages. This dual-supervised signal fusion mechanism allows the model to make full use of the limited high-confidence manual labeling in the early stage of training, then gradually introduce self-generated pseudo-labels to construct training supervision covering a wider area, thus achieving sustained performance enhancement and robust convergence under weakly supervised conditions.

2.3.4. Cross-Scale Contrast Regularization Module

In the aforementioned dual-supervised training mechanism, the model relies on the co-supervised step-by-step optimization of graffiti tagging and self-generated pseudo-tagging to achieve step-by-step learning of the semantics of the target region. Features at one scale cover a localized region of fixed size, while features at multiple scales represent different regions of the original image. A single pixel-level loss function may not be sufficient to constrain the high-dimensional feature space of the model, thus affecting its generalization ability, especially when dealing with complex structures or boundaries. To improve the discriminative ability and structural consistency of the model, Cross-scale Contrast Regularization (CCR) is introduced in this paper as a means of regularization of the feature space, which is jointly optimized with the main supervised loss, as shown in Figure 7.

In traditional convolutional neural networks (such as U-Net), the feature hierarchy is usually gradually refined as the depth of the network increases. The approach is based on the assumption of semantic consistency in the latent space, and it encourages the model to learn feature representations that are discriminative and well clustered at different scales. In order to introduce cross-scale information, features are extracted at the intermediate layers of the model (especially the decoder and bottleneck layers) and transformed to scale. In this way, the model not only learns fine local features at high resolution, but also captures broader global semantic information at low resolution. Specifically, the intermediate feature representations output by the network at each coding layer will be mapped to the potential contrast space. To this end, a lightweight projector module is introduced in the encoder–decoder middle layer of U-Net, which is composed of a series of

1 \times 1

convolutional layers, normalization layers, and ReLU activation functions, and is used to project multi-dimensional features to a unified dimension. Subsequently, we select high-confidence foreground and background pixels from the model-generated pseudo-label set and map them to the latent space to construct anchor points and positive/negative sample pairs. To improve the stability of pseudo-labels, we introduce SPE in training and perform consistency filtering of pseudo-labels at different scales, keeping only pixel points with consistent prediction results at multiple scales for comparison supervision.

Let

f_{θ} (x)

represent a series of transformations

f_{n - 1} ° \dots ° f_{1} ° f_{0} (x)

, in which

f_{θ}

consists of multiple stages. We define a latent feature vector

V_{γ} = f_{γ} ° \dots ° f_{0} (x)

, where

γ

is a specific transformation layer that satisfies

0 < γ \leq n

. This vector is then processed through a projection module that maps potential features from

R^{C \times (H W / δ^{2})}

space to

R^{C^{'} \times (H W / δ^{2})}

space, typically consisting of a convolutional layer, batch normalization layer, and activation function.

Next, we define a pseudo-supervised cross-scale contrast regularized loss function involving the parameters

γ

and

δ

with the formula:

L_{c} (\hat{y}, v; f_{ϕ}, γ, δ, τ) = \frac{δ^{4}}{w^{2} H^{2}} \sum_{i} \sum_{j} B C E ({\hat{y}}_{δ}^{i} \land {\hat{y}}_{δ}^{i}, σ (\frac{Z_{γ}^{i}}{{‖z_{r}^{i}‖}_{2}} \cdot \frac{Z_{γ}^{j}}{{‖z_{r}^{j}‖}_{2}} / τ))

(5)

which is a sigmoid function; the symbols indicate element-by-element logical “or” operations.

{\hat{y}}_{δ}^{i}

represents downsampled versions of the pseudo-label vectors, the scale factors are

δ \in \{2^{i} | i \in N_{0}\}

, and

v_{1}

and

v_{2}

are consistency thresholds for comparison regularization at different resolutions.

Specifically, the

{\hat{y}}_{δ}^{i}

of definitions are as follows:

{\hat{y}}_{δ}^{i} = \{\begin{matrix} 0, i f \frac{1}{δ^{2}} {‖{\hat{y}}_{i : i} + δ^{2}‖}_{1} \leq v_{1} o r i \in Ω^{-} \\ 1, i f \frac{1}{δ^{2}} {‖{\hat{y}}_{i : i} + δ^{2}‖}_{1} \geq v_{2} o r i \in Ω^{+} \\ I g n o r e, o t h e r w i s e \end{matrix}

(6)

Among them,

Ω^{-}

and

Ω^{+}

respectively represent the marked areas of the background and the foreground.

The downsampling

δ

factor indicates the spatial resolution of the contrast regularization operation, while the consistency thresholds

v_{1}

and

v_{2}

ensure that only task-relevant feature regions are considered in the contrast learning process.

In this method, we use the sigmoid function instead of the Softmax function to handle multiple positive samples that may occur in each sample. This enables the model to perform regularization more efficiently in the segmentation task because the number of positive samples in each sample is usually large. By introducing cross-scale contrast regularization, we are able to apply supervised contrast regularization at multiple feature resolutions. This allows the model to utilize features from the middle layer of the encoder–decoder architecture, thus enhancing its feature representation capabilities.

Finally, we combine cross-scale contrast regularization with the main loss function of the supervised learning task (denoted as

L_{t 1}

) to obtain the following total loss function:

L_{t o t a l} (x, s, \hat{y}, u; f_{ϕ}, γ, δ, τ) = L_{t 1} (x, s, \hat{y}) + {λ_{1} L}_{c} (\hat{y}, v; f_{ϕ_{1}}, γ_{1}, δ_{1}, τ_{1}) + {λ_{2} L}_{c} (\hat{y}, v; f_{ϕ_{2}}, γ_{2}, δ_{2}, τ_{2})

(7)

where

λ_{1}

and

λ_{2}

are weight parameters that control the effect of the contrast regularization terms at different scales and

γ_{1}

and

γ_{2}

denote different layers in the model. Experiments show that when the values of

δ

are 1 and 4, a better balance can be achieved between the computational cost and the model’s representational ability. In this way, the model can not only be effectively trained by using sparse graffiti labels and high-confidence pseudo-labels, but also improve its generalization ability in the feature space through cross-scale contrast regularization, enabling the model to better adapt to complex image structures and boundary information.

2.3.5. Efficient Channel Attention Module

Aiming at the challenges such as background interference, plant morphological diversity, and irregular leaf distribution existing in the farmland environment, we designed a multi-scale segmentation framework integrating dynamic convolution operators. The method achieves pixel-level accurate positioning of the seedling corn canopy with the help of a global feature perception module, which especially improves the recognition of irregular structures. In feature extraction, convolutional neural networks enhance the perception of key features by integrating spatial and channel information. Compared to the traditional Squeeze-and-Excitation (SE) module, the ECA (Effective Channel Attention) module avoids the loss of information due to dimensional compression by introducing one-dimensional convolution in place of a fully connected layer. While improving the prediction accuracy and computational efficiency, ECA also significantly reduces the number of model parameters, making the model lighter and more efficient. The structure of the effective channel attention module is shown in Figure 8.

The input image is first passed through the Global Average Pooling (GAP) layer in the ECA module, and then a one-dimensional convolution is selected to perform local cross-channel interaction. In the ECA module, one-dimensional convolution with a kernel size of K is used to generate channel attention predictions, where K denotes the correlation value of the attention prediction involving a channel. The kernel size parameter in one-dimensional convolution is used to control the size of the convolution kernel and determine the coverage of local cross-channel interaction. The ECA module captures local cross-channel interaction by considering each channel and its K value. Experimental results show that this method is both efficient and effective. Therefore, the local weighting weights of the channels can be calculated using the following method:

\{\begin{matrix} ω_{i} = σ (\sum_{j = 1}^{k} a^{j} y_{i}^{j}), \\ y_{i} = G (d_{i}), d_{i} \in P \end{matrix} y_{i}^{j} \in Ω_{i}^{j}

(8)

Among them,

P

is the set of feature vector channels to be weighted,

d_{i}

is the feature vector channel of the

i

-th layer, denotes the global average pooling operation, the aggregated feature after pooling is

y_{i}

,

Ω_{i}^{j}

is the set of

k

adjacent feature channels,

a^{j}

is the original value of the

j

-th channel,

y_{i}^{j}

is the output value of the

j

-th channel adjacent to the

i

-th feature vector channel, and

σ

is the sigmoid activation function.

The above operation can be realized by one-dimensional convolution with a kernel size of k. The channel attention weights

ω

can be calculated according to the following equation:

ω = σ (C 1 D_{k} (y))

(9)

where

C 1 D

denotes a one-dimensional convolution and

y

is an aggregated feature.

As mentioned earlier, the ECA module will adaptively compute the convolutional kernel size k based on the global average pooling. In this way, the ECA module can dynamically adjust the convolutional kernel size according to different input data, thus improving the network’s ability to capture input features. The convolution kernel size

k

of one-dimensional convolution is proportional to the channel dimension

C

. That is, there exists the following mapping relationship between the convolution kernel size

k

and

C

:

C = ϕ (k)

(10)

Obviously, the simplest way to map is to use a linear function that

ϕ (k) = γ \times k - b

, but linear mapping has some limitations. Since the channel dimension

C

(the number of convolution kernels) is usually set to a power of 2, the linear function

ϕ (k) = γ \times k - b

can be extended to a nonlinear form:

C = ϕ (k) = 2^{a \times k + β}

(11)

According to the given channel dimension, the size of

k

(the one-dimensional convolution kernel) can be adaptively determined, and its formula is as follows:

k = ψ (C) = \frac{{l o g}_{2} (C)}{γ} + {\frac{b}{γ}|}_{o d d}

(12)

where

{|t|}_{o d d}

denotes the nearest odd number. In all experiments, we set

γ

and

b

to fixed values of 2 and 1, respectively.

2.4. Environment Setup

2.4.1. Experimental Environment and Parameters

This model was developed based on PyTorch version 1.10.0, using the Ubuntu 20.04 operating system and hardware including an Intel (R) Core (TM) i9-10980 XE CPU with a frequency of 3.00 GHZ and an NVIDIA A6000 GPU. The optimization in this study was performed using the Adam optimizer set to include a batch size of 16, a momentum of 0.9, and scheduling using a cosine annealing learning rate that starts at a maximum value of 3 × 10⁻⁴ and gradually decreases to 1 × 10⁻⁵. To improve the generalization ability of the model, transfer learning was used to introduce the ResNet50 backbone network weights pre-trained on the PASCAL VOC 2012 dataset (Everingham et al., 2012). The total training period was set to 1000 rounds, and the backbone network was frozen for the first 100 rounds to fine-tune only the two split branches, after which the backbone network was unfrozen and trained with the split branches. The model weights with the least loss on the validation set were selected as the final model for the test set evaluation, as shown in Table 1.

2.4.2. Metrics of Evaluation

To evaluate the performance of the model, we used the mean intersection over union (mIoU) and F1 score as the quantitative evaluation indexes of the segmentation effect of the seedling corn images.

The mean intersection and merger ratio MIoU was used to evaluate the segmentation accuracy of the overall target region of the model, defined as shown in Equation (13):

M_{I o U} = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{P_{i i}}{Σ_{j = 0}^{k} P_{i j} + Σ_{j = 0}^{k} P_{j i} - P_{i i}}

(13)

where

k

is the number of categories,

i

is the category corresponding to the truth value,

j

is the predicted category, and

P_{i j}

indicates that the category will be predicted.

The F1-Score combines precision and recall and is their reconciled average. It is calculated by the formula:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(14)

R e c a l l = \frac{T P}{T P + F N}

(15)

F 1 - S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(16)

where

T P

is the number of true instances, indicating that the model predicts a positive instance;

F P

is false positive instances, indicating that the model predicts false positive instances;

F N

is false negative instances, indicating that the model predicts false negative instances; and

T N

is true negative instances, indicating that the model predicts negative instances.

3. Results

3.1. Ablation Experiment

To evaluate the contribution of key modules in ASLNet, ablation experiments were conducted on the Maize–real dataset using U-Net with ResNet-50 [27] as the encoder. As shown in Table 2, the baseline ResNet-50 (Experiment a) achieved an mIoU of 59.58% and an F1-Score of 76.36%, while replacing it with ResNet-101 (Experiment b) resulted in reduced performance (mIoU 46.72%). Adding the Cross-scale Contrast Regularization (CCR) module (Experiment c) improved the mIoU to 63.14%, indicating enhanced context understanding. Incorporating the Efficient Channel Attention (ECA) module (Experiment d) further increased the mIoU by 12.72% and the F1-Score by 7.11%, demonstrating its effectiveness in feature enhancement. Although similar trends were observed with ResNet-101 (Experiments e, f), its overall performance remained inferior. The best performance was achieved by combining ResNet-50 with both CCR and ECA modules (Experiment g), yielding an mIoU of 74.86% and an F1-Score of 85.76%, validating the effectiveness of the ASLNet architecture.

3.2. Self-Generated Label Quality Assessment

In scribble-based supervised learning, annotation quality is often influenced by annotator subjectivity and experience. Given the lack of a unified standard for evaluating scribble quality, it is challenging to assess its impact objectively. To address this, we designed a set of controlled experiments by varying the sampling ratio of scribble annotations, thereby simulating different quality levels. This allows us to evaluate the robustness of ASLNet under varying annotation sparsity. The results are shown in Table 3.

Based on a comparative experiment with the graffiti samples generated using the Scribble4All method [30], this study automatically extracts skeleton pixels from full segmentation masks on the Maize–real dataset. Subsequently, 10%, 30%, 50%, and 100% of the skeleton points are randomly sampled to construct scribble annotations with varying levels of sparsity. These labels are then used to train the proposed ASLNet model. The sampling strategy and resulting annotations are illustrated in Figure 9. Specifically, compared to the annotations generated by the Scribble4All method, ASLNet achieved a 3.15% improvement in mIoU. Notably, even when trained using sparse scribble labels composed of only 10% of the skeleton pixels from the original segmentation masks, ASLNet still reached an mIoU of 63.54%, demonstrating strong adaptability to low-quality annotations. These results indicate that ASLNet maintains high stability and robustness across varying levels of scribble density. The integrated self-generated label module effectively complements sparse annotations by generating high-quality pseudo-labels suitable for weakly supervised training. This enables high-precision automatic annotation of field images with minimal human intervention. Overall, ASLNet not only enhances the practicality of weakly supervised semantic segmentation but also offers a viable solution for reducing annotation costs, highlighting its promising potential for large-scale agricultural applications.

3.3. Comparison Among Different Segmentation Models

To demonstrate the robustness and superiority of ASLNet, both quantitative experiments and qualitative evaluations were conducted, and its performance was compared with state-of-the-art (SOTA) baseline methods. To ensure the objectivity and fairness of the comparison, all methods were trained and evaluated under identical hardware and software environments.

3.3.1. Quantitative Comparison

The quantitative comparison results presented in Table 4 clearly demonstrate that ASLNet, using automatically generated graffiti labels, surpasses other scribble-supervision methods across multiple evaluation metrics. To comprehensively assess the performance of ASLNet in relation to existing weakly supervised segmentation approaches, five representative methods were selected for comparison in this study: ScribbleSup [31], Scribble2Label [32], ScRoadExtractor [33], ScribbleCont [34], and ScribCompNet [35], a structural component-based weak supervision method. Through comparative analysis, the superior segmentation performance of ASLNet in weakly supervised settings is further substantiated.

Specifically, ASLNet was pre-trained by self-generated pseudo-labels and achieved significant progress in both key metrics, mIoU and F1-Score, reaching 74.86% and 85.76% respectively. Compared with the ScribbleSup method based on the Xception backbone network, ASLNet improved by 7.06 percentage points in mIoU; compared with the ScribbleCont and Scribble2Label methods using the ResNet50 backbone network, ASLNet achieved 7.36% to 15.28% improvement in F1-Score and 4.18% to 9.4% improvement in F1-Score; ASLNet achieved 5.58% and 4.41% improvement in mIoU compared to the ScRoadExtractor method and ScribCompNet method, respectively. This result proves that through self-generated scribble labels, the network structure of ASLNet can effectively improve segmentation accuracy when conducting semantic segmentation of group corn seedling stage images, especially in the field environment, showing higher accuracy.

To further analyze the accuracy performance of the ASLNet model in the task of image semantic segmentation, this paper selects four classic fully supervised semantic segmentation models as the comparison benchmarks. Specifically, it includes U-Net, DeepLabv3+ [36], HRNet [37], and SegFormer [38] based on the Transformer architecture. All of the above fully supervised models are trained with the standard cross-entropy loss function, and their feature extraction is based on the backbone network. When compared with these models, ASLNet shows strong competitiveness in several evaluation metrics, and some metrics are even better than those of fully supervised methods.

The specific comparison results are shown in Table 4: Compared with the U-Net models based on ResNet34 and ResNet50, ASLNet improves the mIoU accuracy by 2.47 percentage points and 0.54 percentage points, respectively. Compared with the DeepLabv3+ model, ASLNet improves the mIoU accuracy by 1.11 percentage points. In the SegFormer model with different backbone networks, ASLNet slightly outperforms SegFormer-B0, SegFormer-B3, SegFormer-B4, SegFormer-B2, and is comparable to SegFormer-B1. It even outperforms the novel network HRNet in some cases.

3.3.2. Qualitative Comparison

Figure 10 shows eight segmentation examples to further validate the effectiveness of ASLNet in weakly supervised semantic segmentation tasks. The experiments compare ASLNet with four representative fully supervised methods (U-Net, DeepLabv3+, SegFormer, and HRNet) as well as five graffiti-supervised models (Scribble2Label, ScribbleCont, ScribbleSup, ScRoadExtractor, ScribCompNet) were compared. The experimental results show that ASLNet performs excellently in segmentation accuracy. Especially in matching with real labeling, ASLNet is significantly better than HRNet, Scribble2Label, and ScribbleCont.

ASLNet demonstrates strong segmentation ability and robustness in a variety of complex agricultural scenarios. In the case of blurred boundaries caused by plant occlusion (Figure 10b), this model can effectively distinguish the overlapping areas and achieve accurate boundary recognition. In the images with weed interference (Figure 10f), ASLNet can still accurately identify corn plants, demonstrating good generalization ability. In the scenes where illumination changes cause unstable features (Figure 10c,d), its segmentation effect at the edge of weeds is better than that of the fully supervised model HRNet. However, ASLNet still has some limitations in some specific situations. For example, the segmentation effect decreases when dealing with smaller targets (e.g., maize plants with connectivity domains of less than 100 in Figure 10h), which may stem from the insufficient ability of the used backbone network, ResNet50, in fine feature extraction. Meanwhile, there is also a breakage in the prediction of long-distance structures (e.g., Figure 10e), which may be related to the limitation of ResNet50 in modeling global contextual relationships.

Despite some limitations in some tasks, ASLNet still shows good comprehensive performance in most agricultural image segmentation scenarios, effectively verifying its ability to recognize and segment seedling population corn in a large field environment. The method makes full use of the advantages of graffiti annotation in annotation efficiency and structural information expression, and plays an active role in alleviating the problem of binary classification ambiguity and improving the model learning ability, especially when dealing with regions with irregular structure or small target sizes, showing significant advantages. Although slightly inferior to U-Net in some evaluation indicators, ASLNet is generally superior to other scribble supervision methods in the segmentation task of group corn and approaches or even reaches the performance level of the fully supervised method HRNet in multiple aspects. In summary, ASLNet not only shows good generalization and robustness under weak supervision, but also reflects high practical value in real agricultural applications, providing an efficient and cost-controllable solution for crop image segmentation in complex environments.

4. Discussion

ASLNet surpasses both fully supervised models (e.g., HRNet [37]) and representative scribble-supervised approaches [31,39] primarily due to two key innovations. First, the dual-path supervision-aware feature interaction module facilitates effective integration of visual features and sparse scribble annotations, enabling the extraction of more discriminative representations under limited supervision. This aligns with prior studies showing that improved feature quality and reliable pseudo-labels can significantly close the gap to fully supervised performance [39]. Second, the resolution-aware contrastive learning strategy effectively tackles domain-specific challenges such as occlusion and boundary ambiguity by promoting class separation in high-resolution feature space—offering a notable advantage over HRNet’s resolution-preserving architecture [37]—and consistent with findings from Schnell et al. [40] regarding the efficacy of contrastive learning in sparse annotation settings. Furthermore, contrastive regularization enhances the model’s robustness to variations in illumination and background textures, mitigating the overfitting frequently observed in scribble-supervised frameworks [31], similar to the stability gains reported in multi-scale remote sensing segmentation tasks [41].

Despite its strong overall performance, ASLNet exhibits certain limitations. The performance degradation in small object segmentation likely stems from the inherent sparsity of scribble annotations [40], which provide insufficient spatial constraints for fine-grained structures, as well as the limited capacity of the ResNet50 backbone in capturing subtle details [41]. In addition, the observed fragmentation in long-range structural predictions can be attributed to the restricted receptive field of ResNet50, which hinders the modeling of global contextual dependencies. This issue is particularly pronounced in agricultural scenes, where crop arrangements and field patterns often exhibit long-range spatial correlations. Prior studies have shown that Vision Transformers, through their self-attention mechanism, are better suited for capturing such global dependencies [42], suggesting a potential direction for future model refinement.

Future research can extend ASLNet in three promising directions. First, incorporating self-supervised or active learning strategies [39,40] could further reduce supervision requirements, alleviating reliance on labor-intensive manual annotations, especially for small objects and structurally complex regions. Second, integrating cross-domain adaptation techniques and enhanced multi-scale feature aggregation [41] may improve robustness across diverse agricultural environments, mitigating performance degradation due to variations in soil background, crop species, or climatic conditions. Finally, deploying ASLNet on real-time edge computing platforms such as UAV-based field monitoring systems could enable efficient and scalable crop health assessment in practical agricultural production. These advancements would further enhance ASLNet’s applicability, enabling it to deliver high segmentation accuracy and robustness under weak supervision in large-scale, real-world agricultural scenarios.

5. Conclusions

We propose ASLNet, a weakly supervised semantic segmentation model designed to address the challenges of complex background interference and high annotation costs in seedling-stage maize imagery. By leveraging automatically generated scribble annotations and a dual-loss optimization strategy, ASLNet eliminates the need for manual labeling. Under sparse supervision—using only 10% of skeleton pixels—it achieves a mean Intersection over Union (mIoU) of 63.54%, demonstrating strong robustness and segmentation accuracy.

In terms of architecture, ASLNet incorporates a dual-path feature interaction module along with an ECA-based adaptive context aggregation mechanism, which enhances the representation of fine-grained morphological features. Compared to the baseline Scribble2Label, ASLNet achieves a 15.28% improvement in mIoU, and it significantly outperforms other scribble-based models such as ScribbleSup and ScribCompNet. In challenging scenarios involving high leaf occlusion (leaf overlap ratios exceeding 50%), ASLNet maintains stable segmentation performance with an F1-Score of 85.76%.

Furthermore, ASLNet performs competitively with fully supervised models. It reaches an mIoU of 74.86% and an F1-Score of 85.76%, surpassing DeepLabv3+ by 1.11% in mIoU and demonstrating consistent advantages over several SegFormer variants (B0–B4) and HRNet. When using ResNet-50 as the backbone, ASLNet reduces model complexity by 32% compared to standard fully supervised counterparts, without compromising segmentation accuracy.

In summary, ASLNet enables efficient and high-precision semantic segmentation of maize seedlings without requiring human-annotated labels. It provides a cost-effective solution for large-scale field phenotyping and offers substantial potential for advancing intelligent agricultural applications.

Author Contributions

Methodology, Z.L.; Investigation, Z.L.; Resources, X.L.; Writing—original draft preparation, Z.L.; Writing—review and editing, H.D.; funding acquisition, H.D., Y.Z. and T.M.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Supported by Sub-project of National Key R&D Plan (Grant No. 2022YFD2002303-01 and No. 2024YFD1501205-01); Liaoning Province Innovation Capability Enhancement Joint Fund Project (Grant No. JYTMS20231303).

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors thank the editor and anonymous reviewers for providing helpful suggestions for improving the quality of this manuscript. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ninomiya, S. High-throughput field crop phenotyping: Current status and challenges. Breed. Sci. 2022, 72, 3–18. [Google Scholar] [CrossRef]
Jia, F.; Tao, Z.; Wang, F. Wooden pallet image segmentation based on Otsu and marker watershed. Proc. J. Phys. Conf. Ser. 2021, 1976, 012005. [Google Scholar] [CrossRef]
Zheng, X.; Lei, Q.; Yao, R.; Gong, Y.; Yin, Q. Image segmentation based on adaptive K-means algorithm. J. Image Video Proc. 2018, 68. [Google Scholar] [CrossRef]
Tongbram, S.; Shimray, B.A.; Singh, L.S. Segmentation of image based on k-means and modified subtractive clustering. Indones. J. Electr. Eng. Comput. Sci. 2021, 22, 1396–1403. [Google Scholar] [CrossRef]
Zhang, Z.; Zhu, Q.; Xie, Y. A novel image matting approach based on naive bayes classifier. In Intelligent Computing Technology, Proceedings of the 8th International Conference, ICIC 2012, Huangshan, China, 25–29 July 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 433–441. [Google Scholar] [CrossRef]
Yu, H.; Song, J.; Chen, C.; Heidari, A.A.; Liu, J.; Chen, H.; Mafarja, M. Image segmentation of leaf spot diseases on maize using multi-stage Cauchy-enabled grey wolf algorithm. Eng. Appl. Artif. Intell. 2022, 109, 104653. [Google Scholar] [CrossRef]
Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Simultaneous detection and segmentation. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; Part VII, pp. 297–312. [Google Scholar] [CrossRef]
Zhang, J.; Gong, J.; Zhang, Y.; Mostafa, K.; Yuan, G. Weed identification in maize fields based on improved Swin-Unet. Agronomy 2023, 13, 1846. [Google Scholar] [CrossRef]
Hong, S.; Jinbo, Q.; Song, L. Recognition of the maize canopy at the jointing stage based on deep learning. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2021, 37, 53–61. [Google Scholar] [CrossRef]
Zenkl, R.; Timofte, R.; Kirchgessner, N.; Roth, L.; Hund, A.; Van Gool, L.; Aasen, H. Outdoor plant segmentation with deep learning for high-throughput field phenotyping on a diverse wheat dataset. Front. Plant Sci. 2022, 12, 774068. [Google Scholar] [CrossRef]
Turgut, K.; Dutagaci, H.; Rousseau, D. RoseSegNet: An attention-based deep learning architecture for organ segmentation of plants. Biosyst. Eng. 2022, 221, 138–153. [Google Scholar] [CrossRef]
Fan, X.; Zhou, R.; Tjahjadi, T.; Das Choudhury, S.; Ye, Q. A segmentation-guided deep learning framework for leaf counting. Front. Plant Sci. 2022, 13, 844522. [Google Scholar] [CrossRef]
Narisetti, N.; Henke, M.; Neumann, K.; Stolzenburg, F.; Altmann, T.; Gladilin, E. Deep learning based greenhouse image segmentation and shoot phenotyping (deepshoot). Front. Plant Sci. 2022, 13, 906410. [Google Scholar] [CrossRef] [PubMed]
Ahn, J.; Kwak, S. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4981–4990. [Google Scholar] [CrossRef]
Ahn, J.; Cho, S.; Kwak, S. Weakly supervised learning of instance segmentation with inter-pixel relations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2209–2218. [Google Scholar] [CrossRef]
Lee, J.; Kim, E.; Lee, S.; Lee, J.; Yoon, S. Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5267–5276. [Google Scholar] [CrossRef]
Dai, J.; He, K.; Sun, J. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision 2015, Santiago, Chile, 7–13 December 2015; pp. 1635–1643. [Google Scholar] [CrossRef]
Song, C.; Huang, Y.; Ouyang, W.; Wang, L. Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3136–3145. [Google Scholar] [CrossRef]
Lan, S.; Yu, Z.; Choy, C.; Radhakrishnan, S.; Liu, G.; Zhu, Y.; Anandkumar, A. Discobox: Weakly supervised instance segmentation and semantic correspondence from box supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3406–3416. [Google Scholar] [CrossRef]
Zhang, K.; Zhuang, X. Cyclemix: A holistic strategy for medical image segmentation from scribble supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11656–11665. [Google Scholar] [CrossRef]
Pan, Z.; Sun, H.; Jiang, P.; Li, G.; Tu, C.; Ling, H. CC4S: Encouraging certainty and consistency in scribble-supervised semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8918–8935. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Zheng, Y.; Shan, D.; Yang, S.; Li, Q.; Wang, B.; Shen, D. Scribformer: Transformer makes CNN work better for scribble-based medical image segmentation. IEEE Trans. Med. Imaging 2024, 43, 2254–2265. [Google Scholar] [CrossRef] [PubMed]
Zhao, L.; Zhao, Y.; Liu, T.; Deng, H. A weakly supervised semantic segmentation model of maize seedlings and weed images based on scrawl labels. Sensors 2023, 23, 9846. [Google Scholar] [CrossRef]
Xia, X.; Kulis, B. W-Net: A deep model for fully unsupervised image segmentation. arXiv 2017, arXiv:1711.08506. [Google Scholar] [CrossRef]
Chaturvedi, K.; Braytee, A.; Li, J.; Prasad, M. SS-CPGAN: Self-supervised cut-and-pasting generative adversarial network for object segmentation. Sensors 2023, 23, 3649. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; Part III, Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, Q.; Wang, Y. Road extraction by deep residual U-Net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar] [CrossRef]
Boettcher, W.; Hoyer, L.; Unal, O.; Lenssen, J.E.; Schiele, B. Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets. arXiv 2024, arXiv:2408.12489. [Google Scholar] [CrossRef]
Lin, D.; Dai, J.; Jia, J.; He, K.; Sun, J. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3159–3167. [Google Scholar] [CrossRef]
Lee, H.; Jeong, W.K. Scribble2Label: Scribble-supervised cell segmentation via self-generating pseudo-labels with consistency. In Proceedings of the 23rd International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Lima, Peru, 4–8 October 2020; Part I, Volume 12261, pp. 14–23. [Google Scholar] [CrossRef]
Wei, Y.; Ji, S. Scribble-based weakly supervised deep learning for road surface extraction from remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–12. [Google Scholar] [CrossRef]
Oh, H.J.; Lee, K.; Jeong, W.K. Scribble-supervised cell segmentation using multiscale contrastive regularization. In Proceedings of the 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), Kolkata, India, 28–31 March 2022; pp. 1–5. [Google Scholar] [CrossRef]
Zhang, C.; Li, K.; Yin, Z.; Qin, R. Weakly-supervised structural component segmentation via scribble annotations. Comput.-Aided Civ. Inf. 2025, 40, 561–578. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
Sun, K.; Xiao, B.; Liu, D.; Wang, J.; Wang, X. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5693–5703. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar] [CrossRef]
Chen, Z.; Sun, Q. Weakly-supervised Semantic Segmentation with Image-level Labels: From Traditional Models to Foundation Models. ACM Comput. Surv. 2025, 57, 111. [Google Scholar] [CrossRef]
Schnell, J.; Wang, J.; Qi, L.; Hu, V.T.; Tang, M. Scribblegen: Generative Data Augmentation Improves Scribble-Supervised Semantic Segmentation. arXiv 2023, arXiv:2311.17121. [Google Scholar] [CrossRef]
Wang, Y.; Yang, L.; Liu, X.; Yan, P. An Improved Semantic Segmentation Algorithm for High-Resolution Remote Sensing Images Based on DeepLabv3+. Sci. Rep. 2024, 14, 9716. [Google Scholar] [CrossRef]
Zhang, L.; Lu, J.; Zheng, S.; Zhao, X.; Zhu, X.; Fu, Y.; Wang, L.; Huang, G.; Torr, P.H.S. Vision Transformers: From Semantic Segmentation to Dense Prediction. Int. J. Comput. Vis. 2024, 132, 6142–6162. [Google Scholar] [CrossRef]

Figure 1. Data collection schematic diagram. (Arrows are solely indicative of direction, representing that the images were acquired by the UAV).

Figure 2. Data augmentation example diagram.

Figure 3. Overall model framework.

Figure 4. Architecture of the proposed ECA–U-Net.

Figure 5. (a) BasicBlock, (b) Encoder of ECA–U-Net.

Figure 6. Automatic Scribble Labeling Framework.

Figure 7. Cross-scale Contrast Regularization Module.

Figure 8. Structure of ECA module.

Figure 9. ASLNet Generates scribble label of different proportions by itself.

Figure 10. Qualitative comparison of the methods of the study with the SOTA fully supervised method and the scribble supervised method. “F” and “S” respectively represent the fully supervised method and the scribble supervised method. Scenarios (a,c,d) represent the unstable characteristics of plants caused by changes in lighting. Class (b) scenes reflect the blurring of boundaries caused by plants being obscured. Class (e) scenarios correspond to special cases of long-distance structure prediction; Scenario (f) describes the challenges of plant recognition under the interference of weeds; Scenarios of Class (g,h) specifically handle special cases where the target of the connection domain is less than 100.

Table 1. Parameters and initial values of model training.

Parameter Name	Parameter Value
Image size/pixel	256 × 256
Batch size	8
Learning rate	1 × 10⁻⁵
Max training epoch	1000
Weight decay	0.00005

Table 2. Key component ablation experiments for different networks.

Index	ResNet50	ResNet101	CCR	ECA	mIoU (%)	F1-Score (%)
(a)	√				59.58	76.36
(b)		√			46.72	67.51
(c)	√		√		63.14	77.34
(d)	√			√	72.30	83.47
(e)		√	√		40.34	60.67
(f)		√		√	46.72	67.51
(g)	√		√	√	74.86	85.76

Table 3. Comparison of Scribble4All method with different scales in ASLNet.

Method	mIoU (%)	F1-Score (%)
Scribble4All	71.71	80.69
1	74.86	85.76
0.5	68.36	77.59
0.3	66.80	76.03
0.1	63.54	68.81

Table 4. Quantitative comparison results with the SOTA method. “F” and “S” denote fully supervised and graffiti-supervised methods, respectively.

Type	Method	Backbone	mIoU(%)	F1-Score (%)
F	U-Net	ResNet34	72.39	83.90
	U-Net	ResNet50	74.32	85.21
	DeepLabv3+	MobileNetV2	73.75	82.79
	SegFormer	SegFormer-B0	71.35	80.57
	SegFormer	SegFormer-B1	74.89	83.72
	SegFormer	SegFormer-B2	74.81	83.62
	SegFormer	SegFormer-B3	72.56	81.93
	SegFormer	SegFormer-B4	72.71	81.77
	HRNet	HRNetV2-W18	68.26	77.55
S	ScribbleSup	Xception	67.80	81.74
	Scribble2Label	ResNet50	58	76.36
	ScribbleCont	ResNet50	67.50	81.58
	ScRoadExtractor	ResNet34	69.32	97
	ScribCompNet	HRNetV2-W18	70.45	84.95
	ASLNet	ResNet50	74.86	85.76

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Liu, X.; Deng, H.; Zhou, Y.; Miao, T. Automatic Scribble Annotations Based Semantic Segmentation Model for Seedling-Stage Maize Images. Agronomy 2025, 15, 1972. https://doi.org/10.3390/agronomy15081972

AMA Style

Li Z, Liu X, Deng H, Zhou Y, Miao T. Automatic Scribble Annotations Based Semantic Segmentation Model for Seedling-Stage Maize Images. Agronomy. 2025; 15(8):1972. https://doi.org/10.3390/agronomy15081972

Chicago/Turabian Style

Li, Zhaoyang, Xin Liu, Hanbing Deng, Yuncheng Zhou, and Teng Miao. 2025. "Automatic Scribble Annotations Based Semantic Segmentation Model for Seedling-Stage Maize Images" Agronomy 15, no. 8: 1972. https://doi.org/10.3390/agronomy15081972

APA Style

Li, Z., Liu, X., Deng, H., Zhou, Y., & Miao, T. (2025). Automatic Scribble Annotations Based Semantic Segmentation Model for Seedling-Stage Maize Images. Agronomy, 15(8), 1972. https://doi.org/10.3390/agronomy15081972

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Scribble Annotations Based Semantic Segmentation Model for Seedling-Stage Maize Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Datasets Construction

2.3. Semantic Segmentation Model Based on Self-Generated Labels

2.3.1. Research Objectives and Model Framework

2.3.2. Semantic Segmentation Network

2.3.3. Automatic Scribble Labeling Module

Scribble Label Generation

Pseudo-Label Guided Training

Double-Loss Joint Optimization

2.3.4. Cross-Scale Contrast Regularization Module

2.3.5. Efficient Channel Attention Module

2.4. Environment Setup

2.4.1. Experimental Environment and Parameters

2.4.2. Metrics of Evaluation

3. Results

3.1. Ablation Experiment

3.2. Self-Generated Label Quality Assessment

3.3. Comparison Among Different Segmentation Models

3.3.1. Quantitative Comparison

3.3.2. Qualitative Comparison

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI