DECC-Net: A Maize Tassel Segmentation Model Based on UAV-Captured Imagery

Liu, Yinchuan; He, Lili; Cao, Yuying; Gao, Xinyue; Dong, Shoutian; Jia, Yinjiang

doi:10.3390/agriculture15161751

Open AccessArticle

DECC-Net: A Maize Tassel Segmentation Model Based on UAV-Captured Imagery

by

Yinchuan Liu

^1,2,

Lili He

³,

Yuying Cao

^1,2,

Xinyue Gao

^1,2,

Shoutian Dong

^1,2,* and

Yinjiang Jia

^1,2,*

¹

College of Electrical Engineering and Information, Northeast Agricultural University, Harbin 150030, China

²

Key Laboratory of Northeast Smart Agricultural Technology, Ministry of Agriculture and Rural Affairs, Harbin 150030, China

³

Department of Academic Theory Research, Northeast Agricultural University, Harbin 150030, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2025, 15(16), 1751; https://doi.org/10.3390/agriculture15161751

Submission received: 7 July 2025 / Revised: 11 August 2025 / Accepted: 12 August 2025 / Published: 15 August 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

The male flower of the maize plant, known as the tassel, is a strong indicator of the growth, development, and reproductive stages of maize crops. Monitoring maize tassels under natural conditions is significant for maize breeding, management, and yield estimation. Unmanned aerial vehicle (UAV) remote sensing combined with deep learning-based semantic segmentation offers a novel approach for monitoring maize tassel phenotypic traits. The morphological and size variations in maize tassels, together with numerous similar interference factors in the farmland environment (such as leaf veins, female ears, etc.), pose significant challenges to the accurate segmentation of tassels. To address these challenges, we propose DECC-Net, a novel segmentation model designed to accurately extract maize tassels from complex farmland environments. DECC-Net integrates the Dynamic Kernel Feature Extraction (DKE) module to comprehensively capture semantic features of tassels, along with the Lightweight Channel Cross Transformer (LCCT) and Adaptive Feature Channel Enhancement (AFE) modules to guide effective fusion of multi-stage encoder features while mitigating semantic gaps. Experimental results demonstrate that DECC-Net achieves advanced performance, with IoU and Dice scores of 83.3% and 90.9%, respectively, outperforming existing segmentation models while exhibiting robust generalization across diverse scenarios. This work provides valuable insights for maize varietal selection, yield estimation, and field management operations.

Keywords:

maize tassel; semantic segmentation; unmanned aerial vehicle; cross attention; deep learning

1. Introduction

Maize is one of the world’s three major cereal crops, widely cultivated across the globe and providing 19.5% of the global caloric intake from all sources [1,2]. It plays a diverse and dynamic role in the global agricultural food system and food/nutrition security, serving not only as an important human food crop but also as a crucial livestock feed, industrial, and energy crop [3]. The tassel, located at the top of the plant, is a key phenotypic trait indicating maize growth and reproductive stages. The characteristics of maize tassels provide important insights for improving agricultural practices to enhance yield [4,5,6]. Therefore, effective monitoring of tassel growth during the tasseling stage is important not only for maize yield estimation and variety selection, but also for guiding agricultural personnel in implementing appropriate field management.

Currently, the monitoring of maize tassel growth in the field mainly relies on manual observation, which is time-consuming, labor-intensive, and inefficient. Against the backdrop of continuously rising labor costs and increasing difficulties in labor recruitment, large-scale monitoring of maize tassels through manual methods has become unsustainable. In recent years, the rapid development of low-altitude UAV remote sensing technology and deep learning has provided new solutions for maize tassel monitoring. UAVs can capture high-resolution images at low altitudes and offer advantages including portability, high mobility, and low cost. Meanwhile, deep learning methods have significantly enhanced the accuracy and processing efficiency of image recognition. The collaborative integration of these two technologies has been widely applied across various domains of smart agriculture, including farmland crop classification [7], crop growth status monitoring [8], and estimation of crop leaf area index [9].

Existing studies on maize tassels using deep learning primarily focus on counting and object detection. Zheng et al. [10] proposed a multiscale lite attention enhancement network (MLAENet), which uses point-level annotations for maize tassel counting. Qi et al. [11] developed a MT-YOLO model, which integrates the ECANet attention mechanism into YOLOv5s to enhance both detection accuracy and speed. In our previous work [12], we introduced the CA-YOLO model, achieving an average precision of 96% and effectively detecting early-stage, leaf-obscured, mutually obscured, and complex background tassels.

However, current mainstream counting or object detection methods can only provide relatively coarse-grained recognition at the point or bounding box level, failing to fully characterize the growth status and key characteristics of maize tassels. This limitation restricts their utility in accurately assessing developmental stages and guiding precision field management. In contrast, semantic segmentation methods generate pixel-level annotations of target objects, enabling more detailed and actionable insights for agricultural decision-making. They have been increasingly applied to address diverse and data-rich agricultural problems [13]. In weed identification, Genze et al. [14] proposed a model named DeBlurWeedSeg, which achieved accurate segmentation of weeds in sorghum fields. In leaf disease identification, Wang et al. [15] proposed a network named MFBP-UNet for identifying multiple diseases on pear leaves; Megersa et al. [16] achieved the detection of common rust maize disease based on ResNet50.

Currently, the application of deep learning-based semantic segmentation methods in maize tassel analysis remains relatively limited. Lu et al. [17] proposed a region-based color modeling approach for maize tassel segmentation, which integrates region proposal generation and ensemble neural networks to achieve an average precision (AP) of 74.3%. Yu et al. [18] constructed feature extraction networks using VGG16 and MobileNet, respectively, to explore the potential of the U-Net model for segmenting maize tassels from UAV images.

While these studies have made certain progress, there are still limitations. The scene of maize field images captured by UAVs is extremely complex, especially in high-density planting environments. Interference factors such as leaf veins and female ears, which are morphologically similar to tassels, and varying degrees of light reflection on leaves under different weather conditions all pose significant challenges to the accurate segmentation of tassels. Secondly, tassels at different development stages exhibit significant variations in morphology and size. Existing segmentation models struggle to effectively distinguish tassel features from morphologically similar interference factors and are not robust enough to analyze tassel targets at different scales, leading to insufficient robustness of the models in complex scenes and across different tasseling stages. To address the limitations of existing studies, at the dataset level, this study constructed a maize tassel segmentation dataset covering different developmental stages of tassels and various weather conditions, based on three consecutive years of field planting and image collection. At the network model level, we propose a novel model DECC-Net for maize tassel segmentation, which can accurately identify maize tassels from complex scenes in input images and output the corresponding segmentation masks. The main contributions of this paper can be summarized into three aspects:

Dynamic Kernel Feature Extraction (DKE): This module is based on dynamic convolution, effectively capturing multi-scale features of maize tassels with varying morphologies and sizes. It enhances the model’s capacity to extract discriminative features from complex images.
Lightweight Channel Cross Transformer (LCCT) and Adaptive Feature Channel Enhancement (AFE): These two modules leverage cross-attention mechanisms to capture channel-wise dependencies between multi-scale features. They guide the efficient fusion of multi-scale features and suppress interference from irrelevant information.
Validation on Diverse Maize Tassel Dataset and Robustness Analysis: We validate the efficacy of DECC-Net on the constructed diverse maize tassel dataset and conduct an in-depth analysis of the model’s robustness across different scenarios.

2. Materials and Methods

2.1. Description of Experimental Sites

The maize tassel images used in this study were collected from the Xiangyang Base of Northeast Agricultural University (126°55′39″ E, 45°45′48″ N) and the Acheng Base of Northeast Agricultural University (127°2′58″ E, 45°31′18″ N) in Harbin, Heilongjiang Province, China. The planting area used in the Xiangyang Base was approximately 8000 m², and that in the Acheng Base was approximately 2680 m². The geographical information of the experimental sites is shown in Figure 1. The experiment was conducted continuously over three years from 2022 to 2024, with the detailed experimental scheme, which applies to both bases, presented in Table 1.

2.2. UAV-Based Remote Sensing Image Acquisition

The time period of UAV-based remote sensing image acquisition covers the entire tasseling process of maize. The image acquisition equipment used was the DJI Phantom 4 RTK (DJI-Innovations, Inc., Shenzhen, Guangdong, China) equipped with an RGB image sensor with an effective 20-million-pixel count, producing images with a resolution of 5472 × 3648. During image acquisition, the UAV operated at a flight altitude of 10 m with the camera vertically oriented downward, flying at a constant speed along an S-shaped route.

The tasseling stage of maize coincides with midsummer, characterized by complex and variable weather conditions that can cause varying degrees of reflection on maize leaves. To ensure the dataset meets the robustness requirements of the model, the collected maize field images include both sunny and cloudy weather conditions. The morphology of maize tassels also changes significantly throughout the tasseling process. We collected tassel images covering the entire tasseling process and divided it into three stages—early, middle, and late—based on tassel morphology. Specifically, the early tasseling stage refers to when most of a maize tassel remains enclosed in leaves; the middle stage is when most of the tassel has emerged from the leaves but has not yet started pollen shedding; and the late stage is characterized by full tassel extension and initiation of pollen shedding, which corresponds to the traditionally defined VT (vegetative tasseling stage) [19]. Ultimately, a total of over 8000 maize tassel images covering different weather conditions and developmental stages were collected, with examples of tassels shown in Figure 2.

2.3. Dataset Processing

From the collected images, 360 were selected to construct a maize tassel semantic segmentation dataset. The maize tassels in the images were meticulously annotated using the data annotation software Labelme to generate corresponding mask images. As the high resolution of the original images made them unsuitable for direct input into the model for training, the maize tassel images and their mask images were split into smaller sub-images. This resulted in 2880 sub-images with a resolution of 512 × 512. To ensure effective validation of the model’s robustness and generalization ability and reduce the risk of overfitting, images collected from the Xiangyang Base were used to construct the training and validation sets, while images from the Acheng Base were employed for the test set. The ratio of the training set, validation set, and test set is 6:2:2. In the constructed dataset, the ratio of tassel images at different developmental stages is 1:1:1, and the ratio of images under cloudy and sunny weather conditions is 1:1, ensuring the balance of samples in each category.

To further enrich the training data and reduce the risk of overfitting, data augmentation methods including random rotation, flipping, random scaling, and random color jittering were performed on the training set. Through data augmentation, the number of training set images was expanded from 1728 to 3456. After preprocessing and data augmentation, the final constructed dataset is presented in Table 2.

2.4. DECC-Net for Tassel Segmentation

2.4.1. Overall Structure of DECC-Net

For semantic segmentation of maize tassels in complex field environments, we propose DECC-Net (as shown in Figure 3). DECC-Net is mainly composed of Encoder Blocks, Decoder Blocks, and a Feature Fusion Block. The novelty of this model lies in the integration of our proposed Dynamic Kernel Feature Extraction (DKE) modules into Encoder Blocks and Decoder Blocks to enhance the capability of capturing multi-scale maize tassel features from complex scenarios; meanwhile, the proposed Lightweight Channel Cross Transformer (LCCT) and Adaptive Feature Channel Enhancement (AFE) modules are integrated into the Feature Fusion Block, and their combination can jointly guide the adaptive fusion of multi-scale features, alleviate semantic discrepancies between features of different levels, and suppress the interference from irrelevant information such as leaf veins and leaf reflections.

Specifically, each Encoder Block consists of a DKE module and a downsampling component. Herein, the DKE module is used to extract multi-scale features from the input image and adaptively emphasize meaningful semantic features; the downsampling component performs downsampling on the feature map via a convolutional layer with a kernel size of 2 × 2 and a stride of 2, while doubling the number of channels. The Feature Fusion Block is composed of the LCCT module and AFE module, which deeply fuse features from Encoder Blocks at different levels, highlight key information therein, and suppress interfering information. The Bottleneck Block comprises a single DKE module, which maintains the scale of the feature map unchanged and transmits it to the lowest-level Decoder Block. Within the Decoder Block, the upsampling component utilizes transposed convolution to gradually restore the size of the feature map and halve the number of channels; subsequently, the feature map from the Feature Fusion Block is concatenated with the newly upsampled feature map, enabling the fused feature map to contain rich low-level and high-level semantic information. Following the concatenation operation, the DKE module is used to mitigate gradient vanishing and further capture effective information. Finally, DECC-Net outputs the final segmentation mask through a 1 × 1 convolutional layer.

2.4.2. Dynamic Kernel Feature Extraction

Feature extractors serve as critical components in semantic segmentation models, directly influencing the quality of segmentation outcomes. In maize field scenarios, maize tassels at different developmental stages exhibit significant morphological and size variations. Existing studies predominantly employ CNNs such as MobileNet [20] and VGG [21] as feature extractors for segmentation models. However, constrained by fixed kernel sizes and limited receptive fields, these networks struggle to effectively model long-range pixel dependencies, resulting in suboptimal performance in adaptively capturing multi-scale features from tassels.

In contrast to these methods, dynamic convolution kernels [22] enhance model representational capacity by aggregating multiple kernels through attention mechanisms without increasing additional network depth or width. Inspired by multi-scale feature fusion principles and dynamic convolution kernels, this study proposes the Dynamic Kernel Feature Extraction (DKE) module, as illustrated in Figure 4. This module employs a dynamic selection mechanism to adaptively emphasize the most critical features from multi-scale spatial features extracted by different kernel sizes, guided by global contextual information.

The operations of the DKE module are as follows: for an input feature map, we first employ three depthwise separable convolution layers, DW3 × 3, DW5 × 5, and DW5 × 5, and cascade them. The shallow DW3 × 3 kernel extracts features with rich local details, and cascading these kernels enables the deep DW layer to have a receptive field equivalent to an 11 × 11 convolution kernel, facilitating the capture of long-range global features. Moreover, compared to standard convolutions, using depthwise separable convolutions effectively reduces computational costs and the number of parameters.

F_{1} = {D W C o n v}_{3 \times 3} (F)

(1)

F_{2} = {D W C o n v}_{5 \times 5} (F_{1})

(2)

F_{3} = {D W C o n v}_{5 \times 5} (F_{2}) .

(3)

Subsequently, the multi-scale features

F_{1}

,

F_{2}

, and

F_{3}

are concatenated. Average pooling (AvgPool) and maximum pooling (MaxPool) are then applied along the channel dimension of the concatenated features to effectively model the global spatial relationship of these local features.

W_{a v g} = A v g P o o l (C o n c a t (F_{1}, F_{2}, F_{3}))

(4)

W_{m a x} = M a x P o o l (C o n c a t (F_{1}, F_{2}, F_{3})) .

(5)

Following that, a series of convolutional layers processes

W_{a v g}

and

W_{m a x}

, allowing for full interaction of information across different spatial dimensions. The sigmoid activation function is then applied to generate three sets of dynamic selection values,

S_{1}

,

S_{2}

, and

S_{3}

:

[S_{1}, S_{2}, S_{3}] = S i g m o i d (C o n v (C o n c a t (W_{a v g}, W_{m a x}))) .

(6)

These three sets of dynamic selection values adaptively select features from three different scales of feature maps to generate weighted feature maps. These feature maps are then summed to achieve feature fusion. A residual connection then integrates the original feature map

F

with weighted feature maps, producing the integrated feature map

F_{\sum}

:

F_{\sum} = {(S}_{1} \otimes F_{1}) + {(S}_{2} \otimes F_{2}) + (S_{3} \otimes F_{3}) + F .

(7)

Finally, batch normalization (BN) and the ReLU activation function are applied to the integrated feature map to generate the output feature

F_{o u t}

, where BN helps reduce the model’s reliance on specific features, thereby lowering the risk of overfitting.

F_{o u t} = R e L U (B N (F_{\sum})) .

(8)

2.4.3. Lightweight Channel Cross Transformer

In U-shaped network architectures, features from the encoder are considered low-level features, while those from the decoder are treated as more abstract high-level features. Semantic gaps exist between these two sets of features. Failure to effectively bridge these semantic gaps during the encoding–decoding process can adversely affect segmentation results. To restore fine-grained features of target objects and improve segmentation performance, numerous improvement schemes have been proposed, including convolution-based UNet++ [23], MultiResUnet [24], and Vision Transformer [25]-based UCTransNet [26]. While these models have demonstrated encouraging performance in certain domains (e.g., medical image segmentation), their applicability to maize tassel segmentation tasks still requires further validation. Convolution-based approaches struggle to model long-range semantic dependencies effectively, limiting their improvement efficacy. While transformer-based solutions are more conducive to capturing global contextual dependencies and long-distance dependencies, their complex architectures give rise to a surge in computational complexity and parameter count, imposing strict demands on hardware resources.

Inspired by the aforementioned methods and based on the requirements of the maize tassel segmentation task, we designed the LCCT module, whose structure is depicted in Figure 5. The LCCT module is a lightweight channel cross-attention module designed to replace traditional skip connections, mitigating semantic gaps between different features and improving feature fusion.

The LCCT module operates as follows: Features originating from the encoder section at different scales, denoted as

E_{i} \in R^{C_{i} \times \frac{H}{2^{i - 1}} \times \frac{W}{2^{i - 1}}}

(i = 1, 2, 3, 4), are initially fed into the patch embedding sub-module. Here, an average pooling operation is employed to compress these features, thereby unifying the height (H) and width (W) dimensions of the input feature maps across different scales while preserving their original channel dimensions. Subsequently, a 1 × 1 depthwise separable convolution operation is utilized to generate the token sequence:

T_{i} = {D W C o n v}_{1 \times 1} (A v g P o o l (E_{i})),

(9)

where

T_{i}

represents the tokens generated from the features

E_{i}

derived from the encoder after tokenization. In this manner, the channel dimensions of the tokenized multi-scale features from the encoder remain consistent with their original counterparts, while the number of patches in different

T_{i}

is uniform. This uniformity is a prerequisite for enabling cross attention among tokens generated from features at various stages of the encoder.

Following the patch embedding sub-module, the four tokens

T_{i}

(i = 1, 2, 3, 4) are input into the cross-attention sub-module as queries. Concurrently, these tokens undergo a concatenation operation to form

T_{Σ}

:

T_{Σ} = C o n c a t (T_{1}, T_{2} {, T}_{3} {, T}_{4}),

(10)

where

T_{Σ}

serves as the keys and values for the subsequent cross attention process. Following this, we employ a 1 × 1 depthwise separable convolution, instead of the traditional linear projection method, to generate the matrices

Q_{i}

,

K

, and

V

:

Q_{i} = {D W C o n v}_{1 \times 1}^{Q i} (T_{i})

(11)

K = {D W C o n v}_{1 \times 1}^{K} (T_{Σ})

(12)

V = {D W C o n v}_{1 \times 1}^{V} (T_{Σ}),

(13)

Here,

Q_{i} \in R^{C_{i} \times d}

,

K \in R^{C_{Σ} \times d}

, and

V \in R^{C_{Σ} \times d}

represent the projected queries, keys, and values, respectively.

C_{i}

(i = 1, 2, 3, 4) denotes the number of channels in

E_{i}

,

C_{Σ}

represents the sum of the channel numbers, and d indicates the sequence length. Compared to linear projection, using depthwise separable convolution can significantly reduce computational complexity while effectively extracting local features. Subsequently, cross attention is performed along the channel dimension:

{C A}_{i} = S o f t m a x (\frac{{Q_{i}}^{T} K}{\sqrt{C_{Σ}}}) V^{T},

(14)

where

Q_{i}, K,

and

V

represent the matrices of queries, keys, and values, respectively; and

\frac{1}{\sqrt{C_{Σ}}}

denotes the scaling factor. During the cross-attention process, the weights of the values are computed based on the similarity between the queries and keys, followed by the application of the Softmax function. The cross-attention result, denoted as

{C A}_{i}

, is projected using a 1 × 1 depthwise convolution. This result is then added to the initial

T_{i}

to achieve feature fusion. Layer normalization and the GeLU activation function are then applied to the fused features, yielding

{C A ’}_{i}

(i = 1, 2, 3, 4).

Finally, the upsampling layer restores the dimensions of

{C A ’}_{i}

to

C_{i} \times \frac{H}{2^{i - 1}} \times \frac{W}{2^{i - 1}}

, followed by batch normalization and the ReLU activation function. This process yields four sets of outputs from the LCCT module.

2.4.4. Adaptive Feature Channel Enhancement

While channel-wise transformer architectures can mitigate the semantic gap between the encoder and decoder, recent studies have shown that self-attention mechanisms operating along the channel dimension may overly focus on localized positions, potentially impairing the model’s ability to capture globally relevant information [27]. Furthermore, the application of dense attention mechanisms risks introducing interference from extraneous semantic information outside the feature regions of interest [28]. To address these limitations, this study proposes the AFE module (as shown in Figure 6), which suppresses interference from irrelevant semantics introduced by self-attention mechanisms while enhancing discriminative features within the model’s target regions.

For encoder input features

E_{i} \in R^{C_{i} \times \frac{H}{2^{i - 1}} \times \frac{W}{2^{i - 1}}}

(i = 1, 2, 3, 4), their channel dimensions are first unified using a convolutional layer to obtain processed features

{E ’}_{i}

:

{E ’}_{i} = {C o n v}_{1 \times 1} (E_{i}) .

(15)

Each

{E ’}_{i}

undergoes global average pooling to generate feature descriptor vectors

V_{i}

, which encode channel-wise information across multi-scale features.

V_{i} = G A P ({E ’}_{i}),

(16)

where GAP denotes the global average pooling operation. These feature descriptor vectors are then concatenated and fed into a linear layer to facilitate comprehensive cross-scale and cross-channel feature interaction, generating selection values

W

.

W = L i n e a r (C o n c a t (V_{1}, V_{2}, V_{3}, V_{4})) .

(17)

The selection values

W

adaptively highlight critical information while suppressing irrelevant components across multi-scale features

{E ’}_{i}

, generating weighted feature maps. A convolutional layer is then applied to the weighted feature maps, restoring their channel dimensions to match the original input. Finally, the weighted feature maps are integrated with the original input features through residual connections, yielding the output features

{E^{o u t}}_{i}

.

{E^{o u t}}_{i} = {C o n v}_{1 \times 1} (W \otimes {E ’}_{i}) + E_{i} .

(18)

The multi-scale features output by the AFE module are summed with the corresponding hierarchical features from the LCCT module and subsequently passed to the decoder.

2.5. Hardware and Software Configuration

To ensure the fairness of experimental comparisons, all experiments were conducted under a unified hardware and software configuration. For hardware, an NVIDIA GeForce RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) was used. For software, all network models were built on PyTorch 1.10.0 with Python 3.8. The details of the configuration are presented in Table 3.

2.6. Training Parameters

In maize field images, most areas consist of background pixels, with maize tassel pixels accounting for a low proportion. To more effectively reduce the negative impact caused by class imbalance and to better sensitize the model to the prediction of foreground pixels, the loss function used in the experiment is constructed by combining cross-entropy loss and Dice loss. The expressions for the cross-entropy loss and Dice loss functions are as follows, respectively:

{L o s s}_{C E} = - \frac{1}{N} \sum_{i}^{N} [y_{i} \times l o g (p_{i}) + (1 - y_{i}) \times l o g (1 - p_{i})]

(19)

{L o s s}_{D i c e} = 1 - \frac{2 \times \sum_{i}^{N} (y_{i} \times p_{i}) + ε}{\sum_{i}^{N} y_{i} + \sum_{i}^{N} p_{i} + ε},

(20)

where N represents the total number of pixels,

y_{i}

represents the true label of the current sample,

p_{i}

represents the model’s predicted output, and

ε

is a smoothing term used to avoid the occurrence of a denominator of zero. The combined loss function is as follows:

L o s s = {L o s s}_{C E} + {L o s s}_{D i c e} .

(21)

During training, the Adam optimizer was used to optimize the model parameters via back propagation. The initial learning rate was set to 1 ×

10^{- 3}

, and the weight decay was set to 5 ×

10^{- 5}

to accommodate the model requirements at different training stages. The batch size was set to 8, and all models were trained for 100 epochs.

2.7. Evaluation Indicators

To objectively evaluate the performance of the model in the maize tassel segmentation task, this study uses precision, recall, intersection over union (IoU), and the Dice similarity coefficient (Dice) to comprehensively evaluate the model’s performance. The confusion matrix in Table 4 provides the basis for calculating the aforementioned evaluation metrics.

Precision and recall are two fundamental metrics used in deep learning task evaluation. Precision reflects the proportion of correctly predicted tassel samples (TP) among all predicted tassel samples (TP + FP), and recall reflects the proportion of correctly predicted tassel samples (TP) among the actual tassel samples in the true labels (TP + FN). The formulas for these two metrics are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(22)

R e c a l l = \frac{T P}{T P + F N}

(23)

IoU is a key metric for evaluating image segmentation performance in semantic segmentation tasks, representing the ratio of the intersection to the union of the predicted results and the true labels. The formula is as follows:

I o U = \frac{T P}{T P + F P + F N}

(24)

The Dice is used to measure the similarity between the predicted results and ground truth. Its value ranges from 0 to 1, with larger values indicating greater similarity. The formula is as follows:

D i c e = \frac{2 \times T P}{2 \times T P + F P + F N}

(25)

3. Results and Discussion

3.1. Experiment Comparing with Different Models

3.1.1. Performance Comparison of DECC-Net and Advanced U-Shaped Networks

To thoroughly evaluate the performance of DECC-Net, we compared it with a variety of U-shaped networks on the maize tassel dataset. The comparative models include U-Net with VGG16 as the encoder, which has been established as a baseline in maize tassel segmentation studies [18], and advanced U-shaped networks with applications in the agricultural domain, including convolution-based MU-Net [29], ResUnet++ [30], MultiResUnet [24], transformer-based TransUNet [31], SwinUNet [32], and UCTransNet [26]. The experimental results are presented in Table 5.

In the maize tassel segmentation task, IoU and Dice are key metrics for evaluating the segmentation accuracy of models. Experimental results show that the DECC-Net proposed in this study achieves superior performance in these metrics. Specifically, DECC-Net achieves an IoU of 83.3% and a Dice of 90.9%—outperforming the next-best model (UCTransNet) by 2.7% and 1.6%, respectively. Figure 7 illustrates the IoU curves during the training process on the validation set. During the training process, DECC-Net’s IoU rapidly reaches a high level and achieves a better final convergence result compared to other network models. Figure 8 presents the qualitative segmentation results on the test set: DECC-Net can accurately distinguish tassels from interfering factors such as leaf veins and leaf reflections, and its segmentation boundaries are closer to the actual contours of tassels than those of other models. This indicates that DECC-Net has excellent multi-scale feature extraction capability and the ability to distinguish targets from similar interfering factors. In contrast, models such as U-Net, ResUnet++, and UCTransNet have shortcomings in feature extraction capability. They struggle to capture the subtle differences in morphology and texture between leaf veins and tassel branches and thus tend to misidentify leaf veins as tassel branches. TransUNet and SwinUNet, which use transformers as encoders, have limitations in capturing local detailed features, leading to a large number of false positive results.

Benefiting from its streamlined architecture and effective utilization of depthwise separable convolutions, DECC-Net achieves a compact parameter count of 3.67 M, significantly outperforming transformer-based networks such as TransUNet (67.87 M) and SwinUNet (27.17 M) in parameter efficiency. It also outperforms convolution-based networks such as ResUnet++ (13.09 M) and MultiResUnet (7.24 M). This parameter efficiency enables DECC-Net to be effectively deployed on hardware-constrained devices. Figure 9a visually demonstrates the IoU values and parameter counts of multiple models, where DECC-Net achieves favorable segmentation performance while significantly reducing the parameter count.

3.1.2. Performance Comparison of DECC-Net and Classic Segmentation Networks

We compared DECC-Net with classic networks previously applied in the fields of maize tassel segmentation and maize canopy segmentation [33], including FCN [34], SegNet [35], and DeepLabV3+ [36]. The experimental results are presented in Table 6. Compared to these classic models, DECC-Net performs optimally across all defined metrics, outperforming the second-best FCN by 7.9% in IoU and 4.9% in Dice, respectively. FCN and SegNet are constrained by fixed-size convolution kernels and thus have inherent limitations in feature extraction capabilities. DeepLabV3+ fails to achieve deep fusion of multi-scale features and similarly struggles to adapt to maize tassel targets with significant morphological variations. In contrast, DECC-Net benefits from the synergistic effect of dynamic convolution and cross-attention mechanisms. It can accurately identify tassel targets of varying sizes and morphologies, thereby achieving superior segmentation accuracy. Furthermore, DECC-Net maintains a parameter count of 3.67 M, which is significantly lower than SegNet’s 15.27 M, FCN’s 35.31 M, and DeepLabV3+’s 42.0 M.

Figure 9b visually demonstrates the IoU values and parameter counts of DECC-Net compared with other classic semantic segmentation models. The experimental results demonstrate that DECC-Net not only achieves superior segmentation performance but also exhibits significantly lower parameter counts, making it well-suited for maize tassel segmentation tasks.

3.2. Ablation Study

To further validate the effectiveness of the proposed modules, ablation experiments were conducted on the constructed dataset. U-Net with VGG16 as the encoder was used as the baseline, which has a CNN-based encoder and decoder, and uses skip connections for feature fusion. Three primary evaluation metrics, including IoU, Dice, and parameter count, were adopted. Table 7 summarizes the quantitative outcomes of these ablation experiments.

DKE modules constitute the core components of DECC-Net’s encoder and decoder. By integrating multi-scale feature extraction and a dynamic selection mechanism, the DKE module enriches semantic information capture while adaptively emphasizing the most discriminative features. This design enhances the precision of segmentation masks and improves the model’s ability to distinguish maize tassels from complex backgrounds such as leaf veins. Additionally, the application of batch normalization reduces the model’s reliance on local features in the training data, effectively mitigating the risk of overfitting. When the baseline model was augmented with DKE modules, its IoU increased by 1.0% and the Dice increased by 0.6%, demonstrating the effectiveness of the DKE module.

The LCCT module acts as a lightweight transformer-based structure designed to enhance conventional skip connections in U-shaped networks. Compared to the baseline, the LCCT module improved IoU by 2.4% and Dice by 1.5%. Further, integrating the AFE module into the baseline + LCCT architecture improved IoU to 82.2% and Dice to 90.2%. This improvement primarily stems from two key mechanisms: The LCCT module implements channel-wise attention to effectively capture interdependencies among multi-scale features from the encoder, thereby mitigating semantic gaps between different feature layers. Concurrently, the AFE module adaptively learns channel-wise weights for multi-scale features, enabling the model to focus on regions of interest while suppressing interference from irrelevant areas. The synergistic integration of these two modules facilitates effective feature fusion, thereby further enhancing the model’s representational capacity.

Compared with the individual introduction of DKE modules or the combined introduction of the LCCT and AFE modules, the joint integration of all three modules leads to more substantial performance improvements. DECC-Net further increases the IoU to 83.3% and the Dice coefficient to 90.9%.

This study further conducts ablation experiments on the fusion strategies of LCCT and AFE modules. The experimental results are shown in Table 8, where “+” and “−” denote summation fusion and sequential fusion of the output features from the two modules, respectively. All three feature fusion strategies improve the segmentation performance, but summation fusion enables the model to achieve the best performance.

In summary, our improvement strategy, which constructs the encoder and decoder using DKE modules and replaces skip connections with the LCCT and AFE modules to perform summation fusion of input features, simultaneously improves segmentation performance and maintains a relatively small parameter count. The experimental results fully validate the effectiveness of the proposed improvements.

3.3. Evaluation of Model Performance with Different Data Augmentation Levels

To investigate the segmentation accuracy of the models under different levels of data augmentation, we conducted multiple sets of experiments based on the original non-augmented data, with variations in the proportion of augmented samples relative to the original training set samples. The specific settings are as follows: one set of experiments used original tassel images without data augmentation as the training set, while the other four sets constructed training sets by expanding the original tassel images at ratios of 25%, 50%, 75%, and 100%, respectively. Subsequently, these five training sets were used to train both the proposed DECC-Net and U-Net in this study. The experimental results are shown in Figure 10. In the original group without data augmentation, DECC-Net achieved an IoU of 77.4% and a Dice of 87.3%, both higher than those of U-Net. As the level of data augmentation increased from 0% to 100%, both DECC-Net and U-Net showed varying degrees of improvement in performance. However, across all levels of data augmentation, DECC-Net consistently achieved higher IoU and Dice scores than U-Net, demonstrating the superiority of its segmentation performance under different data conditions.

3.4. Robustness Analysis of DECC-Net

To further analyze the performance of DECC-Net and evaluate its robustness in handling different scenarios, we subdivided the test set into five categories based on weather conditions (sunny, cloudy) and tasseling stages (early, middle, late). Comparative tests were conducted between DECC-Net and U-Net, which serves as the mainstream model for maize tassel segmentation [18], with results presented in Table 9.

Experimental results showed that model performance varies across tasseling stages. Both DECC-Net and U-Net achieved the lowest performance at the early tasseling stage. In contrast, both models demonstrated improved segmentation performance during the middle and late tasseling stages. At the early tasseling stage, tassels are smaller in size and highly similar in morphology to leaf veins, while being easily occluded by canopy leaves, leading to lower segmentation accuracy. In the middle and late tasseling stages, the main bodies of tassels emerge from the canopy leaves, with more distinct morphological characteristics, thus facilitating accurate recognition and segmentation by the model.

Although segmentation performance varies across different tasseling stages for both models, DECC-Net outperforms U-Net in all stages. Notably, at the early tasseling stage, which imposes higher demands on the model’s feature extraction capability, DECC-Net achieved an IoU 5.1% higher than U-Net. The experimental results fully validate the robustness of the proposed DECC-Net and its effectiveness in tassel segmentation across the entire developmental cycle.

Under cloudy conditions, DECC-Net and U-Net achieved IoU values of 85.3% and 83.2%, respectively. In contrast, under sunny conditions, the IoU values of both models decreased to 81.4% and 74.6%. This performance decline under sunny conditions is attributed to intense light reflections on maize leaves, which introduce complex interference in the images. Specifically, strip-shaped light reflections may be misidentified as tassels, thereby reducing segmentation accuracy. Conversely, cloudy conditions minimize such reflection interference, facilitating more accurate tassel segmentation. The results further demonstrate that DECC-Net outperforms U-Net under both weather conditions, confirming its robustness in handling tassel segmentation under diverse environmental challenges.

Figure 11 shows the visualized tassel segmentation results of DECC-Net and U-Net across different tasseling stages and weather conditions. It was observed that U-Net exhibits significant limitations when performing the maize tassel segmentation task. Particularly in areas where tassels and leaf veins overlap and in situations with large light reflections on the leaves, U-Net, limited by its relatively weak feature extraction and fusion capabilities, may incorrectly identify leaf veins and reflected light on leaves as parts of the tassel, or conversely, misidentify tassels as background. This results in many false positives and negatives that affect the segmentation outcome. In comparison, DECC-Net significantly improves segmentation accuracy in complex environments by effectively capturing multi-scale tassel features and mitigating the semantic gap between the encoder and decoder through enhanced feature fusion capabilities. Figure 12 illustrates the segmentation results of DECC-Net in complex field environments. When handling maize tassels under different weather conditions and at varying developmental stages, DECC-Net demonstrates strong generalizability.

4. Conclusions

In this paper, we focused on the semantic segmentation of maize tassels in cropland scenes. Using UAV-collected RGB images of field maize as the research object, we constructed a maize tassel dataset covering diverse tasseling stages and weather conditions. To address the challenges faced by existing network models in maize tassel segmentation, including low accuracy, insufficient generalization ability, and large parameter count, we proposed DECC-Net. The model employs DKE modules to construct the encoder and decoder, while enhancing feature fusion capabilities through the LCCT and AFE modules. The DKE modules improve the ability to extract critical features from images while reducing the parameter count. The LCCT and AFE modules collaboratively capture long-range contextual dependencies and enable effective fusion of multi-scale channel features, mitigating semantic gaps between encoder and decoder features. The performance of DECC-Net was validated and analyzed through extensive experiments, with the main conclusions as follows:

DECC-Net exhibits excellent segmentation performance, achieving 85.6% precision, 96.9% recall, 83.3% IoU, and 90.9% Dice scores, surpassing the baseline U-Net (with VGG16 as the encoder) adopted in existing maize tassel segmentation studies, as well as a series of representative advanced segmentation models in agricultural applications.
DECC-Net has a parameter count of only 3.67 M, which is lower than that of other mainstream semantic segmentation models. This indicates that DECC-Net can be more effectively applied to hardware-limited devices.
DECC-Net demonstrates better robustness and generalization ability, effectively performing maize tassel segmentation tasks across different tasseling stages and weather conditions. The segmentation results more closely resemble the actual morphology of maize tassels, thereby providing more precise guidance for field management during the tasseling period of maize.

Compared with existing studies, the DECC-Net proposed in this study has achieved promising segmentation results, though there remains room for improvement. Future research will focus on introducing semi-supervised learning methods to reduce the cost of data annotation in semantic segmentation tasks. Furthermore, we plan to collect data in more experimental regions and further enhance the diversity of maize varieties, aiming to further verify and improve the model’s generalization ability and robustness. Meanwhile, we will carry out tassel area estimation in actual agricultural production environments to enhance its practical applicability in real-world scenarios.

Author Contributions

Conceptualization, Y.L., S.D. and Y.J.; methodology, Y.L. and Y.C.; validation, Y.L. and X.G.; data curation, Y.L., L.H. and X.G.; writing—original draft preparation, Y.L.; writing—review and editing, Y.J., X.G. and Y.C.; visualization, Y.L. and L.H.; supervision, S.D. and Y.J.; project administration, S.D. and Y.J.; funding acquisition, Y.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Project of the Laboratory of Advanced Agricultural Sciences, Heilongjiang Province (ZY04JD05-011).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The dataset is available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Waqas, M.A.; Wang, X.; Zafar, S.A.; Noor, M.A.; Hussain, H.A.; Azher Nawaz, M.; Farooq, M. Thermal Stresses in Maize: Effects and Management Strategies. Plants 2021, 10, 293. [Google Scholar] [CrossRef] [PubMed]
Wattoo, F.M.; Rana, R.M.; Fiaz, S.; Zafar, S.A.; Noor, M.A.; Hassan, H.M.; Bhatti, M.H.; Rehman, S.U.; Anis, G.B.; Amir, R.M. Identification of Drought Tolerant Maize Genotypes and Seedling Based Morpho-Physiological Selection Indices for Crop Improvement. Sains Malays 2018, 47, 295–302. [Google Scholar]
Erenstein, O.; Jaleta, M.; Sonder, K.; Mottaleb, K.; Prasanna, B.M. Global Maize Production, Consumption and Trade: Trends and R&D Implications. Food Sec. 2022, 14, 1295–1319. [Google Scholar] [CrossRef]
Xu, G.; Wang, X.; Huang, C.; Xu, D.; Li, D.; Tian, J.; Chen, Q.; Wang, C.; Liang, Y.; Wu, Y.; et al. Complex Genetic Architecture Underlies Maize Tassel Domestication. New Phytol. 2017, 214, 852–864. [Google Scholar] [CrossRef] [PubMed]
Shekoofa, A.; Emam, Y.; Shekoufa, N.; Ebrahimi, M.; Ebrahimie, E. Determining the Most Important Physiological and Agronomic Traits Contributing to Maize Grain Yield through Machine Learning Algorithms: A New Avenue in Intelligent Agriculture. PLoS ONE 2014, 9, e97288. [Google Scholar] [CrossRef]
Zhang, Y.; Dong, X.; Wang, H.; Lin, Y.; Jin, L.; Lv, X.; Yao, Q.; Li, B.; Gao, J.; Wang, P. Maize Breeding for Smaller Tassels Threatens Yield under a Warming Climate. Nat. Clim. Change 2024, 14, 1306–1313. [Google Scholar] [CrossRef]
Lee, C.-H.; Chen, K.-Y.; Liu, L.D. Effect of Texture Feature Distribution on Agriculture Field Type Classification with Multitemporal UAV RGB Images. Remote Sens. 2024, 16, 1221. [Google Scholar] [CrossRef]
Yue, J.; Yang, G.; Li, C.; Liu, Y.; Wang, J.; Guo, W.; Ma, X.; Niu, Q.; Qiao, H.; Feng, H. Analyzing Winter-Wheat Biochemical Traits Using Hyperspectral Remote Sensing and Deep Learning. Comput. Electron. Agric. 2024, 222, 109026. [Google Scholar] [CrossRef]
Zhang, Y.; Jiang, Y.; Xu, B.; Yang, G.; Feng, H.; Yang, X.; Yang, H.; Liu, C.; Cheng, Z.; Feng, Z. Study on the Estimation of Leaf Area Index in Rice Based on UAV RGB and Multispectral Data. Remote Sens. 2024, 16, 3049. [Google Scholar] [CrossRef]
Zheng, H.; Fan, X.; Bo, W.; Yang, X.; Tjahjadi, T.; Jin, S. A Multiscale Point-Supervised Network for Counting Maize Tassels in the Wild. Plant Phenomics 2023, 5, 0100. [Google Scholar] [CrossRef]
Qi, J.; Ding, C.; Zhang, R.; Xie, Y.; Li, L.; Zhang, W.; Chen, L. UAS-Based MT-YOLO Model for Detecting Missed Tassels in Hybrid Maize Detasseling. Plant Methods 2025, 21, 21. [Google Scholar] [CrossRef]
Jia, Y.; Fu, K.; Lan, H.; Wang, X.; Su, Z. Maize Tassel Detection with CA-YOLO for UAV Images in Complex Field Environments. Comput. Electron. Agric. 2024, 217, 108562. [Google Scholar] [CrossRef]
Luo, Z.; Yang, W.; Yuan, Y.; Gou, R.; Li, X. Semantic Segmentation of Agricultural Images: A Survey. Inf. Process. Agric. 2024, 11, 172–186. [Google Scholar] [CrossRef]
Genze, N.; Wirth, M.; Schreiner, C.; Ajekwe, R.; Grieb, M.; Grimm, D.G. Improved Weed Segmentation in UAV Imagery of Sorghum Fields with a Combined Deblurring Segmentation Model. Plant Methods 2023, 19, 87. [Google Scholar] [CrossRef]
Wang, H.; Ding, J.; He, S.; Feng, C.; Zhang, C.; Fan, G.; Wu, Y.; Zhang, Y. MFBP-UNet: A Network for Pear Leaf Disease Segmentation in Natural Agricultural Environments. Plants 2023, 12, 3209. [Google Scholar] [CrossRef] [PubMed]
Megersa, Z.M.; Adege, A.B.; Rashid, F. Real-Time Common Rust Maize Leaf Disease Severity Identification and Pesticide Dose Recommendation Using Deep Neural Network. Knowledge 2024, 4, 615–634. [Google Scholar] [CrossRef]
Lu, H.; Cao, Z.; Xiao, Y.; Li, Y.; Zhu, Y. Region-Based Colour Modelling for Joint Crop and Maize Tassel Segmentation. Biosyst. Eng. 2016, 147, 139–150. [Google Scholar] [CrossRef]
Yu, X.; Yin, D.; Nie, C.; Ming, B.; Xu, H.; Liu, Y.; Bai, Y.; Shao, M.; Cheng, M.; Liu, Y.; et al. Maize Tassel Area Dynamic Monitoring Based on Near-Ground and UAV RGB Images by U-Net Model. Comput. Electron. Agric. 2022, 203, 107477. [Google Scholar] [CrossRef]
Nleya, T.; Chungu, C.; Kleinjan, J. Corn Growth and Development. In iGrow Corn: Managemenent Practices; Reseachgate: Berlin, Germany, 2016; Volume 722, pp. 5–8. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic Convolution: Attention Over Convolution Kernels. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11027–11036. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Lecture Notes in Computer Science; Stoyanov, D., Taylor, Z., Carneiro, G., Syeda-Mahmood, T., Martel, A., Maier-Hein, L., Tavares, J.M.R.S., Bradley, A., Papa, J.P., Belagiannis, V., et al., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. ISBN 978-3-030-00888-8. [Google Scholar]
Ibtehaz, N.; Rahman, M.S. MultiResUNet: Rethinking the U-Net Architecture for Multimodal Biomedical Image Segmentation. Neural Netw. 2020, 121, 74–87. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-Wise Perspective with Transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; Volume 36, pp. 2441–2449. [Google Scholar] [CrossRef]
Wang, C.; Wang, L.; Wang, N.; Wei, X.; Feng, T.; Wu, M.; Yao, Q.; Zhang, R. CFATransUnet: Channel-Wise Cross Fusion Attention and Transformer for 2D Medical Image Segmentation. Comput. Biol. Med. 2024, 168, 107803. [Google Scholar] [CrossRef] [PubMed]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 4794–4803. [Google Scholar]
Zhang, S.; Zhang, C. Modified U-Net for Plant Diseased Leaf Image Segmentation. Comput. Electron. Agric. 2023, 204, 107511. [Google Scholar] [CrossRef]
Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Johansen, D.; de Lange, T.; Halvorsen, P.; Johansen, H.D. ResUNet++: An Advanced Architecture for Medical Image Segmentation. In Proceedings of the 2019 IEEE International Symposium on Multimedia (ISM), San Diego, CA, USA, 9–11 December 2019. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Computer Vision—ECCV 2022 Workshops; Lecture Notes in Computer Science; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer Nature: Cham, Switzerland, 2023; Volume 13803, pp. 205–218. ISBN 978-3-031-25065-1. [Google Scholar]
Shao, M.; Nie, C.; Cheng, M.; Yu, X.; Bai, Y.; Ming, B.; Song, H.; Jin, X. Quantifying Effect of Tassels on Near-Ground Maize Canopy RGB Images Using Deep Learning Segmentation Algorithm. Precis. Agric. 2022, 23, 400–418. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]

Figure 1. Geographical information of the experimental sites.

Figure 2. Tassels under different weather conditions and different developmental stages.

Figure 3. Overall structure of DECC-Net.

Figure 4. Structure of the DKE module.

Figure 5. Structure of the LCCT module, where P&R stands for Projection and Reshape.

Figure 6. Structure of the AFE module.

Figure 7. IoU curves of DECC-Net and other advanced U-shaped networks.

Figure 8. Qualitative comparison of segmentation results between DECC-Net and other advanced U-shaped networks.

Figure 9. (a) Comparison of experimental results between DECC-Net and other advanced U-shaped networks; (b) comparison of experimental results between DECC-Net and classic segmentation networks.

Figure 10. Segmentation performance of models under different data augmentation levels. The bars represent IoU, and the lines represent Dice.

Figure 11. Qualitative comparison results of tassel segmentation between DECC-Net and U-Net in complex environments. (a) Early tasseling stage, cloudy; (b) middle tasseling stage, cloudy; (c) late tasseling stage, cloudy; (d) early tasseling stage, sunny; (e) middle tasseling stage, sunny; (f) late tasseling stage, sunny.

Figure 12. Segmentation results of DECC-Net in complex field scenarios.

Table 1. Experimental scheme information.

Year	Sowing Date	Maize Varieties	Tasseling Duration
2022	9 May	Dongnong279, Dongnong285	11–20 July
2023	6 May	Dongnong279, Dongnong285, Aaobang368, Qiangsheng370	8–18 July
2024	9 May	Jinnong308, Aaobang368, Qiangsheng370	22–31 July

Table 2. Experimental dataset.

Categories		Number
Categories		Training Set (Before Augmentation)	Training Set (After Augmentation)	Validation Set	Test Set
Tasseling stage	Early	576	1152	192	192
	Middle	576	1152	192	192
	Late	576	1152	192	192
Weather condition	Sunny	864	1728	288	288
Weather condition	Cloudy	864	1728	288	288
Total number		1728	3456	576	576

Table 3. Hardware and software configuration.

	Item	Configuration
Hardware	GPU	NVIDIA GeForce RTX 3090
	Video Memory	24 GB
	CPU	AMD EPYC 7642 (Advanced Micro Devices, Inc., Santa Clara, CA, USA)
Software	Operating System	Ubuntu 20.04
	GPU Computing Platform	CUDA 11.3
	Programming Language	Python 3.8
	Deep-learning Framework	PyTorch 1.10.0

Table 4. Confusion matrix.

Confusion Matrix		Predicted Class
Confusion Matrix		Maize Tassel	Background
Actual class	Maize tassel	TP (True Positive)	FN (False Negative)
Actual class	Background	FP (False Positive)	TN (True Negative)

Table 5. Comparison of DECC-Net and other advanced U-shaped networks. Bold values indicate the best performance.

Model	Precision (%)	Recall (%)	IoU (%)	Dice (%)	Parameters (M)
U-Net	83.9	93.0	78.9	88.2	26.45
MU-Net	82.9	95.8	80.0	88.9	7.97
ResUnet++	82.3	95.1	79.0	88.2	13.09
MultiResUnet	83.8	93.6	79.3	88.4	7.24
TransUNet	80.1	93.6	76.0	86.3	67.87
SwinUNet	80.1	90.0	73.6	84.8	27.17
UCTransNet	84.5	94.6	80.6	89.3	17.07
DECC-Net	85.6	96.9	83.3	90.9	3.67

Table 6. Comparison of DECC-Net and classic segmentation models. Bold values indicate the best performance.

Model	Precision (%)	Recall (%)	IoU (%)	Dice (%)	Parameters (M)
FCN	79.0	94.3	75.4	86.0	35.31
DeepLabV3+	75.2	93.8	71.6	83.5	42.00
SegNet	70.5	89.0	64.8	78.7	15.27
DECC-Net	85.6	96.9	83.3	90.9	3.67

Table 7. Ablation experiments of DECC-Net, where √ denotes that the corresponding module is used and bold values indicate the best performance.

Baseline	Module			IoU (%)	Dice (%)	Parameters (M)
Baseline	DKE	LCCT	AFE	IoU (%)	Dice (%)	Parameters (M)
U-Net				78.9	88.2	26.45
	√			79.9	88.8	2.98
		√		81.3	89.7	26.55
		√	√	82.2	90.2	27.13
	√	√	√	83.3	90.9	3.67

Table 8. Comparison of different fusion strategies for LCCT and AFE modules.

Model	IoU (%)	Dice (%)	Parameters (M)
U-Net	78.9	88.2	26.45
U-Net (AFE-LCCT)	81.7	89.9	27.13
U-Net (LCCT-AFE)	81.8	90.0	27.13
U-Net (LCCT + AFE)	82.2	90.2	27.13

Table 9. Segmentation performance of models at different tasseling stages and under different weather conditions.

Test Set	IoU(%)		Dice(%)
Test Set	U-Net	DECC-Net	U-Net	DECC-Net
Early	77.7	82.8 (+5.1)	87.4	90.6 (+3.2)
Middle	79.3	83.4 (+4.1)	88.5	90.9 (+2.4)
Late	79.7	83.7 (+4.0)	88.7	91.2 (+2.5)
Sunny	74.6	81.4 (+6.8)	85.5	89.7 (+4.2)
Cloudy	83.2	85.3 (+2.1)	90.8	92.1 (+1.3)
All	78.9	83.3 (+4.4)	88.2	90.9 (+2.7)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; He, L.; Cao, Y.; Gao, X.; Dong, S.; Jia, Y. DECC-Net: A Maize Tassel Segmentation Model Based on UAV-Captured Imagery. Agriculture 2025, 15, 1751. https://doi.org/10.3390/agriculture15161751

AMA Style

Liu Y, He L, Cao Y, Gao X, Dong S, Jia Y. DECC-Net: A Maize Tassel Segmentation Model Based on UAV-Captured Imagery. Agriculture. 2025; 15(16):1751. https://doi.org/10.3390/agriculture15161751

Chicago/Turabian Style

Liu, Yinchuan, Lili He, Yuying Cao, Xinyue Gao, Shoutian Dong, and Yinjiang Jia. 2025. "DECC-Net: A Maize Tassel Segmentation Model Based on UAV-Captured Imagery" Agriculture 15, no. 16: 1751. https://doi.org/10.3390/agriculture15161751

APA Style

Liu, Y., He, L., Cao, Y., Gao, X., Dong, S., & Jia, Y. (2025). DECC-Net: A Maize Tassel Segmentation Model Based on UAV-Captured Imagery. Agriculture, 15(16), 1751. https://doi.org/10.3390/agriculture15161751

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DECC-Net: A Maize Tassel Segmentation Model Based on UAV-Captured Imagery

Abstract

1. Introduction

2. Materials and Methods

2.1. Description of Experimental Sites

2.2. UAV-Based Remote Sensing Image Acquisition

2.3. Dataset Processing

2.4. DECC-Net for Tassel Segmentation

2.4.1. Overall Structure of DECC-Net

2.4.2. Dynamic Kernel Feature Extraction

2.4.3. Lightweight Channel Cross Transformer

2.4.4. Adaptive Feature Channel Enhancement

2.5. Hardware and Software Configuration

2.6. Training Parameters

2.7. Evaluation Indicators

3. Results and Discussion

3.1. Experiment Comparing with Different Models

3.1.1. Performance Comparison of DECC-Net and Advanced U-Shaped Networks

3.1.2. Performance Comparison of DECC-Net and Classic Segmentation Networks

3.2. Ablation Study

3.3. Evaluation of Model Performance with Different Data Augmentation Levels

3.4. Robustness Analysis of DECC-Net

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI