1. Introduction
Maize is one of the world’s three major cereal crops, widely cultivated across the globe and providing 19.5% of the global caloric intake from all sources [
1,
2]. It plays a diverse and dynamic role in the global agricultural food system and food/nutrition security, serving not only as an important human food crop but also as a crucial livestock feed, industrial, and energy crop [
3]. The tassel, located at the top of the plant, is a key phenotypic trait indicating maize growth and reproductive stages. The characteristics of maize tassels provide important insights for improving agricultural practices to enhance yield [
4,
5,
6]. Therefore, effective monitoring of tassel growth during the tasseling stage is important not only for maize yield estimation and variety selection, but also for guiding agricultural personnel in implementing appropriate field management.
Currently, the monitoring of maize tassel growth in the field mainly relies on manual observation, which is time-consuming, labor-intensive, and inefficient. Against the backdrop of continuously rising labor costs and increasing difficulties in labor recruitment, large-scale monitoring of maize tassels through manual methods has become unsustainable. In recent years, the rapid development of low-altitude UAV remote sensing technology and deep learning has provided new solutions for maize tassel monitoring. UAVs can capture high-resolution images at low altitudes and offer advantages including portability, high mobility, and low cost. Meanwhile, deep learning methods have significantly enhanced the accuracy and processing efficiency of image recognition. The collaborative integration of these two technologies has been widely applied across various domains of smart agriculture, including farmland crop classification [
7], crop growth status monitoring [
8], and estimation of crop leaf area index [
9].
Existing studies on maize tassels using deep learning primarily focus on counting and object detection. Zheng et al. [
10] proposed a multiscale lite attention enhancement network (MLAENet), which uses point-level annotations for maize tassel counting. Qi et al. [
11] developed a MT-YOLO model, which integrates the ECANet attention mechanism into YOLOv5s to enhance both detection accuracy and speed. In our previous work [
12], we introduced the CA-YOLO model, achieving an average precision of 96% and effectively detecting early-stage, leaf-obscured, mutually obscured, and complex background tassels.
However, current mainstream counting or object detection methods can only provide relatively coarse-grained recognition at the point or bounding box level, failing to fully characterize the growth status and key characteristics of maize tassels. This limitation restricts their utility in accurately assessing developmental stages and guiding precision field management. In contrast, semantic segmentation methods generate pixel-level annotations of target objects, enabling more detailed and actionable insights for agricultural decision-making. They have been increasingly applied to address diverse and data-rich agricultural problems [
13]. In weed identification, Genze et al. [
14] proposed a model named DeBlurWeedSeg, which achieved accurate segmentation of weeds in sorghum fields. In leaf disease identification, Wang et al. [
15] proposed a network named MFBP-UNet for identifying multiple diseases on pear leaves; Megersa et al. [
16] achieved the detection of common rust maize disease based on ResNet50.
Currently, the application of deep learning-based semantic segmentation methods in maize tassel analysis remains relatively limited. Lu et al. [
17] proposed a region-based color modeling approach for maize tassel segmentation, which integrates region proposal generation and ensemble neural networks to achieve an average precision (AP) of 74.3%. Yu et al. [
18] constructed feature extraction networks using VGG16 and MobileNet, respectively, to explore the potential of the U-Net model for segmenting maize tassels from UAV images.
While these studies have made certain progress, there are still limitations. The scene of maize field images captured by UAVs is extremely complex, especially in high-density planting environments. Interference factors such as leaf veins and female ears, which are morphologically similar to tassels, and varying degrees of light reflection on leaves under different weather conditions all pose significant challenges to the accurate segmentation of tassels. Secondly, tassels at different development stages exhibit significant variations in morphology and size. Existing segmentation models struggle to effectively distinguish tassel features from morphologically similar interference factors and are not robust enough to analyze tassel targets at different scales, leading to insufficient robustness of the models in complex scenes and across different tasseling stages. To address the limitations of existing studies, at the dataset level, this study constructed a maize tassel segmentation dataset covering different developmental stages of tassels and various weather conditions, based on three consecutive years of field planting and image collection. At the network model level, we propose a novel model DECC-Net for maize tassel segmentation, which can accurately identify maize tassels from complex scenes in input images and output the corresponding segmentation masks. The main contributions of this paper can be summarized into three aspects:
Dynamic Kernel Feature Extraction (DKE): This module is based on dynamic convolution, effectively capturing multi-scale features of maize tassels with varying morphologies and sizes. It enhances the model’s capacity to extract discriminative features from complex images.
Lightweight Channel Cross Transformer (LCCT) and Adaptive Feature Channel Enhancement (AFE): These two modules leverage cross-attention mechanisms to capture channel-wise dependencies between multi-scale features. They guide the efficient fusion of multi-scale features and suppress interference from irrelevant information.
Validation on Diverse Maize Tassel Dataset and Robustness Analysis: We validate the efficacy of DECC-Net on the constructed diverse maize tassel dataset and conduct an in-depth analysis of the model’s robustness across different scenarios.
2. Materials and Methods
2.1. Description of Experimental Sites
The maize tassel images used in this study were collected from the Xiangyang Base of Northeast Agricultural University (126°55′39″ E, 45°45′48″ N) and the Acheng Base of Northeast Agricultural University (127°2′58″ E, 45°31′18″ N) in Harbin, Heilongjiang Province, China. The planting area used in the Xiangyang Base was approximately 8000 m
2, and that in the Acheng Base was approximately 2680 m
2. The geographical information of the experimental sites is shown in
Figure 1. The experiment was conducted continuously over three years from 2022 to 2024, with the detailed experimental scheme, which applies to both bases, presented in
Table 1.
2.2. UAV-Based Remote Sensing Image Acquisition
The time period of UAV-based remote sensing image acquisition covers the entire tasseling process of maize. The image acquisition equipment used was the DJI Phantom 4 RTK (DJI-Innovations, Inc., Shenzhen, Guangdong, China) equipped with an RGB image sensor with an effective 20-million-pixel count, producing images with a resolution of 5472 × 3648. During image acquisition, the UAV operated at a flight altitude of 10 m with the camera vertically oriented downward, flying at a constant speed along an S-shaped route.
The tasseling stage of maize coincides with midsummer, characterized by complex and variable weather conditions that can cause varying degrees of reflection on maize leaves. To ensure the dataset meets the robustness requirements of the model, the collected maize field images include both sunny and cloudy weather conditions. The morphology of maize tassels also changes significantly throughout the tasseling process. We collected tassel images covering the entire tasseling process and divided it into three stages—early, middle, and late—based on tassel morphology. Specifically, the early tasseling stage refers to when most of a maize tassel remains enclosed in leaves; the middle stage is when most of the tassel has emerged from the leaves but has not yet started pollen shedding; and the late stage is characterized by full tassel extension and initiation of pollen shedding, which corresponds to the traditionally defined VT (vegetative tasseling stage) [
19]. Ultimately, a total of over 8000 maize tassel images covering different weather conditions and developmental stages were collected, with examples of tassels shown in
Figure 2.
2.3. Dataset Processing
From the collected images, 360 were selected to construct a maize tassel semantic segmentation dataset. The maize tassels in the images were meticulously annotated using the data annotation software Labelme to generate corresponding mask images. As the high resolution of the original images made them unsuitable for direct input into the model for training, the maize tassel images and their mask images were split into smaller sub-images. This resulted in 2880 sub-images with a resolution of 512 × 512. To ensure effective validation of the model’s robustness and generalization ability and reduce the risk of overfitting, images collected from the Xiangyang Base were used to construct the training and validation sets, while images from the Acheng Base were employed for the test set. The ratio of the training set, validation set, and test set is 6:2:2. In the constructed dataset, the ratio of tassel images at different developmental stages is 1:1:1, and the ratio of images under cloudy and sunny weather conditions is 1:1, ensuring the balance of samples in each category.
To further enrich the training data and reduce the risk of overfitting, data augmentation methods including random rotation, flipping, random scaling, and random color jittering were performed on the training set. Through data augmentation, the number of training set images was expanded from 1728 to 3456. After preprocessing and data augmentation, the final constructed dataset is presented in
Table 2.
2.4. DECC-Net for Tassel Segmentation
2.4.1. Overall Structure of DECC-Net
For semantic segmentation of maize tassels in complex field environments, we propose DECC-Net (as shown in
Figure 3). DECC-Net is mainly composed of Encoder Blocks, Decoder Blocks, and a Feature Fusion Block. The novelty of this model lies in the integration of our proposed Dynamic Kernel Feature Extraction (DKE) modules into Encoder Blocks and Decoder Blocks to enhance the capability of capturing multi-scale maize tassel features from complex scenarios; meanwhile, the proposed Lightweight Channel Cross Transformer (LCCT) and Adaptive Feature Channel Enhancement (AFE) modules are integrated into the Feature Fusion Block, and their combination can jointly guide the adaptive fusion of multi-scale features, alleviate semantic discrepancies between features of different levels, and suppress the interference from irrelevant information such as leaf veins and leaf reflections.
Specifically, each Encoder Block consists of a DKE module and a downsampling component. Herein, the DKE module is used to extract multi-scale features from the input image and adaptively emphasize meaningful semantic features; the downsampling component performs downsampling on the feature map via a convolutional layer with a kernel size of 2 × 2 and a stride of 2, while doubling the number of channels. The Feature Fusion Block is composed of the LCCT module and AFE module, which deeply fuse features from Encoder Blocks at different levels, highlight key information therein, and suppress interfering information. The Bottleneck Block comprises a single DKE module, which maintains the scale of the feature map unchanged and transmits it to the lowest-level Decoder Block. Within the Decoder Block, the upsampling component utilizes transposed convolution to gradually restore the size of the feature map and halve the number of channels; subsequently, the feature map from the Feature Fusion Block is concatenated with the newly upsampled feature map, enabling the fused feature map to contain rich low-level and high-level semantic information. Following the concatenation operation, the DKE module is used to mitigate gradient vanishing and further capture effective information. Finally, DECC-Net outputs the final segmentation mask through a 1 × 1 convolutional layer.
2.4.2. Dynamic Kernel Feature Extraction
Feature extractors serve as critical components in semantic segmentation models, directly influencing the quality of segmentation outcomes. In maize field scenarios, maize tassels at different developmental stages exhibit significant morphological and size variations. Existing studies predominantly employ CNNs such as MobileNet [
20] and VGG [
21] as feature extractors for segmentation models. However, constrained by fixed kernel sizes and limited receptive fields, these networks struggle to effectively model long-range pixel dependencies, resulting in suboptimal performance in adaptively capturing multi-scale features from tassels.
In contrast to these methods, dynamic convolution kernels [
22] enhance model representational capacity by aggregating multiple kernels through attention mechanisms without increasing additional network depth or width. Inspired by multi-scale feature fusion principles and dynamic convolution kernels, this study proposes the Dynamic Kernel Feature Extraction (DKE) module, as illustrated in
Figure 4. This module employs a dynamic selection mechanism to adaptively emphasize the most critical features from multi-scale spatial features extracted by different kernel sizes, guided by global contextual information.
The operations of the DKE module are as follows: for an input feature map, we first employ three depthwise separable convolution layers, DW3 × 3, DW5 × 5, and DW5 × 5, and cascade them. The shallow DW3 × 3 kernel extracts features with rich local details, and cascading these kernels enables the deep DW layer to have a receptive field equivalent to an 11 × 11 convolution kernel, facilitating the capture of long-range global features. Moreover, compared to standard convolutions, using depthwise separable convolutions effectively reduces computational costs and the number of parameters.
Subsequently, the multi-scale features
,
, and
are concatenated. Average pooling (AvgPool) and maximum pooling (MaxPool) are then applied along the channel dimension of the concatenated features to effectively model the global spatial relationship of these local features.
Following that, a series of convolutional layers processes
and
, allowing for full interaction of information across different spatial dimensions. The sigmoid activation function is then applied to generate three sets of dynamic selection values,
,
, and
:
These three sets of dynamic selection values adaptively select features from three different scales of feature maps to generate weighted feature maps. These feature maps are then summed to achieve feature fusion. A residual connection then integrates the original feature map
with weighted feature maps, producing the integrated feature map
:
Finally, batch normalization (BN) and the ReLU activation function are applied to the integrated feature map to generate the output feature
, where BN helps reduce the model’s reliance on specific features, thereby lowering the risk of overfitting.
2.4.3. Lightweight Channel Cross Transformer
In U-shaped network architectures, features from the encoder are considered low-level features, while those from the decoder are treated as more abstract high-level features. Semantic gaps exist between these two sets of features. Failure to effectively bridge these semantic gaps during the encoding–decoding process can adversely affect segmentation results. To restore fine-grained features of target objects and improve segmentation performance, numerous improvement schemes have been proposed, including convolution-based UNet++ [
23], MultiResUnet [
24], and Vision Transformer [
25]-based UCTransNet [
26]. While these models have demonstrated encouraging performance in certain domains (e.g., medical image segmentation), their applicability to maize tassel segmentation tasks still requires further validation. Convolution-based approaches struggle to model long-range semantic dependencies effectively, limiting their improvement efficacy. While transformer-based solutions are more conducive to capturing global contextual dependencies and long-distance dependencies, their complex architectures give rise to a surge in computational complexity and parameter count, imposing strict demands on hardware resources.
Inspired by the aforementioned methods and based on the requirements of the maize tassel segmentation task, we designed the LCCT module, whose structure is depicted in
Figure 5. The LCCT module is a lightweight channel cross-attention module designed to replace traditional skip connections, mitigating semantic gaps between different features and improving feature fusion.
The LCCT module operates as follows: Features originating from the encoder section at different scales, denoted as
(i = 1, 2, 3, 4), are initially fed into the patch embedding sub-module. Here, an average pooling operation is employed to compress these features, thereby unifying the height (H) and width (W) dimensions of the input feature maps across different scales while preserving their original channel dimensions. Subsequently, a 1 × 1 depthwise separable convolution operation is utilized to generate the token sequence:
where
represents the tokens generated from the features
derived from the encoder after tokenization. In this manner, the channel dimensions of the tokenized multi-scale features from the encoder remain consistent with their original counterparts, while the number of patches in different
is uniform. This uniformity is a prerequisite for enabling cross attention among tokens generated from features at various stages of the encoder.
Following the patch embedding sub-module, the four tokens
(i = 1, 2, 3, 4) are input into the cross-attention sub-module as queries. Concurrently, these tokens undergo a concatenation operation to form
:
where
serves as the keys and values for the subsequent cross attention process. Following this, we employ a 1 × 1 depthwise separable convolution, instead of the traditional linear projection method, to generate the matrices
,
, and
:
Here,
,
, and
represent the projected queries, keys, and values, respectively.
(i = 1, 2, 3, 4) denotes the number of channels in
,
represents the sum of the channel numbers, and d indicates the sequence length. Compared to linear projection, using depthwise separable convolution can significantly reduce computational complexity while effectively extracting local features. Subsequently, cross attention is performed along the channel dimension:
where
and
represent the matrices of queries, keys, and values, respectively; and
denotes the scaling factor. During the cross-attention process, the weights of the values are computed based on the similarity between the queries and keys, followed by the application of the Softmax function. The cross-attention result, denoted as
, is projected using a 1 × 1 depthwise convolution. This result is then added to the initial
to achieve feature fusion. Layer normalization and the GeLU activation function are then applied to the fused features, yielding
(i = 1, 2, 3, 4).
Finally, the upsampling layer restores the dimensions of to , followed by batch normalization and the ReLU activation function. This process yields four sets of outputs from the LCCT module.
2.4.4. Adaptive Feature Channel Enhancement
While channel-wise transformer architectures can mitigate the semantic gap between the encoder and decoder, recent studies have shown that self-attention mechanisms operating along the channel dimension may overly focus on localized positions, potentially impairing the model’s ability to capture globally relevant information [
27]. Furthermore, the application of dense attention mechanisms risks introducing interference from extraneous semantic information outside the feature regions of interest [
28]. To address these limitations, this study proposes the AFE module (as shown in
Figure 6), which suppresses interference from irrelevant semantics introduced by self-attention mechanisms while enhancing discriminative features within the model’s target regions.
For encoder input features
(i = 1, 2, 3, 4), their channel dimensions are first unified using a convolutional layer to obtain processed features
:
Each
undergoes global average pooling to generate feature descriptor vectors
, which encode channel-wise information across multi-scale features.
where GAP denotes the global average pooling operation. These feature descriptor vectors are then concatenated and fed into a linear layer to facilitate comprehensive cross-scale and cross-channel feature interaction, generating selection values
.
The selection values
adaptively highlight critical information while suppressing irrelevant components across multi-scale features
, generating weighted feature maps. A convolutional layer is then applied to the weighted feature maps, restoring their channel dimensions to match the original input. Finally, the weighted feature maps are integrated with the original input features through residual connections, yielding the output features
.
The multi-scale features output by the AFE module are summed with the corresponding hierarchical features from the LCCT module and subsequently passed to the decoder.
2.5. Hardware and Software Configuration
To ensure the fairness of experimental comparisons, all experiments were conducted under a unified hardware and software configuration. For hardware, an NVIDIA GeForce RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) was used. For software, all network models were built on PyTorch 1.10.0 with Python 3.8. The details of the configuration are presented in
Table 3.
2.6. Training Parameters
In maize field images, most areas consist of background pixels, with maize tassel pixels accounting for a low proportion. To more effectively reduce the negative impact caused by class imbalance and to better sensitize the model to the prediction of foreground pixels, the loss function used in the experiment is constructed by combining cross-entropy loss and Dice loss. The expressions for the cross-entropy loss and Dice loss functions are as follows, respectively:
where N represents the total number of pixels,
represents the true label of the current sample,
represents the model’s predicted output, and
is a smoothing term used to avoid the occurrence of a denominator of zero. The combined loss function is as follows:
During training, the Adam optimizer was used to optimize the model parameters via back propagation. The initial learning rate was set to 1 × , and the weight decay was set to 5 × to accommodate the model requirements at different training stages. The batch size was set to 8, and all models were trained for 100 epochs.
2.7. Evaluation Indicators
To objectively evaluate the performance of the model in the maize tassel segmentation task, this study uses precision, recall, intersection over union (IoU), and the Dice similarity coefficient (Dice) to comprehensively evaluate the model’s performance. The confusion matrix in
Table 4 provides the basis for calculating the aforementioned evaluation metrics.
Precision and recall are two fundamental metrics used in deep learning task evaluation. Precision reflects the proportion of correctly predicted tassel samples (TP) among all predicted tassel samples (TP + FP), and recall reflects the proportion of correctly predicted tassel samples (TP) among the actual tassel samples in the true labels (TP + FN). The formulas for these two metrics are as follows:
IoU is a key metric for evaluating image segmentation performance in semantic segmentation tasks, representing the ratio of the intersection to the union of the predicted results and the true labels. The formula is as follows:
The Dice is used to measure the similarity between the predicted results and ground truth. Its value ranges from 0 to 1, with larger values indicating greater similarity. The formula is as follows:
3. Results and Discussion
3.1. Experiment Comparing with Different Models
3.1.1. Performance Comparison of DECC-Net and Advanced U-Shaped Networks
To thoroughly evaluate the performance of DECC-Net, we compared it with a variety of U-shaped networks on the maize tassel dataset. The comparative models include U-Net with VGG16 as the encoder, which has been established as a baseline in maize tassel segmentation studies [
18], and advanced U-shaped networks with applications in the agricultural domain, including convolution-based MU-Net [
29], ResUnet++ [
30], MultiResUnet [
24], transformer-based TransUNet [
31], SwinUNet [
32], and UCTransNet [
26]. The experimental results are presented in
Table 5.
In the maize tassel segmentation task, IoU and Dice are key metrics for evaluating the segmentation accuracy of models. Experimental results show that the DECC-Net proposed in this study achieves superior performance in these metrics. Specifically, DECC-Net achieves an IoU of 83.3% and a Dice of 90.9%—outperforming the next-best model (UCTransNet) by 2.7% and 1.6%, respectively.
Figure 7 illustrates the IoU curves during the training process on the validation set. During the training process, DECC-Net’s IoU rapidly reaches a high level and achieves a better final convergence result compared to other network models.
Figure 8 presents the qualitative segmentation results on the test set: DECC-Net can accurately distinguish tassels from interfering factors such as leaf veins and leaf reflections, and its segmentation boundaries are closer to the actual contours of tassels than those of other models. This indicates that DECC-Net has excellent multi-scale feature extraction capability and the ability to distinguish targets from similar interfering factors. In contrast, models such as U-Net, ResUnet++, and UCTransNet have shortcomings in feature extraction capability. They struggle to capture the subtle differences in morphology and texture between leaf veins and tassel branches and thus tend to misidentify leaf veins as tassel branches. TransUNet and SwinUNet, which use transformers as encoders, have limitations in capturing local detailed features, leading to a large number of false positive results.
Benefiting from its streamlined architecture and effective utilization of depthwise separable convolutions, DECC-Net achieves a compact parameter count of 3.67 M, significantly outperforming transformer-based networks such as TransUNet (67.87 M) and SwinUNet (27.17 M) in parameter efficiency. It also outperforms convolution-based networks such as ResUnet++ (13.09 M) and MultiResUnet (7.24 M). This parameter efficiency enables DECC-Net to be effectively deployed on hardware-constrained devices.
Figure 9a visually demonstrates the IoU values and parameter counts of multiple models, where DECC-Net achieves favorable segmentation performance while significantly reducing the parameter count.
3.1.2. Performance Comparison of DECC-Net and Classic Segmentation Networks
We compared DECC-Net with classic networks previously applied in the fields of maize tassel segmentation and maize canopy segmentation [
33], including FCN [
34], SegNet [
35], and DeepLabV3+ [
36]. The experimental results are presented in
Table 6. Compared to these classic models, DECC-Net performs optimally across all defined metrics, outperforming the second-best FCN by 7.9% in IoU and 4.9% in Dice, respectively. FCN and SegNet are constrained by fixed-size convolution kernels and thus have inherent limitations in feature extraction capabilities. DeepLabV3+ fails to achieve deep fusion of multi-scale features and similarly struggles to adapt to maize tassel targets with significant morphological variations. In contrast, DECC-Net benefits from the synergistic effect of dynamic convolution and cross-attention mechanisms. It can accurately identify tassel targets of varying sizes and morphologies, thereby achieving superior segmentation accuracy. Furthermore, DECC-Net maintains a parameter count of 3.67 M, which is significantly lower than SegNet’s 15.27 M, FCN’s 35.31 M, and DeepLabV3+’s 42.0 M.
Figure 9b visually demonstrates the IoU values and parameter counts of DECC-Net compared with other classic semantic segmentation models. The experimental results demonstrate that DECC-Net not only achieves superior segmentation performance but also exhibits significantly lower parameter counts, making it well-suited for maize tassel segmentation tasks.
3.2. Ablation Study
To further validate the effectiveness of the proposed modules, ablation experiments were conducted on the constructed dataset. U-Net with VGG16 as the encoder was used as the baseline, which has a CNN-based encoder and decoder, and uses skip connections for feature fusion. Three primary evaluation metrics, including IoU, Dice, and parameter count, were adopted.
Table 7 summarizes the quantitative outcomes of these ablation experiments.
DKE modules constitute the core components of DECC-Net’s encoder and decoder. By integrating multi-scale feature extraction and a dynamic selection mechanism, the DKE module enriches semantic information capture while adaptively emphasizing the most discriminative features. This design enhances the precision of segmentation masks and improves the model’s ability to distinguish maize tassels from complex backgrounds such as leaf veins. Additionally, the application of batch normalization reduces the model’s reliance on local features in the training data, effectively mitigating the risk of overfitting. When the baseline model was augmented with DKE modules, its IoU increased by 1.0% and the Dice increased by 0.6%, demonstrating the effectiveness of the DKE module.
The LCCT module acts as a lightweight transformer-based structure designed to enhance conventional skip connections in U-shaped networks. Compared to the baseline, the LCCT module improved IoU by 2.4% and Dice by 1.5%. Further, integrating the AFE module into the baseline + LCCT architecture improved IoU to 82.2% and Dice to 90.2%. This improvement primarily stems from two key mechanisms: The LCCT module implements channel-wise attention to effectively capture interdependencies among multi-scale features from the encoder, thereby mitigating semantic gaps between different feature layers. Concurrently, the AFE module adaptively learns channel-wise weights for multi-scale features, enabling the model to focus on regions of interest while suppressing interference from irrelevant areas. The synergistic integration of these two modules facilitates effective feature fusion, thereby further enhancing the model’s representational capacity.
Compared with the individual introduction of DKE modules or the combined introduction of the LCCT and AFE modules, the joint integration of all three modules leads to more substantial performance improvements. DECC-Net further increases the IoU to 83.3% and the Dice coefficient to 90.9%.
This study further conducts ablation experiments on the fusion strategies of LCCT and AFE modules. The experimental results are shown in
Table 8, where “+” and “−” denote summation fusion and sequential fusion of the output features from the two modules, respectively. All three feature fusion strategies improve the segmentation performance, but summation fusion enables the model to achieve the best performance.
In summary, our improvement strategy, which constructs the encoder and decoder using DKE modules and replaces skip connections with the LCCT and AFE modules to perform summation fusion of input features, simultaneously improves segmentation performance and maintains a relatively small parameter count. The experimental results fully validate the effectiveness of the proposed improvements.
3.3. Evaluation of Model Performance with Different Data Augmentation Levels
To investigate the segmentation accuracy of the models under different levels of data augmentation, we conducted multiple sets of experiments based on the original non-augmented data, with variations in the proportion of augmented samples relative to the original training set samples. The specific settings are as follows: one set of experiments used original tassel images without data augmentation as the training set, while the other four sets constructed training sets by expanding the original tassel images at ratios of 25%, 50%, 75%, and 100%, respectively. Subsequently, these five training sets were used to train both the proposed DECC-Net and U-Net in this study. The experimental results are shown in
Figure 10. In the original group without data augmentation, DECC-Net achieved an IoU of 77.4% and a Dice of 87.3%, both higher than those of U-Net. As the level of data augmentation increased from 0% to 100%, both DECC-Net and U-Net showed varying degrees of improvement in performance. However, across all levels of data augmentation, DECC-Net consistently achieved higher IoU and Dice scores than U-Net, demonstrating the superiority of its segmentation performance under different data conditions.
3.4. Robustness Analysis of DECC-Net
To further analyze the performance of DECC-Net and evaluate its robustness in handling different scenarios, we subdivided the test set into five categories based on weather conditions (sunny, cloudy) and tasseling stages (early, middle, late). Comparative tests were conducted between DECC-Net and U-Net, which serves as the mainstream model for maize tassel segmentation [
18], with results presented in
Table 9.
Experimental results showed that model performance varies across tasseling stages. Both DECC-Net and U-Net achieved the lowest performance at the early tasseling stage. In contrast, both models demonstrated improved segmentation performance during the middle and late tasseling stages. At the early tasseling stage, tassels are smaller in size and highly similar in morphology to leaf veins, while being easily occluded by canopy leaves, leading to lower segmentation accuracy. In the middle and late tasseling stages, the main bodies of tassels emerge from the canopy leaves, with more distinct morphological characteristics, thus facilitating accurate recognition and segmentation by the model.
Although segmentation performance varies across different tasseling stages for both models, DECC-Net outperforms U-Net in all stages. Notably, at the early tasseling stage, which imposes higher demands on the model’s feature extraction capability, DECC-Net achieved an IoU 5.1% higher than U-Net. The experimental results fully validate the robustness of the proposed DECC-Net and its effectiveness in tassel segmentation across the entire developmental cycle.
Under cloudy conditions, DECC-Net and U-Net achieved IoU values of 85.3% and 83.2%, respectively. In contrast, under sunny conditions, the IoU values of both models decreased to 81.4% and 74.6%. This performance decline under sunny conditions is attributed to intense light reflections on maize leaves, which introduce complex interference in the images. Specifically, strip-shaped light reflections may be misidentified as tassels, thereby reducing segmentation accuracy. Conversely, cloudy conditions minimize such reflection interference, facilitating more accurate tassel segmentation. The results further demonstrate that DECC-Net outperforms U-Net under both weather conditions, confirming its robustness in handling tassel segmentation under diverse environmental challenges.
Figure 11 shows the visualized tassel segmentation results of DECC-Net and U-Net across different tasseling stages and weather conditions. It was observed that U-Net exhibits significant limitations when performing the maize tassel segmentation task. Particularly in areas where tassels and leaf veins overlap and in situations with large light reflections on the leaves, U-Net, limited by its relatively weak feature extraction and fusion capabilities, may incorrectly identify leaf veins and reflected light on leaves as parts of the tassel, or conversely, misidentify tassels as background. This results in many false positives and negatives that affect the segmentation outcome. In comparison, DECC-Net significantly improves segmentation accuracy in complex environments by effectively capturing multi-scale tassel features and mitigating the semantic gap between the encoder and decoder through enhanced feature fusion capabilities.
Figure 12 illustrates the segmentation results of DECC-Net in complex field environments. When handling maize tassels under different weather conditions and at varying developmental stages, DECC-Net demonstrates strong generalizability.
4. Conclusions
In this paper, we focused on the semantic segmentation of maize tassels in cropland scenes. Using UAV-collected RGB images of field maize as the research object, we constructed a maize tassel dataset covering diverse tasseling stages and weather conditions. To address the challenges faced by existing network models in maize tassel segmentation, including low accuracy, insufficient generalization ability, and large parameter count, we proposed DECC-Net. The model employs DKE modules to construct the encoder and decoder, while enhancing feature fusion capabilities through the LCCT and AFE modules. The DKE modules improve the ability to extract critical features from images while reducing the parameter count. The LCCT and AFE modules collaboratively capture long-range contextual dependencies and enable effective fusion of multi-scale channel features, mitigating semantic gaps between encoder and decoder features. The performance of DECC-Net was validated and analyzed through extensive experiments, with the main conclusions as follows:
DECC-Net exhibits excellent segmentation performance, achieving 85.6% precision, 96.9% recall, 83.3% IoU, and 90.9% Dice scores, surpassing the baseline U-Net (with VGG16 as the encoder) adopted in existing maize tassel segmentation studies, as well as a series of representative advanced segmentation models in agricultural applications.
DECC-Net has a parameter count of only 3.67 M, which is lower than that of other mainstream semantic segmentation models. This indicates that DECC-Net can be more effectively applied to hardware-limited devices.
DECC-Net demonstrates better robustness and generalization ability, effectively performing maize tassel segmentation tasks across different tasseling stages and weather conditions. The segmentation results more closely resemble the actual morphology of maize tassels, thereby providing more precise guidance for field management during the tasseling period of maize.
Compared with existing studies, the DECC-Net proposed in this study has achieved promising segmentation results, though there remains room for improvement. Future research will focus on introducing semi-supervised learning methods to reduce the cost of data annotation in semantic segmentation tasks. Furthermore, we plan to collect data in more experimental regions and further enhance the diversity of maize varieties, aiming to further verify and improve the model’s generalization ability and robustness. Meanwhile, we will carry out tassel area estimation in actual agricultural production environments to enhance its practical applicability in real-world scenarios.