DSTANet: A Lightweight and High-Precision Network for Fine-Grained and Early Identification of Maize Leaf Diseases in Field Environments

Gao, Xinyue; He, Lili; Liu, Yinchuan; Wu, Jiaxin; Cao, Yuying; Dong, Shoutian; Jia, Yinjiang

doi:10.3390/s25164954

Open AccessArticle

DSTANet: A Lightweight and High-Precision Network for Fine-Grained and Early Identification of Maize Leaf Diseases in Field Environments

by

Xinyue Gao

^1,2,

Lili He

^1,3,

Yinchuan Liu

^1,2,

Jiaxin Wu

^1,2,

Yuying Cao

^1,2,

Shoutian Dong

^1,2,* and

Yinjiang Jia

^1,2,*

¹

College of Electrical Engineering and Information, Northeast Agricultural University, Harbin 150030, China

²

Key Laboratory of Northeast Smart Agricultural Technology, Ministry of Agriculture and Rural Affairs, Harbin 150030, China

³

Department of Academic Theory Research, Northeast Agricultural University, Harbin 150030, China

^*

Authors to whom correspondence should be addressed.

Sensors 2025, 25(16), 4954; https://doi.org/10.3390/s25164954

Submission received: 24 June 2025 / Revised: 30 July 2025 / Accepted: 6 August 2025 / Published: 10 August 2025

(This article belongs to the Section Smart Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Early and accurate identification of maize diseases is crucial for ensuring sustainable agricultural development. However, existing maize disease identification models face challenges including high inter-class similarity, intra-class variability, and limited capability in identifying early-stage symptoms. To address these limitations, we proposed DSTANet (decomposed spatial token aggregation network), a lightweight and high-performance model for maize leaf disease identification. In this study, we constructed a comprehensive maize leaf image dataset comprising six common disease types and healthy samples, with early and late stages of northern leaf blight and eyespot specifically differentiated. DSTANet employed MobileViT as the backbone architecture, combining the advantages of CNNs for local feature extraction with transformers for global feature modeling. To enhance lesion localization and mitigate interference from complex field backgrounds, DSFM (decomposed spatial fusion module) was introduced. Additionally, the MSTA (multi-scale token aggregator) was designed to leverage hidden-layer feature channels more effectively, improving information flow and preventing gradient vanishing. Experimental results showed that DSTANet achieved an accuracy of 96.11%, precision of 96.17%, recall of 96.11%, and F1-score of 96.14%. With only 1.9M parameters, 0.6 GFLOPs (floating point operations), and an inference speed of 170 images per second, the model meets real-time deployment requirements on edge devices. This study provided a novel and practical approach for fine-grained and early-stage maize disease identification, offering technical support for smart agriculture and precision crop management.

Keywords:

maize diseases identification; image classification; feature extraction; DSTANet

1. Introduction

Maize is a widely cultivated crop, serving not only as food but also as industrial raw material with extensive applications in many fields. Current maize cultivation practices often involve extensive and continuous monoculture, which could result in the rapid spread of diseases, leading to huge economic losses [1]. In addition, the overuse of pesticides pollutes the field environment, reduces soil fertility, enhances the resistance of pathogens, and threatens the sustainability of maize cultivation.

Traditional methods rely on manual identification based on the experience of agricultural experts. Jurado et al. (2006) [2] proposed a polymerase chain reaction detection method to identify pathogenic fusarium in maize. But these traditional methods are subjective and can identify diseases only when symptoms are significant, making the timely identification of early-stage diseases challenging. Kusumo et al. (2018) [3] and Panigrahi et al. (2020) [4] used some machine learning algorithms to evaluate the effect of various image features in maize disease detection. The introduction of machine learning methods has enabled the automatic identification of diseases, but there is still a problem that it is difficult to extract disease features automatically, and feature extractors also need to be designed manually.

With the development of deep learning and computer vision, fine-grained image identification technology has achieved significant breakthroughs. Representative algorithms include ResNet50 [5] and vision transformer [6]. Building upon these algorithms, researchers have made improvements to meet the needs of the automatic identification of crop diseases. Yang et al. (2023) [7] proposed re-GoogLeNet for the accurate identification of rice leaf diseases in field environments, enhancing the feature extraction capability for small, irregularly shaped spots on diseased leaves. Zhang et al. (2023) [8] introduced a progressive non-local means algorithm to solve the challenge of inter-class similarity and intra-class variability in tomato leaf diseases, and introduced a multi-channel automatic directional recurrent attention network. Liu et al. (2024) [9] constructed the GLDCNet for identifying grapevine leaf roll diseases based on UAV RGB images, achieving an accuracy of 99.57%. Liu et al. (2024) [10] constructed a Multi-scale constrained MCDCNet based on multi-branch convolution and deformable convolution to identify apple leaf diseases. Compared to SOTA models, it improved accuracy by 3.85% and could accurately classify five common apple leaf diseases. Zhang et al. (2024) [11] combined leaf vein features with other textural features to generate high-quality semantic features. They designed a Multi-Attention IBN Anti-aliasing Network based on Fourier analysis for identifying cassava leaf diseases.

At the same time, deep learning algorithms have also been widely applied to the field environment of maize leaf disease identification. Zeng et al. (2022) [12] employed a lightweight dense-scale network to improve the identification accuracy of maize diseases. Addressing the challenges of small sample sizes and complex backgrounds, Li et al. (2023) [13] applied ACGAN to augment the maize diseases dataset. By combining this with transfer learning, they established a practical method for identifying maize leaf diseases under field conditions. Xu et al. (2023) [14], building on ResNet50, introduced the ECA attention mechanism and the Adam optimizer to identify six types of maize diseases and pests, achieving an identification accuracy of 93.95%. Bai et al. (2024) [15] found that the use of hyperspectral images could better understand the response of maize plants to leaf spot disease infection, which was conducive to improving the early detection strategy. Wang et al. (2024) [16] proposed a texture-color dual-branch multi-scale residual shrinkage network (TC-MRSN). One branch of this network uses the improved LBP algorithm to extract texture features, and the other branch utilizes the RGB features of the convolutional neural network, retaining the feature information of small lesions. The results showed that the optimized model achieved an identification accuracy of 94.88%. Zhang et al. (2024) [17] introduced separable convolutions and attention mechanisms to enhance the extraction capability of maize disease features, proposing LSANNet for maize leaf disease identification, and the accuracy achieved to 94.35%. In order to reduce unnecessary redundant spatial information, Wang et al. (2024) [18] used octave convolution to accelerate training. They established OSCRNet to achieve the interaction of different feature information within images. Li et al. (2025) [19] integrated high-frequency detail information in multiple layers of MobileNetV2, using HFFE (high-frequency feature extraction) to enhance the network’s ability to learn detail information, achieving an accuracy of 95.7%. To reduce the semantic ambiguity on RGB single-modal data and enhance the connection between images and text, Wang et al. (2025) [20] proposed WCG-VMamba, which utilized multi-modal data of text and images to identify four common maize diseases, improving the recognition accuracy in complex environments, achieving an accuracy of 99.23%. Based on the hyperspectral maize disease imagery, Liu et al. (2024) [21] proposed an attention-based spatial-spectral joint network, which enhanced the model’s identification capabilities by extracting features from spatial and spectral dimensions. Compared to traditional disease identification methods based on image classification, object detection algorithms such as YOLO enable the precise detection and localization of disease lesions [22]. For instance, studies by Yang et al. (2024) [23] and Li et al. (2024) [24] both achieved high-precision identification of maize diseases by improving the YOLOv8. However, this approach is highly dependent on accurately annotated datasets. Furthermore, the bounding box regression process incurs significant computational overhead, which increases the model’s training costs and poses challenges for deployment on edge devices.

However, in the field environment, the inter-class similarity and intra-class variability of maize diseases brought a huge challenge to precise identification. Certain different types of maize diseases exhibited similar symptoms, which could lead to error if identification only relied on the color and shape features of the lesions. For instance, both gray leaf spot [25] and late-stage northern leaf blight [26] manifested as tan-brown stripes. Similarly, common rust [27] and eyespot [28] often presented as small, yellowish-brown, spot-like lesions. In addition, the morphology of the same diseases was variable in different stages. There were only sporadic disease spots in the early stage of eyespot, while the infected leaves would wither in the late stage. Therefore, identifying different maize leaf diseases in time and correctly was of vital importance in agricultural production to avoid economic losses caused by disease flooding.

In addition, most studies relied on public datasets with simple backgrounds and lacked early-stage disease identification. First, early-stage lesions were extremely small, often comprising only a few pixels in size, making them easily overlooked. Second, early disease symptoms exhibited subtle color differences from healthy leaf tissue, typically presenting slight yellow-green variations that were difficult to distinguish from natural color variations in leaves. Additionally, early symptoms have irregular morphology with blurred boundaries and lack distinct characteristic markers, which pose significant challenges for automated recognition algorithms. Environmental factors also affect the accuracy of early detection, as changes in lighting conditions can alter the reflective properties of leaf surfaces, further increasing identification difficulty. And identifying early diseases in time can reduce pesticide use, achieve targeted spraying, avoid environmental pollution, and promote sustainable agricultural development. These challenges underlined the practical importance of developing precise identification methods specifically for early-stage diseases.

To deal with the above-mentioned challenges, the main contributions of this study are as follows:

(1): To address the inter-class similarity and intra-class variability characteristics of maize leaf diseases, the decomposition spatial fusion module (DSFM) was constructed, which could accurately locate the lesions on the leaves and overcome the influence of redundant environmental noise.
(2): The multi-scale token aggregator (MSTA) was introduced, which utilized depthwise separable convolution at different scales to achieve the fusion and complementation of feature information, thereby obtaining more complete and comprehensive feature information.
(3): Combining DSFM and MSTA, this study proposed a decomposed spatial token aggregation network (DSTANet) that integrated the advantages of local feature extraction of CNN and global feature extraction of transformer and had the ability to identify early-stage maize diseases.
(4): A dataset comprising six different types of maize leaf diseases and healthy leaves was created. Data for both early and late stages of eyespot and northern leaf blight were collected. This dataset served as the basis for validating the superiority of DSTANet.

2. Materials and Methods

2.1. Materials

The maize disease images in this study were collected from Xiangyang Farm and Acheng Farm, located in Harbin City, China. The geographical information of the experimental area is shown in Figure 1. And the data acquisition protocol is shown in Table 1.

We expanded the data samples by mixing self-built datasets and public datasets. Finally, healthy (H) leaves, early-stage eyespot (ES-E), late-stage eyespot (ES-L), early-stage northern leaf blight (NNB-E), late-stage northern leaf blight (NLB-L), phosphorus deficiency (PD), zinc deficiency (ZD), common rust (CR), and gray leaf spot (GLS) have 1000 images each, for a total of 9000 images. The public datasets were PlantVillage [29] and PlantDoc [30], which were publicly available on Kaggle (https://www.kaggle.com/). Examples of the maize diseases data were shown in Figure 2.

The original images were resized to 224 × 224 pixels. The dataset was subsequently split randomly into training, validation, and test sets at a ratio of 6:2:2. In order to evaluate the model’s performance more objectively, we applied data augmentation techniques, including random brightness adjustment (increase and decrease), salt-and-pepper noise addition, random erasing, and random scaling to the images in the training sets.

2.2. Methodologies

2.2.1. Overall Structure of the Model

The similarity between different maize leaf diseases, the variability within the same disease across different stages, and the complexity of the real field environment lead to challenges for existing disease identification models in practical applications, including poor generalization and low accuracy. In order to address these problems, this study introduced a multi-stage neural network DSTANet that combined CNNs and transformers. This model took MobileViT [31] as the baseline and combined DSFM and MSTA. The model’s overall architectural framework is depicted in Figure 3a.

2.2.2. Decomposed Spatial Fusion Module

During feature extraction, attending to both inter-channel relationships and spatial positional relationships within feature maps could enable models to achieve better results. Classic attention mechanisms such as SENet [32] and CBAM [33] primarily considered encoding information based on inter-channel relationships, neglecting positional information. To effectively extract features of subtle lesions and mitigate the impact of inter-class similarity and intra-class variability for high-accuracy identification among maize leaf diseases, this study was inspired by ELA [34] and integrated one-dimensional convolution and group normalization [35] for feature enhancement. This method constructed a multi-branch feature extraction module, named decomposed spatial fusion module (DSFM). In the first and second branches, two one-dimensional positional feature maps were encoded to enhance the extraction capabilities for ambiguous and subtle features, capturing global contextual information. The third branch performed convolution and pooling operations for deeper processing and learning of more abstract local information features. The structure of DSFM was illustrated in Figure 3b.

Firstly, the input

X \in R^{C \times H \times W}

is subjected to average pooling along the horizontal and vertical directions to capture long-range dependencies, mitigating the influence of irrelevant regions on predicting the category of the diseases. The formula is represented as follows, where

x_{c}

denotes the information of a single channel feature map,

X_{c}^{h} (h)

and

X_{c}^{w} (w)

capture the global receptive field and precisely locate the significant features of the diseases.

X_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(1)

X_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w)

(2)

The coordinate information obtained from the above formulas is input into a 1D convolution to enhance the positional encoding weights in the horizontal and vertical directions. Compared to 2D convolution, 1D convolution is not only adept at handling sequential information but is also more lightweight with higher computational efficiency. To precisely locate regions of interest, the DSFM employs 1D convolution with a kernel size of 7, which decides the coverage range of local interactions. The enhanced coordinate information is subsequently processed through group normalization, producing positional attention representations in both horizontal and vertical directions. The formulas are as follows.

y^{h} = G N ({C o n v 1 d}_{h} (X_{c}^{h} (h)))

(3)

y^{w} = G N ({C o n v 1 d}_{w} (X_{c}^{w} (h)))

(4)

The third branch employs a 3 × 3 convolution to expand the input

X

, thereby acquiring richer feature information, which is computed as

y^{e p} \in R^{H \times W \times (C \times e p)}

. Subsequently, the pointwise convolution maps the feature channels to a higher dimension, and the dropout layer is introduced to prevent co-adaptation and overfitting within the model, yielding

y^{p r o} \in R^{H \times W \times D}

, where

D \geq C

.

y^{e p} = e x p a n d_c o n v (X)

(5)

y^{p r o} = p r o j e c t_c o n v (y^{e p})

(6)

Finally, the feature tensor on the three parallel branches is fused and the weights is reallocated to obtain Y. The re-weight module consists of depthwise separable convolution, average pool, and a nonlinear activation function

σ

. The formula is as follows:

Y = σ (A v g P o o l (D W C o n v 2 d (y^{h} * y^{w} * y^{p r o})))

(7)

2.2.3. Multi-Scale Token Aggregator

Traditional vision transformers lacked the local inductive bias characteristic of convolutional neural networks. MobileViT addressed this limitation by integrating the locality of CNNs with the globality of ViT. However, a single-scale token aggregation mechanism could not fully leverage the abundant feature channel information within the hidden layers [36]. Therefore, we introduced the multi-scale token aggregator (MSTA). As shown in Figure 3c, the core of the module is a multi-scale aggregation mechanism, which processes and fuses features across different scales, depthwise separable convolutions in parallel. This enhances the model’s ability to perceive disease spots of various sizes in leaf disease images. Meanwhile, the application of depthwise separable convolutions effectively reduced the model’s computational overhead, thus satisfying the lightweight requirements.

Firstly, the input feature map

X \in R^{H \times W \times C}

is mapped to a higher dimension via a pointwise convolution to obtain

X_{H} \in R^{H \times W \times (C \times e)}

, where e is the channel expansion factor. This convolution layer consists of a 1 × 1 convolution, a GELU activation function, and a batch normalization layer. Subsequently, the

X_{H}

is processed in parallel by four different depthwise separable convolutions, with each convolution operating on one-quarter of the channels, resulting in

X_{1}

,

X_{2}

,

X_{3}

,

X_{4}

. Convolutions with kernel sizes of 3, 5, and 7 effectively capture multi-scale information, while the 1 × 1 convolution acts as a learnable channel dimension expansion factor. And the formulas are as follows:

X_{H} = {B N (G E L U (C o n v}_{1 \times 1} (X)))

(8)

\{\begin{array}{l} {X_{1} = D W C o n v}_{1 \times 1} (X_{\frac{H}{4}}) \\ X_{2} = {D W C o n v}_{3 \times 3} (X_{\frac{H}{4}}) \\ X_{3} = {D W C o n v}_{5 \times 5} (X_{\frac{H}{4}}) \\ X_{4} = {D W C o n v}_{7 \times 7} (X_{\frac{H}{4}}) \end{array}

(9)

Finally, the output of the multi-scale depthwise separable convolution is added to the original features through the residual connection. This facilitates information flow, avoids vanishing gradients, and enables information fusion and complementarity between features at different scales. And another 1 × 1 point convolution is used for channel dimension reduction, and the feature map is restored to the original input dimension to obtain the final output

X_{o u t} \in R^{H \times W \times C}

. The formula is as follows:

X_{o u t} = {(C o n v}_{1 \times 1} (C o n t a c t (X_{1}, X_{2}, X_{3}, X_{4}) + X_{H}))

(10)

2.2.4. Multi-Scale Token Aggregation Transformer

The multi-scale token aggregation transformer (MSTAT) is composed of three modules: local information encoder, global information encoder, and multi-scale feature fusion unit. The specific implementation is shown in Figure 3d.

Firstly, the local information encoder

c o n s i s t s

of a 3 × 3 convolution and a 1 × 1 convolution. The 3 × 3 convolution is used for local feature encoding, while the 1 × 1 pointwise convolution maps the feature map to a higher-dimensional feature space, resulting in the output

X_{L}

.

X_{L} = {C o n v}_{1 \times 1} ({C o n v}_{3 \times 3} (X))

(11)

Secondly, to enable the model to learn global information with spatial inductive bias, in the transformer-based global information encoder,

X_{L}

is unfolded into N non-overlapping patches of equal size to obtain

X_{U} \in R^{P \times N \times d}

(where

P = h \times w

,

N = \frac{H W}{P}

,

P

represents the number of pixels in each patch,

N

represents the number of patches, and

h < n

,

w < n

are the height and width of each patch, respectively).

X_{U}

is input into L stacked Transformer layers to encode global information and obtain the dependencies between patches, resulting in

X_{G} \in R^{P \times N \times d}

. The formulas are as follows:

X_{U} = U n f o l d (X_{L})

(12)

X_{G} (p) = T r a n s f o r m e r (X_{U} (p)), 1 \leq p \leq P

(13)

Since

X_{U} (p)

employs 3 × 3 convolution for local information encoding, and

X_{G} (p)

encodes the global information for the p-th location across P patches, each weight value in

X_{G}

represents an encoding of information from all pixels in X. Therefore, the overall effective receptive field of the MSTAT is H × W [31].

Unlike traditional vision transformers that discard inherent spatial relationships (both between patches and within each patch), this module retains hierarchical structure at both levels. Specifically, it preserves the topological order of patches across the image while maintaining the absolute spatial arrangement of pixels within every local patch, thus capturing fine-grained positional information lost in standard ViT architectures. So

X_{G}

is folded to obtain

X_{F} \in R^{H \times W \times d}

.

X_{F} = F o l d (X_{G})

(14)

In the multi-scale feature fusion unit, the MSTAT employs a 1 × 1 pointwise convolution to project

X_{F}

into a C-dimensional space, and then concatenates it with the original input X via a concatenation operation to obtain

X_{F m}

. Subsequently, a 3 × 3 convolution is applied to

X_{F m}

for preliminary feature fusion, and the result is then fed into the MSTA for multi-scale feature aggregation to obtain

Y

.

X_{F m} = c o n t a c t (X, {C o n v}_{1 \times 1} (X_{F}))

(15)

Y = M S T A ({C o n v}_{3 \times 3} (X_{F m}))

(16)

3. Results

3.1. Evaluation Index

In order to evaluate the effectiveness of the proposed model in identifying maize leaf diseases, this study used accuracy, precision, recall, and F1-score as evaluation indicators.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(17)

P r e c i s i o n = \frac{T P}{T P + F P}

(18)

R e c a l l = \frac{T P}{T P + F N}

(19)

F 1 - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(20)

TP represents the instance for which both predicted and actual values are positive. FP represents the instance for which the predicted value is positive but the actual value is negative. FN represents the instance for which the predicted value is negative but the actual value is positive. TN represents the instance for which both predicted and actual values are negative. Accuracy is the proportion of all correctly classified samples to the total number of samples. Precision is the proportion of samples predicted as positive that were actually correct, reflecting the purity of the model’s positive predictions. Recall is the proportion of all actual positive samples that the model successfully identified, reflecting the model’s ability to find all relevant instances. F1-score provides a balanced measure between precision and recall by calculating their harmonic mean. To provide a comprehensive evaluation across all classes, the Macro Average was used.

3.2. Parameter Setting

All the model training and testing in this study were deployed on the NVIDIA GeForce RTX 3090 24GB machine (NVIDIA, Santa Clara, CA, USA). The software environment was based on Python 3.8 and PyTorch 1.8.1. Other configurations included: the operating system was Ubuntu 20.04, the CPU was an AMD EPYC 7642, the RAM was 80 GB, and the GPU computing platform was CUDA 11.1.

Learning rate and optimizer are indispensable hyperparameters in the model training process, which have a direct influence on the model learning process and final performance. And the initial learning rate determines the pace of weight update in each iteration. And the optimizer is used to guide how to update the model weights based on the loss function.

Under the premise that the learning rate was set as 0.001, the influence of SGD [37], Adam [38], and AdamW [39] on the accuracy of DSTANet was compared. The specific results were shown in Table 2, where the model obtained the best result when AdamW was selected. To systematically evaluate the impact of different learning rates on DSTANet’s recognition accuracy, we employed the AdamW optimizer across all experiments. The results are shown in Table 3, where the model performed best when the initial learning rate was 0.001. This study explored the dynamic optimization path for model training by fine-tuning the optimizer configurations and learning rate. Final hyperparameter selection converged on AdamW with a 0.001 learning rate, balancing convergence speed and recognition accuracy.

According to the above experimental results, the parameter settings for the initial learning rate, batch size, epochs, optimizer, learning rate adjustment strategy, and loss function during the training process are shown in Table 4.

The experimental content of this study included the analysis of the effectiveness of DSTANet architecture, the comparison with different models, the performance analysis of different models, the analysis of similar diseases identification, and the ablation experiments. The confusion matrix and Grad-CAM [40] feature visualization tools were combined to verify the effectiveness and novelty of the model.

3.3. Analysis of the Effectiveness of DSTANet Architecture

To validate the effectiveness of the model architecture proposed in this study and to ensure the stability of the results, a 5-fold cross-validation method was employed. The dataset was divided into five equal folds while preserving the original class distribution. In each iteration, four folds were used for training, and the remaining one fold was used for validation. This process was repeated five times, allowing each fold to serve as the validation set once, thereby effectively minimizing bias and variance. The experimental results for DSTANet were presented in Table 5.

The accuracy for each fold exceeded 95%, which provided strong evidence for the effectiveness of the model architecture constructed in this study. Through cross-validation, the phenomenon of obtaining an accidentally high accuracy due to a random data split was mitigated.

3.4. Comparison with Different Models

To validate the superior identification performance of DSTANet for maize diseases in complex field environments, this section compares DSTANet with eight mainstream models. We selected representative state-of-the-art (SOTA) models in the field of computer vision in recent years, including various transformer-based network models such as MobileViT, DaViT [41], ViT, SwinTransformer [42], and traditional convolution neural network (CNN) models such as MobileNetV3 [43], EfficientNetV2 [44], ConvNeXt [45]. Additionally, we compared the currently popular VMamba [46] with our method.

The results of different models based on the training set were shown in Figure 4 and Figure 5, where the model accuracy gradually increased with the increase of training epochs until the model was fitted. The loss value on the y-axis in Figure 4 represented the difference between the predicted label and the actual label. A larger loss value indicated a greater difference between the predicted result of the model and the actual result. From the training loss curve, it could be seen that the loss of DSTANet was smaller, decreased faster, and fluctuated the least during training. As shown in Figure 5, after the 65th epoch, DSTANet showed a faster convergence speed and higher accuracy than other models. After the 120th epoch of training, the accuracy of DSTANet gradually stabilized above 95%. The results of different models are shown in Table 6. DSTANet’s accuracy, precision, recall, and F1-score were 96.11%, 96.17%, 96.11%, and 96.14%, respectively, with all metrics being higher than those of other advanced disease identification models. Compared to traditional convolutional neural networks, DSTANet’s accuracy was 3.17% higher than that of the best-performing EfficientNetV2. In comparison with various vision transformer models, DSTANet’s accuracy was 4.61% higher than that of the best-performing SwinTransformer. And compared to VMamba, which was the most popular computer vision model, DSTANet’s accuracy was 3.33% higher.

3.5. Performance Analysis of Different Models

When designing and selecting a classification model, the parameters, floating point operations (FLOPs), and frames per second (FPS) are the indispensable evaluation metrics of the performance. Table 7 presents the comparison of parameters, FLOPs, and FPS for DSTANet compared with other models.

The parameters and FLOPs of DSTANet were only marginally higher than MobileViT and significantly lower than other models, achieving 1.9M and 0.6G. This was because the influence of the complex field environment was comprehensively considered during the design of the DSTANet structure. Therefore, DSTANet showed a higher scene fit. We measured the processing speed of each model for maize disease images under the same hardware environment. DSTANet achieved an FPS of 170, indicating its rapid response capability in practical deployments. Although some models have faster inference speed than DSTANet, DSTANet is still the best choice when considering identification accuracy and computational complexity. These results collectively demonstrated that DSTANet maintained high performance while possessing low model complexity and fast inference speed, making it suitable for deployment on hardware devices with limited resources for real-time applications.

3.6. Analysis of Inter-Class Similarity and Intra-Class Variability

One of the most significant challenges in automated maize leaf disease diagnosis lies in accurately distinguishing between morphologically similar diseases that share comparable visual characteristics while managing the inherent variability within each disease category. In real-world agricultural scenarios, corn leaf diseases often exhibit high inter-class similarity, where different diseases present similar symptoms, similar coloration patterns, and comparable lesion morphologies. Simultaneously, some diseases demonstrate intra-class variability due to disease progression stages. To provide a more intuitive demonstration of DSTANet’s superior identification capability in a field environment compared to other models, Figure 6 shows the confusion matrices plotted based on each model.

Analysis of the confusion matrix revealed that DSTANet confused eyespot with common rust in only two instances and could accurately distinguish between late-stage northern leaf blight and gray leaf spot. Regarding the different stages of northern leaf blight, DSTANet had only 17 confusion instances, which was also lower than other models. For the different stages of eyespot, DSTANet misclassified only 15 instances between them, but the confusion rate between these stages in other models was significantly higher than DSTANet’s. Compared with DSTANet, although the baseline model MobileViT misclassified CR into ES-L in only one instance, it could also accurately distinguish late-stage northern leaf blight and gray leaf spot. However, for different stages of northern leaf blight, MobileViT had 40 confusions, far more than DSTANet (only 17). For different stages of the eyespot, MobileViT misclassified 46 instances. It could be seen clearly that the overall performance of MobileViT was worse than that of DSTANet.

These results demonstrated DSTANet’s robustness to both inter-class similarity and intra-class variability, enabling high accuracy in identifying little lesion signatures. By extending the transformer’s advantage to capture global features, it also solves the problem of small-scale feature disappearance caused by multiple convolutions.

The detailed identification results of DSTANet for each disease category are presented in Table 8. Out of the total 1800 images in the test set, DSTANet correctly classified 1730 samples. Its performance was significantly superior to other models.

3.7. Ablation Experiments

3.7.1. Ablation Experiment of MSTA

To validate the effectiveness of the 4-branch design in MSTA, this section conducted comprehensive ablation experiments to evaluate the impact of different branch configurations on model performance. This study compared four different MSTA variants: a 2-branch version using 1 × 1 and 3 × 3 depthwise convolutions with 1/2 channel allocation each, a 3-branch version incorporating 1 × 1, 3 × 3, and 5 × 5 convolutions with 1/3 channel allocation, our proposed 4-branch version utilizing 1 × 1, 3 × 3, 5 × 5, and 7 × 7 convolutions with 1/4 channel allocation, and a 5-branch version extending to 1 × 1, 3 × 3, 5 × 5, 7 × 7 and 9 × 9 convolutions with 1/5 channel allocation. All variants were evaluated under identical training conditions, including the same dataset, optimizer settings, and training epochs, to ensure fair comparison.

The experimental results, presented in Table 9, demonstrated that the 4-branch configuration achieved superior performance. The 4-branch MSTA attained an accuracy of 96.11%, precision of 96.17%, recall of 96.11%, and F1-score of 96.14%, representing a significant improvement of 2.04% in accuracy compared to the 2-branch version. While the 3-branch version showed better performance than the 2-branch version, it remained inferior to the 4-branch version. The 5-branch version exhibited performance degradation while increasing computational complexity, which suggested that too many branches may lead to feature redundancy.

The ablation experiment demonstrated that the 4-branch MSTA was the optimal design choice for maize leaf disease identification. The combination of 1 × 1, 3 × 3, 5 × 5, and 7 × 7 convolution kernels with quarter-channel allocation provided sufficient feature representation capacity while avoiding parameter redundancy. This configuration effectively matched the scale distribution characteristics of maize leaf diseases and demonstrated superior generalization capability across different pathological conditions, providing strong support for our architectural design decisions.

3.7.2. Ablation Experiment of DSTANet

To validate the effectiveness of the DSFM and MSTA modules introduced in this study, an ablation experiment was conducted in this section. Based on the baseline MobileViT, the DSFM and the MSTA were added separately, resulting in DSFMViT and MSTAViT. These models, along with MobileViT and DSTANet, were compared.

Table 10 presents the performance of all four models. DSTANet achieved the best overall results with 96.11% accuracy, exceeding DSFMViT (94.85%) by 1.26% and MSTAViT (94.75%) by 1.36%. Both DSFMViT and MSTAViT significantly outperformed MobileViT (90.94%).

To evaluate whether the proposed method could overcome environmental interference and distinguish similar diseases, the model was visualized using Grad-CAM. The results of feature visualization are shown in Table 11. The red region that was identified as key demonstrated how image spatial features correspond to class-specific weights in the classification model.

For MobileViT, some ground and leaf edge regions were shown in red and dark blue, which reflected that MobileViT’s original attention to diseases was diverted, and the complex environment affected the model’s weight calculation for diseases. For example, in the case of PD and ZD, MobileViT failed to effectively differentiate between the ground and weeds, assigning excessive weight to them. Relative to MobileViT, MSTAViT enhanced the perception of lesions by expanding the receptive field for effective feature extraction while simultaneously overcoming the impact of environmental noise. In contrast, DSFMViT excelled at mitigating noise interference, focusing its regions of interest more precisely on diseased areas, which resulted in more accurate and comprehensive localization of pathogenic information. DSTANet combined the advantages of DFSC and MSTA, comprehensively perceived the lesion while accurately extracting the location information of the lesion, and was not affected by noise.

Furthermore, for diseases characterized by spots, such as CR, ES-E, and ES-L, DSTANet disregarded healthy parts of the leaf, intensified its focus on the lesions, and increased the weight assigned to them. For diseases manifesting as stripes, such as GLS and NLB, DSTANet could also effectively and precisely concentrate its attention on the diseased locations.

4. Discussion

The architecture effectiveness of DSTANet was validated through 5-fold cross-validation. The consistent performance across all folds (>95% accuracy) demonstrates that the observed high accuracy is not an artifact of favorable data partitioning but rather reflects the inherent effectiveness of our architectural design. This stability across different data splits provides strong evidence for the model’s generalization capability and suggests that the performance improvements achieved by DSTANet are systematic rather than coincidental.

In the comparison with different models, this study comprehensively evaluated DSTANet with seven other mainstream disease identification models. The results showed that DSTANet outperformed all the comparison models in the four key indicators of accuracy, precision, recall, and F1-score. Compared with the optimal model EfficientNetV2 in traditional convolutional neural networks, the accuracy of DSTANet increased by 3.17%. In the comparison with various vision transformer networks, the accuracy rate of DSTANet was also 4.67% higher than that of ViT. These data proved that other models were disturbed to varying degrees by factors such as inter-class similarity, intra-class variability, and complex field environment noise. DSTANet could effectively overcome the above challenges and demonstrate outstanding identification performance. Furthermore, DSTANet’s number of parameters, FLOPs, and inference speed were also superior to those of other models.

In the analysis of similar disease identification, this study visualized the identification results by drawing the confusion matrix. DSTANet could accurately distinguish between gray leaf spot and northern leaf blight, as well as common rust and eyespot. There were only two cases of misclassification for CR and ES, and they had achieved precise identification of GLS and NLB-L. Meanwhile, DSTANet also had significant advantages over other models in the identification of early and late leaf blight and eyespot. As shown in Table 6, the precision of DSTANet for ES-E and NLB-E reached 93.37% and 95.29%, the recall reached 91.50% and 91.00%, and the F1-scores reached 92.42% and 93.10%. This was because DSTANet inherits the advantages of CNN and the transformer.

Ablation experiment of MSTA validated the rationality of 1 × 1, 3 × 3, 5 × 5, and 7 × 7 convolutions with 1/4 channel allocation structures. The 4-branch structure had sufficient feature representation ability and avoided parameter redundancy. Ablation experiments of DSTANet validated the effectiveness of the DSFM and MSTA in this study. As shown in Table 10, models incorporating DSFM or MSTA individually improved accuracy over MobileViT by 3.91% and 3.62%, respectively. DSTANet achieved accuracies 1.26% and 1.36% higher than DSFMViT and MSTAViT, respectively, and 5.17% higher than MobileViT.

Visualization of the identification effects using Grad-CAM showed that the model with the DSFM could effectively concentrate its attention on lesion areas, distinguishing between the background and leaves while assigning different weights to the diseased and healthy regions of the leaves. Furthermore, the model introducing DSFM could capture more detailed texture information and preserve richer feature information from leaf lesions. These findings demonstrate the effectiveness of the DSFM.

MSTAViT demonstrated superior lesion perception capability compared to MobileViT. While MobileViT only focused on partial diseased regions, MSTAViT’s attention endeavored to cover the entire lesion area as comprehensively as possible. These results substantiated that multi-scale token aggregation, as opposed to single-scale convolutional feature extraction methods, could fully integrate information between channel feature maps, thereby enhancing recognition accuracy. DSTANet combined the advantages of DSFM and MSTA, enabling it to not only accurately identify leaf lesion locations while ignoring irrelevant background noise, but also possess stronger feature extraction and information perception capabilities.

However, this study still has the following limitations:

(1): DSTANet lacks the ability to identify multiple labels, and can only recognize the disease with the most prominent symptoms. This limitation may lead to the spread of overlooked diseases, thus missing the best window period for disease prevention and control, increasing the cost of prevention, causing a decline in yield and a deterioration in corn quality, and reducing farmers’ income.
(2): The high accuracy of DSTANet relies on a large amount of training data. However, the data collection of maize disease is subject to various restrictions, and it takes a lot of manpower and material resources to collect enough data.
(3): The dataset we used only covers a few types of maize leaf diseases, and it still needs to be improved to strengthen the generalization ability of the model.

These problems limit the further improvement of the model and still need to be addressed by us. In the future, we will explore a multi-label classification disease recognition algorithm that fuses text and image multi-modal data, and combine generative AI for high-quality data augmentation to improve the performance of the model. And we will deploy the model from this research to mobile devices for farmers’ use as soon as possible, to guide the precise prevention and control of maize diseases and contribute to the sustainable development of the maize industry.

5. Conclusions

Taking maize leaf diseases in field environments as the research subject, we created a dataset containing six different categories of maize leaf diseases and healthy leaves. Data for both early and late stages of eyespot and northern leaf blight were also collected. Addressing the challenges brought by inter-class similarity and intra-class variability among leaf diseases, as well as noise interference in the field environment, this study designed DSTANet, which combined CNNs and transformers and introduced the DSFM and MSTA to improve the identification performance of the model. As a result, the DSTANet achieved an accuracy of 96.11%, a precision of 96.17%, a recall of 96.11%, and an F1-score of 96.14%. Furthermore, with a parameter count of only 1.9 M, 0.6GFLOPs, and the capability to recognize 170 images per second, its performance was significantly superior to other models.

In the DSTANet constructed in this study, the introduction of DSFM and MSTA not only simplified the feature extraction process, achieving precise localization of leaf disease lesions, but also significantly enhanced the perception of disease information within the extracted features by fusing multi-scale information. This integration greatly improved the model’s identification efficiency and accuracy. The experimental results demonstrated that DSTANet accurately identified leaf diseases across various types and stages. The introduction of this algorithm enabled users to accurately control the occurrence of diseases in fields, providing technical support for early prevention and control of field diseases, and thereby promoting the precise control and scientific management of maize diseases.

Ablation experiments conducted on the MSTA module and the DSTANet architecture provided support for the rationality of our design choices. The ablation experiment of MSTA showed that the 4-branch MSTA had sufficient feature representation ability and avoided parameter redundancy, which effectively matched the scale distribution characteristics of maize leaf diseases. Ablation experiments of DSTANet also confirmed the effectiveness of DSFM and MSTA, with DSTANet achieving the accuracy of 96.11%, higher 5.17% than the baseline MobileViT.

Compared with other models, DSTANet had significant superiority and could effectively identify early-stage maize diseases, providing guidance for achieving precise prevention and control of early diseases. Future work will focus on model improvement, dataset expansion, and extending this framework to other crop disease identification.

Author Contributions

Conceptualization, X.G. and Y.J.; methodology, X.G., S.D. and Y.J.; software, J.W.; validation, X.G., L.H. and Y.L.; resources, S.D. and Y.J.; data curation, Y.L. and X.G.; writing—original draft preparation, X.G.; writing—review and editing, X.G., Y.J., L.H., Y.C. and S.D.; visualization, X.G. and L.H.; supervision, S.D. and Y.J.; project administration, S.D. and Y.J.; funding acquisition, S.D. and Y.J. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Project of Laboratory of Advanced Agricultural Sciences, Heilongjiang Province (ZY04JD05-011).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chapwanya, M.; Matusse, A.; Dumont, Y. On Synergistic Co-Infection in Crop Diseases. The Case of the Maize Lethal Necrosis Disease. Appl. Math. Modell. 2021, 90, 912–942. [Google Scholar] [CrossRef]
Jurado, M.; Vázquez, C.; Marín, S.; Sanchis, V.; González-Jaén, M.T. PCR-based strategy to detect contamination with mycotoxigenic Fusarium species in maize. Syst. Appl. Microbiol. 2006, 29, 681–689. [Google Scholar] [CrossRef]
Kusumo, B.S.; Heryana, A.; Mahendra, O.; Pardede, H.F. Machine Learning-Based for Automatic Detection of Corn-Plant Diseases Using Image Processing. In Proceedings of the 2018 International Conference on Computer, Control, Informatics and Its Applications (IC3INA), Tangerang, Indonesia, 1–2 November 2018; pp. 93–97. [Google Scholar] [CrossRef]
Panigrahi, K.P. Maize Leaf Disease Detection and Classification Using Machine Learning Algorithms. In Progress in Computing, Analytics and Networking; Springer Nature: Berlin/Heidelberg, Germany, 2020; Volume 1119. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. Available online: http://arxiv.org/abs/1512.03385 (accessed on 18 October 2024).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Yang, L.; Yu, X.; Zhang, S.; Long, H.; Zhang, H.; Xu, S.; Liao, Y. GoogLeNet Based on Residual Network and Attention Mechanism Identification of Rice Leaf Diseases. Comput. Electron. Agric. 2023, 204, 107543. [Google Scholar] [CrossRef]
Zhang, Y.; Huang, S.; Zhou, G.; Hu, Y.; Li, L. Identification of Tomato Leaf Diseases Based on Multi-Channel Automatic Orientation Recurrent Attention Network. Comput. Electron. Agric. 2023, 205, 107605. [Google Scholar] [CrossRef]
Liu, Y.; Su, J.; Zheng, Z.; Liu, D.; Song, Y.; Fang, Y.; Yang, P.; Su, B. GLDCNet: A Novel Convolutional Neural Network for Grapevine Leafroll Disease Recognition Using UAV-Based Imagery. Comput. Electron. Agric. 2024, 218, 108668. [Google Scholar] [CrossRef]
Liu, B.; Huang, X.; Sun, L.; Wei, X.; Ji, Z.; Zhang, H. MCDCNet: Multi-Scale Constrained Deformable Convolution Network for Apple Leaf Disease Detection. Comput. Electron. Agric. 2024, 222, 109028. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, B.; Qi, C.; Nyalala, I.; Mecha, P.; Chen, K.; Gao, J. MAIANet: Signal Modulation in Cassava Leaf Disease Classification. Comput. Electron. Agric. 2024, 225, 109351. [Google Scholar] [CrossRef]
Zeng, W.; Li, H.; Hu, G.; Liang, D. Lightweight Dense-Scale Network (LDSNet) for Corn Leaf Disease Identification. Comput. Electron. Agric. 2022, 197, 106943. [Google Scholar] [CrossRef]
Li, E.; Wang, L.; Xie, Q.; Gao, R.; Su, Z.; Li, Y. A Novel Deep Learning Method for Maize Disease Identification Based on Small Sample-Size and Complex Background Datasets. Ecol. Inform. 2023, 75, 102011. [Google Scholar] [CrossRef]
Xu, W.; Li, W.; Wang, L.; Pompelli, M.F. Enhancing Corn Pest and Disease Recognition through Deep Learning: A Comprehensive Analysis. Agronomy 2023, 13, 2242. [Google Scholar] [CrossRef]
Bai, Y.; Nie, C.; Yu, X.; Gou, M.; Liu, S.; Zhu, Y.; Jiang, T.; Jia, X.; Liu, Y.; Nan, F.; et al. Comprehensive Analysis of Hyperspectral Features for Monitoring Canopy Maize Leaf Spot Disease. Comput. Electron. Agric. 2024, 225, 109350. [Google Scholar] [CrossRef]
Wang, H.; Pan, X.; Zhu, Y.; Li, S.; Zhu, R. Maize Leaf Disease Recognition Based on TC-MRSN Model in Sustainable Agriculture. Comput. Electron. Agric. 2024, 221, 108915. [Google Scholar] [CrossRef]
Zhang, F.; Bao, R.; Yan, B.; Wang, M.; Zhang, Y.; Fu, S. LSANNet: A Lightweight Convolutional Neural Network for Maize Leaf Disease Identification. Biosyst. Eng. 2024, 248, 97–107. [Google Scholar] [CrossRef]
Wang, P.; Xiong, Y.; Zhang, H. Maize Leaf Disease Recognition Based on Improved MSRCR and OSCRNet. Crop Prot. 2024, 183, 106757. [Google Scholar] [CrossRef]
Li, H.; Ruan, C.; Zhao, J.; Huang, L.; Dong, Y.; Huang, W.; Liang, D. Integrating High-Frequency Detail Information for Enhanced Corn Leaf Disease Recognition: A Model Utilizing Fusion Imagery. Eur. J. Agron. 2025, 164, 127489. [Google Scholar] [CrossRef]
Wang, H.; He, M.; Zhu, M.; Liu, G. WCG-VMamba: A Multi-Modal Classification Model for Corn Disease. Comput. Electron. Agric. 2025, 230, 109835. [Google Scholar] [CrossRef]
Liu, J.; Liu, F.; Fu, J. An Attention-Based Spatial-Spectral Joint Network for Maize Hyperspectral Images Disease Detection. Agriculture 2024, 14, 1951. [Google Scholar] [CrossRef]
Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural Object Detection with You Only Look Once (YOLO) Algorithm: A Bibliometric and Systematic Literature Review. Comput. Electron. Agric. 2024, 223, 109090. [Google Scholar] [CrossRef]
Yang, S.; Yao, J.; Teng, G. Corn Leaf Spot Disease Recognition Based on Improved YOLOv8. Agriculture 2024, 14, 666. [Google Scholar] [CrossRef]
Li, R.; Li, Y.; Qin, W.; Abbas, A.; Li, S.; Ji, R.; Wu, Y.; He, Y.; Yang, J. Lightweight Network for Corn Leaf Disease Identification Based on Improved YOLO V8s. Agriculture 2024, 14, 220. [Google Scholar] [CrossRef]
Zhong, T.; Zhu, M.; Zhang, Q.; Zhang, Y.; Deng, S.; Guo, C.; Xu, L.; Liu, T.; Li, Y.; Bi, Y. The ZmWAKL–ZmWIK–ZmBLK1–ZmRBOH4 Module Provides Quantitative Resistance to Gray Leaf Spot in Maize. Nat. Genet. 2024, 56, 315–326. [Google Scholar] [CrossRef] [PubMed]
Sun, J.; Yang, Y.; He, X.; Wu, X. Northern maize leaf blight detection under complex field environment based on deep learning. IEEE Access 2020, 8, 33679–33688. [Google Scholar] [CrossRef]
Debnath, S.; Chhetri, S.; Biswas, S. Southern rust disease of corn—A review. Int. J. Curr. Microbiol. App. Sci. 2019, 8, 855–862. [Google Scholar] [CrossRef]
Chen, N.; Xiao, S.; Sun, J.; He, L.; Liu, M.; Gao, W.; Xu, J.; Wang, H.; Huang, S.; Xue, C. Virulence and Molecular Diversity in the Kabatiella Zeae Population Causing Maize Eyespot in China. Plant Dis. 2020, 104, 3197–3206. [Google Scholar] [CrossRef]
Hughes, D.P.; Salathe, M. An Open Access Repository of Images on Plant Health to Enable the Development of Mobile Disease Diagnostics. arXiv 2016, arXiv:1511.08060. [Google Scholar]
Singh, D.; Jain, N.; Jain, P.; Kayal, P.; Kumawat, S.; Batra, N. PlantDoc: A Dataset for Visual Plant Disease Detection. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, Hyderabad, India, 5–7 January 2020; pp. 249–253. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. arXiv 2019, arXiv:1709.01507. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Xu, W.; Wan, Y. ELA: Efficient Local Attention for Deep Convolutional Neural Networks 2024. arXiv 2024, arXiv:2403.01123. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Lou, M.; Zhang, S.; Zhou, H.-Y.; Yang, S.; Wu, C.; Yu, Y. TransXNet: Learning Both Global and Local Dynamics With a Dual Dynamic Token Mixer for Visual Recognition. IEEE Trans. Neural Networks Learn. Syst. 2025, 36, 11534–11547. [Google Scholar] [CrossRef]
Song, S.; Chaudhuri, K.; Sarwate, A.D. Stochastic Gradient Descent with Differentially Private Updates. In Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing, Austin, TX, USA, 3–5 December 2013; pp. 245–248. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. Available online: https://arxiv.org/abs/1412.6980v9 (accessed on 15 June 2025).
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Ding, M.; Xiao, B.; Codella, N.; Luo, P.; Wang, J.; Yuan, L. DaViT: Dual Attention Vision Transformers. In Computer Vision—ECCV 2022; Springer Nature Switzerland: Cham, Switzerland, 2022; Volume 13684, pp. 74–92. ISBN 978-3-031-20052-6. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the Machine Learning Research, Virtual, 18–24 July 2021; Volume 139, pp. 10096–10106. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A Convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 37, pp. 103031–103063. [Google Scholar]

Figure 1. Geographical location of the experimental areas.

Figure 2. Example images of maize leaf diseases.

Figure 3. Overall model architecture: (a) Internal structure of DSTANet (The arrow represents downsampling). (b) Internal structure of decomposed spatial pusion module. (c) Internal structure of multi-scale token aggregation transformer. (d) Internal structure of multi-scale token aggregator.

Figure 4. Comparison of loss of different models.

Figure 5. Comparison of accuracy of different models.

Figure 6. Confusion matrix of maize disease classification.

Table 1. Data acquisition protocol for maize diseases.

Data Acquisition Protocol	Parameters
Camera model	Xiaomi 13 Ultra (Xiaomi, Beijing, China), Redmi K40 (Xiaomi, Beijing, China)
Camera lens	IMX989 (Sony, Tokyo, Japan), IMX582 (Sony, Tokyo, Japan)
Illumination condition	Sunny, cloudy, rainy, morning, afternoon
Collection Environment	Complex backgrounds (weed, soil, sky, etc.) under natural lighting conditions
Capture Distance	The distance between the leaf and the camera should be within 0.1–0.5 m
Original image resolution	3072 pixels × 4096 pixels
Collection time	From June to August 2024

Table 2. Comparison of DSTANet performance for different optimizers.

Optimizer	Epoch	Accuracy	Precision	Recall	F1-Score
SGD	150	91.45%	90.54%	90.86%	90.70%
Adam	150	88.43%	86.43%	87.22%	86.82%
AdamW	150	96.11%	96.17%	96.11%	96.14%

Table 3. Comparison of DSTANet performance with different learning rates.

LR	Epoch	Accuracy	Precision	Recall	F1-Score
0.01	150	53.54%	52.25%	43.53%	47.49%
0.001	150	96.11%	96.17%	96.11%	96.14%
0.0001	150	92.03%	91.05%	91.18%	91.11%
0.00001	150	76.58%	73.92%	72.78%	73.35%

Table 4. Training parameter setting table.

Parameter	Value
Initial learning rate	1 × 10⁻³
Batch size	64
Epochs	150
Optimizer	AdamW
Lr scheduler	Cosine Annealing
Minimum learning rate	1 × 10⁻⁹
Loss function	Cross Entropy Loss and Label Smoothing

Table 5. DSTANet’s 5-fold cross-validation results.

Fold Number	Accuracy	Precision	Recall	F1-Score
Fold 1	95.56%	96.16%	95.23%	95.69%
Fold 2	95.33%	95.48%	95.32%	95.40%
Fold 3	95.21%	95.64%	94.97%	95.30%
Fold 4	95.46%	94.89%	94.76%	94.82%
Fold 5	95.67%	96.05%	95.25%	95.65%
Average	95.44 ± 0.18%	95.64 ± 0.51%	95.11 ± 0.23%	95.37 ± 0.35%

Table 6. Comparison of results of different models.

Model	Accuracy	Precision	Recall	F1-Score
MobileViT	90.94%	91.34%	90.94%	91.14%
DaViT	89.33%	89.84%	89.33%	89.58%
ViT	91.44%	91.73%	91.44%	91.58%
SwinTransformer	91.50%	92.04%	91.49%	91.76%
MobileNetV3	92.39%	92.57%	92.39%	92.48%
EfficientNetV2	92.94%	93.29%	92.94%	93.11%
ConvNeXt	91.72%	92.01%	91.72%	91.86%
VMamba	92.78%	92.94%	92.78%	92.86%
DSTANet	96.11%	96.17%	96.11%	96.14%

Table 7. Performance comparison between DSTANet and different models.

Model	Accuracy	Parameters	FLOPs	FPS
MobileViT	90.94%	1.2M	0.41G	155
DaViT	89.33%	28.3M	4.5G	77
ViT	91.44%	86.6M	17.6G	74
SwinTransformer	91.50%	29M	4.5G	124
MobileNetV3	92.39%	2.5M	5.9G	202
EfficientNetV2	92.94%	22M	8.8G	88
ConvNeXt	91.72%	28.6M	4.5G	155
VMamba	92.77%	22M	4.5G	187
DSTANet	96.11%	1.9M	0.6G	170

Table 8. The identification results of DSTANet for various diseases.

Type	Precision	Recall	F1-Score
Common Rust (CR)	98.50%	98.50%	98.50%
Gray Leaf Spot (GLS)	98.50%	98.50%	98.50%
Healthy (H)	98.97%	96.50%	97.72%
Early stage Eyespot (ES-E)	93.37%	91.50%	92.42%
Late-stage Eyespot (ES-L)	89.95%	98.50%	94.03%
Early stage Northern Leaf Blight (NLB-E)	95.29%	91.00%	93.10%
Late-stage Northern Leaf Blight (NLB-L)	95.43%	94.00%	94.71%
Phosphorus Deficiency (PD)	99.50%	99.00%	99.25%
Zinc Deficiency (ZD)	96.06%	97.50%	96.77%

Table 9. Comparison of MSTA results for different branch versions.

	Kernel Size	Accuracy	Precision	Recall	F1-Score
2-branch	1 × 1, 3 × 3	93.28%	93.32%	93.21%	93.26%
3-branch	1 × 1, 3 × 3, 5 × 5	94.87%	95.12%	94.57%	94.84%
4-branch	1 × 1, 3 × 3, 5 × 5, 7 × 7	96.11%	96.17%	96.11%	96.14%
5-branch	1 × 1, 3 × 3, 5 × 5, 7 × 7, 9 × 9	94.07%	94.02%	94.09%	94.05%

Table 10. Effects of DSFM and MSTA on model performance.

Model	DSFM	MSTA	Accuracy	Precision	Recall	F1-Score
MobileViT			90.94%	91.34%	90.94%	91.14%
DSFMViT	√		94.85%	93.88%	93.58%	93.73%
MSTAViT		√	94.56%	93.64%	94.11%	93.87%
DSTANet	√	√	96.11%	96.17%	96.11%	96.14%

Table 11. Heat map of ablation experiment.

		MobileViT	MSTAViT	DSFMViT	DSTANet
CR
GLS
H
ES-E
ES-L
NLB-E
NLB-L
PD
ZD

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, X.; He, L.; Liu, Y.; Wu, J.; Cao, Y.; Dong, S.; Jia, Y. DSTANet: A Lightweight and High-Precision Network for Fine-Grained and Early Identification of Maize Leaf Diseases in Field Environments. Sensors 2025, 25, 4954. https://doi.org/10.3390/s25164954

AMA Style

Gao X, He L, Liu Y, Wu J, Cao Y, Dong S, Jia Y. DSTANet: A Lightweight and High-Precision Network for Fine-Grained and Early Identification of Maize Leaf Diseases in Field Environments. Sensors. 2025; 25(16):4954. https://doi.org/10.3390/s25164954

Chicago/Turabian Style

Gao, Xinyue, Lili He, Yinchuan Liu, Jiaxin Wu, Yuying Cao, Shoutian Dong, and Yinjiang Jia. 2025. "DSTANet: A Lightweight and High-Precision Network for Fine-Grained and Early Identification of Maize Leaf Diseases in Field Environments" Sensors 25, no. 16: 4954. https://doi.org/10.3390/s25164954

APA Style

Gao, X., He, L., Liu, Y., Wu, J., Cao, Y., Dong, S., & Jia, Y. (2025). DSTANet: A Lightweight and High-Precision Network for Fine-Grained and Early Identification of Maize Leaf Diseases in Field Environments. Sensors, 25(16), 4954. https://doi.org/10.3390/s25164954

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DSTANet: A Lightweight and High-Precision Network for Fine-Grained and Early Identification of Maize Leaf Diseases in Field Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.2. Methodologies

2.2.1. Overall Structure of the Model

2.2.2. Decomposed Spatial Fusion Module

2.2.3. Multi-Scale Token Aggregator

2.2.4. Multi-Scale Token Aggregation Transformer

3. Results

3.1. Evaluation Index

3.2. Parameter Setting

3.3. Analysis of the Effectiveness of DSTANet Architecture

3.4. Comparison with Different Models

3.5. Performance Analysis of Different Models

3.6. Analysis of Inter-Class Similarity and Intra-Class Variability

3.7. Ablation Experiments

3.7.1. Ablation Experiment of MSTA

3.7.2. Ablation Experiment of DSTANet

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI