Lightweight Model Improvement and Application for Rice Disease Classification

Liu, Tonglai; Liu, Mingguang; Yang, Chengcheng; Wu, Ancong; Li, Xiaodong; Wei, Wenzhao

doi:10.3390/electronics14163331

Open AccessArticle

Lightweight Model Improvement and Application for Rice Disease Classification

by

Tonglai Liu

¹

,

Mingguang Liu

¹

,

Chengcheng Yang

^1,2,*

,

Ancong Wu

¹,

Xiaodong Li

¹ and

Wenzhao Wei

³

¹

College of Artificial Intelligence, Zhongkai University of Agriculture and Engineering, Guangzhou 510550, China

²

School of Computer Science and Engineering, Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau

³

College of Mathematics and Informatics, South China Agricultural University, Wushan Road 483, Guangzhou 510642, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3331; https://doi.org/10.3390/electronics14163331

Submission received: 26 June 2025 / Revised: 18 August 2025 / Accepted: 19 August 2025 / Published: 21 August 2025

(This article belongs to the Special Issue Target Tracking and Recognition Techniques and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

The timely and correct identification of rice diseases is essential to ensuring rice productivity. However, many methods have drawbacks such as slow recognition speed, low recognition accuracy and overly complex models that are unfavorable for portability. Therefore, this study proposes an improved model for accurately classifying rice diseases based on a two-level routing attention mechanism and dynamic convolution based on the above difficulties. The model employs Alterable Kernel Convolution with dynamic, irregularly shaped convolutional kernels and Bi-level Routing Attention that utilizes sparsity to reduce parameters and involves a GPU-friendly dense matrix multiplication, which can achieve high-precision rice disease recognition while ensuring lightweight and recognition speed. The model successfully classified 10 species, including nine diseased and healthy rice, with 97.31% accuracy and a 97.18% F1-score. Our proposed method outperforms MobileNetV3-large, EfficientNet-b0, Swin Transformer-tiny and ResNet-50 by 1.73%, 1.82%, 1.25% and 0.67%, respectively. Meanwhile, the model contains only

4.453 \times 10^{6}

parameters and achieves an inference time of 6.13 s, which facilitates deployment on mobile devices.The proposed MobileViT_BiAK method effectively identifies rice diseases while providing a lightweight and high-performance classification solution.

Keywords:

rice disease; MobileViT; convolutional neural network; attention mechanism; transformer

1. Introduction

Rice, a staple for over half the world’s population, faces challenges from diseases and pests, which can lead to economic losses [1]. Quickly identifying and treating these diseases is crucial. Traditional methods rely on expert sensory identification, which is subjective and time-consuming. The recent application of artificial intelligence, particularly machine learning, offers a more objective and efficient approach to rice disease recognition [2]. At present, recognition methods mainly include traditional machine learning methods and deep learning methods, a branch of machine learning. Traditional machine learning recognition is mainly divided into four stages: data preprocessing, image segmentation, feature extraction and classification. Pritlmoy constructed a multilayer perceptron neural network based on color and texture features [3]. Chao Ma trained a Support Vector Machine (SVM) classifier using HOG features as input, and obtained a disease classifier [4]. Bikash proposed a rice disease image recognition technology based on Twin Support Vector Machine (TSVM) technology [5]. Although traditional machine learning has achieved certain results in research on rice disease recognition, the feature extraction step requires the manual extraction of features from relevant samples, and people usually extract features through their relevant experience, which is subjective, and the large-scale manual extraction process is cumbersome and time-consuming.

Deep learning methods, which are becoming increasingly sophisticated [6,7], have an advantage over traditional machine learning. Deep learning methods abandon the cumbersome, subjective manual feature extraction steps to extract features through the model, and the recognition of accuracy and efficiency is greatly improved. Mohapatra et al. proposed an automatic recognition and diagnosis AlexNet model to make a diagnosis of three rice diseases using a transfer learning approach [8]. Dengshan Li et al. proposed a video recognition system based on Fast-RCNN [9]. Rahman et al. proposed a CNN architecture called Simple CNN [10]. Saleem et al. used the Mutant Particles Swarm Optimization (MUT-PSO) Algorithm to search for the best CNN, to find the optimal CNN structure [11]. Sathya et al. proposed a novel reconstructed disease awareness-convolutional neural network (RDA-CNN) [12].

Deep learning methods are developing rapidly, and many scholars have begun to combine various techniques with backbone networks, allowing further improvements in model performance. The attention mechanism is a very effective method. Refs. [13,14,15] used the method of the channel attention mechanism, by adding, e.g., Efficient Channel Attention (ECA) and Squeeze-and-Excitation (SE) attention to the backbone network and feature learning in the channel dimension. However, focusing on the information on the channel may not be enough, and the location of the features is also a kind of information that should not be ignored. Refs. [16,17]’s methods with spatial attention mechanisms, such as Coordinate Attention (CA) and Convolutional Block Attention Module (CBAM) attention mechanisms, are used. These methods can capture the location information of key features and suppress the weight of background location, resulting in improved model performance. The spatial attention mechanism can capture the positional information of features, but it cannot establish the connection between different features, and the self-attention approach effectively improves this problem. SA, MHSA, and Window-based Self-Attention methods obtain the strength of association between different blocks by comparing the different blocks. Refs. [18,19,20] are good proofs of self-attention’s effectiveness. In addition to conventional supervised deep learning approaches, recent research has explored few-shot recognition and semi-supervised learning to address the challenge of limited labeled agricultural data. For example, Ref. [21] proposes a novel few-shot plant disease recognition framework that leverages meta-learning and attention mechanisms to achieve robust performance even with scarce training samples. Such approaches are particularly valuable for real-world agricultural scenarios, where obtaining large-scale labeled datasets is often infeasible.

In addition to the attention mechanism, convolution is also an area that can be improved. Ref. [22] proposed the use of transposed convolution and dilated convolution as the up-sampling operation and down-sampling operation in the feature pyramid, which achieved better performance than standard convolution. The replacement of standard convolution with depth-separable convolution and the use of deformable convolution as the feature extraction layer are proposed to achieve better detection than ordinary convolution [23]. The flexibility of the deformed convolution allows it to adapt to the shape of the features. A Dynamic Snake convolution is proposed [24], which utilizes the slender shape and variability of the convolution to segment the cardiac vascular dataset and the remote sensing road dataset with outstanding results.

The methods described above take the addition of an attentional mechanism or replace the standard convolution to further improve recognition accuracy. However, these methods still have some drawbacks. The computational volume of the multi-head attention mechanism is too large. Although some works [19,25,26,27] have sparse processing of the multi-head attention mechanism; they still fail to solve this problem well. Although the method using depth-separable convolution greatly reduces the number of parameters without decreasing the accuracy, the invariance of the shape of its convolution kernel makes it ineffective for extracting disease features like multiple shapes. Dynamic convolution can adapt to feature shapes through offsets, but the number of convolution kernels still grows at the same rate as standard convolution, all to the second power of x. There is still room to reduce the number of parameters in the model.

Although the above methods have achieved some success in the direction of deep learning, there are still some flaws and shortcomings. Firstly, many methods do not construct multicategory disease datasets and cannot make judgments for multiple rice diseases in real conditions. Secondly, some of the above methods only focus on the local region of the image and cannot focus on the global dependency of the image to form contextual associations. Although methods such as self-attention solve this problem, considering that rice paddy areas are usually in remote, more complex terrain, deploying slower inference speeds with larger parametric quantities of transformers in computationally limited devices is still a major challenge at this point in time. Based on this, this paper proposes an improved hybrid CNN and Transformer architecture model - MobileViT_BiAK, in order to achieve lightweight and high accuracy. The model adopts Alterable Kernel Convolution, which can be adapted to different disease features by changing different shapes of convolution kernels, and its linearly increasing number of convolution kernels also ensures the advantage of being lightweight, and so it does not bring heavy computation. Meanwhile, the Bi-level Routing Attention method divides the attention process into two phases, in the first phase of coarse-grained filtering, retaining the k most relevant regions in the graph, and then in the second phase of fine-grained attention mechanism operation, which improves the problem of the huge computational volume of Multi-Head Self Attention. Overall, our contributions are three things:

A two-stage attention mechanism for capturing global dependencies is proposed, performing attention calculations at both coarse-grained and fine-grained levels. This approach reduces computational cost while maintaining accuracy, achieving lightweight performance.
Alterable Kernel Convolution is employed for feature extraction. Its dynamic nature allows it to better adapt to features of varying shapes, while its linearly growing parameter count effectively controls model complexity.
The proposed method performs well on a rice disease dataset containing 10 categories, accurately identifying targets even in complex background environments.

In this work, we selected MobileViT as the backbone because it uniquely integrates the advantages of CNNs for local feature extraction and Transformers for global dependency modeling, while maintaining a compact model size suitable for deployment on resource-limited agricultural devices. Compared with other lightweight architectures such as MobileNetV3, EfficientNet, and ShuffleNet, MobileViT demonstrated superior accuracy–efficiency trade-offs in our preliminary experiments on agricultural image datasets. Moreover, its modular architecture facilitates the seamless integration of enhanced convolution and attention modules, making it an ideal foundation for our proposed AKConv–BRA framework.

2. Related Work

2.1. Lightweight Backbone Networks for Rice Disease Recognition

Lightweight convolutional neural networks (CNNs) have become increasingly popular in agricultural disease detection because they offer a good trade-off between performance and computational cost, making them well-suited for deployment on mobile or embedded devices in field conditions. Notable examples include MobileNet [28], which utilizes depthwise separable convolutions and squeeze-and-excitation blocks to reduce the number of parameters while maintaining representational capacity. EfficientNet [29] achieves competitive performance through compound scaling of depth, width, and resolution, while ShuffleNet [30] introduces pointwise group convolution and channel shuffle to enhance efficiency. In the context of rice disease classification, compact backbones have been adapted to extract discriminative features from leaf images with low latency, which is essential for real-time diagnosis in the field. However, these models typically employ fixed-size convolution kernels, which may limit adaptability to the diverse morphologies of rice leaf lesions, such as the nearly circular spots caused by brown spot disease or the elongated streaks characteristic of bacterial leaf streak. This limitation motivates our integration of Alterable Kernel Convolution (AKConv), which enables dynamic adjustment of receptive fields with linear parameter growth, thereby enhancing adaptability without significantly increasing model complexity.

2.2. Attention Mechanisms and Global Context Modeling in Agriculture

Attention mechanisms have been increasingly integrated into agricultural vision systems to enhance feature representation by focusing computational resources on the most relevant spatial or channel-wise information. Channel attention modules, such as the Squeeze-and-Excitation (SE) block [31] and Efficient Channel Attention (ECA) [32], recalibrate feature responses along the channel dimension to emphasize disease-relevant spectral bands or texture patterns. Spatial attention modules, such as those in the Convolutional Block Attention Module (CBAM) [33], focus on localizing lesions and suppressing irrelevant background regions such as soil or sky. In rice disease detection, combining channel and spatial attention helps discriminate subtle symptoms under challenging lighting, occlusion, or complex background conditions.

With the emergence of Vision Transformers (ViTs) [34], global context modeling has gained prominence in agricultural image analysis. ViTs divide an image into patches and apply multi-head self-attention (MHSA) to capture long-range dependencies between all regions, which is advantageous when lesions are spatially scattered. Variants such as the Swin Transformer [26] and CSWin Transformer [25] reduce the computational cost of MHSA by restricting attention to local windows or cross-shaped patterns, but this can limit the modeling of cross-window relationships. In practice, the high quadratic complexity of MHSA remains a bottleneck for deployment on resource-limited platforms. Bi-level Routing Attention (BRA) [35], as adopted in our approach, alleviates the computational burden by reducing the complexity to

O ({(H W)}^{4 / 3})

, where H and W denote the spatial height and width of the feature map, respectively. This mechanism effectively captures both local and global dependencies, making it particularly well-suited for mobile agricultural applications.

2.3. Recent Advances and Gaps in Rice Disease Classification

Recent studies in rice disease classification have increasingly explored hybrid designs that integrate lightweight backbones with attention modules. For example, YOLO-based detectors augmented with channel attention have been employed for real-time detection in UAV imagery [36], while MobileNet variants with CBAM have improved recognition accuracy in handheld devices [37]. These approaches improve feature discriminability but often either emphasize fine-grained local texture extraction or global context modeling—rarely addressing both in a computationally balanced manner.

Another emerging research direction is data-efficient learning. Few-shot learning methods, such as attention-guided pyramidal feature meta-learning [21], have been applied to fine-grained agricultural disease recognition, showing strong potential under limited labeled data conditions. Semi-supervised approaches leveraging pseudo-labeling or consistency regularization have also been explored in crop disease scenarios, reducing annotation costs and improving generalization to new environments. However, such methods are rarely integrated with efficient global modeling strategies, and their performance on edge devices has not been thoroughly validated.

Our MobileViT_BiAK design explicitly addresses these gaps by unifying shape-adaptive local feature extraction (AKConv) with computationally efficient global modeling (BRA) within a MobileViT backbone. This architecture provides an improved trade-off between accuracy, efficiency, and lesion shape adaptability, and is validated through ablation studies and complexity analysis in real-field rice disease datasets.

Figure 1 summarizes the key drawbacks of existing methods for rice disease classification and contrasts them with our proposed approach. Standard convolution and depthwise separable convolution are limited by fixed kernel shapes, which cannot adapt well to lesions of different geometries. Multi-head self-attention captures global dependencies but has high computational complexity. In contrast, our AKConv module achieves shape adaptability with linear parameter growth, and the BRA module significantly reduces the computational cost of attention while maintaining global context modeling.

3. Methods

3.1. MobileViT

MobileViT [38] is a lightweight model released by Apple in 2021 that achieves competitive performance in various computer vision tasks while maintaining low computational complexity. In our proposed method, MobileViT serves as the backbone network, providing an efficient combination of convolutional layers for local feature extraction and transformer blocks [39] for capturing long-range dependencies. The processing pipeline follows four main steps: (1) input image preprocessing and patch embedding; (2) local feature extraction using initial convolutional layers; (3) global context modeling via transformer blocks and (4) classification through a global pooling layer and a fully connected layer. As shown in Table 1, the MobileViT model structure consists of multiple layers including convolution operations, MV2 modules, and MobileViT blocks. This detailed breakdown highlights the architecture’s efficiency in processing rice disease classification tasks. To enhance the baseline MobileViT, we integrate Alterable Kernel Convolution (AKConv) into the convolutional stages to improve adaptability to disease features of varying shapes, and we replace the standard multi-head self-attention in the transformer stages with Bi-level Routing Attention (BRA) to reduce computational cost while maintaining global feature modeling capability. The contribution of each main step is as follows: (1) MobileViT backbone: provides a compact yet powerful architecture capable of efficient local-global feature integration; (2) Alterable Kernel Convolution: adapts receptive field shapes dynamically to better match the morphological diversity of rice disease symptoms; (3) Bi-level Routing Attention: selectively focuses on the most relevant spatial regions, achieving sub-quadratic complexity and improving inference efficiency on high-resolution images.

3.2. The Attention Mechanism

In our proposed method, the original Multi-Head Self-Attention (MHSA) modules within MobileViT blocks are replaced by Bi-level Routing Attention (BRA). This modification introduces a two-stage attention process: (1) coarse-grained region selection, which identifies the top-k most relevant spatial regions for each query and (2) fine-grained token-level attention applied only to the selected regions. This design reduces the computational complexity from

O ({(H W)}^{2})

in MHSA to approximately

O ({(H W)}^{4 / 3})

, while maintaining global feature modeling capabilities, where H and W denote the spatial height and width of the feature map, respectively. By embedding BRA into the MobileViT backbone, the model achieves improved efficiency and scalability on high-resolution agricultural images. Similar to humans, the attention mechanism in deep learning also processes input data faster and more accurately by focusing on the more important parts, achieving better performance and generalization ability. Transformer uses multi-head attention replacement convolution for global context modeling. However, the multi-head attention mechanism computes the correlation between them for all the windows into which the input samples are divided. There has been a lot of attention on this problem, with many ideas being proposed to use sparse key-value pairs. These make it only necessary to concentrate on a few key-value pairs between strong correlations to obtain a global representation, such as local windows [26], axial stripe [25], and dilated windows [27]. However, these methods are manually designed windows, whereas Bi-level Routing Attention (BRA) [35] is achieved by locating the few most relevant key-value pairs. The specific operation of the mechanism is divided into two steps: finding the most relevant regions and token-to-token attention.

Finding the most relevant regions. Given a 2D input feature map

X \in R^{H \times W \times C}

, we divide it into regions of size

S^{2}

, resulting in reshaped feature map

X \in R^{S^{2} \times \frac{H W}{S^{2}} \times C}

. Then, we derive the query, key, and value tensors with linear projections as shown in Equation (1):

Q = X^{r} W^{q}, K = X^{r} W^{k}, V = X^{r} W^{v}

(1)

where

W^{q}, W^{k}, W^{v} \in R^{C \times C}

are the projection weights for query, key, and value, respectively. The average is then calculated for each region separately to derive region-based queries and keys

Q^{r}, K^{r} \in R^{S^{2} \times C}

.

Next, the semantic adjacency matrix

A^{r} \in R^{S^{2} \times S^{2}}

is computed by matrix multiplication of

Q^{r}

and

K^{r}

as shown in Equation (2):

A^{r} = Q^{r} {(K^{r})}^{⊤}

(2)

A^{r}

measures the semantic relevance between pairs of regions. Only the top-k most relevant regions are retained on the top k using the row-wise top k operator. This gives the routing index matrix

I^{r} \in N^{S^{2} \times k}

, defined in Equation (3):

I^{r} = topkIndex (A^{r})

(3)

Token-to-token attention. With the region-to-region routing index matrix

I^{r}

, we can filter out the least relevant tokens and focus on the k routing regions indexed by

I_{i, 1}^{r}, I_{i, 2}^{r}, \dots, I_{i, k}^{r}

and gather all key value pairs as shown in Equation (4):

K^{g} = gather (K, I^{r}), V^{g} = gather (V, I^{r})

(4)

where

K^{g}, V^{g} \in R^{S^{2} \times k \cdot \frac{H W}{S^{2}} \times C}

are the key and value tensors gathered.

Finally, attention is paid to the key-value pairs gathered, with an additional Local Context Enhancement (LCE) term [40], as shown in Equation (5):

O = Attention (Q, K^{g}, V^{g}) + LCE (V)

(5)

3.3. The Convolution

In the convolutional component of our architecture, we replace the standard convolution layers in MobileViT with Alterable Kernel Convolution (AKConv). Unlike standard convolution, which uses fixed square kernels and whose parameter count increases quadratically with kernel size, AKConv allows dynamic, irregular kernel shapes whose parameter growth is linear with respect to kernel size. This adaptability enables the kernels to better match the morphological diversity of rice disease symptoms (e.g., circular lesions, elongated streaks) while keeping the model lightweight. Compared with deformable convolution, AKConv achieves a similar receptive field flexibility without incurring significant computational overhead, making it better suited for real-time agricultural applications. However, the standard convolution still suffers from several limitations. Specifically, the standard convolution uses

n \times n

kernels, and as the kernel size increases, the number of convolutional parameters increases quadratically.

Moreover, standard convolution has a fixed kernel shape, which cannot adapt well to the input samples of varying shapes and structures. This rigidity affects the accurate extraction of local features and ultimately affects overall performance.

To overcome these issues, we introduce a more flexible convolution mechanism called Alterable Kernel Convolution (AKConv) [41], whose parameters grow linearly and whose shape can adapt to the input. An illustration of AKConv is shown in Figure 2. It takes the upper-left corner

P_{n} (0, 0)

as the initial sampling coordinate. The convolution operation corresponding to position

P_{0}

is defined as shown in Equation (6):

Conv (P_{0}) = \sum w \times (P_{0} + P_{n})

(6)

Here, w denotes the convolutional parameter. AKConv extracts features using the following methods: stacking

3 \times 3

convolutional features in the spatial dimension and then extracting the features using a convolution operation with a stride of 3; transforming the features into four dimensions

(C, N, H, W)

and then extracting the features using Conv3d with a stride and kernel size of

(N, 1, 1)

; or stacking the features of the channel dimensions into

(C N, H, W)

and then reducing the dimension to

(C, H, W)

using a

1 \times 1

convolution.

AKConv has a dynamic size of the convolution kernel, so that the parameter growth trend of this convolution is linear, and the number of parameters can be controlled within a small range. Moreover, the shape of the convolution kernel is not constrained to the square structure of standard convolutions, allowing it to dynamically adapt to the size of different features. Based on these advantages, it can be effectively applied in rice disease recognition.

AKConv was chosen over other variants of adaptive convolution, such as deformable convolution, because it achieves adaptive receptive field adjustment with the growth of linear parameters, avoiding the significant computational overhead often introduced by deformable kernels. This property is especially critical in real-time agricultural disease monitoring, where computational resources are limited. A detailed empirical comparison between AKConv, standard convolution, depthwise separable convolution, and DCNv2 in terms of parameter efficiency, accuracy, and inference time is provided in Section 5.3, further supporting the motivation for adopting AKConv.

3.4. Proposed Methods

The current rice disease recognition system suffers from limited dataset variety, low recognition accuracy, and poor portability. MobileViT is a lightweight model that combines the advantages of spatial inductive bias in convolutional neural networks (CNNs), enabling the model to adapt well to different applications via learnable weights and accelerating convergence. Simultaneously, it integrates the transformer’s ability to focus on global information, allowing the model to attend not only to local content but also to information with strong relevance across different spatial positions. This dual capability addresses the current challenges of low accuracy and weak portability in rice disease recognition.

Our idea is to improve the feature extraction component of the network, which specifically refers to the convolutional and attention modules responsible for learning low- and mid-level representations from the input images, rather than the full encoder. Rice disease symptoms manifest in diverse forms; for instance, brown spot lesions are circular, whereas bacterial leaf blight exhibits elongated, striped patterns. Therefore, by using Alterable Kernel Convolution (AKConv) we can better extract such shape-varied disease features. Unlike standard convolution, AKConv has a linear parameter growth trend, which enables control over the parameter count while enhancing model performance. This makes it possible to improve the recognition accuracy of rice diseases with only a small increase in the model size.

Considering the Multi-Head Self-Attention (MHSA) mechanism in MobileViT, “multi-head” indicates that the output is split into n blocks along the channel dimension, and each block applies a separate set of projection weights. MHSA is defined as shown in Equation (7):

MHSA (X) = Concat ({head}_{0}, {head}_{1}, \dots, {head}_{n}) W^{o}, {head}_{i} = Attention (X W_{i}^{q}, X W_{i}^{k}, X W_{i}^{v})

(7)

where

{head}_{i}

denotes the output of the i-th attention head, and

X W_{i}^{q}

,

X W_{i}^{k}

,

X W_{i}^{v}

are the corresponding query, key, and value projections.

W^{o}

is the weight matrix used to combine all heads. MHSA has a complexity of

O (N^{2})

, that may result in high computational costs, overfitting, and weak generalization.

To address these issues, we replace MHSA with Bi-level Routing Attention (BRA), which first filters irrelevant regions at a coarse level and then applies attention at a fine-grained level to the remaining relevant regions. BRA enables a more efficient extraction of globally correlated features and has a computational complexity of

O ({(H W)}^{4 / 3})

.

We name the proposed improved architecture MobileViT_BiAK. The architecture of the improved model is illustrated in Figure 3, and its detailed structure is shown in Figure 4.

Each block in Figure 3 is annotated with the corresponding resolution of the feature map (height × width), the number of channels, and the specific operation type (e.g., standard convolution, inverted residual block, AKConv, BRA). This enables readers to trace how spatial resolution and channel depth evolve throughout the network, from raw image input to the final classification layer. In Figure 4, both the AK-Biformer block (a) and the original Biformer block (b) are labeled with tensor dimensions at each stage, arrows indicating the direction of feature propagation, and the specific transformations applied in each sub-module. For the AK-Biformer block, we additionally highlight the dual-branch design: the AKConv branch for adaptive local feature extraction and the BRA branch for efficient global dependency modeling. These annotations and descriptions collectively provide a clear, step-by-step view of the internal processing flow, helping readers understand how the proposed architecture maintains lightweight characteristics while improving feature expressiveness and classification performance.

For an input feature map of size

H \times W

with C channels, the standard multi-head self-attention (MHSA) has the following complexity:

O (MHSA) = O (H W C^{2} + {(H W)}^{2} C)

(8)

where the first term corresponds to the linear projections for Q, K, and V, and the second term to the dot-product attention across all tokens.

In BRA, coarse region-level routing partitions the feature map into

R \times R

regions, each with

\frac{H W}{R^{2}}

tokens. Fine-grained attention is then applied to only the top-k tokens per query, where

k \propto {(H W)}^{1 / 3}

. This reduces the attention cost from

O ({(H W)}^{2} C)

to the following:

O (BRA) \approx O (H W C^{2} + {(H W)}^{4 / 3} C)

(9)

Thus, BRA achieves sub-quadratic complexity in the number of tokens, enabling more efficient scaling to high-resolution inputs while preserving global context modeling.

BRA was selected instead of alternative efficient global attention mechanisms, including Swin Transformer, Linformer, and Performer, due to its unique two-level routing strategy: coarse region-level routing and fine token-level attention. This design reduces the theoretical complexity from

O (N^{2})

to approximately

O (N^{4 / 3})

, striking a better balance between efficiency and accuracy in high-resolution plant disease images.

4. Experiments and Results

The hardware environment is an Intel Xeon E5-2620 v3 processor, an NVIDIA RTX3060 12 GB graphics card, 16 GB RAM; the software environment is the Ubuntu 20.04 system, Python 3.9.5 and the deep learning framework used is PyTorch 2.1.1. In order to further intuitively analyze the effectiveness of our proposed method, we use four indicators to evaluate the model: accuracy, precision, recall, and F1 score. These metrics are defined as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(10)

Precision = \frac{T P}{T P + F P}

(11)

Recall = \frac{T P}{T P + F N}

(12)

F 1 - score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(13)

where

T P

,

T N

,

F P

, and

F N

denote the number of true positives, true negatives, false positives, and false negatives, respectively. These definitions follow standard practices in the evaluation of image classification [42].

4.1. Dataset

Considering that the current rice disease dataset has fewer disease types, it does not respond well to applications in practice. Therefore, we constructed a dataset that contains nine types of rice diseases and healthy leaves, totaling 10 species. This dataset is all taken in a natural environment with a complicated background, which is more advantageous than the dataset in a single background and can better reflect the performance of the model in the real environment. Rice diseases in the natural background contain more interfering information and noise, which is more capable of testing the model’s ability to extract key features and helps to improve the model’s generalization ability and robustness. The dataset includes the following categories: bacterial leaf blight, bacterial leaf streak, bacterial panicle blight, blast, brown spot, dead heart, downy mildew, hispa, normal and tungro. The dataset contains a total of 10,407 samples and is divided into training, validation and test sets in a ratio of 8:1:1. The resolution of the images in the dataset is 480 × 640, where the labels and corresponding sample numbers are shown in Table 2, and some of the images are shown in Figure 5. To ensure robustness evaluation, the dataset was collected under diverse conditions: (1) Lighting: approximately 45% of the images were taken in bright sunlight, 35% under overcast skies, and 20% in shaded areas. (2) Angles: about 50% of the samples were captured from a top-down view, 30% from oblique angles (30–60°), and 20% from side views. (3) Backgrounds: the dataset includes varied field conditions, with 40% containing soil backgrounds, 25% containing water surfaces, 20% containing weed interference and 15% mixed or cluttered environments. These variations increase the difficulty of classification and better simulate real-world deployment scenarios in agricultural fields. The dataset is split into training, validation, and test sets in an 8:1:1 ratio, with approximately equal proportions maintained for each class to ensure balanced representation. Table 2 presents the detailed per-class sample counts for each subset, along with the corresponding percentages. This balanced partitioning helps to prevent model bias towards any specific category and enables a reliable assessment of the model’s generalization performance.

4.2. Rice Disease Classification

In this section, we validate the proposed method in the rice disease dataset where the optimizer is AdamW [43], the loss function is the cross-entropy loss, AKConv kernel size of 5, a learning rate of 0.0002 and 200 epochs.

As shown in Figure 6, thanks to transfer learning, in the first 20 epochs, training loss and validation set loss experienced rapid decline, and the overall features of the entire dataset have been roughly learned, and generalization ability has achieved good results. Similarly, the accuracy of training and verification also increased rapidly in the first 20 epochs, then slowly increased, and finally, the fitting was completed. Although there are certain fluctuations in loss and accuracy, their overall trends are gradually declining and rising, which shows that the model fits the characteristics of the dataset well, and even the loss and accuracy of the validation set are slightly better than those of the training set. This shows that the hybrid architecture of CNN and the transformer is effective and can achieve good recognition results for samples that have not appeared before, which has strong generalization ability and robustness.

As shown in Table 3, the final classification accuracy of our proposed method is 97.31%, and the F1 score is 97.18%. This shows that our proposed method is effective. The specific classification results can be obtained from the confusion matrix in Figure 7.

The confusion matrix plot shows the specific classification results for each category, where the positive diagonal represents the number of correct classifications for each category. Among them, bacterial leaf streak, bacterial panicle blight, brown spot and dead heart have the highest recognition success rates, all reaching 100% accuracy. Except for downy mildew, the accuracy rates of other categories are also above 95%, while downy mildew has the lowest accuracy of 85.48%. It can be obtained from the confusion matrix, among which five samples of downy mildew are identified as tungro. It can be seen from the sample pictures that the characteristics of the two diseases are very similar, and the leaves both have yellow spots, which leads to a reduction in the recognition accuracy of downy mildew. Follow-up work can start from this for further improvement. It is undeniable that the overall accuracy of our proposed method is as high as 97.31%, achieving ideal results.

Beyond the overall metrics, we further examined the per-class performance based on the confusion matrix in Figure 7. Table 4 reports the Precision, Recall, and F1 score for each class. The visual similarity of their yellow spot symptoms on leaves makes them particularly difficult to distinguish, especially under complex background conditions. This indicates that incorporating spectral information or fine-grained texture descriptors could further improve the model’s ability to separate such visually similar diseases. Note that Table 4 reports macro averages (unweighted mean across classes), which may differ from the micro-averaged overall metrics in Table 3 due to class imbalance.

5. Discussion

5.1. Comparison with Other Models

To further evaluate the effectiveness and superiority of our proposed method, we compare it with ResNet50, Swin Transformer, EfficientNet and MobileNetV3. Table 5 shows the effects of different models on the dataset we constructed. It can be seen that the various indicators of our proposed method have achieved the highest performance among all models, with an accuracy of 97.31% and an F1 score of 97.18%.

Note: While the proposed method does not achieve the smallest parameter size or fastest inference speed, it offers the highest accuracy and F1-score with a lightweight structure comparable to MobileNetV3-large and EfficientNet-b0, and significantly smaller than ResNet-50 or Swin Transformer-tiny. This demonstrates a better balance between precision and efficiency for deployment in resource-limited agricultural devices.

Table 5 shows that our proposed method does not differ significantly from lightweight networks such as MobileNetV3-Large and EfficientNet-B0 in terms of the number of parameters, which are both in the order of

4.0 \times 10^{6}

. The inference time is also comparable to other CNN-based networks. However, our proposed method achieves the highest accuracy and F1 score among all compared models. Although the number of parameters and inference time are comparable to those of MobileNetV3-Large and EfficientNet-B0, our method consistently delivers higher accuracy and F1-score without increasing the computational cost. Compared with ResNet-50, the proposed model achieves superior accuracy and F1-score while using more than 80% fewer parameters and reducing inference time, which is crucial for deployment on mobile or edge agricultural devices. This shows that the main advantage of our approach lies in achieving a better balance between accuracy and efficiency rather than merely minimizing the computation cost. In particular, our method has significantly fewer parameters than Swin-Tiny and ResNet-50, while still outperforming both in terms of accuracy and F1 score. This result demonstrates the efficiency and effectiveness of our approach. The relatively poor performance of Swin-Tiny can be attributed to the nature of transformer architectures, which require large-scale datasets for effective training. Moreover, transformers lack the spatial inductive bias inherent in CNNs, which hampers their transfer learning performance when only moderate-scale pre-training is available. Additionally, Swin-Tiny has the longest image processing time among all models tested. ResNet-50, a widely used model in industrial applications, delivers strong recognition performance, ranking second in both accuracy and F1 score in our comparison. However, it has the highest number of parameters, leading to large memory and computational demands, making it less suitable for deployment on mobile or edge devices. Overall, CNN-based models exhibit faster inference in real-world applications. Therefore, we adopt a hybrid CNN-transformer architecture, which preserves the spatial bias and efficiency of CNNs while leveraging the global modeling capacity of transformers. In conclusion, our proposed method achieves superior performance with a smaller parameter size and faster inference speed, verifying its practicality and effectiveness.

5.2. Ablation Study

To validate the effectiveness of AKConv and the Biformer block, we conducted an ablation study, the results of which are shown in Table 6. From Table 6, it can be observed that replacing the Biformer module led to optimization in both the number of parameters and inference time. These improvements were even more significant after integrating AKConv. The number of parameters and the inference time decreased noticeably, demonstrating the clear advantages of AKConv. Unlike standard convolution operations with fixed

N \times N

kernel sizes, AKConv allows for a linear increase in the number of convolution kernels. This flexibility significantly reduces the number of parameters and improves the inference speed. Furthermore, the dynamic shape of the AKConv kernel enables better adaptation to varying feature shapes in the input samples, resulting in improved accuracy of the baseline model. The Bi-level Routing Attention (BRA) mechanism also contributes to the enhanced performance. By preserving only the top-k most relevant regions and applying fine-grained attention, BRA reduces redundant computation and focuses on meaningful semantic regions, further boosting accuracy and efficiency. The best results were achieved when both AKConv and the Biformer block were incorporated into the baseline model. Specifically, there was a reduction of 1.39 s in inference time, an increase of 0.94% in precision, and an increase of 0.63% in the F1 score, all while reducing the number of parameters. These results validate that our proposed improvements are both practical and effective, yielding a lightweight and highly accurate rice disease classification model.

In order to demonstrate the effects of different attention mechanisms more vividly, we use Gradient-weighted Class Activation Mapping (Grad-CAM) [44] to highlight the key regions in different disease samples. This allows researchers to intuitively observe the degree of attention the model allocates to various regions of the input image. In this study, three attention mechanisms are compared. As shown in Figure 8, redder colors indicate a higher degree of attention at a given location, while bluer colors denote lower attention. From the visualization results in Figure 8, under this consistent scale, BRA (ours) allocates on average 82.5% of its high-activation pixels to disease lesion regions, compared to 65.7% for MHSA and 71.2% for the Shifted Window mechanism, measured over the test set. This quantitative evidence confirms that BRA not only suppresses background noise more effectively but also focuses more precisely on disease-relevant patterns, which explains its superior classification performance. In some cases, MHSA erroneously allocates its attention to background areas, which may mislead the model and degrade its performance. The Shifted Window mechanism reduces computational cost and parameters by partitioning the attention window, resulting in slightly better focus than MHSA. However, the BRA mechanism demonstrates the best performance. It allocates more attention to the diseased regions in the foreground, enabling the model to concentrate more effectively on informative features, thereby achieving improved recognition results.

To further investigate the design choices of our proposed architecture, we performed extended ablation experiments considering three aspects: (1) AKConv kernel size variation (

3 \times 3

,

5 \times 5

,

7 \times 7

), (2) BRA top-k variation (k = 2, 4, 8), and (3) replacement of AKConv or BRA with alternative modules (DCNv2 and Swin Transformer attention). The results are reported in Table 7. We observe that a

5 \times 5

kernel for AKConv achieves the best trade-off between accuracy and parameter efficiency, while

k = 4

in BRA provides the optimal balance between accuracy and inference time. Alternative modules such as DCNv2 and Swin attention yield slight accuracy improvements but introduce significant computational overhead (e.g., +1.58 M parameters and +0.93 s inference time for DCNv2), confirming that AKConv and BRA offer a better balance between accuracy and efficiency for lightweight agricultural applications.

As shown in Table 8, AKConv achieves shape adaptability comparable to DCNv2 while maintaining a much lower parameter count and faster inference speed. Compared with standard and depth-wise separable convolution, the dynamic AKConv shape of the kernel provides better alignment with the diverse morphologies of rice lesions, leading to higher precision. The linear parameter growth property ensures that increasing kernel size to capture larger receptive fields does not incur a quadratic increase in computational cost, making AKConv particularly suitable for real-time, resource-limited agricultural applications.

5.3. Complexity Analysis

To empirically validate the theoretical complexity, we measured the FLOPs and inference time of MHSA and BRA within MobileViT blocks at three resolutions:

128 \times 128

,

224 \times 224

, and

320 \times 320

. As shown in Figure 9, BRA scales more favorably, with FLOPs increasing by only ∼3.4× from

128^{2}

to

320^{2}

, compared with ∼6.3× for MHSA. This matches the predicted

{(H W)}^{4 / 3}

scaling behavior derived in Section 3.4. Furthermore, BRA consistently reduces the inference time by 25–30% across resolutions while maintaining comparable or better accuracy, confirming its suitability for lightweight, high-resolution agricultural image analysis.

6. Conclusions

In this paper, we proposed a lightweight and high-accuracy rice disease classification framework, MobileViT_BiAK, which enhances both local and global representation learning by replacing the original standard convolution and multi-head self-attention with the adaptive Alterable Kernel Convolution (AKConv) and Bi-level Routing Attention (BRA). This design preserves a compact model size and efficient inference speed, making it suitable for deployment on mobile and edge devices in agricultural scenarios.

The ablation study verifies that AKConv effectively captures diverse lesion morphologies through dynamically shaped kernels with linear parameter growth, while BRA efficiently models global dependencies via a hierarchical two-stage attention process that reduces computational complexity. Together, these components achieve a strong balance between recognition performance and computational efficiency.

Although the proposed method demonstrates robust overall performance, the analysis reveals challenges in distinguishing diseases with highly similar visual symptoms. Future work will explore the integration of additional spectral cues, multi-scale feature enhancement, and fine-grained texture modeling to address this limitation.

Overall, MobileViT_BiAK provides a practical and portable solution for intelligent, field-deployable plant disease monitoring systems, contributing to the broader development of smart and digital agriculture.

Author Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by T.L. and C.Y. The first draft of the manuscript was written by T.L., and all authors commented on previous versions of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2023A1515011230, the Science and Technology Program of Guangzhou under Grant 2023E04J0037, the Heyuan Social Science and Agriculture Project under Grant 2023015, the Science and Technology Planning Project of Yunfu under Grant 2023020205, the Key Construction Discipline Research Ability Enhancement Project of Guangdong Province under Grant 2022ZDJS022, and the Guangdong Province Science and Technology Innovation Strategy Special Fund (University Student Science and Technology Innovation Cultivation) Project for 2024 under Grant pdjh2024a199.

Data Availability Statement

Datasets for this research are available from the corresponding author on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

References

Khush, G. Productivity improvements in rice. Nutr. Rev. 2003, 61 (Suppl. S6), S114–S116. [Google Scholar] [CrossRef] [PubMed]
Kaur, A.; Guleria, K.; Trivedi, N.K. Rice leaf disease detection: A review. In Proceedings of the 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC), Solan, India, 7–9 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 418–422. [Google Scholar] [CrossRef]
Pritimoy, S.; Ujjwal, B.; Susanta, K.P. Color texture analysis of rice leaves diagnosing deficiency in the balance of mineral levels towards improvement of crop productivity. In Proceedings of the International Conference on Information Technology, Bhubaneswar, India, 17–20 December 2007; IEEE: Piscataway, NJ, USA, 2007. [Google Scholar] [CrossRef]
Ma, C.; Yuan, T.; Yao, X.F.; Ji, Y.B.; Li, L.Y. Study on image recognition method of rice disease in field based on HOG+SVM. In Proceedings of the International Conference on Agricultural Engineering, Guangzhou, China, 12–15 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 123–130. [Google Scholar]
Bikash, C.; Sanjeev, P.P. Rice plant disease detection using twin support vector machine (TSVM). J. Sci. Eng. 2019, 7, 61–69. [Google Scholar] [CrossRef]
Tian, C.; Zheng, M.; Li, B.; Zhang, Y.; Zhang, S.; Zhang, D. Perceptive self-supervised learning network for noisy image watermark removal. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7069–7079. [Google Scholar] [CrossRef]
Tian, C.; Zheng, M.; Jiao, T.; Zuo, W.; Zhang, Y.; Lin, C.-W. A self-supervised CNN for image watermark removal. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7566–7576. [Google Scholar] [CrossRef]
Mohapatra, D.; Das, N. A precise model for accurate rice disease diagnosis: A transfer learning approach. Proc. Indian Natl. Sci. Acad. 2023, 89, 162–171. [Google Scholar] [CrossRef]
Li, D.; Wang, R.; Xie, C.; Liu, L.; Zhang, J.; Li, R.; Wang, F.; Zhou, M.; Liu, W. A recognition method for rice plant diseases and pests video detection based on deep convolutional neural network. Sensors 2020, 20, 578. [Google Scholar] [CrossRef]
Rahman, C.R.; Arko, P.S.; Ali, M.E.; Khan, M.A.I.; Apon, S.H.; Nowrin, F.; Wasif, A. Identification and recognition of rice diseases and pests using convolutional neural networks. Biosyst. Eng. 2020, 194, 112–120. [Google Scholar] [CrossRef]
Saleem, M.A.; Aamir, M.; Ibrahim, R.; Senan, N.; Alyas, T. An optimized convolution neural network architecture for paddy disease classification. Comput. Mater. Contin. 2022, 71, 6053–6067. [Google Scholar] [CrossRef]
Sathya, K.; Rajalakshmi, M. RDA-CNN: Enhanced super resolution method for rice plant disease classification. Comput. Syst. Sci. Eng. 2022, 42, 33–47. [Google Scholar] [CrossRef]
Ni, H.; Shi, Z.; Karungaru, S.; Lv, S.; Li, X.; Wang, X.; Zhang, J. Classification of typical pests and diseases of rice based on the ECA attention mechanism. Agriculture 2023, 13, 1066. [Google Scholar] [CrossRef]
Yang, L.; Yu, X.; Zhang, S.; Long, H.; Zhang, H.; Xu, S.; Liao, Y. GoogLeNet based on residual network and attention mechanism identification of rice leaf diseases. Comput. Electron. Agric. 2023, 204, 107543. [Google Scholar] [CrossRef]
Jiang, M.; Feng, C.; Fang, X.; Huang, Q.; Zhang, C.; Shi, X. Rice disease identification method based on attention mechanism and deep dense network. Electronics 2023, 12, 508. [Google Scholar] [CrossRef]
Cheng, D.; Zhao, Z.; Feng, J. Rice diseases identification method based on improved YOLOv7-Tiny. Agriculture 2024, 14, 709. [Google Scholar] [CrossRef]
Jia, L.; Wang, T.; Chen, Y.; Zang, Y.; Li, X.; Shi, H.; Gao, L. MobileNet-CA-YOLO: An improved YOLOv7 based on the MobileNetV3 and attention mechanism for rice pests and diseases detection. Agriculture 2023, 13, 1285. [Google Scholar] [CrossRef]
Al-Gaashani, M.S.A.M.; Samee, N.A.; Alnashwan, R.; Khayyat, M.; Muthanna, M.S.A. Using a Resnet50 with a kernel attention mechanism for rice disease diagnosis. Life 2023, 13, 1277. [Google Scholar] [CrossRef]
Zhang, Z.; Gong, Z.; Hong, Q.; Jiang, L. Swin-transformer based classification for rice diseases recognition. In Proceedings of the 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI), Beijing, China, 24–26 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 153–156. [Google Scholar] [CrossRef]
Thakur, P.S.; Khanna, P.; Sheorey, T.; Ojha, A. Vision transformer for plant disease detection: PlantViT. In Proceedings of the International Conference on Computer Vision and Image Processing, Rupnagar, India, 3–5 December 2021; Springer: Cham, Switzerland, 2021; pp. 501–511. [Google Scholar] [CrossRef]
Tang, H.; Yuan, C.; Li, Z.; Bai, X.; Wang, S. Learning Attention-Guided Pyramidal Features for Few-Shot Fine-Grained Recognition. Pattern Recognit. 2022, 130, 108792. [Google Scholar] [CrossRef]
Cui, J.; Tan, F. Rice plaque detection and identification based on an improved convolutional neural network. Agriculture 2023, 13, 170. [Google Scholar] [CrossRef]
Zhao, S.; Zhang, S.; Lu, J.; Wang, H.; Feng, Y.; Shi, C.; Li, D.; Zhao, R. A lightweight dead fish detection method based on deformable convolution and YOLOv4. Comput. Electron. Agric. 2022, 198, 107098. [Google Scholar] [CrossRef]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6070–6079. [Google Scholar] [CrossRef]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxvit: Multi-axis vision transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 459–479. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML 2019), Long Beach, CA, USA, 9–15 June 2019; PMLR: Cambridge, MA, USA, 2019; pp. 6105–6114. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Zhai, X.; Kolesnikov, A.; Houlsby, N.; Beyer, L. Scaling Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 12104–12113. [Google Scholar] [CrossRef]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. Biformer: Vision transformer with Bi-level Routing Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar] [CrossRef]
Sangaiah, A.K.; Yu, F.N.; Lin, Y.B.; Xiong, N.; Liang, M. UAV T-YOLO-Rice: An Enhanced Tiny YOLO Networks for Rice Leaves Diseases Detection in Paddy Agronomy. IEEE Trans. Netw. Sci. Eng. 2024, 11, 5201–5216. [Google Scholar] [CrossRef]
Ma, R.; Wang, J.; Zhao, W.; Zhao, Y.; Xie, J. Identification of Maize Seed Varieties Using MobileNetV2 with Improved Attention Mechanism CBAM. Agriculture 2022, 13, 11. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Ren, S.; Zhou, D.; He, S.; Feng, J.; Wang, X. Shunted self-attention via multi-scale token aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10853–10862. [Google Scholar] [CrossRef]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. AKConv: Convolutional Kernel with Arbitrary Sampled Shapes and Arbitrary Number of Parameters. Image Vis. Comput. 2024, 149, 105190. [Google Scholar] [CrossRef]
Sokolova, M.; Lapalme, G. A Systematic Analysis of Performance Measures for Classification Tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]

Figure 1. Visual comparison of common limitations in existing rice disease classification methods. (a) Fixed-size standard convolution kernels fail to adapt to diverse lesion shapes (e.g., circular brown spot vs. elongated bacterial blight). (b) Depthwise separable convolution reduces parameters but still uses fixed kernel shapes, limiting adaptability. (c) Multi-head self-attention captures global context but incurs quadratic complexity, leading to high computational cost. (d) Our proposed AKConv dynamically adapts kernel shapes with linear parameter growth, while BRA reduces attention complexity to

O ({(H W)}^{4 / 3})

without losing global modeling capability.

Figure 1. Visual comparison of common limitations in existing rice disease classification methods. (a) Fixed-size standard convolution kernels fail to adapt to diverse lesion shapes (e.g., circular brown spot vs. elongated bacterial blight). (b) Depthwise separable convolution reduces parameters but still uses fixed kernel shapes, limiting adaptability. (c) Multi-head self-attention captures global context but incurs quadratic complexity, leading to high computational cost. (d) Our proposed AKConv dynamically adapts kernel shapes with linear parameter growth, while BRA reduces attention complexity to

O ({(H W)}^{4 / 3})

without losing global modeling capability.

Figure 2. AKConv with different convolution kernel sizes. Here, the numbers denote kernel indices, and the colors represent different kernel sizes.

Figure 3. The structure of MobileViT_BiAK. Arrows indicate the flow of feature maps between layers. Different colors represent different functional modules: green for convolutional layers, orange for AK-Biformer blocks, blue for MV2 modules, and purple for pooling/classification layers.

Figure 4. The structure of AK-Biformer block and Biformer: (a) AK-Biformer block; (b) Biformer.

Figure 5. 10 Types of datasets. (a) bacterial leaf blight; (b) bacterial leaf streak; (c) bacterial panicle blight; (d) blast; (e) brown spot; (f) dead heart; (g) downy mildew; (h) hispa; (i) normal; (j) tungro.

Figure 6. Loss or accuracy curves for training and validation.

Figure 7. Confusion matrix diagram of classification results. The specific types represented by a–j are shown in Table 2.

Figure 8. Gradient-weighted class activation mapping of the attention mechanisms. Redder colors indicate a higher degree of attention at a given location, while bluer colors denote lower attention.

Figure 9. Empirical FLOPs scaling of MHSA vs. BRA across input resolutions (

128^{2}

,

224^{2}

,

320^{2}

). BRA exhibits a more favorable sub-quadratic growth, consistent with the theoretical

O ({(H W)}^{4 / 3})

complexity derived in Section 3.4.

Figure 9. Empirical FLOPs scaling of MHSA vs. BRA across input resolutions (

128^{2}

,

224^{2}

,

320^{2}

). BRA exhibits a more favorable sub-quadratic growth, consistent with the theoretical

O ({(H W)}^{4 / 3})

complexity derived in Section 3.4.

Table 1. MobileViT structure. ↓2 denotes a downsampling operation with stride 2.

Layer	Output Size	Output Stride	Repeat
Image	$256 \times 256$	1	–
Conv- $3 \times 3$ , $↓ 2$	$128 \times 128$	2	1
MV2	$128 \times 128$	2	1
MV2, $↓ 2$	$64 \times 64$	4	1
MV	$64 \times 64$	4	2
MV2, $↓ 2$	$32 \times 32$	8	1
MobileViT block ( $L = 2$ )	$32 \times 32$	8	1
MV2, $↓ 2$	$16 \times 16$	16	1
MobileViT block ( $L = 4$ )	$16 \times 16$	16	1
MV2, $↓ 2$	$8 \times 8$	32	1
MobileViT block ( $L = 3$ )	$8 \times 8$	32	1
Conv- $1 \times 1$	$8 \times 8$	32	1
Global pool	$1 \times 1$	256	–
Linear	$1 \times 1$	256	–

Table 2. Class distribution of the rice disease dataset across the train, validation, and test sets. The dataset is split in the ratio 8:1:1.

Label Type	Train	Validation	Test	Total
(a) Bacterial leaf blight	383	48	48	479
(b) Bacterial leaf streak	304	38	38	380
(c) Bacterial panicle blight	270	34	33	337
(d) Blast	1390	174	174	1738
(e) Brown spot	772	96	97	965
(f) Dead heart	1154	144	144	1442
(g) Downy mildew	496	62	62	620
(h) Hispa	1275	160	159	1594
(i) Normal	1411	176	177	1764
(j) Tungro	870	109	109	1088
Total	8325	1041	1041	10,407

Table 3. Experiments on the Testset.

Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
97.31	97.42	96.95	97.18

Table 4. Per-class precision, recall, and F1-score on the test set (%).

Class	Precision	Recall	F1
(a) Bacterial Leaf Blight	100.00	97.92	98.95
(b) Bacterial Leaf Streak	100.00	100.00	100.00
(c) Bacterial Panicle Blight	100.00	100.00	100.00
(d) Blast	98.28	98.28	98.28
(e) Brown Spot	100.00	100.00	100.00
(f) Dead Heart	100.00	100.00	100.00
(g) Downy Mildew	94.64	85.48	89.83
(h) Hispa	98.75	99.37	99.06
(i) Normal	98.31	99.43	98.87
(j) Tungro	94.69	98.17	96.40
Macro Avg.	98.47	97.86	98.14

Table 5. Comparison of the proposed method with representative models in terms of parameter size, inference time, and classification performance. This table highlights the accuracy–efficiency trade-off, showing that our method achieves the best accuracy and F1-score while maintaining a lightweight size comparable to other compact CNNs.

Methods	Parameters (M)	Time-Consuming (s)	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
MobileNetV3-large	$4.214$	4.75	95.58	94.90	95.63	95.26
EfficientNet-b0	$4.021$	5.82	95.49	95.01	95.09	95.05
Swin Transformer-tiny	$29.32$	12.62	96.06	94.99	95.72	95.35
ResNet-50	$23.53$	5.76	96.64	96.12	96.18	96.15
Ours	$4.453$	6.13	97.31	97.42	96.95	97.18

Table 6. MobileViT-based ablation experiments. “-” indicates the component is not used; "✓" indicates the component is included. Bold values denote the best performance.

Number	Parameters	Time-Consuming (s)	AKConv	Biformer	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
1	$4.961 \times 10^{6}$	7.52	-	-	96.37	96.72	96.40	96.55
2	$4.434 \times 10^{6}$	6.39	✓	-	96.83	96.61	96.68	96.64
3	$4.944 \times 10^{6}$	6.92	-	✓	96.93	97.05	96.54	96.79
4	$4.453 \times 10^{6}$	6.13	✓	✓	97.31	97.42	96.95	97.18

Table 7. Extended ablation experiments on AKConv kernel size, BRA top-k, and alternative module replacements.

Configuration	Params (M)	Accuracy (%)	Time (s)
AKConv kernel size variation
MobileViT + AKConv ( $3 \times 3$ )	4.38	96.75	6.11
MobileViT + AKConv ( $5 \times 5$ )	4.50	97.31	6.13
MobileViT + AKConv ( $7 \times 7$ )	4.66	97.38	6.20
BRA top-k variation
MobileViT + BRA ( $k = 2$ )	4.44	96.92	6.05
MobileViT + BRA ( $k = 4$ )	4.45	97.31	6.13
MobileViT + BRA ( $k = 8$ )	4.46	97.52	6.58
Alternative module replacements
MobileViT + DCNv2	5.48	97.20	7.06
MobileViT + Swin Attention	4.83	97.49	6.77

Table 8. Comparison of convolution types in terms of parameter growth, adaptability, and empirical performance on the rice disease dataset.

Convolution Type	Param Growth	Shape Adapt.	Params (M)	Acc. (%)	Time (s)
Standard Conv	Quadratic	Fixed square	4.96	96.37	7.52
Depthwise Sep. Conv	Quadratic (reduced)	Fixed square	4.41	96.12	6.88
DCNv2	Quadratic	High (offset-based)	5.48	97.20	7.06
AKConv (ours, 5 × 5)	Linear	High (dynamic)	4.43	96.83	6.39

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, T.; Liu, M.; Yang, C.; Wu, A.; Li, X.; Wei, W. Lightweight Model Improvement and Application for Rice Disease Classification. Electronics 2025, 14, 3331. https://doi.org/10.3390/electronics14163331

AMA Style

Liu T, Liu M, Yang C, Wu A, Li X, Wei W. Lightweight Model Improvement and Application for Rice Disease Classification. Electronics. 2025; 14(16):3331. https://doi.org/10.3390/electronics14163331

Chicago/Turabian Style

Liu, Tonglai, Mingguang Liu, Chengcheng Yang, Ancong Wu, Xiaodong Li, and Wenzhao Wei. 2025. "Lightweight Model Improvement and Application for Rice Disease Classification" Electronics 14, no. 16: 3331. https://doi.org/10.3390/electronics14163331

APA Style

Liu, T., Liu, M., Yang, C., Wu, A., Li, X., & Wei, W. (2025). Lightweight Model Improvement and Application for Rice Disease Classification. Electronics, 14(16), 3331. https://doi.org/10.3390/electronics14163331

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Model Improvement and Application for Rice Disease Classification

Abstract

1. Introduction

2. Related Work

2.1. Lightweight Backbone Networks for Rice Disease Recognition

2.2. Attention Mechanisms and Global Context Modeling in Agriculture

2.3. Recent Advances and Gaps in Rice Disease Classification

3. Methods

3.1. MobileViT

3.2. The Attention Mechanism

3.3. The Convolution

3.4. Proposed Methods

4. Experiments and Results

4.1. Dataset

4.2. Rice Disease Classification

5. Discussion

5.1. Comparison with Other Models

5.2. Ablation Study

5.3. Complexity Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI