1. Introduction
Against the backdrop of global food security and sustainable agricultural development, soybean is recognized as a key food and economic crop. Traditional disease diagnosis has relied on field inspections and expert experience, which are labor-intensive, time-consuming, and limited in coverage. As a result, such methods are insufficient to meet the need for early, rapid, and accurate disease monitoring in modern large-scale agriculture.
In early studies, the classification and detection of plant leaf diseases were predominantly carried out using traditional machine learning algorithms. For instance, Kaur et al. proposed a semi-automated soybean leaf disease detection system that combined image preprocessing and feature extraction with support vector machines (SVMs) for disease classification [
1]. Jadhav et al. employed multi-class SVM and K-nearest neighbor (KNN) algorithms to recognize soybean leaf diseases and evaluate disease severity [
2], while Aparajita et al. integrated statistical and color-based features with SVM for automated multi-class leaf disease classification, achieving high accuracy [
3]. Similarly, Masazhar and Kamal developed a digital image processing-based system for oil palm disease identification using color-texture features and SVMs [
4], and Pinki et al. applied content-based feature extraction with SVMs for rice leaf disease recognition [
5]. These studies established a strong foundation for subsequent research on deep learning and large-scale multimodal models in plant disease recognition.
With the advancement of deep learning, convolutional neural networks (CNNs) and Transformer-based architectures have significantly improved plant disease detection performance. Wang et al. proposed a sliding segmentation approach using the Swin Transformer, achieving 99.64% accuracy for soybean bacterial leaf spot recognition [
6]. Bevers et al. applied transfer learning with DenseNet201 to classify raw soybean field images, attaining 96.8% accuracy [
7]. Karlekar and Seal developed SoyNet, a CNN-based model with background segmentation, for classifying 16 soybean leaf disease or healthy states, achieving 98.14% accuracy [
8]. Zhang et al. constructed a synthetic soybean image dataset and introduced a multi-feature fusion Faster R-CNN model with an mAP of 83.34% [
9]. Comparable progress has been achieved for other crops, such as rice, maize, and potatoes. However, these methods generally depend on large annotated datasets and tend to exhibit limited generalization when applied to new environments or unseen disease types [
10,
11,
12,
13,
14].
Recently, the emergence of visual large models (VLMs) and multimodal large language models (MLLMs) has opened new research avenues for multi-crop and multi-scenario disease recognition. Pan et al. developed ChatLeafDisease, which leveraged chain-of-thought prompting with LLMs for zero-shot crop disease classification, achieving 88.9% accuracy [
15]. Qing et al. introduced YOLOPC-GPT, combining lightweight YOLOPC with GPT-4 for pest and disease diagnosis, reaching 94.5% accuracy [
16]. Zhu et al. proposed PotatoGPT, integrating visual and textual modalities for disease recognition and prevention recommendation, achieving 98.43% accuracy [
17,
18]. Tang et al. designed GF-CNN, combining prompt engineering and gated convolutional networks for potato pest question answering, while Zhao et al. constructed a seabuckthorn disease detection system integrating LLMs and knowledge graphs, achieving 94% accuracy [
19,
20]. Yu et al. introduced AgriVLM, a vision-language framework employing LoRA fine-tuning to achieve over 90% accuracy in cross-modal agricultural tasks [
21].
Although traditional convolutional neural networks perform reasonably well in plant disease classification, their generalization capability is limited under complex conditions and across different data distributions. Large Language models provide stronger feature representations and cross-task transfer capabilities; however, their effective application in agricultural disease recognition remains challenging due to scarce and imbalanced annotated data. Under such small-sample conditions, conventional batch fine-tuning often leads to overfitting and unstable training, while existing parameter-efficient fine-tuning methods focus primarily on computational efficiency and may fail to fully adapt the visual encoder to fine-grained, domain-specific disease features.
To address these limitations, a disease recognition method suitable for small-sample scenarios was proposed, combining large language models with a stage-wise progressive fine-tuning strategy to classify three major soybean leaf diseases and healthy samples. During training, the dataset is gradually introduced in stages, allowing the model to first capture core task features and subsequently adapt to more complex sample distributions, thereby improving training stability and generalization. An auxiliary ViT model is used in the initial stage to determine optimal hyperparameters through Bayesian optimization, which are then transferred to the fine-tuning of Qwen2.5-VL-3B. This progressive, task-oriented strategy enables the large model to achieve high accuracy and robustness even with limited data, better utilizing small-sample information and enhancing fine-grained disease feature recognition compared to conventional batch fine-tuning or existing parameter-efficient methods. This approach provides a scalable framework for rapid and reliable disease diagnosis in precision agriculture.
2. Materials and Methods
2.1. Experimental Workflow
This section describes the methodological framework used to evaluate the visual capabilities of the multimodal large language model Qwen2.5-VL for plant disease detection and classification [
22]. The overall experimental workflow is illustrated in
Figure 1 and includes four main components: cross-architecture hyperparameter transfer, progressive parameter-efficient fine-tuning, zero-shot evaluation, and quantitative evaluation with ablation studies.
First, the Vision Transformer (ViT) model was trained using Bayesian optimization to identify the optimal hyperparameter configuration, which was then transferred to Qwen2.5-VL to provide high-quality initialization for subsequent fine-tuning [
23,
24]. Next, the model underwent progressive parameter-efficient fine-tuning, integrating parameter freezing, low-rank adaptation (LoRA), and prompt engineering strategies to improve convergence efficiency and generalization performance under limited sample conditions [
25].
Then, zero-shot evaluation was conducted without any additional fine-tuning, directly leveraging the visual-language knowledge acquired during pre-training to assess the model’s inherent recognition ability on soybean leaf diseases. Finally, quantitative evaluation and ablation experiments were performed to comprehensively analyze model performance and verify the contribution of each methodological component.
2.2. Data Preprocessing
The dataset used in this study was sourced from the publicly available ASDID dataset (Auburn University, Auburn, AL, USA) [
26]. It consists of soybean leaf images acquired under diverse lighting conditions and natural field environments and is widely adopted as a benchmark in plant disease-related computer vision research. Four categories were selected for analysis-healthy leaves, soybean rust, downy mildew, and Cercospora leaf blight-as illustrated in
Figure 2.
To ensure the scientific rigor of the experiments, the dataset partitioning and usage are summarized in
Table 1. During the hyperparameter optimization of the ViT auxiliary model, the model was trained on the training set and evaluated on an independent validation set, and the final assessment of the optimized model was conducted on the test set to comprehensively measure its generalization ability. During the fine-tuning of the Qwen2.5-VL model, the core hyperparameters were predetermined through a cross-architecture transfer strategy, and this stage was primarily intended to verify the effectiveness of these settings. All performance metrics of Qwen2.5-VL were derived from the test set. All images in the dataset were manually annotated by the original dataset creators, ensuring high-quality labels and accurate correspondence to the respective disease categories. To construct a high-quality and standardized dataset, a complete data processing pipeline was implemented, including data cleaning, quality filtering, lesion-region extraction, and multi-scale normalization, thereby providing reliable data support for model training.
To address the issues of uneven quality and substantial background noise in the original dataset, a two-stage preprocessing pipeline was designed, as illustrated in
Figure 3. In the first stage, image quality was evaluated using the variance of the Laplacian to measure sharpness and the standard deviation of pixel intensity to measure contrast. All images were ranked according to a “sharpness-first, contrast-second” criterion, and the top 800 images from each category were selected to form a high-quality image subset. This subset was subsequently randomly divided into training, validation, and test sets for further experiments.
In the second stage, lesion localization and ROI extraction were performed using an attention-based deep learning strategy. ResNet-50 was adopted as the backbone feature extractor and trained exclusively on the training set. After training, the network parameters were frozen. Grad-CAM was then applied to the trained model to generate class activation maps for images in the training, validation, and test sets [
27]. Based on these activation maps, regions of interest (ROIs) were automatically delineated and cropped.
To ensure consistent representation and to accommodate lesions of varying spatial scales, the extracted ROIs were resized to three standardized resolutions: , , and . Through this procedure, the final multi-scale experimental dataset was established.
2.3. Qwen2.5-VL Model Architecture and Implementation
The identification and classification of soybean leaf diseases are investigated in this study. The task is regarded as a typical fine-grained visual classification problem, in which subtle lesion features must be captured and the overall leaf structure together with its contextual relationships must be understood. Although traditional convolutional neural networks (CNNs) are capable of extracting local features, their ability to model global information and long-range dependencies has been considered insufficient [
28]. To overcome these limitations, a Transformer-based architecture is employed. Through its self-attention mechanism, global relationships among image regions are established from the initial stage of processing, which is essential for fine-grained disease classification.
Qwen2.5-VL-3B is adopted as the primary model. Owing to the Transformer’s global modeling capability and extensive pretraining on large-scale vision-language datasets, the model provides strong visual priors and semantic understanding, forming a solid basis for disease recognition. To mitigate the computational cost of hyperparameter optimization on such a large model, a Vision Transformer (ViT) is introduced as an auxiliary model. Owing to the architectural consistency between ViT and Qwen2.5-VL-3B, both relying on Transformer-based vision encoders, ViT serves as an effective reference for hyperparameter tuning. Moreover, ViT demonstrates strong performance with relatively low computational overhead, making it a suitable lightweight tool for optimization in this study.
In terms of model architecture, Qwen2.5-VL was composed of a high-efficiency visual encoder, a dynamic spatiotemporal modeling module, and a language model backbone. The visual encoder was based on the Vision Transformer (ViT) architecture. To handle high-resolution images and long videos, window attention was incorporated into certain layers, such that attention was computed within local windows to reduce computational complexity while enhancing training and inference efficiency. The encoder also employed SwiGLU activation functions and RMSNorm regularization, ensuring compatibility with the language model backbone and enabling stable cross-modal feature alignment.
Regarding dynamic resolution and spatiotemporal modeling, the naive dynamic resolution mechanism was extended. In the spatial dimension, arbitrary-resolution images were supported and converted into visual tokens. In the temporal dimension, dynamic FPS sampling was applied to accommodate videos with varying frame rates. By employing multi-modal rotary position encoding (M-RoPE), temporal sequences and playback speed were learned, and specific moments within videos could be accurately localized.
For model initialization and internal visual stack, Qwen2.5-VL was loaded with publicly released pretrained weights, covering large-scale multi-modal data to learn general visual and linguistic representations. The text encoder utilized a BPE tokenizer with a vocabulary size of approximately 152k, including special tokens for multimodal alignment. The visual encoder partitioned input images into patches and mapped them into token embeddings. Each encoder layer consisted of multi-head self-attention and feedforward networks, with certain layers using window attention to reduce computation while preserving global dependencies. During fine-tuning, selected visual encoder parameters could be frozen to facilitate training on small datasets, and visual features were projected into the language model embedding space to enable joint cross-modal representation. In this study, Qwen2.5-VL is employed strictly as the experimental model for soybean leaf disease recognition.
2.4. Prompt Engineering
Prompt engineering refers to designing initial text or instructions that guide a model to generate outputs of a desired type or style, helping ensure responses meet specific expectations. By aligning prompt construction with natural language interaction patterns, it lowers the technical barrier for users and supports effective interaction with complex large models.
In this study, prompt engineering played an important role in enabling the model to accurately classify soybean leaf diseases. Carefully designed prompts directed the model’s attention toward key features relevant to disease diagnosis. To refine the prompts, a systematic iterative process was applied, involving adjustments based on analyses of instruction clarity, contextual completeness, and overall structure.
Initially, preliminary prompts were tested on a set of validation samples, and the resulting model outputs were carefully analyzed, with particular emphasis on classification accuracy, adherence to instructions, and consistency. It was observed that enforcing a structured output format was more effective than providing complex input structures. Specifically, the model was explicitly instructed to return classification results in JSON format, providing a clear framework for visual reasoning and encouraging the model to focus on the classification task rather than generating descriptive text. This approach ensured machine-readable outputs and provided a reliable basis for subsequent quantitative evaluation. It is important to emphasize that while the designed prompts constrain the output space to specific categories to ensure consistency, this process involves no gradient updates or exposure to labeled training instances, thereby strictly maintaining the zero-shot nature of the initial evaluation. As summarized in
Table 2, through this data-driven optimization process, the general capabilities of Qwen2.5-VL were effectively guided and precisely applied to the task of soybean disease diagnosis.
2.5. Bayesian Optimization for Fine-Tuning Hyperparameters
This study focused on evaluating model performance under data-scarce scenarios and employed few-shot learning to fine-tune the Qwen2.5-VL model. Independent training iterations were conducted for each image resolution, and the model’s performance was ultimately assessed based on the highest classification accuracy achieved on an independent test set.
The final effectiveness of model fine-tuning is highly dependent on hyperparameter configuration, including key variables such as the number of epochs, batch size, and learning rate, forming a multidimensional optimization problem. For instance, a high learning rate may accelerate convergence but risks falling into suboptimal solutions, while a large batch size stabilizes gradient estimates but may compromise model generalization. Therefore, a systematic exploration of the hyperparameter space is a prerequisite for maximizing model performance.
Traditional hyperparameter optimization (HPO) methods, such as grid search, rely on exhaustive exploration of discrete parameter combinations. Their computational complexity grows exponentially with the number of parameters, making them impractical for high-dimensional spaces [
29]. To address this, Bayesian optimization was adopted in this study. This approach uses a probabilistic model to predict the potential impact of different hyperparameter combinations on model performance and iteratively searches for the optimal configuration. The Tree-structured Parzen Estimator (TPE) algorithm was employed, which constructs a surrogate function to approximate the actual model and performs global optimization using a sequential model. Specifically, the algorithm builds a probability distribution based on the results of previous samples and uses an acquisition function to select the next evaluation point. Through iterative training, this process progressively enhances the predictive performance of the model.
Specifically, the TPE method models the conditional probability
and the marginal probability
, defining the expression for
as:
Here, x represents the observations, and is a predefined threshold. and are the density functions formed by observations with loss values less than and greater than , respectively.
The Expected Improvement (
) is used as the acquisition function. Using Bayes’ theorem, the expression for the
acquisition function is:
By defining , we can transform to .
Based on the formula, the maximum guides the selection of the next set of parameters by leveraging the distributions (for good parameters) and (for bad parameters). This process iteratively updates and until convergence, yielding the optimal hyperparameters.
To implement Bayesian optimization, this study adopted the open-source framework Optuna 3.6, utilizing its default Tree-structured Parzen Estimator (TPE) as the sampler. The entire optimization process was conducted on the preprocessed Soybean leaf image dataset, which included training and validation samples with corresponding disease labels. Optuna’s trial-based framework records the model’s accuracy on the validation and test sets after each training epoch. Based on this, it dynamically updates its probabilistic surrogate model to provide more precise hyperparameter suggestions for subsequent trials. Furthermore, to maximize computational efficiency, Optuna’s built-in early stopping and pruning mechanisms were employed. This allows for the termination of unpromising trials at intermediate stages, saving resources and focusing efforts on more promising parameter configurations [
30].
After 30 trials of Bayesian optimization, this study successfully identified the optimal hyperparameters for the ViT (Vision Transformer) model on the soybean dataset. For each image resolution, the best-performing model weights were saved as a .pth file for final evaluation, and their corresponding hyperparameters were archived in a .json file to guide subsequent experiments. This approach ensures that a fully optimized, specialized classification model was generated for each resolution during the experimental process.
2.6. Parameter-Efficient Fine-Tuning (PEFT) Setup
To enhance experimental efficiency, this research adopted a cross-architecture hyperparameter transfer strategy. This approach assumes that the optimal hyperparameters identified during the auxiliary ViT model’s training will yield comparable performance when applied to Qwen 2.5-VL, thereby narrowing the search space for large model tuning.
Experiments were conducted on a workstation running Windows 10 Professional (Build 19045), equipped with an Intel processor, 32 GB of RAM, and an NVIDIA Quadro RTX 5000 GPU (16 GB VRAM). The software environment included Python 3.10.11, PyTorch 2.5.1+cu118, and CUDA 11.8.
In this study, the Parameter-Efficient Fine-Tuning (PEFT) framework was constructed by integrating a selective freezing mechanism with Low-Rank Adaptation (LoRA). First, to address the high training overhead and parameter redundancy of multimodal large language models on visual tasks, a selective freezing mechanism was introduced. This mechanism freezes the shallow, general-purpose representation layers and retains gradient updates only in the high-level, task-specific representation layers. This significantly reduces VRAM and computational consumption while maintaining the model’s stability in semantic transfer and cross-modal representation.
Building on this, a Low-Rank Adaptation (LoRA) module was further integrated. It is deployed via an external injection method into the key projection layers responsible for vision-language interaction. The introduction of LoRA allows the model to learn the discriminative features of the disease recognition task using a minimal number of additional parameters, all while keeping the original model weights fixed. This lightweight parameter-injection method not only ensures high efficiency during fine-tuning but also equips the model with better extensibility and cross-task transferability.
Table 3,
Table 4 and
Table 5 summarize the parameter settings used for the Freeze-tuning and LoRA methods across the three different image resolutions.
The data required for fine-tuning were organized in JSONL format, with each entry consisting of an input-output pair. For the soybean disease recognition task, four stage-specific training files were constructed according to different input image resolutions (224 px, 448 px, and 712 px), accompanied by corresponding validation and test sets. These files were automatically converted from the original CSV dataset.
Using these resolution-specific data files, LoRA-based parameter-efficient fine-tuning tasks for the Qwen2.5-VL model were conducted on a local GPU environment. During fine-tuning, the optimal hyperparameters identified in the ViT tuning phase were applied. Upon completion, the optimized models for each specific resolution were used for the final prediction tasks. This process ensured controlled computational resource usage, enabled the transfer and reuse of hyperparameters across different model architectures, and improved the overall efficiency of the experimental design.
2.7. Progressive Fine-Tuning Strategy
To further validate the generalizability of the near-optimal hyperparameters obtained via Bayesian optimization using the ViT auxiliary model, progressive fine-tuning experiments were conducted on the Qwen2.5-VL model.
The dataset was randomly divided into training, validation, and test sets at the beginning of the experiments, as summarized in
Table 1. This split was performed only once, and the validation and test sets were kept fixed throughout all stages of the progressive fine-tuning process.
A four-stage progressive fine-tuning strategy was adopted and applied exclusively within the training set. In the first stage, the model was fine-tuned using 488 training samples. In each subsequent stage, an additional 488 training samples were cumulatively incorporated from the remaining training data. This staged strategy enables the evaluation of the applicability of the transferred hyperparameters and allows for assessing the model’s adaptability as the training dataset gradually expands.
During this process, 12 fine-tuned Qwen2.5-VL models were trained across different image resolutions of soybean leaves, each corresponding to a specific stage of the progressive fine-tuning strategy. After training, each model was used to generate predictions on the test set, enabling a systematic evaluation of performance changes across the stages. This approach allows for comparing the effects of different image resolutions on model performance and analyzing how the model’s adaptability and generalization ability improve with the progressive increase in training samples.
2.8. Model Evaluation Strategy
In the evaluation phase, both the fine-tuned Qwen2.5-VL and the auxiliary ViT models were employed to generate predictions on the test dataset. Each model processed the test images, and the resulting plant disease classifications were recorded individually for subsequent analysis. The predictions from the ViT model were also archived separately for comparison.
Four standard metrics were employed for quantitative evaluation:
Accuracy,
Precision,
Recall, and
F1-
score [
31].
Accuracy measures the proportion of all correctly classified samples relative to the total number of samples. It is calculated as:
where the terms for a given class are defined as:
TP (True Positive): The number of positive samples (e.g., diseased) correctly classified as positive.
TN (True Negative): The number of negative samples (e.g., healthy) correctly classified as negative.
FP (False Positive): The number of negative samples incorrectly classified as positive.
FN (False Negative): The number of positive samples incorrectly classified as negative.
Recall (or Sensitivity) measures the model’s ability to identify all actual positive cases. It is calculated as:
Precision measures the proportion of predicted positive cases that were truly positive. It is calculated as:
The
F1-
score is the harmonic mean of
Precision and
Recall, providing a single metric that balances both. A higher
F1-
score indicates better model performance. It is calculated as:
To further assess the computer vision capabilities of the Qwen2.5-VL model, this study conducted a cross-resolution prediction experiment. In this experiment, models fine-tuned on medium-resolution images were used to predict diseases on the high-resolution test set. Concurrently, models fine-tuned on high-resolution images were used to predict diseases on the medium-resolution test set. This was done to evaluate the model’s adaptability and robustness across different image resolutions.
3. Results
This section presents and analyzes the experimental results of this study. The primary objective is to evaluate the performance difference of the multimodal large language model Qwen2.5-VL-3B on the soybean disease detection and classification task, both before and after fine-tuning. It also aims to validate the effectiveness of the proposed cross-architecture hyperparameter transfer and progressive fine-tuning strategies. This research employed a structured experimental design and a systematic performance evaluation workflow. To ensure the clarity and comparability of the results, the entire evaluation process adhered to a unified set of metrics and experimental settings.
First, a comparative experiment was conducted to verify the optimization effects of different auxiliary models on the fine-tuning results under the hyperparameter transfer strategy. Subsequently, during the few-shot learning phase, the performance improvements from Parameter-Efficient Fine-Tuning were systematically analyzed. Simultaneously, the Qwen2.5-VL-3B model was evaluated in a zero-shot phase without any fine-tuning to establish a baseline performance based solely on its pre-trained knowledge. Finally, in the cross-resolution experiment, the model’s generalization ability under mismatched training and testing input conditions was assessed, thereby comprehensively validating the effectiveness of the proposed methodology.
3.1. Comparison of ResNet-50 and ViT for Hyperparameter Transfer
As described in the methods section, this study first optimized the hyperparameters for both the ViT and ResNet-50 models. The optimal combination of key training parameters, such as learning rate and batch size, was automatically searched using the Bayesian optimization algorithm. Subsequently, the optimized hyperparameters were transferred to the Qwen 2.5-VL multimodal large model for fine-tuning.
To validate the effectiveness of this transfer strategy, a soybean disease detection experiment was conducted using 1952 training images and 611 test images, covering three major soybean leaf diseases and healthy samples. Under the condition of a 448 × 448 pixel input image resolution,
Figure 4 presents a detailed comparison of the performance metrics.
Experimental results indicate that ViT was observed to significantly outperform ResNet-50 across multiple metrics, including Accuracy, Precision, Recall, and F1-score. During training, an increase in ViT’s Accuracy from 89.69% to 94.10% () was recorded, whereas ResNet-50’s Accuracy improved from 78.89% to 84.94%. The Recall of ViT was also increased from 89.91% to 94.24%, demonstrating a substantial enhancement in the correct identification of disease samples.
This performance gap is primarily attributed to the architectural compatibility between ViT and Qwen2.5-VL. ViT, built upon the Transformer architecture, is designed to leverage self-attention for capturing global contextual relationships and long-range dependencies, while the visual encoder of Qwen2.5-VL employs a similar global feature modeling strategy. Consequently, hyperparameters optimized via ViT can be effectively transferred to Qwen2.5-VL, enabling fine-tuning to be performed more efficiently and stably. In contrast, the convolutional structure of ResNet-50 primarily extracts local features, which limits the effectiveness of hyperparameter transfer due to its architectural inconsistency.
By employing Bayesian optimization in combination with a progressive fine-tuning strategy, near-optimal parameter settings were obtained during early training stages, and further improvements were achieved in subsequent iterations. This synergistic approach allows the advantages of the Transformer architecture to be leveraged together with hyperparameter transfer, resulting in higher Accuracy and enhanced robustness of Qwen2.5-VL in soybean disease detection tasks.
3.2. Fine-Tuning Efficacy and Model Comparison
3.2.1. Performance Analysis of Fine-Tuning Strategies
To evaluate the impact of different fine-tuning strategies on the performance of the soybean disease classification model, this study designed two fine-tuning schemes: Progressive Fine-Tuning and Parameter-Efficient Fine-Tuning. Progressive fine-tuning involves gradually increasing the training samples in stages, with an additional 488 images added in each subsequent stage. This allows the model to first learn key soybean leaf features from a small dataset and then gradually adapt to a larger data distribution. In contrast, Parameter-Efficient Fine-Tuning uses the entire training set in a single stage for a one-time parameter update. The experimental results are shown in
Table 6. where
Precision,
Recall, and
F1-
score are reported as macro-average across the four soybean disease classes, and
Accuracy represents the overall accuracy.
The Experimental results show that Progressive Fine-Tuning consistently outperformed Parameter-Efficient Fine-Tuning across all resolutions. The most significant improvement occurred at 224 px, where Accuracy increased from 84.50% to 92.03% and the F1-score rose from 85.20% to 92.12%. At 448 px and 712 px, Progressive Fine-Tuning maintained a stable and consistent advantage. The statistical significance of these performance gains is confirmed by a t-test (). To ensure a fair comparison, training epochs for Parameter-Efficient Fine-Tuning were precisely adjusted to match the cumulative sample exposure of the Progressive Fine-Tuning strategy, aligning the total computational budget and training time across both schemes.
The stage-wise training approach mitigates overfitting under limited data conditions and allows the model to gradually learn from low-level to high-level image features, thereby enhancing its capacity to represent complex visual patterns. Additionally, progressive fine-tuning helps maintain smooth gradient updates, ensuring stable convergence and reliable predictive performance.
Further analysis was conducted to evaluate the performance of progressive fine-tuning under different image resolutions. As illustrated in
Figure 5, both
Precision and
Recall consistently improve with increasing image resolution. From Phase 1 to Phase 4, performance gains are observed across all resolutions, with the most pronounced improvements occurring at 712 px, suggesting that higher-resolution inputs provide richer visual details for feature learning. In comparison, improvements at 224 px and 448 px are more moderate but remain stable, particularly in Phase 4. Overall, these results demonstrate that progressive fine-tuning effectively enhances model performance and exhibits strong adaptability across varying image resolutions.
3.2.2. Fine-Tuning Performance Comparison of ViT and Various MLLMs
In this study, the Qwen2.5-VL and InternVL2-2B-hf multimodal large language models, along with the ViT model, were fine-tuned under few-shot learning conditions to enhance their capabilities in identifying and classifying soybean leaf diseases. While the ViT model served as a baseline and underwent standard fine-tuning, the multimodal large language models were trained using the progressive fine-tuning strategy proposed in this research.
Figure 6 illustrates the comparative results among these models, including overall accuracy and macro-averaged metrics. These results provide a systematic assessment of the scalability and effectiveness of different models in soybean leaf disease detection and classification.
In comparative experiments under multi-scale input resolutions, Qwen2.5-VL consistently outperformed both ViT and InternVL2-2B-hf across all evaluation metrics. Performance improved for all models with increasing resolution, highlighting the importance of high-dimensional visual features in complex soybean disease classification. Specifically, ViT accuracy increased from 88.87% to 90.99%, while Qwen2.5-VL improved from 92.03% to 95.35% (), remaining superior across all resolutions. At 712 px, Qwen2.5-VL achieved F1-score of 95.50%, exceeding InternVL2-2B-hf (94.64%). These results indicate that, under few-shot learning, Qwen2.5-VL exploits high-resolution visual information more effectively, leading to improved multi-class disease recognition.
3.3. Zero-Shot Classification and Fine-Tuned Performance
To evaluate the effectiveness of the fine-tuning strategies, we compared the untrained large model, a model fine-tuned using Qwen’s default hyperparameters, and the proposed hyperparameter transfer fine-tuning approach. The results, including overall
Accuracy and macro-averaged metrics, are presented in
Table 7. A
t-test was also performed to assess the statistical significance of the observed differences.
Across all image resolutions, the initial stage of progressive fine-tuning (Phase-1) outperformed both the zero-shot model and the fine-tuning baseline based on Qwen’s default configurations. The zero-shot model attained Accuracy rates between 69.75% and 73.25% () across the tested resolutions. Notably, at the 712 px resolution, Accuracy of 90.67% was obtained via Qwen-default fine-tuning; however, this was surpassed by the progressive fine-tuning strategy, which reached 92.96% (). This outcome reflects a steady increase in Accuracy correlated with the rise in input resolution, highlighting the significant optimization achieved through the progressive strategy. Furthermore, Precision, Recall, and F1-score all exceeded 85%, demonstrating a substantial enhancement in the model’s discriminative capability following initial task adaptation.
These trends were further corroborated by confusion matrix analysis. As illustrated in
Figure 7, the zero-shot model exhibited notable confusion among multiple classes, particularly between cercospora leaf blight and rust, while downy mildew was recognized with relatively lower
Accuracy. After the initial stage of progressive fine-tuning, the diagonal elements of the confusion matrix were markedly strengthened, and off-diagonal misclassifications were substantially reduced, indicating clearer boundaries between classes.
To provide readers with a more intuitive understanding of the model’s classification performance across different stages,
Figure 8 presents visual examples of correctly and incorrectly classified leaves for each stage (Zero-shot and Fine-tuned). This allows readers to directly observe how the model’s predictions improve following progressive fine-tuning.
The observed performance improvement is mainly due to the effectiveness of progressive fine-tuning in task adaptation and feature transfer. Qwen-default fine-tuning may not fully capture the complex visual features of crop diseases. By introducing task-oriented parameter updates during the initial training stage, the model learns discriminative features relevant to the target task while preserving general representations. Progressive fine-tuning ensures better convergence and greater inter-class separability, whereas zero-shot models rely solely on general features, making them more prone to confusion. Significant performance gains are achieved early in the progressive fine-tuning process.
3.4. Soybean Disease Model Generalization on Multi-Source Data
To evaluate the generalization capability of the proposed soybean disease recognition model, supplementary experiments were conducted using a multi-source dataset. This dataset consists of 400 images collected from PlantVillage and local fields, covering four major types of soybean diseases. Although the soybean varieties in the test images are consistent with those in the ASDID dataset, significant differences exist in illumination conditions, image resolution, and complex backgrounds. This setup is designed to rigorously assess the model’s robustness in cross-domain scenarios and to verify whether it can extract disease lesion features that generalize beyond specific dataset environments, as shown in
Table 8, the results include overall Accuracy and the macro-averaged metrics.
The fine-tuned large model demonstrated strong generalization capability on novel data sources. Although field-acquired images differed significantly from the ASDID training set in terms of natural lighting, shooting angles, and background complexity, the model achieved a classification accuracy of 91.00% at a resolution of 224 px and further improved to 95.91% at 712 px. These results show that the ViT-assisted hyperparameter combination effectively enhanced the model’s feature learning during fine-tuning, enabling it to handle heterogeneous real-world image data and perform well across different scenarios.
3.5. Cross-Resolution Classification and Evaluation
To further evaluate the model’s generalization performance across different image resolutions, a cross-resolution classification experiment was designed. The model was trained and tested on high-resolution (
) and medium-resolution (
) images in a cross manner, enabling the assessment of its adaptability when the training and testing resolutions are inconsistent. The results are shown in
Table 9, the results include overall accuracy and the macro-averaged metrics.
The experimental results indicate that when the model was trained on high-resolution images and tested on medium-resolution images, it achieved an Accuracy of 91.00% and an F1-score of 91.42%, outperforming the reverse scenario where the model was trained on medium-resolution images and tested on high-resolution images (Accuracy: 88.54%, F1-score: 89.32%). This demonstrates that models trained on high-resolution data maintain strong recognition and classification capabilities even when applied to downsampled inputs. The observed performance difference can be attributed to the richer textural details and spatial features available in high-resolution images, which allow the model to learn more robust and globally consistent representations during training. In contrast, training on medium-resolution images results in loss of feature information, limiting the model’s ability to capture fine-grained disease patterns and thereby reducing its generalization capability when predicting on high-resolution images.
4. Discussion
4.1. Effectiveness of the Proposed Adaptation Strategies
This study provides a comprehensive assessment of the capability of the multimodal large language model Qwen2.5-VL to recognize soybean leaf diseases under limited data conditions, with a particular focus on the effectiveness of cross-architecture hyperparameter transfer and progressive fine-tuning strategies. The experimental results demonstrate that both approaches significantly enhance the model’s adaptation efficiency, training stability, and generalization performance in downstream agricultural vision tasks.
With respect to the choice of auxiliary models, the Vision Transformer (ViT) exhibited a noticeably stronger hyperparameter transfer advantage compared with the ResNet-50. Under identical training configurations, the model tuned with hyperparameters derived from ViT achieved Accuracy improvement of 9.16% and showed more stable Recall performance. This finding highlights the importance of architectural consistency in hyperparameter transfer, as ViT and Qwen2.5-VL share the Transformer-based global attention mechanism, enabling more effective parameter migration. In contrast, the convolutional inductive bias of ResNet-50 leads to structural mismatches, thereby weakening its transfer capability.
The progressive fine-tuning strategy also delivered substantial improvements under small-sample conditions. Compared with Parameter-Efficient Fine-tuning, the progressive approach consistently achieved higher Accuracy across all three input resolutions. The improvement was particularly evident in the low-resolution setting, where progressive fine-tuning achieved 92.03% Accuracy, surpassing the 84.50% () obtained through full fine-tuning. Confusion matrix analysis further confirmed that stage-wise training enhances category separability, markedly reducing misclassifications-especially for disease categories that were frequently confused in the zero-shot phase, such as rust and cercospora leaf blight. After progressive fine-tuning, these errors were nearly eliminated, indicating that the strategy enables the model to build more stable and discriminative decision boundaries.
Zero-shot performance remained limited, with Accuracy around 71%. However, after progressive fine-tuning, Accuracy across all resolutions increased to 85.69–95.35%, demonstrating that even minimal task-oriented tuning significantly strengthens the model’s ability to interpret fine-grained lesion patterns.
The cross-resolution experiments revealed an asymmetric generalization pattern. The model trained with high-resolution images achieved 91.00% Accuracy on medium-resolution inputs, outperforming the reverse direction (88.54%). This suggests that training with high-resolution data yields more robust semantic representations with better downward compatibility, while low-resolution models tend to overfit coarse patterns and cannot effectively process detailed high-resolution images. This finding indicates that maintaining a single high-resolution model may suffice for deployment across devices with varying input resolutions.
4.2. Limitations and Future Directions
Despite the promising outcomes, several limitations should be acknowledged. First, the dataset primarily consists of publicly collected images captured under relatively controlled conditions. The model’s robustness in real field environments-with variable lighting, complex backgrounds, and diverse leaf orientations-remains to be thoroughly evaluated. Second, the progressive fine-tuning process employs a fixed sample-increment schedule and does not adapt dynamically to model convergence behavior or sample difficulty, which may limit the strategy’s optimal efficiency. Third, prompt engineering in this study relies on manually crafted templates, and thus does not fully exploit multimodal models’strengths in semantic understanding and dynamic instruction generation. Furthermore, the present work focuses exclusively on visual information and does not incorporate auxiliary data such as meteorological, soil, or temporal features, leaving room for broader multimodal fusion in future research.
In summary, the proposed framework for cross-architecture hyperparameter transfer and parameter-efficient fine-tuning provides a practical and effective solution for adapting multimodal large models to plant disease recognition tasks. Future research may explore validation in real field scenarios, develop adaptive progressive training mechanisms, automate prompt generation, and integrate heterogeneous data sources to enhance the model’s robustness, generalization capability, and decision-making intelligence.