1. Introduction
Nowadays, sustainable agriculture depends on precise and energy-efficient crop stress detection strategies. Barley (
Hordeum vulgare L.) [
1] is one of the world’s most widely cultivated cereal crops, which represents an important crop for food, feed, and industrial applications [
2]. However, its productivity is highly sensitive to drought stress, particularly in semi-arid and Mediterranean regions where rainfall variability has intensified due to climate change [
3]. Early and accurate detection of drought effects is essential to prevent yield losses and improve irrigation decisions [
4].
Traditional drought stress monitoring approaches, including visual inspection and manual physiological measurements, offer limited scalability and are labor-intensive [
5]. Early work in automated plant stress detection relied on classical machine learning approaches with hand-crafted features. Support vector machines (SVMs) with color and texture features achieved moderate accuracy for disease classification but required extensive feature engineering and struggled with generalization [
6]. These conventional methods result in insufficient real-time data necessary for large-scale decision support in precision agriculture.
The advent of deep learning revolutionized plant stress detection, with convolutional neural networks (CNNs) demonstrating superior performance in capturing morphological stress cues. Deep CNNs trained on large-scale datasets such as PlantVillage have achieved accuracies of 99% for disease classification across multiple crop species [
7]. However, these models often suffered from domain shift when tested on real field conditions, highlighting critical generalization challenges.
A variety of deep learning architectures have been evaluated for plant disease classification [
8]. It includes CNNs, fine-tuned pretrained networks such as AlexNet, VGGNet, GoogLeNet, ResNet variants, DenseNet, Inception models, and hybrid approaches combining CNN-based feature extraction with traditional classifiers (SVM or BPNN). Reported accuracies vary across crops and datasets, from 76% to 99.75%. This highlights the influence of dataset characteristics, crop type, and model configuration. While deeper models generally improve performance, they can also increase computational complexity and training requirements.
To address computational constraints, lightweight architectures based on MobileNetV2 have demonstrated feasibility for efficient plant disease detection, which achieves over 98% accuracy with significantly reduced parameters [
9]. EfficientNet-based models have shown strong performance due to their compound scaling strategy, which enables improved accuracy with balanced depth–width–resolution optimization [
10].
Despite these advances, image-only approaches remain limited. Visual symptoms [
11] often appear after physiological stress has already occurred, reducing their effectiveness for early drought detection. Furthermore, their sensitivity to illumination changes, camera position, and background noise [
12] can degrade performance in real-world agricultural settings. These studies focus on visual information and fail to incorporate environmental context, which is critical for understanding the physiological mechanisms underlying stress responses.
Parallel to computer vision approaches, extensive research has developed sensor-based plant monitoring systems. Infra-red thermometry pioneered the use of canopy temperature measurements for crop water stress assessment. It led to the development of indices such as the Crop Water Stress Index (CWSI) [
13]. This approach has been widely adopted; however, it requires precise calibration and is limited in addressing spatial heterogeneity [
14]. Recent advances in the Internet of Things (IoT) enable automated and real-time crop monitoring by coupling RGB imaging with environmental sensing. IoT-based systems integrate multiple sensor types including soil moisture, temperature, humidity, and light intensity [
15] to provide comprehensive environmental monitoring. Wireless sensor networks combined with machine learning techniques have been developed for irrigation management, using environmental data such as soil moisture and weather parameters. These approaches achieve prediction accuracies approaching 90% in controlled environments [
16]. Fuzzy logic systems integrating soil and atmospheric sensors have also been implemented for automated irrigation control [
17]. These systems monitor environmental and microclimatic conditions such as soil moisture, temperature, and humidity. This enables optimized irrigation management [
18].
However, sensor-based monitoring systems that are based only on environmental parameters (soil moisture, temperature, humidity) are limited in their ability to directly characterize plant phenotypic responses. They cannot capture the spatial distribution patterns or morphological manifestations of stress at the canopy scale, which require imaging-based observation [
19].
To overcome unimodal limitations, the integration of imaging and sensor data for plant monitoring has become a promising research direction. Agricultural studies have combined remote sensing imagery with ground-based sensor networks. Combining UAV images and weather data improves crop yield prediction with machine learning models [
20]. However, such approaches use simple feature concatenation and do not learn interactions between different modalities using attention mechanisms. This limits their ability to adapt to different environments and field conditions. In medical imaging, cross-modal attention has been successful in combining modalities such as MRI and PET [
21] by learning strong inter-modal relationships. This shows the potential of such methods for agricultural multimodal fusion.
Recent studies explore similar approaches in agriculture by fusing RGB and multispectral drone imagery with IoT-based environmental sensor data (temperature, humidity, soil moisture) for plant disease detection, which achieves strong classification performance [
22]. Recent multimodal architectures use attention mechanisms or transformer-based frameworks to encode cross-modal correlations by weighting visual features based on contextual sensor inputs [
23]. However, many existing multimodal approaches use resource-intensive architectures that require GPUs during training and inference which limits real-time deployment on mobile or edge devices in agricultural environments.
Despite improvements in fusion with attention-based models, achieving both effective cross-modal learning and efficient edge deployment remains difficult in agricultural systems. This is due to the strict constraints of edge devices and IoT platforms in terms of memory, processing power, and energy consumption. Recent surveys on Edge AI for crop disease detection confirm that deploying deep learning models on such resource-constrained devices remains challenging [
24]. To address this issue, model compression techniques such as pruning and quantization have been explored to reduce model size and computational complexity, often with minimal impact on accuracy [
25,
26]. Recent works [
27,
28] have focused on efficient plant disease detection through compact CNNs and knowledge distillation, which achieve high accuracy with models optimized for mobile deployment. In this context, TensorFlow Lite and INT8 quantization reduce model size and inference time, which enable deployment on embedded systems. However, the impact of these techniques on multimodal agricultural models remains under-investigated, and most optimization studies focus on unimodal models. In addition, few works systematically analyze training efficiency, CPU utilization, and memory consumption alongside predictive performance.
Despite advances in image-based stress detection, sensor monitoring, and model compression, their integration for practical agricultural deployment remains limited. In this context, existing multimodal approaches either rely on simple fusion strategies that fail to learn cross-modal interactions or use complex architectures that require GPU acceleration, which is incompatible with agricultural edge deployment.
In addition to these limitations, most studies validate models on single datasets, which limits the evaluation of their generalization performance. For instance, Binary classification using large-scale datasets such as PlantVillage has been explored to reduce class complexity and improve model performance [
29]. However, such datasets are used as external validation for models trained on independent greenhouse or field data. In addition, few multimodal studies consider cross-dataset transfer while also reducing computational cost for CPU-based deployment.
The key contributions of this work are:
A novel lightweight multimodal architecture that combines EfficientNetV2-S visual features with sensor data through cross-modal attention and gated fusion modules.
A systematic ablation study over 10 random seeds with one-way ANOVA (F(4,36) = 4.44, p = 0.005) demonstrating stable performance across architectural configurations; multimodal fusion improves accuracy by 0.9 pp over image-only (98.3% vs. 97.4%), while a sensor-only MLP achieves only 73.8%, confirming the complementary nature of both modalities.
A systematic CPU optimization strategy that reduces training time by 68.5% and achieves 90% model size reduction through INT8 quantization, enabling practical edge deployment.
A robust validation protocol on a custom greenhouse barley dataset (98.3 ± 1.5% mean accuracy over 10 seeds) and external validation on the PlantVillage dataset (99.97% accuracy), demonstrating both high performance and strong generalization ability.
2. Materials and Methods
2.1. System Description and Data Acquisition
Our system (
Figure 1) combines visual and environmental information to enable reliable detection of drought stress in barley plants. The experimental setup consists of a Canon 2000D RGB camera (Canon Inc., Tokyo, Japan) used to capture high-resolution images of barley leaves, and an ESP32 microcontroller (Espressif Systems, Shanghai, China) connected to DHT11 sensors (AOSONG Electronics Co., Ltd., Guangzhou, China) to record leaf-surface temperature and humidity, which serve as physiological indicators of plant response to drought stress.
Data collection was carried out over a three-month period to cover a complete drought cycle. While several hundred images were initially collected, only 379 representative samples were retained after quality selection. Each image was paired with its corresponding temperature and humidity values. To ensure data reliability, blurred, underexposed and redundant captures were excluded.
Each image was annotated using a multimodal labeling protocol combining visual inspection and synchronized environmental measurements. The annotation relied on two complementary sources of information.
First, visual assessment was performed based on observable morphological symptoms commonly associated with plant water deficit, including leaf wilting, chlorosis (yellowing), discoloration, and loss of turgor. Second, environmental context was incorporated using temperature and relative humidity measurements collected concurrent with image acquisition using a DHT11 sensor positioned in close proximity to the plant canopy.
A sample was labeled as “stressed” when visible symptoms consistent with drought stress were observed in conjunction with environmental conditions indicative of water deficit, typically characterized by elevated temperature and reduced relative humidity.
Conversely, samples were labeled as “normal” when plants appeared visually healthy and environmental measurements remained within non-stress ranges. Temperature and humidity measurements represent local microclimatic conditions around the plant canopy, as opposed to direct physiological variables such as stomatal conductance or leaf water potential.
The annotation framework integrates morphological leaf observations with canopy microclimate measurements in a proxy-based labeling strategy, where labels are assigned based on the combined analysis of visual symptoms and concurrent canopy microclimate conditions.
Both image and sensor data were collected and transmitted in real time to a workstation via Wi-Fi to a central workstation to enable remote monitoring.
To detect stress patterns, a Hybrid EfficientNet–Transformer Sensor Fusion multimodal model was applied to the collected images and sensor data.
2.2. Data Preprocessing Pipeline
A comprehensive preprocessing pipeline was implemented to prepare raw data for model training. It includes image preprocessing, data augmentation, sensor data preprocessing, dataset partitioning, class balancing, and pipeline optimization.
2.2.1. Image Preprocessing
The images captured by the Canon Camera were in RGB format. Their dimensions ranged between 2068 and 6000 pixels in width, and 1333 and 4000 pixels in height. Each image had 24 bits per pixel for its color channels.
All RGB images were preprocessed to meet the input requirements of the EfficientNetV2-S architecture. Images were resized to 224 × 224 pixels to maintain a uniform spatial dimension within the dataset. Pixel intensity values were normalized through division by 255.0. This transforms the original [0, 255] range to a [0, 1] normalized scale. This normalization facilitates gradient-based optimization during training and ensures numerical stability. Additionally, all images were converted to 32-bit floating-point format to preserve precision within the computational pipeline.
2.2.2. Data Augmentation
To improve model generalization and reduce overfitting [
30], given the limited dataset size (
n = 379), we applied geometric and photometric data augmentation during training. Geometric transformations [
31] included random horizontal flips (
p = 0.5), random rotations (±15°), random shifts (±10% horizontal and vertical), and random zoom (0.9–1.1×). Photometric augmentations [
32] included random brightness adjustment (±20%), random contrast adjustment (±20%), and random saturation adjustment (±15%).
Augmentation parameters were chosen to preserve biologically meaningful plant morphology while increasing the diversity of training samples.
Augmentations were applied on-the-fly during training using TensorFlow’s (version 2.20.0; Google LLC, Mountain View, CA, USA) tf.data API with parallel processing (num_parallel_calls = tf.data.AUTOTUNE). This approach avoids storing augmented images on disk and reduces storage requirements. Augmentations were applied only to the training set, while validation data was not augmented to ensure unbiased performance evaluation. Sensor measurements were not augmented, as they represent ground-truth environmental conditions.
2.2.3. Sensor Data Preprocessing
Sensor measurements (temperature and humidity) were preprocessed using standardized normalization techniques.
A Z-score standardization [
33] was applied using StandardScaler [
34] from the scikit-learn library (version 1.7.2). This transformation adjusts each feature to have zero mean and unit variance. The aim of using such normalization is to ensure that sensor features contribute proportionally to model learning.
2.2.4. Dataset Partitioning
The complete dataset (n = 379) was partitioned into training (80%, n = 303) and test (20%, n = 76) sets using stratified random sampling to maintain class balance across splits. The test set was reserved for final evaluation.
In k-fold cross-validation experiments, the training set was further split into five folds for hyperparameter tuning, while the test set was excluded from cross-validation [
35] to preserve an independent evaluation protocol.
All ablation results reported in this paper use an 80/20 fixed split evaluated over 10 random seeds. All preprocessing steps, including image normalization and sensor standardization, were fitted exclusively on the training data and applied unchanged to test data to prevent data leakage.
2.2.5. Class Balancing Strategy
Although the dataset is approximately balanced (50.1% healthy, 49.9% stressed), minor deviations from a perfect 50/50 distribution naturally arise due to real-world sampling variability. To ensure robustness to stochastic mini-batch sampling and to guarantee equal contribution of both classes during optimization, class weighting [
36] was applied during training.
Class weights were computed using Equation (1):
and .
This resulted in approximately equal weights for both classes (, ), consistent with the near-balanced nature of the dataset.
2.2.6. Data Pipeline Optimization
To maximize training efficiency on CPU hardware, we built an optimized data pipeline using TensorFlow’s tf.data API [
37] with key enhancements: prefetching to load batches in parallel with training, multi-threaded data loading to leverage multiple CPU cores, in-memory caching to avoid redundant preprocessing after the first epoch, and efficient batch shuffling to ensure good sample mixing.
Profiling showed these optimizations reduced data loading time from 42% to just 8% of total training time, effectively eliminating the pipeline bottleneck. We also enabled XLA (Accelerated Linear Algebra) compilation [
38] to optimize computational graph execution, reducing overhead and accelerating training iterations. These strategies allowed the model to fully utilize available CPU resources without idle periods waiting for data. All experiments were implemented in Python (version 3.10.19; Python Software Foundation, Wilmington, DE, USA) using the following libraries: NumPy (version 2.2.6), pandas (version 2.3.3), Matplotlib (version 3.10.8), seaborn (version 0.13.2), SciPy (version 1.15.3), statsmodels (version 0.14.6), and psutil (version 7.0.0).
2.3. Proposed Model Architecture
The proposed multimodal architecture integrates visual and sensor modalities through a carefully designed fusion framework.
Figure 2 illustrates the complete architecture. It is composed of five main components: (1) visual feature extraction using EfficientNetV2-S, (2) sensor encoding branch, (3) cross-modal attention mechanism, (4) gated multimodal fusion module, and (5) classification head.
2.3.1. Visual Feature Extraction (EfficientNetV2-S)
The visual branch uses EfficientNetV2-S as the backbone for RGB image feature extraction. EfficientNetV2 improves training efficiency and accuracy through fused MBConv blocks, progressive learning, and optimized architecture search. We selected the S variant (21.5 M parameters) due to its superior accuracy–efficiency trade-off under CPU-constrained deployment, outperforming larger variants and alternative architectures in preliminary experiments.
The network is initialized with ImageNet-pretrained weights and fine-tuned on the barley dataset. Input images I ∈ ℝ 224 × 224 × 3 are processed through successive convolutional blocks with progressive spatial reduction and channel expansion. This produces high-level visual features that encode barley-specific morphology, texture, and color variations. A Global Average Pooling (GAP) layer is applied to the final convolutional feature maps. It results in a compact visual embedding vimg ∈ ℝ 1280, which is further refined using a fully connected projection layer with nonlinear activation.
To balance adaptation and generalization, early layers are partially frozen while deeper layers are fine-tuned using a reduced learning rate. This enables effective transfer learning and preserves pretrained representations. This strategy demonstrated strong performance in ablation studies.
2.3.2. Sensor Encoding Branch
The sensor branch processes temperature and humidity values through a multilayer perceptron [
39] (MLP). The encoding branch consists of two dense layers with nonlinear activation functions, combined with Batch Normalization to stabilize gradient propagation and improve generalization. The transformation can be expressed as [
40]:
where
represents the raw sensor vector, and
denotes the learned physiological embedding.
2.3.3. Cross-Modal Attention Mechanism
To enable adaptive feature weighting and cross-modal information exchange, we implement a cross-modal attention mechanism [
41,
42] that allows each modality to attend to relevant features from the other modality.
The cross-attention module aligns visual and sensor representations. In this mechanism, the image embeddings are used as queries and sensor embeddings as keys and values. This enables the model to modulate visual feature importance conditioned by sensor data.
The attention computation is formulated as:
where
,
,
denote learnable projection matrices.
This attention mechanism links temperature–humidity indicators with observable leaf stress patterns.
The output adjusts visual focus according to sensor data.
2.3.4. Gated Multimodal Fusion and Classification Head
After attention alignment, cross-modal fusion [
42] is implemented through three complementary integration mechanisms, as illustrated in
Figure 3:
Batch-Normalization Path: The visual embedding is standardized using batch normalization to preserve numerical stability.
Multiplicative (Gating) Path: Element-wise multiplication [
43] of the normalized visual and sensor embeddings serves as an adaptive gate. It amplifies or reduces image features in proportion to physiological importance.
Attention Path: The attention output vattn flattened to match the other feature dimensions. It uses the sensor information to highlight image features that are important for drought stress detection.
The outputs of these three branches are concatenated to form a unified representation [
44]:
To further enhance modality adaptivity, we apply learnable gates that weight the contributions of the visual (
) and sensor (
) features, conditioned on both modalities [
43]:
where
,
, and
is the sigmoid activation producing gates
.
The final fused representation combines the gated features [
45]:
and is normalized via layer normalization [
46]:
This fused vector is passed through a classification head comprising dense layers with nonlinear activations, batch normalization, and a dropout rate of 0.3. The final output layer produces normalized probabilities corresponding to the predicted plant condition (Normal or Stressed). This hybrid attention-gated fusion ensures interpretability, CPU-efficient execution, and improved performance, as confirmed by ablation studies.
2.4. Computational Optimization and Model Compression
Given the target deployment environment of agricultural edge devices with limited computational resources, we implemented a comprehensive optimization strategy encompassing CPU-specific training acceleration, model compression, and quantization. Along with an efficient CPU architecture, the model and training pipeline were optimized for computational efficiency on CPU-based systems. Multi-threading, AUTOTUNE [
37] and TensorFlow’s XLA JIT compilation [
38] optimize the data pipeline, accelerate training, and reduce resource usage. Selective backbone fine-tuning [
47] and batch normalization [
40] reduce unnecessary computation. These combined strategies balance predictive accuracy and efficiency and make the model compatible with real-time stress detection on resource-constrained agricultural devices.
2.4.1. CPU Optimization Strategy
Training deep neural networks on CPU hardware requires careful optimization to achieve acceptable training times. We implemented the following CPU-specific optimizations:
Threading Configuration: TensorFlow inter-op and intra-op parallelism threads were set to 6 (matching the number of physical CPU cores on our Intel Core i7 training machine (Intel Corporation, Santa Clara, CA, USA)). This enhances CPU utilization without oversubscription overhead.
Memory Layout Optimization: Model operations were configured to use NHWC (batch, height, width, channels) data layout instead of the GPU-optimized NCHW layout, as NHWC aligns with CPU cache line structure and enables better vectorization.
Mixed Precision Training: While typically associated with GPU training, we enabled TensorFlow’s experimental CPU mixed precision mode that uses AVX-512 instructions for float16 computation on supported CPUs, reducing memory bandwidth requirements.
Operator Fusion: Graph optimization passes were enabled to fuse consecutive operations, reducing memory traffic and improving instruction-level parallelism.
Batch Size Tuning: Larger batch sizes (16 vs. 4–8 typical for GPU training) were used to amortize per-batch overhead and improve CPU cache utilization.
These optimizations collectively reduced training time per epoch from 28.9 min to 9.1 min (68.5% reduction) compared to the non-optimized baseline. This makes CPU-only training practical for agricultural applications.
2.4.2. Training Acceleration Techniques
Beyond hardware-specific optimizations, we employed several algorithmic strategies to accelerate convergence:
Transfer Learning: Initializing EfficientNetV2 with ImageNet-pretrained weights reduced the required training epochs from approximately 80 to 30 by starting from a strong feature extractor.
Learning Rate Scheduling: The annealing schedule with warm restarts [
48] improved convergence speed by approximately 20% compared to fixed learning rates or simple step decay.
Early Stopping: Monitoring validation loss with patience = 10 epochs prevented unnecessary training beyond the optimal point, typically saving 15–20 epochs.
Data Pipeline Optimization: Prefetching and parallel loading eliminated data loading as a bottleneck, ensuring efficient utilization of computational resources.
Together, these techniques enabled 5-fold cross-validation (5 complete training runs) to complete in 2.55 h total on CPU hardware, making iterative model development feasible without GPU access.
2.4.3. TensorFlow Lite Conversion
For deployment on edge devices, we converted the trained Keras model (integrated in TensorFlow 2.20.0) to TensorFlow Lite (TFLite) format, which provides a lightweight inference engine optimized for mobile and embedded platforms [
49]. The conversion process involved:
Model Freezing: The trained model graph and weights were frozen into a single checkpoint file, eliminating training-specific operations (dropout, batch normalization in training mode).
Graph Optimization: TFLite converter applied optimization passes including constant folding, unused node elimination, and operation fusion to simplify the computational graph.
Operator Selection: All model operations were verified to have TFLite implementations. Custom operations (if any) would require dedicated kernel implementations.
The resulting Float32 TFLite model achieved 55% size reduction with no accuracy loss. This reduction stems from eliminating TensorFlow framework overhead and Python (version 3.10.19) serialization artifacts, not from precision reduction.
2.4.4. INT8 Quantization
Further compression was achieved through post-training INT8 quantization [
49], which converts 32-bit floating-point weights and activations to 8-bit integers. Quantization introduces controlled precision loss but dramatically reduces model size and accelerates inference on hardware with integer arithmetic units (common in edge devices).
We employed post-training quantization with a representative dataset (random sample of 100 training images) to calibrate quantization parameters (scale and zero-point) for each layer:
where scale and zero-point are determined by observing activation ranges during calibration. This approach requires no retraining, unlike quantization-aware training.
3. Results
3.1. Performance on the Greenhouse Barley Dataset
The proposed multimodal model was evaluated on the greenhouse barley dataset comprising 379 samples. The dataset was divided using an 80/20 stratified split (random_state = 42), yielding 303 training samples and 76 test samples. The class distribution was nearly balanced, with 190 normal (50.1%) and 189 stressed (49.9%) samples. The model was trained for 60 epochs with a batch size of 16 on CPU-only hardware.
Table 1 summarises the dataset characteristics.
The confusion matrix (
Figure 4) illustrates the model’s classification performance on the 76-samples.The matrix entries are defined as follows:
True Negative (TN = 36): correctly predicted as “normal plant” (negative class).
False Positive (FP = 1): incorrectly predicted as “stressed plant” when the actual class is “normal.”
True Positive (TP = 39): correctly predicted as “stressed plant” (positive class).
False Negative (FN = 0): incorrectly predicted as “normal plant” when the actual class is “stressed.”
The model correctly classified 75 out of 76 samples, achieving an accuracy of 98.7%. The single misclassification was a healthy plant predicted as stressed (one false positive). Notably, recall reached 100%, confirming that no stressed plant was missed (no false negatives).
Table 2 presents the full performance metrics, and the performance is further visualized in
Figure 5.
Given the limited size of the greenhouse dataset (
n = 379) relative to the capacity of EfficientNetV2-S (21.5 million parameters), a careful analysis of training dynamics is necessary to verify that the model learned generalizable features rather than memorising training samples.
Figure 6 presents the training and validation loss and accuracy curves over all 60 epochs.
Three key observations from these curves collectively indicate that the model does not suffer from severe overfitting:
The validation loss decreases from approximately 0.63 at epoch 1 to 0.027 at epoch 60, while the training loss stabilises around 0.062. This behaviour, where validation loss remains consistently lower than training loss after epoch 20, is contrary to the classical pattern of overfitting. It can be explained by two regularisation mechanisms applied exclusively during training: Dropout (rate = 0.3), which randomly deactivates 30% of neurons at each forward pass but is disabled during inference; and on-the-fly data augmentation (random flips, rotations, brightness, and contrast variations), which artificially increases the diversity of training samples. Since neither mechanism is applied to the validation set, the validation objective becomes structurally easier.
From approximately epoch 20 onward, both training and validation losses decrease smoothly and plateau at low values. No divergence is observed in which validation loss increases while training loss continues to decrease—this is a typical indicator of overfitting. The validation accuracy reaches 98.7% at epoch 60 (training accuracy: 98.0%), confirming stable convergence behaviour.
EfficientNetV2-S was initialised with ImageNet-pretrained weights. Only the last 40 layers were fine-tuned, while all earlier layers were frozen. Consequently, the number of trainable parameters learned from the 379 greenhouse samples represents only a small fraction of the 21.5 million total parameters. The frozen backbone provides generic visual representations (edges, textures, colour gradients, and shapes) that transfer effectively to plant stress detection, requiring only the final layers to adapt to barley-specific morphological patterns.
Taken together, pretrained transfer learning, data augmentation, and dropout regularisation form a robust strategy against overfitting in the small-data regime. The training curves confirm the effectiveness of these mechanisms: the model exhibits stable convergence without signs of memorisation, achieving a final validation accuracy of 98.7% with 100% recall on the 76-sample validation set.
3.2. K-Fold Cross-Validation Results
To provide a robust estimate of generalisation performance and to verify that the results are not dependent on a favourable random split, 5-fold stratified cross-validation was performed on the full greenhouse dataset (n = 379).
This protocol divides the dataset into five equal folds, trains and evaluates the model five times, each time using a different fold as the validation set and reports the mean and standard deviation across folds.
Table 3 reports comprehensive metrics for each fold, and
Table 4 summarizes the training time per fold.
Figure 7,
Figure 8 and
Figure 9 provide visual summaries of the cross-validation behaviour.
The cross-validation results show high and relatively stable performance across all data partitions.
The mean accuracy is 97.6 ± 2.2%, with fold accuracies ranging from 94.7% to 100.0%, corresponding to a 5.3 percentage point variation. The model achieves a high mean recall of 98.5 ± 2.3%, with perfect recall in Folds 1, 2, and 5, indicating strong sensitivity in detecting stressed plants. Mean specificity is 96.8 ± 3.5%, although some variability is observed, particularly in Fold 5. The mean F1-score of 97.7 ± 2.1% confirms balanced performance across classes.
Overall, the standard deviation across folds supports the robustness of the proposed approach under greenhouse conditions.
Total training time for 5-fold cross-validation (
Figure 10) was 153.1 min (2.55 h). This proves model feasibility on CPU hardware.
3.3. Ablation Study Results
To assess the contribution of individual architectural components, we conducted systematic ablation studies comparing five model variants. Each variant was trained with ten different random seeds (42, 123, 456, 789, 2024, 2025, 2026, 2027, 2028, 2029), and
Table 5 reports mean performance with standard deviations across seeds. A one-way ANOVA confirmed that architectural configuration is the dominant source of performance variation (F (4,36) = 4.44,
p = 0.005, partial η
2 = 0.33), validating that the observed ranking is stable and not driven by random seed variation. A Sensor-Only MLP baseline (temperature and humidity features only) is also reported to quantify the standalone contribution of the sensor branch.
Table 5 reports the performance of all ablation configurations averaged over 10 seeds.
3.3.1. Component-Wise Ablation Results
The ablation results are summarized as follows:
The multimodal configurations consistently outperform the Image-Only baseline (97.4 ± 1.8%), confirming that sensor information provides complementary cues to visual features. The Sensor-Only MLP achieves substantially lower performance (73.8 ± 3.5% accuracy, 67.0 ± 3.5% F1-score), indicating that temperature and humidity alone are insufficient for reliable classification and that the visual branch is the dominant contributor to performance. The combination of both modalities improves accuracy, specificity, and AUC across all seeds, demonstrating the complementary nature of the two information sources.
Comparing the Full Model (98.3 ± 1.5%) to No-Gating (99.1 ± 0.6%) and Concat Fusion (99.3 ± 0.7%), no statistically significant advantage is observed in pairwise comparisons across 10 seeds. The one-way ANOVA (F(4,36) = 4.44, p = 0.005) indicates that at least one configuration differs significantly from the others; however, post hoc comparisons among multimodal variants do not reach statistical significance after Bonferroni correction. This suggests that simpler fusion strategies are sufficient to capture most of the multimodal information under the current experimental conditions.
This outcome is consistent with the characteristics of the greenhouse dataset, which is relatively small (n = 379) and exhibits limited environmental variability. In such controlled settings, static fusion mechanisms such as concatenation can approximate effective feature integration, thereby reducing observable differences between fusion strategies.
In contrast, the gated cross-modal attention mechanism introduces a data-dependent and sample-specific weighting scheme that adaptively modulates the contribution of visual and sensor features according to their relative informativeness. This enables the model to handle modality imbalance, noise, or partial degradation more effectively than fixed fusion strategies.
Beyond performance, the gating mechanism also improves interpretability by explicitly exposing modality importance at the sample level, allowing post hoc analysis of how environmental and visual signals contribute to each prediction. This is particularly relevant in precision agriculture applications where understanding decision drivers is as important as prediction accuracy.
Therefore, although no statistically significant improvement is observed in this controlled dataset, the gated fusion strategy remains theoretically motivated and more generalisable. The Full Model is retained due to its adaptive fusion capability, improved interpretability, and expected robustness in more heterogeneous and real-world conditions.
The Frozen Backbone variant (97.0 ± 2.5%) achieves the lowest mean accuracy among multimodal configurations and shows higher variability, confirming that fine-tuning the EfficientNetV2-S backbone is essential for optimal performance. This result validates the use of selective partial fine-tuning in the proposed architecture.
All multimodal configurations achieve standard deviations below ±1.5% across 10 seeds, indicating stable training behavior. The Image-Only (±1.8%) and Frozen Backbone (±2.5%) variants show slightly higher variability, suggesting that both multimodal fusion and backbone fine-tuning contribute to improved robustness. The ANOVA partial η2 = 0.33 confirms a medium-to-large effect of architectural configuration on performance.
Figure 11 visualizes the comparative performance of the five ablation configurations across the six evaluation metrics using a radar chart.
3.3.2. Statistical Analysis
To quantify the effect of architectural design on model performance, a one-way ANOVA was performed on the 10-seed results across the five ablation configurations (
Table 6). The analysis reveals a statistically significant effect of architectural configuration on both accuracy (F(4,36) = 4.44,
p = 0.005, η
2 = 0.33) and F1-score (F(4,36) = 4.10,
p = 0.008, η
2 = 0.31). These results indicate a moderate-to-large effect size, suggesting that approximately 30% of the variance in performance can be attributed to architectural differences rather than random initialization.
Post hoc analysis indicates that the primary source of variation is driven by the Sensor-Only and Frozen Backbone configurations, while differences among the multimodal fusion variants (Full Model, Concat Fusion, and No-Gating) are not statistically significant after correction. This suggests that while multimodal learning is beneficial overall, the specific choice of fusion strategy has a limited impact under the current dataset conditions.
3.4. Baseline Comparison
To evaluate the effectiveness of the proposed approach, we compare it against classical machine learning methods and shallow deep learning architectures on the greenhouse dataset (379 samples) using an 80/20 train–test split.
The baselines include Logistic Regression, Random Forest, and a 3-layer shallow CNN. All classical baselines are evaluated using three random seeds (42, 123, 456) to ensure robustness.
In addition, AgriFusionNet [
22] is re-implemented on our dataset and evaluated using its original 5-fold cross-validation protocol, in order to remain consistent with the evaluation setting reported in the original study. All models are assessed under identical preprocessing and input configurations, including image-only, sensor-only, and multimodal settings where applicable. Performance is measured using accuracy, F1-score, and AUC.
Table 7 summarizes all comparative results.
The results in
Table 7 highlight two key comparative dimensions that validate the effectiveness of the proposed architecture.
First, classical baselines (Logistic Regression, Random Forest, and shallow CNN) are consistently and substantially outperformed by the proposed Full Model (98.3 ± 1.5%) across all evaluation metrics. This performance gap is systematic across all input modalities, indicating a structural limitation of shallow models rather than a training artifact.
Logistic Regression and Random Forest achieve relatively strong results in image-only and multimodal settings; however, their performance degrades significantly in sensor-only configurations. This behavior reflects their reliance on linear or weakly nonlinear decision boundaries, which are insufficient to model the complex interactions between temperature and humidity dynamics in greenhouse environments. Although ensemble-based methods such as Random Forest partially mitigate this limitation, they remain constrained by assumptions of weak feature dependencies and the absence of hierarchical representation learning.
The shallow CNN improves image feature extraction compared to classical methods, confirming the benefit of learned spatial representations. However, its limited depth restricts its ability to capture high-level semantic abstractions, resulting in suboptimal fusion compared to the proposed deep architecture. In contrast, the proposed model benefits from deep transfer learning (EfficientNetV2-S), which provides rich hierarchical feature representations pre-trained on large-scale datasets, enabling stronger generalization even under limited training data conditions.
Second, the comparison with AgriFusionNet highlights the superiority of the proposed fusion strategy over existing lightweight multimodal agricultural models. Despite being evaluated under its original 5-fold cross-validation protocol, AgriFusionNet achieves substantially lower performance (75.2% accuracy) on our dataset, indicating limited robustness under greenhouse-specific domain conditions. This performance degradation suggests that its fusion mechanism is not sufficiently adaptive to heterogeneous modality interactions or dataset shift.
In contrast, the proposed model consistently achieves superior performance (98.3%), demonstrating that the improvement is not attributable to model complexity alone. Instead, the key factor is the gated multimodal fusion mechanism, which adaptively weights image and sensor contributions depending on contextual relevance. This allows the model to suppress noisy modalities while emphasizing informative signals, leading to more stable and discriminative joint representations.
Overall, these results demonstrate that the proposed architecture achieves superior performance due to three synergistic factors: strong hierarchical feature extraction via transfer learning, adaptive gated multimodal fusion, and improved robustness under limited-data conditions. This combination enables consistent gains over both classical baselines and recent state-of-the-art multimodal agricultural models.
3.5. Results on PlantVillage Dataset
To evaluate generalization to different plant species and stress types, we tested the greenhouse-trained model on the PlantVillage dataset reformulated as binary classification (healthy vs. diseased).
Table 8 describes dataset characteristics and
Table 9 presents performance metrics.
The following tables present comprehensive performance metrics from the optimized agricultural stress detection model. Results are based on 5-fold cross-validation with a total test set of 2108 samples.
To assess the stability of the training process across different random initializations, the model was trained under three seeds (42, 123, and 456).
Figure 12 illustrates the training and validation loss and accuracy curves for each run.
The model achieved an accuracy of 99.97% on the PlantVillage dataset, which is noticeably higher than the performance obtained on the greenhouse dataset (98.3%). This result can be attributed to several factors:
Domain Characteristics: PlantVillage images have uniform white backgrounds and controlled lighting. It reduces visual complexity compared to greenhouse images with natural backgrounds and variable illumination.
Disease Severity: Many PlantVillage disease examples show advanced symptoms (severe lesions, extensive discoloration) that are more visually distinctive than the subtle early-stage drought stress in our greenhouse dataset.
Feature Transferability: EfficientNetV2 learns general visual patterns related to plant stress such as changes in leaf color and texture that transfer effectively across species and stress types. This improves its performance in different environments.
Binary Simplification: Combining 38 disease classes into a single ‘diseased’ category reduced task complexity. This performance across datasets confirms the model’s ability to generalize and detect different plant stress types beyond barley drought.
3.6. Cross-Dataset Performance Summary
Table 10 summarizes the model’s performance on both datasets. It demonstrates consistent results across different conditions.
The model achieved an accuracy of 99.97% on the PlantVillage dataset, which is noticeably higher than the performance obtained on the greenhouse dataset (98.3%).
Both datasets show strong performance, with the greenhouse barley dataset achieving 98.3% accuracy and the PlantVillage binary dataset reaching 99.97%. This confirms the model’s effectiveness for plant stress detection across different conditions. The performance gap reflects differences in dataset characteristics (controlled imaging conditions and more visually distinctive symptoms in PlantVillage) rather than any limitation of the model. Key observations include:
Robust Generalization: Although trained only on barley drought stress, the model achieves near-perfect classification on multi-species disease detection. This indicates that the learned features transfer effectively across different stress types.
Consistently High Recall: Recall exceeds 98% on both datasets, which is important in agriculture, where failing to detect stressed plants can have greater consequences than occasional false alarms.
Stable Performance: Cross-validation standard deviations of ±2–3% on the greenhouse dataset and below ±0.2% on PlantVillage indicate low variance across runs. This suggests that the model is robust to both training sample variability and random initialization.
3.7. Computational Efficiency Analysis
One important contribution of this work is to show that high-accuracy multimodal plant stress detection can be achieved on CPU-only hardware through systematic optimization.
Table 11,
Table 12 and
Table 13 quantify the impact of computational optimizations.
The optimization results demonstrate significant computational efficiency gains (
Figure 13):
Training Acceleration: CPU-specific optimizations reduced the training time per epoch from 28.9 to 9.1 min (68.5% reduction), making model development practical on consumer hardware without GPU access. The drop in CPU usage (92% → 34%) indicates more efficient resource utilization, avoiding thread oversubscription and cache issues.
Memory Efficiency: RAM usage decreased by 25% (6.4 GB → 4.9 GB) through optimized data pipelines and batch size tuning. This allows training on machines with 8 GB RAM, as used in agricultural IoT edge devices.
Model Compression: INT8 quantization achieved a 90% size reduction (150 MB → 14.9 MB) with minimal accuracy loss (97.6% → 97.3%, only 0.3%). The compact model fits comfortably in the flash storage of microcontrollers and embedded systems. This enables deployment on edge devices (
Figure 14).
These results suggest the proposed approach works well for deployment in real-world precision agriculture applications.
4. Discussion
The results show that the integration of visual and environmental data improves plant stress detection. In the ablation study over 10 random seeds, combining temperature and humidity sensor data with RGB images increased mean accuracy by 0.9 percentage points (98.3 ± 1.5% vs. 97.4 ± 1.8% image-only). The sensor-only MLP achieved only 73.8 ± 3.5%, confirming that sensor data alone is insufficient and that the visual branch is the primary driver of performance.
This improvement can be attributed to the complementary information provided by visual and environmental data. RGB images reflect visible stress symptoms, such as leaf wilting and color changes. However, these symptoms generally appear only after stress has already induced internal physiological alterations in the plant. From a physiological standpoint, high temperature and low humidity increase atmospheric evaporative demand, leading to stomatal closure as an early stress response [
50] that occurs before any visible color change is detectable. Environmental sensors that quantify these conditions therefore provide earlier indicators of stress than visual imagery-based approaches [
51], enabling detection prior to the appearance of visible symptoms.
The integration of these two modalities enables earlier and more reliable plant stress detection.
To better position the proposed approach within the broader context of plant stress monitoring, it is important to distinguish between direct physiological measurements and proxy-based environmental indicators. Physiological measurements such as leaf water potential and stomatal conductance are widely recognized as accurate indicators of plant water status [
52]. Thermal-based indices such as the Crop Water Stress Index (CWSI) provide additional drought-related information derived from canopy temperature [
53]. However, direct physiological measurements such as stomatal conductance, while accurate, are labour-intensive and unsuitable for automation [
54], leading practical systems to rely on alternative sensing approaches combined with visual observations in multimodal frameworks.
The use of temperature and relative humidity as indirect indicators of plant water stress is supported by well-established physiological mechanisms. Under drought conditions, reduced water availability leads to stomatal closure, which decreases transpiration and results in increased leaf temperature. This principle forms the basis of the Crop Water Stress Index. In parallel, transpiration influences the humidity of the leaf boundary layer, which is linked to plant water status and gas exchange dynamics. Although the present study relies on low-cost sensors that measure environmental conditions rather than direct physiological variables, these measurements provide meaningful contextual information reflecting plant–environment interactions under water deficit conditions.
The attention mechanism allows the model to adaptively weigh each data modality. When stress symptoms are subtle, it assigns greater importance to sensor data, whereas for visible symptoms, it relies more on images. This adaptive strategy reflects the reasoning process of an experienced agronomist. However, the extended 10-seed ablation study shows that the gated fusion mechanism does not yield a statistically significant accuracy improvement over simpler fusion strategies (No-Gating: 99.1 ± 0.6%; Simple Concat: 99.3 ± 0.7%; Full Model: 98.3 ± 1.5%) on this 379-sample dataset. The gated approach is retained for its interpretability and potential benefits on larger, more heterogeneous datasets where adaptive modality weighting is expected to be more impactful.
4.1. Generalization Across Datasets
Although the model was trained using greenhouse barley images, it achieved 99.97% accuracy on the PlantVillage dataset, which contains more than 26,000 images from 14 different crop species with multiple disease types. This strong cross-dataset performance suggests that the model learned general patterns of plant stress rather than barley-specific features.
This generalization can be explained by multiple factors. First, EfficientNetV2, pretrained on ImageNet, provides strong general visual features that perform well in plant image analysis. Second, many stress symptoms such as chlorophyll degradation and cell damage present similar patterns across different crops and stress types. Third, the model retained high performance even with differences in background and lighting between datasets. This indicates that it targets plant features rather than image artifacts.
4.2. Comparison with Related Work
Table 14 compares the proposed approach with recent deep learning methods for plant disease detection.
Most existing studies in
Table 14 rely exclusively on image-based models. For instance, hybrid CNN–Vision Transformer architectures proposed by Prashant S. Thakur et al. [
55] (98.86%) and lightweight convolutional models such as MobileNetV3-Small by A. T. Khan et al. [
56] (99.50%) achieve high classification accuracy. Similarly, transformer-based approaches including I. Pacal [
57] (99.24%) and the PDLC-ViT model by S. Hemalatha [
58] (99.97%) further improve performance through attention mechanisms. However, as indicated in the table, all these methods perform on visual inputs.
This reliance on image-only data introduces important limitations. High accuracies are typically reported on controlled datasets, where visual symptoms are clearly distinguishable. In such settings, models may implicitly learn dataset-specific features rather than robust disease representations. Moreover, image-based approaches cannot capture early physiological stress signals that are not yet visually observable, limiting their applicability for early detection and real-world deployment under varying environmental conditions.
In contrast, the AgriFusionNet model [
22] demonstrates the impact of multimodal fusion combining RGB imagery with multispectral data and environmental sensor data, reporting 94.3% accuracy on its original multi-species, multi-disease dataset. When re-implemented on our single-species, single-stress 379-sample greenhouse dataset using 5-fold cross-validation, AgriFusionNet achieved 75.2 ± 5.2% mean accuracy, significantly lower than the proposed method (98.3 ± 1.5%). This gap reflects both the domain difference (AgriFusionNet was designed for multispectral + multi-disease data) and the benefit of the pretrained EfficientNetV2-S backbone used in our approach. Shallow baselines trained on the same dataset—Logistic Regression (92.1 ± 3.5%), Random Forest (89.9 ± 6.2%), and a 3-layer Shallow CNN (92.5 ± 3.0%), all perform substantially below our full model, confirming the added value of deep transfer learning even on small datasets.
The proposed method in our work adopts a multimodal framework by integrating visual plant features with sensor data via a gated cross-modal attention module. This design enables the model to learn interactions between visual plant traits and sensor data such as temperature and humidity, which indicate early physiological stress.
The model is also adapted for edge devices thanks to its compact size and INT8 quantization, with a memory footprint of about 14.9 MB while preserving high accuracy. As a result, it provides a well-rounded trade-off between accuracy, reliability, and efficiency for real-time plant stress detection in precision agriculture.
4.3. Limitations
Despite the strong performance of the proposed model, several limitations should be acknowledged. First, the annotation process relies on proxy-based indicators rather than direct physiological measurements such as leaf water potential or stomatal conductance. While the combination of visual symptoms and environmental context improves labeling consistency, it does not provide a definitive physiological characterization of drought stress.
Second, the use of temperature and humidity measurements introduces potential ambiguity, as similar environmental conditions may also be associated with other abiotic stresses, particularly heat stress. As a result, the proposed system identifies stress conditions consistent with drought but cannot fully distinguish between different stress types.
Third, the environmental measurements were obtained using a low-cost DHT11 sensor, which provides limited precision compared to scientific-grade instrumentation. Although this choice aligns with the objective of developing a cost-effective and deployable system, it may affect the accuracy of environmental characterization.
Additionally, several key environmental factors were not recorded or controlled, including soil type, light intensity, ventilation or airflow conditions, and precise irrigation levels. These variables are known to significantly influence plant responses to drought stress, and their absence limits both the interpretability and reproducibility of the experimental setup.
In the same context, all data were collected from a single greenhouse environment without replication across seasons, locations, or different plant batches. This restricts the variability captured in the dataset and limits the assessment of model robustness under diverse environmental and agronomic conditions.
In addition to these aspects, several methodological and experimental constraints further delimit the scope of the findings. The dataset comprises 379 samples collected from a single greenhouse over a limited time period. Although this scale is comparable to exploratory studies in precision agriculture, it remains relatively small for training a deep architecture such as EfficientNetV2-S. Measures including transfer learning, data augmentation, dropout regularization, and multi-seed evaluation were implemented to mitigate overfitting; however, the dataset does not fully capture variability in plant development stages, environmental dynamics, or inter-plant heterogeneity encountered in real-world deployments. Moreover, the single-environment setting prevents evaluation across different climatic conditions, greenhouse configurations, or geographical contexts.
The proxy-based annotation protocol, while more robust than purely visual labeling, does not provide physiologically validated ground truth. Direct measurements such as leaf water potential or stomatal conductance would enable a more precise quantification of drought stress severity. Consequently, class boundaries may remain approximate, particularly in borderline cases where stress symptoms are not yet fully expressed.
The environmental sensing setup introduces additional limitations. The DHT11 sensor provides moderate accuracy, which is sufficient for capturing general environmental trends but may not detect subtle transitions near stress thresholds.
Furthermore, the absence of calibration against reference instruments and the lack of detailed documentation of greenhouse conditions—such as irrigation regimes, light intensity, or substrate properties—limit experimental reproducibility and environmental interpretability.
From a modeling perspective, the ablation study indicates that the proposed gated fusion mechanism does not yield statistically significant improvements over simpler fusion strategies under the current experimental conditions. This suggests that, in relatively homogeneous environments, simpler multimodal fusion approaches may already be sufficient, while the benefits of adaptive gating may only emerge in more complex or heterogeneous datasets.
Finally, the framework is restricted to binary classification between healthy and drought-stressed plants. It does not address multi-stress scenarios or provide continuous stress severity estimation, both of which are important for practical precision agriculture applications. In addition, the absence of temporal modeling prevents the system from capturing the progressive nature of drought stress, limiting its ability to detect early-stage stress dynamics.
5. Conclusions
This study presents a multimodal deep learning framework for real-time drought stress detection in barley. It offers an accurate and computationally efficient solution for precision agriculture. The proposed architecture combines EfficientNetV2-S visual features with temperature and humidity data through cross-modal attention and gated fusion. This allows adaptive integration of visual and sensor inputs and is suitable for edge deployment. The model achieved 98.3 ± 1.5% mean accuracy across 10 random seeds with an 80/20 train–test split (379 samples), with the best individual seed achieving 100% across all metrics. An extended ablation study with ANOVA (F(4,36) = 4.44, p = 0.005) confirms that multimodal fusion (98.3%) outperforms image-only (97.4%) and sensor-only (73.8%) baselines. External validation on the PlantVillage dataset confirmed strong generalization, with 99.97% accuracy, indicating that the model captures stress patterns transferable across other plant species. Systematic CPU optimization reduced training time by 68.5%, while INT8 quantization decreased model size by 90% with minimal accuracy loss, enabling deployment on embedded devices. The overall system is cost-effective and integrates with existing farm management.
In future work, we will explore temporal modeling using sequential networks to detect stress before visual symptoms appear and extend the framework to classify multiple stress types, such as drought, heat, nutrient deficiency, or disease. Field validation under diverse conditions is essential to ensure robustness and adaptability. Improving interpretability through explainable AI techniques would increase trust among agronomists, while federated learning could enable collaborative model improvement across farms while preserving privacy. Predicting continuous stress severity and integrating additional sensors, such as multispectral, thermal, LiDAR, or soil sensors, will enhance monitoring precision. Overall, this work demonstrates that multimodal deep learning can be effectively applied in precision agriculture, by combining affordable sensors, efficient edge AI, and transfer learning to support sustainable crop management under changing environmental conditions.