An Efficient Multimodal Framework for Barley Drought Stress Detection on Resource-Constrained Devices

Boukouba, Rihab; Ben Aissa, Dalenda; Guidara, Amira; Smaoui, Nadia; Ebel, Chantal

doi:10.3390/agriengineering8060230

Open AccessArticle

An Efficient Multimodal Framework for Barley Drought Stress Detection on Resource-Constrained Devices

by

Rihab Boukouba

^1,2,*

,

Dalenda Ben Aissa

¹,

Amira Guidara

³

,

Nadia Smaoui

⁴

and

Chantal Ebel

⁵

¹

Microwave Electronics Research Laboratory, Faculty of Sciences of Tunis, Tunis El-Manar University, El Manar, Tunis 2029, Tunisia

²

National Engineering School of Gabes, University of Gabes, Gabes 6029, Tunisia

³

ATMS Technologies of Medicine and Signals, National Engineering School of Sfax, University of Sfax, Sfax 3038, Tunisia

⁴

Control and Energy Management Laboratory, National Engineering School of Sfax, University of Sfax, Sfax 3038, Tunisia

⁵

Plant Physiology and Functional Genomics Research Unit, Higher Institute of Biotechnology, Sfax 3000, Tunisia

^*

Author to whom correspondence should be addressed.

AgriEngineering 2026, 8(6), 230; https://doi.org/10.3390/agriengineering8060230

Submission received: 22 March 2026 / Revised: 1 May 2026 / Accepted: 10 May 2026 / Published: 5 June 2026

(This article belongs to the Special Issue Precision Agriculture: Sensor-Based Systems and IoT-Enabled Machinery)

Download

Browse Figures

Versions Notes

Abstract

Drought stress significantly impacts barley (Hordeum vulgare L.) production, necessitating early and accurate detection systems for precision agriculture. Traditional monitoring approaches rely on manual inspection or single-modality sensing, which often fail to capture the complex physiological responses to water deficit. This study presents a novel multimodal deep learning framework that integrates RGB imaging with environmental sensor data (temperature and humidity) for real-time drought stress classification in barley plants. The proposed architecture employs EfficientNetV2-S for visual feature extraction, coupled with a dedicated sensor encoding branch, unified through a cross-modal attention mechanism and gated multimodal fusion strategy. To address the computational constraints of agricultural IoT systems, we implemented comprehensive CPU optimization techniques and model compression via TensorFlow Lite INT8 quantization, achieving a 68.5% reduction in training time and 90% reduction in model size. The system was validated on a custom greenhouse dataset (379 samples, 80/20 split) and the PlantVillage dataset (26,000 images, binary reformulation). A 10-seed evaluation protocol demonstrated that the full multimodal model achieves 98.3 ± 1.5% accuracy, outperforming both an image-only baseline (97.4 ± 1.8%) and a sensor-only MLP (73.8 ± 3.5%). Across seeds, the model also achieved an F1-score of 98.34 ± 1.48% and ROC-AUC of 99.93 ± 0.13%. Ablation analysis with ANOVA (F(4,36) = 4.44, p = 0.005) confirmed that multimodal fusion improves accuracy by 0.92% over image-only models, with the full gated cross-modal attention mechanism outperforming all simplified baselines, including AgriFusionNet (75.22%), Shallow CNN (92.54%), Logistic Regression multimodal (92.11%), and Random Forest multimodal (89.91%). These results further show that relying on environmental data alone is insufficient, reinforcing the benefit of multimodal fusion. External validation on PlantVillage achieved 99.97% accuracy, demonstrating strong generalization capabilities. The optimized model operates efficiently on CPU-only hardware (training time: 9.1 min/epoch), making it suitable for edge deployment in resource-constrained agricultural environments. This work demonstrates that a low-cost, CPU-compatible multimodal deep learning system can reliably detect drought stress in barley under real greenhouse conditions and provides a practical and scalable solution for early stress monitoring in precision agriculture.

Keywords:

drought stress; barley; multimodal learning; deep learning; edge computing; IoT; precision agriculture; EfficientNet; sensor fusion; model compression

1. Introduction

Nowadays, sustainable agriculture depends on precise and energy-efficient crop stress detection strategies. Barley (Hordeum vulgare L.) [1] is one of the world’s most widely cultivated cereal crops, which represents an important crop for food, feed, and industrial applications [2]. However, its productivity is highly sensitive to drought stress, particularly in semi-arid and Mediterranean regions where rainfall variability has intensified due to climate change [3]. Early and accurate detection of drought effects is essential to prevent yield losses and improve irrigation decisions [4].

Traditional drought stress monitoring approaches, including visual inspection and manual physiological measurements, offer limited scalability and are labor-intensive [5]. Early work in automated plant stress detection relied on classical machine learning approaches with hand-crafted features. Support vector machines (SVMs) with color and texture features achieved moderate accuracy for disease classification but required extensive feature engineering and struggled with generalization [6]. These conventional methods result in insufficient real-time data necessary for large-scale decision support in precision agriculture.

The advent of deep learning revolutionized plant stress detection, with convolutional neural networks (CNNs) demonstrating superior performance in capturing morphological stress cues. Deep CNNs trained on large-scale datasets such as PlantVillage have achieved accuracies of 99% for disease classification across multiple crop species [7]. However, these models often suffered from domain shift when tested on real field conditions, highlighting critical generalization challenges.

A variety of deep learning architectures have been evaluated for plant disease classification [8]. It includes CNNs, fine-tuned pretrained networks such as AlexNet, VGGNet, GoogLeNet, ResNet variants, DenseNet, Inception models, and hybrid approaches combining CNN-based feature extraction with traditional classifiers (SVM or BPNN). Reported accuracies vary across crops and datasets, from 76% to 99.75%. This highlights the influence of dataset characteristics, crop type, and model configuration. While deeper models generally improve performance, they can also increase computational complexity and training requirements.

To address computational constraints, lightweight architectures based on MobileNetV2 have demonstrated feasibility for efficient plant disease detection, which achieves over 98% accuracy with significantly reduced parameters [9]. EfficientNet-based models have shown strong performance due to their compound scaling strategy, which enables improved accuracy with balanced depth–width–resolution optimization [10].

Despite these advances, image-only approaches remain limited. Visual symptoms [11] often appear after physiological stress has already occurred, reducing their effectiveness for early drought detection. Furthermore, their sensitivity to illumination changes, camera position, and background noise [12] can degrade performance in real-world agricultural settings. These studies focus on visual information and fail to incorporate environmental context, which is critical for understanding the physiological mechanisms underlying stress responses.

Parallel to computer vision approaches, extensive research has developed sensor-based plant monitoring systems. Infra-red thermometry pioneered the use of canopy temperature measurements for crop water stress assessment. It led to the development of indices such as the Crop Water Stress Index (CWSI) [13]. This approach has been widely adopted; however, it requires precise calibration and is limited in addressing spatial heterogeneity [14]. Recent advances in the Internet of Things (IoT) enable automated and real-time crop monitoring by coupling RGB imaging with environmental sensing. IoT-based systems integrate multiple sensor types including soil moisture, temperature, humidity, and light intensity [15] to provide comprehensive environmental monitoring. Wireless sensor networks combined with machine learning techniques have been developed for irrigation management, using environmental data such as soil moisture and weather parameters. These approaches achieve prediction accuracies approaching 90% in controlled environments [16]. Fuzzy logic systems integrating soil and atmospheric sensors have also been implemented for automated irrigation control [17]. These systems monitor environmental and microclimatic conditions such as soil moisture, temperature, and humidity. This enables optimized irrigation management [18].

However, sensor-based monitoring systems that are based only on environmental parameters (soil moisture, temperature, humidity) are limited in their ability to directly characterize plant phenotypic responses. They cannot capture the spatial distribution patterns or morphological manifestations of stress at the canopy scale, which require imaging-based observation [19].

To overcome unimodal limitations, the integration of imaging and sensor data for plant monitoring has become a promising research direction. Agricultural studies have combined remote sensing imagery with ground-based sensor networks. Combining UAV images and weather data improves crop yield prediction with machine learning models [20]. However, such approaches use simple feature concatenation and do not learn interactions between different modalities using attention mechanisms. This limits their ability to adapt to different environments and field conditions. In medical imaging, cross-modal attention has been successful in combining modalities such as MRI and PET [21] by learning strong inter-modal relationships. This shows the potential of such methods for agricultural multimodal fusion.

Recent studies explore similar approaches in agriculture by fusing RGB and multispectral drone imagery with IoT-based environmental sensor data (temperature, humidity, soil moisture) for plant disease detection, which achieves strong classification performance [22]. Recent multimodal architectures use attention mechanisms or transformer-based frameworks to encode cross-modal correlations by weighting visual features based on contextual sensor inputs [23]. However, many existing multimodal approaches use resource-intensive architectures that require GPUs during training and inference which limits real-time deployment on mobile or edge devices in agricultural environments.

Despite improvements in fusion with attention-based models, achieving both effective cross-modal learning and efficient edge deployment remains difficult in agricultural systems. This is due to the strict constraints of edge devices and IoT platforms in terms of memory, processing power, and energy consumption. Recent surveys on Edge AI for crop disease detection confirm that deploying deep learning models on such resource-constrained devices remains challenging [24]. To address this issue, model compression techniques such as pruning and quantization have been explored to reduce model size and computational complexity, often with minimal impact on accuracy [25,26]. Recent works [27,28] have focused on efficient plant disease detection through compact CNNs and knowledge distillation, which achieve high accuracy with models optimized for mobile deployment. In this context, TensorFlow Lite and INT8 quantization reduce model size and inference time, which enable deployment on embedded systems. However, the impact of these techniques on multimodal agricultural models remains under-investigated, and most optimization studies focus on unimodal models. In addition, few works systematically analyze training efficiency, CPU utilization, and memory consumption alongside predictive performance.

Despite advances in image-based stress detection, sensor monitoring, and model compression, their integration for practical agricultural deployment remains limited. In this context, existing multimodal approaches either rely on simple fusion strategies that fail to learn cross-modal interactions or use complex architectures that require GPU acceleration, which is incompatible with agricultural edge deployment.

In addition to these limitations, most studies validate models on single datasets, which limits the evaluation of their generalization performance. For instance, Binary classification using large-scale datasets such as PlantVillage has been explored to reduce class complexity and improve model performance [29]. However, such datasets are used as external validation for models trained on independent greenhouse or field data. In addition, few multimodal studies consider cross-dataset transfer while also reducing computational cost for CPU-based deployment.

The key contributions of this work are:

A novel lightweight multimodal architecture that combines EfficientNetV2-S visual features with sensor data through cross-modal attention and gated fusion modules.
A systematic ablation study over 10 random seeds with one-way ANOVA (F(4,36) = 4.44, p = 0.005) demonstrating stable performance across architectural configurations; multimodal fusion improves accuracy by 0.9 pp over image-only (98.3% vs. 97.4%), while a sensor-only MLP achieves only 73.8%, confirming the complementary nature of both modalities.
A systematic CPU optimization strategy that reduces training time by 68.5% and achieves 90% model size reduction through INT8 quantization, enabling practical edge deployment.
A robust validation protocol on a custom greenhouse barley dataset (98.3 ± 1.5% mean accuracy over 10 seeds) and external validation on the PlantVillage dataset (99.97% accuracy), demonstrating both high performance and strong generalization ability.

2. Materials and Methods

2.1. System Description and Data Acquisition

Our system (Figure 1) combines visual and environmental information to enable reliable detection of drought stress in barley plants. The experimental setup consists of a Canon 2000D RGB camera (Canon Inc., Tokyo, Japan) used to capture high-resolution images of barley leaves, and an ESP32 microcontroller (Espressif Systems, Shanghai, China) connected to DHT11 sensors (AOSONG Electronics Co., Ltd., Guangzhou, China) to record leaf-surface temperature and humidity, which serve as physiological indicators of plant response to drought stress.

Data collection was carried out over a three-month period to cover a complete drought cycle. While several hundred images were initially collected, only 379 representative samples were retained after quality selection. Each image was paired with its corresponding temperature and humidity values. To ensure data reliability, blurred, underexposed and redundant captures were excluded.

Each image was annotated using a multimodal labeling protocol combining visual inspection and synchronized environmental measurements. The annotation relied on two complementary sources of information.

First, visual assessment was performed based on observable morphological symptoms commonly associated with plant water deficit, including leaf wilting, chlorosis (yellowing), discoloration, and loss of turgor. Second, environmental context was incorporated using temperature and relative humidity measurements collected concurrent with image acquisition using a DHT11 sensor positioned in close proximity to the plant canopy.

A sample was labeled as “stressed” when visible symptoms consistent with drought stress were observed in conjunction with environmental conditions indicative of water deficit, typically characterized by elevated temperature and reduced relative humidity.

Conversely, samples were labeled as “normal” when plants appeared visually healthy and environmental measurements remained within non-stress ranges. Temperature and humidity measurements represent local microclimatic conditions around the plant canopy, as opposed to direct physiological variables such as stomatal conductance or leaf water potential.

The annotation framework integrates morphological leaf observations with canopy microclimate measurements in a proxy-based labeling strategy, where labels are assigned based on the combined analysis of visual symptoms and concurrent canopy microclimate conditions.

Both image and sensor data were collected and transmitted in real time to a workstation via Wi-Fi to a central workstation to enable remote monitoring.

To detect stress patterns, a Hybrid EfficientNet–Transformer Sensor Fusion multimodal model was applied to the collected images and sensor data.

2.2. Data Preprocessing Pipeline

A comprehensive preprocessing pipeline was implemented to prepare raw data for model training. It includes image preprocessing, data augmentation, sensor data preprocessing, dataset partitioning, class balancing, and pipeline optimization.

2.2.1. Image Preprocessing

The images captured by the Canon Camera were in RGB format. Their dimensions ranged between 2068 and 6000 pixels in width, and 1333 and 4000 pixels in height. Each image had 24 bits per pixel for its color channels.

All RGB images were preprocessed to meet the input requirements of the EfficientNetV2-S architecture. Images were resized to 224 × 224 pixels to maintain a uniform spatial dimension within the dataset. Pixel intensity values were normalized through division by 255.0. This transforms the original [0, 255] range to a [0, 1] normalized scale. This normalization facilitates gradient-based optimization during training and ensures numerical stability. Additionally, all images were converted to 32-bit floating-point format to preserve precision within the computational pipeline.

2.2.2. Data Augmentation

To improve model generalization and reduce overfitting [30], given the limited dataset size (n = 379), we applied geometric and photometric data augmentation during training. Geometric transformations [31] included random horizontal flips (p = 0.5), random rotations (±15°), random shifts (±10% horizontal and vertical), and random zoom (0.9–1.1×). Photometric augmentations [32] included random brightness adjustment (±20%), random contrast adjustment (±20%), and random saturation adjustment (±15%).

Augmentation parameters were chosen to preserve biologically meaningful plant morphology while increasing the diversity of training samples.

Augmentations were applied on-the-fly during training using TensorFlow’s (version 2.20.0; Google LLC, Mountain View, CA, USA) tf.data API with parallel processing (num_parallel_calls = tf.data.AUTOTUNE). This approach avoids storing augmented images on disk and reduces storage requirements. Augmentations were applied only to the training set, while validation data was not augmented to ensure unbiased performance evaluation. Sensor measurements were not augmented, as they represent ground-truth environmental conditions.

2.2.3. Sensor Data Preprocessing

Sensor measurements (temperature and humidity) were preprocessed using standardized normalization techniques.

A Z-score standardization [33] was applied using StandardScaler [34] from the scikit-learn library (version 1.7.2). This transformation adjusts each feature to have zero mean and unit variance. The aim of using such normalization is to ensure that sensor features contribute proportionally to model learning.

2.2.4. Dataset Partitioning

The complete dataset (n = 379) was partitioned into training (80%, n = 303) and test (20%, n = 76) sets using stratified random sampling to maintain class balance across splits. The test set was reserved for final evaluation.

In k-fold cross-validation experiments, the training set was further split into five folds for hyperparameter tuning, while the test set was excluded from cross-validation [35] to preserve an independent evaluation protocol.

All ablation results reported in this paper use an 80/20 fixed split evaluated over 10 random seeds. All preprocessing steps, including image normalization and sensor standardization, were fitted exclusively on the training data and applied unchanged to test data to prevent data leakage.

2.2.5. Class Balancing Strategy

Although the dataset is approximately balanced (50.1% healthy, 49.9% stressed), minor deviations from a perfect 50/50 distribution naturally arise due to real-world sampling variability. To ensure robustness to stochastic mini-batch sampling and to guarantee equal contribution of both classes during optimization, class weighting [36] was applied during training.

Class weights were computed using Equation (1):

w_{c} = \frac{N}{C \times N_{c}}

(1)

w_{healthy} = 1.00

and

w_{stressed} = 1.00

.

This resulted in approximately equal weights for both classes (

w_{h e a l t h y} = 1.00

,

w_{s t r e s s e d} = 1.00

), consistent with the near-balanced nature of the dataset.

2.2.6. Data Pipeline Optimization

To maximize training efficiency on CPU hardware, we built an optimized data pipeline using TensorFlow’s tf.data API [37] with key enhancements: prefetching to load batches in parallel with training, multi-threaded data loading to leverage multiple CPU cores, in-memory caching to avoid redundant preprocessing after the first epoch, and efficient batch shuffling to ensure good sample mixing.

Profiling showed these optimizations reduced data loading time from 42% to just 8% of total training time, effectively eliminating the pipeline bottleneck. We also enabled XLA (Accelerated Linear Algebra) compilation [38] to optimize computational graph execution, reducing overhead and accelerating training iterations. These strategies allowed the model to fully utilize available CPU resources without idle periods waiting for data. All experiments were implemented in Python (version 3.10.19; Python Software Foundation, Wilmington, DE, USA) using the following libraries: NumPy (version 2.2.6), pandas (version 2.3.3), Matplotlib (version 3.10.8), seaborn (version 0.13.2), SciPy (version 1.15.3), statsmodels (version 0.14.6), and psutil (version 7.0.0).

2.3. Proposed Model Architecture

The proposed multimodal architecture integrates visual and sensor modalities through a carefully designed fusion framework. Figure 2 illustrates the complete architecture. It is composed of five main components: (1) visual feature extraction using EfficientNetV2-S, (2) sensor encoding branch, (3) cross-modal attention mechanism, (4) gated multimodal fusion module, and (5) classification head.

2.3.1. Visual Feature Extraction (EfficientNetV2-S)

The visual branch uses EfficientNetV2-S as the backbone for RGB image feature extraction. EfficientNetV2 improves training efficiency and accuracy through fused MBConv blocks, progressive learning, and optimized architecture search. We selected the S variant (21.5 M parameters) due to its superior accuracy–efficiency trade-off under CPU-constrained deployment, outperforming larger variants and alternative architectures in preliminary experiments.

The network is initialized with ImageNet-pretrained weights and fine-tuned on the barley dataset. Input images I ∈ ℝ 224 × 224 × 3 are processed through successive convolutional blocks with progressive spatial reduction and channel expansion. This produces high-level visual features that encode barley-specific morphology, texture, and color variations. A Global Average Pooling (GAP) layer is applied to the final convolutional feature maps. It results in a compact visual embedding v_img ∈ ℝ 1280, which is further refined using a fully connected projection layer with nonlinear activation.

To balance adaptation and generalization, early layers are partially frozen while deeper layers are fine-tuned using a reduced learning rate. This enables effective transfer learning and preserves pretrained representations. This strategy demonstrated strong performance in ablation studies.

2.3.2. Sensor Encoding Branch

The sensor branch processes temperature and humidity values through a multilayer perceptron [39] (MLP). The encoding branch consists of two dense layers with nonlinear activation functions, combined with Batch Normalization to stabilize gradient propagation and improve generalization. The transformation can be expressed as [40]:

v_{sens} = σ (W_{2} (σ (W_{1} x + b_{1})) + b_{2})

(2)

where

x

represents the raw sensor vector, and

v_{sens} \in R^{d_{s}}

denotes the learned physiological embedding.

2.3.3. Cross-Modal Attention Mechanism

To enable adaptive feature weighting and cross-modal information exchange, we implement a cross-modal attention mechanism [41,42] that allows each modality to attend to relevant features from the other modality.

The cross-attention module aligns visual and sensor representations. In this mechanism, the image embeddings are used as queries and sensor embeddings as keys and values. This enables the model to modulate visual feature importance conditioned by sensor data.

The attention computation is formulated as:

α = softmax, (\frac{(W_{Q} v_{img}) (W_{K} v_{sens})^{T}}{\sqrt{d_{k}}}), v_{attn} = α (W_{V} v_{sens})

(3)

where

W_{Q}

,

W_{K}

,

W_{V}

denote learnable projection matrices.

This attention mechanism links temperature–humidity indicators with observable leaf stress patterns.

The output

v_{a t t n}

adjusts visual focus according to sensor data.

2.3.4. Gated Multimodal Fusion and Classification Head

After attention alignment, cross-modal fusion [42] is implemented through three complementary integration mechanisms, as illustrated in Figure 3:

Batch-Normalization Path: The visual embedding is standardized using batch normalization to preserve numerical stability.
Multiplicative (Gating) Path: Element-wise multiplication [43] of the normalized visual and sensor embeddings serves as an adaptive gate. It amplifies or reduces image features in proportion to physiological importance.
Attention Path: The attention output v_attn flattened to match the other feature dimensions. It uses the sensor information to highlight image features that are important for drought stress detection.

The outputs of these three branches are concatenated to form a unified representation [44]:

h = [BN (v_{img}), v_{img} ⊙ v_{sens}, v_{attn}]

(4)

To further enhance modality adaptivity, we apply learnable gates that weight the contributions of the visual (

v^{'}

) and sensor (

z^{'}

) features, conditioned on both modalities [43]:

g_{v} = σ (W_{g v} v^{'} + W_{g z} z^{'} + b_{g v}), g_{z} = σ (W_{g v} v^{'} + W_{g z} z^{'} + b_{g z})

(5)

where

W_{g v}, W_{g z} \in R^{1280 \times 1280}

,

b_{g v}, b_{g z} \in R^{1280}

, and

σ

is the sigmoid activation producing gates

g_{v}, g_{z} \in [0, 1]^{1280}

.

The final fused representation combines the gated features [45]:

f = g_{v} ⊙ v^{'} + g_{z} ⊙ z^{'}

(6)

and is normalized via layer normalization [46]:

f_{norm} = LayerNorm (f)

(7)

This fused vector

f_{norm}

is passed through a classification head comprising dense layers with nonlinear activations, batch normalization, and a dropout rate of 0.3. The final output layer produces normalized probabilities corresponding to the predicted plant condition (Normal or Stressed). This hybrid attention-gated fusion ensures interpretability, CPU-efficient execution, and improved performance, as confirmed by ablation studies.

2.4. Computational Optimization and Model Compression

Given the target deployment environment of agricultural edge devices with limited computational resources, we implemented a comprehensive optimization strategy encompassing CPU-specific training acceleration, model compression, and quantization. Along with an efficient CPU architecture, the model and training pipeline were optimized for computational efficiency on CPU-based systems. Multi-threading, AUTOTUNE [37] and TensorFlow’s XLA JIT compilation [38] optimize the data pipeline, accelerate training, and reduce resource usage. Selective backbone fine-tuning [47] and batch normalization [40] reduce unnecessary computation. These combined strategies balance predictive accuracy and efficiency and make the model compatible with real-time stress detection on resource-constrained agricultural devices.

2.4.1. CPU Optimization Strategy

Training deep neural networks on CPU hardware requires careful optimization to achieve acceptable training times. We implemented the following CPU-specific optimizations:

Threading Configuration: TensorFlow inter-op and intra-op parallelism threads were set to 6 (matching the number of physical CPU cores on our Intel Core i7 training machine (Intel Corporation, Santa Clara, CA, USA)). This enhances CPU utilization without oversubscription overhead.
Memory Layout Optimization: Model operations were configured to use NHWC (batch, height, width, channels) data layout instead of the GPU-optimized NCHW layout, as NHWC aligns with CPU cache line structure and enables better vectorization.
Mixed Precision Training: While typically associated with GPU training, we enabled TensorFlow’s experimental CPU mixed precision mode that uses AVX-512 instructions for float16 computation on supported CPUs, reducing memory bandwidth requirements.
Operator Fusion: Graph optimization passes were enabled to fuse consecutive operations, reducing memory traffic and improving instruction-level parallelism.
Batch Size Tuning: Larger batch sizes (16 vs. 4–8 typical for GPU training) were used to amortize per-batch overhead and improve CPU cache utilization.

These optimizations collectively reduced training time per epoch from 28.9 min to 9.1 min (68.5% reduction) compared to the non-optimized baseline. This makes CPU-only training practical for agricultural applications.

2.4.2. Training Acceleration Techniques

Beyond hardware-specific optimizations, we employed several algorithmic strategies to accelerate convergence:

Transfer Learning: Initializing EfficientNetV2 with ImageNet-pretrained weights reduced the required training epochs from approximately 80 to 30 by starting from a strong feature extractor.
Learning Rate Scheduling: The annealing schedule with warm restarts [48] improved convergence speed by approximately 20% compared to fixed learning rates or simple step decay.
Early Stopping: Monitoring validation loss with patience = 10 epochs prevented unnecessary training beyond the optimal point, typically saving 15–20 epochs.
Data Pipeline Optimization: Prefetching and parallel loading eliminated data loading as a bottleneck, ensuring efficient utilization of computational resources.

Together, these techniques enabled 5-fold cross-validation (5 complete training runs) to complete in 2.55 h total on CPU hardware, making iterative model development feasible without GPU access.

2.4.3. TensorFlow Lite Conversion

For deployment on edge devices, we converted the trained Keras model (integrated in TensorFlow 2.20.0) to TensorFlow Lite (TFLite) format, which provides a lightweight inference engine optimized for mobile and embedded platforms [49]. The conversion process involved:

Model Freezing: The trained model graph and weights were frozen into a single checkpoint file, eliminating training-specific operations (dropout, batch normalization in training mode).
Graph Optimization: TFLite converter applied optimization passes including constant folding, unused node elimination, and operation fusion to simplify the computational graph.
Operator Selection: All model operations were verified to have TFLite implementations. Custom operations (if any) would require dedicated kernel implementations.

The resulting Float32 TFLite model achieved 55% size reduction with no accuracy loss. This reduction stems from eliminating TensorFlow framework overhead and Python (version 3.10.19) serialization artifacts, not from precision reduction.

2.4.4. INT8 Quantization

Further compression was achieved through post-training INT8 quantization [49], which converts 32-bit floating-point weights and activations to 8-bit integers. Quantization introduces controlled precision loss but dramatically reduces model size and accelerates inference on hardware with integer arithmetic units (common in edge devices).

We employed post-training quantization with a representative dataset (random sample of 100 training images) to calibrate quantization parameters (scale and zero-point) for each layer:

Quantized_value = round (\frac{float_value}{scale}) + zero_point

(8)

where scale and zero-point are determined by observing activation ranges during calibration. This approach requires no retraining, unlike quantization-aware training.

3. Results

3.1. Performance on the Greenhouse Barley Dataset

The proposed multimodal model was evaluated on the greenhouse barley dataset comprising 379 samples. The dataset was divided using an 80/20 stratified split (random_state = 42), yielding 303 training samples and 76 test samples. The class distribution was nearly balanced, with 190 normal (50.1%) and 189 stressed (49.9%) samples. The model was trained for 60 epochs with a batch size of 16 on CPU-only hardware. Table 1 summarises the dataset characteristics.

The confusion matrix (Figure 4) illustrates the model’s classification performance on the 76-samples.The matrix entries are defined as follows:

True Negative (TN = 36): correctly predicted as “normal plant” (negative class).
False Positive (FP = 1): incorrectly predicted as “stressed plant” when the actual class is “normal.”
True Positive (TP = 39): correctly predicted as “stressed plant” (positive class).
False Negative (FN = 0): incorrectly predicted as “normal plant” when the actual class is “stressed.”

The model correctly classified 75 out of 76 samples, achieving an accuracy of 98.7%. The single misclassification was a healthy plant predicted as stressed (one false positive). Notably, recall reached 100%, confirming that no stressed plant was missed (no false negatives). Table 2 presents the full performance metrics, and the performance is further visualized in Figure 5.

Given the limited size of the greenhouse dataset (n = 379) relative to the capacity of EfficientNetV2-S (21.5 million parameters), a careful analysis of training dynamics is necessary to verify that the model learned generalizable features rather than memorising training samples. Figure 6 presents the training and validation loss and accuracy curves over all 60 epochs.

Three key observations from these curves collectively indicate that the model does not suffer from severe overfitting:

Validation loss converges below training loss throughout training.

The validation loss decreases from approximately 0.63 at epoch 1 to 0.027 at epoch 60, while the training loss stabilises around 0.062. This behaviour, where validation loss remains consistently lower than training loss after epoch 20, is contrary to the classical pattern of overfitting. It can be explained by two regularisation mechanisms applied exclusively during training: Dropout (rate = 0.3), which randomly deactivates 30% of neurons at each forward pass but is disabled during inference; and on-the-fly data augmentation (random flips, rotations, brightness, and contrast variations), which artificially increases the diversity of training samples. Since neither mechanism is applied to the validation set, the validation objective becomes structurally easier.

Both loss curves converge stably without divergence.

From approximately epoch 20 onward, both training and validation losses decrease smoothly and plateau at low values. No divergence is observed in which validation loss increases while training loss continues to decrease—this is a typical indicator of overfitting. The validation accuracy reaches 98.7% at epoch 60 (training accuracy: 98.0%), confirming stable convergence behaviour.

Transfer learning from ImageNet effectively mitigates the small-sample problem.

EfficientNetV2-S was initialised with ImageNet-pretrained weights. Only the last 40 layers were fine-tuned, while all earlier layers were frozen. Consequently, the number of trainable parameters learned from the 379 greenhouse samples represents only a small fraction of the 21.5 million total parameters. The frozen backbone provides generic visual representations (edges, textures, colour gradients, and shapes) that transfer effectively to plant stress detection, requiring only the final layers to adapt to barley-specific morphological patterns.

Taken together, pretrained transfer learning, data augmentation, and dropout regularisation form a robust strategy against overfitting in the small-data regime. The training curves confirm the effectiveness of these mechanisms: the model exhibits stable convergence without signs of memorisation, achieving a final validation accuracy of 98.7% with 100% recall on the 76-sample validation set.

3.2. K-Fold Cross-Validation Results

To provide a robust estimate of generalisation performance and to verify that the results are not dependent on a favourable random split, 5-fold stratified cross-validation was performed on the full greenhouse dataset (n = 379).

This protocol divides the dataset into five equal folds, trains and evaluates the model five times, each time using a different fold as the validation set and reports the mean and standard deviation across folds. Table 3 reports comprehensive metrics for each fold, and Table 4 summarizes the training time per fold. Figure 7, Figure 8 and Figure 9 provide visual summaries of the cross-validation behaviour.

The cross-validation results show high and relatively stable performance across all data partitions.

The mean accuracy is 97.6 ± 2.2%, with fold accuracies ranging from 94.7% to 100.0%, corresponding to a 5.3 percentage point variation. The model achieves a high mean recall of 98.5 ± 2.3%, with perfect recall in Folds 1, 2, and 5, indicating strong sensitivity in detecting stressed plants. Mean specificity is 96.8 ± 3.5%, although some variability is observed, particularly in Fold 5. The mean F1-score of 97.7 ± 2.1% confirms balanced performance across classes.

Overall, the standard deviation across folds supports the robustness of the proposed approach under greenhouse conditions.

Total training time for 5-fold cross-validation (Figure 10) was 153.1 min (2.55 h). This proves model feasibility on CPU hardware.

3.3. Ablation Study Results

To assess the contribution of individual architectural components, we conducted systematic ablation studies comparing five model variants. Each variant was trained with ten different random seeds (42, 123, 456, 789, 2024, 2025, 2026, 2027, 2028, 2029), and Table 5 reports mean performance with standard deviations across seeds. A one-way ANOVA confirmed that architectural configuration is the dominant source of performance variation (F (4,36) = 4.44, p = 0.005, partial η² = 0.33), validating that the observed ranking is stable and not driven by random seed variation. A Sensor-Only MLP baseline (temperature and humidity features only) is also reported to quantify the standalone contribution of the sensor branch. Table 5 reports the performance of all ablation configurations averaged over 10 seeds.

3.3.1. Component-Wise Ablation Results

The ablation results are summarized as follows:

Multimodal Fusion Value:

The multimodal configurations consistently outperform the Image-Only baseline (97.4 ± 1.8%), confirming that sensor information provides complementary cues to visual features. The Sensor-Only MLP achieves substantially lower performance (73.8 ± 3.5% accuracy, 67.0 ± 3.5% F1-score), indicating that temperature and humidity alone are insufficient for reliable classification and that the visual branch is the dominant contributor to performance. The combination of both modalities improves accuracy, specificity, and AUC across all seeds, demonstrating the complementary nature of the two information sources.

Gated Fusion Analysis:

Comparing the Full Model (98.3 ± 1.5%) to No-Gating (99.1 ± 0.6%) and Concat Fusion (99.3 ± 0.7%), no statistically significant advantage is observed in pairwise comparisons across 10 seeds. The one-way ANOVA (F(4,36) = 4.44, p = 0.005) indicates that at least one configuration differs significantly from the others; however, post hoc comparisons among multimodal variants do not reach statistical significance after Bonferroni correction. This suggests that simpler fusion strategies are sufficient to capture most of the multimodal information under the current experimental conditions.

This outcome is consistent with the characteristics of the greenhouse dataset, which is relatively small (n = 379) and exhibits limited environmental variability. In such controlled settings, static fusion mechanisms such as concatenation can approximate effective feature integration, thereby reducing observable differences between fusion strategies.

In contrast, the gated cross-modal attention mechanism introduces a data-dependent and sample-specific weighting scheme that adaptively modulates the contribution of visual and sensor features according to their relative informativeness. This enables the model to handle modality imbalance, noise, or partial degradation more effectively than fixed fusion strategies.

Beyond performance, the gating mechanism also improves interpretability by explicitly exposing modality importance at the sample level, allowing post hoc analysis of how environmental and visual signals contribute to each prediction. This is particularly relevant in precision agriculture applications where understanding decision drivers is as important as prediction accuracy.

Therefore, although no statistically significant improvement is observed in this controlled dataset, the gated fusion strategy remains theoretically motivated and more generalisable. The Full Model is retained due to its adaptive fusion capability, improved interpretability, and expected robustness in more heterogeneous and real-world conditions.

Frozen Backbone:

The Frozen Backbone variant (97.0 ± 2.5%) achieves the lowest mean accuracy among multimodal configurations and shows higher variability, confirming that fine-tuning the EfficientNetV2-S backbone is essential for optimal performance. This result validates the use of selective partial fine-tuning in the proposed architecture.

Stability:

All multimodal configurations achieve standard deviations below ±1.5% across 10 seeds, indicating stable training behavior. The Image-Only (±1.8%) and Frozen Backbone (±2.5%) variants show slightly higher variability, suggesting that both multimodal fusion and backbone fine-tuning contribute to improved robustness. The ANOVA partial η² = 0.33 confirms a medium-to-large effect of architectural configuration on performance.

Figure 11 visualizes the comparative performance of the five ablation configurations across the six evaluation metrics using a radar chart.

3.3.2. Statistical Analysis

To quantify the effect of architectural design on model performance, a one-way ANOVA was performed on the 10-seed results across the five ablation configurations (Table 6). The analysis reveals a statistically significant effect of architectural configuration on both accuracy (F(4,36) = 4.44, p = 0.005, η² = 0.33) and F1-score (F(4,36) = 4.10, p = 0.008, η² = 0.31). These results indicate a moderate-to-large effect size, suggesting that approximately 30% of the variance in performance can be attributed to architectural differences rather than random initialization.

Post hoc analysis indicates that the primary source of variation is driven by the Sensor-Only and Frozen Backbone configurations, while differences among the multimodal fusion variants (Full Model, Concat Fusion, and No-Gating) are not statistically significant after correction. This suggests that while multimodal learning is beneficial overall, the specific choice of fusion strategy has a limited impact under the current dataset conditions.

3.4. Baseline Comparison

To evaluate the effectiveness of the proposed approach, we compare it against classical machine learning methods and shallow deep learning architectures on the greenhouse dataset (379 samples) using an 80/20 train–test split.

The baselines include Logistic Regression, Random Forest, and a 3-layer shallow CNN. All classical baselines are evaluated using three random seeds (42, 123, 456) to ensure robustness.

In addition, AgriFusionNet [22] is re-implemented on our dataset and evaluated using its original 5-fold cross-validation protocol, in order to remain consistent with the evaluation setting reported in the original study. All models are assessed under identical preprocessing and input configurations, including image-only, sensor-only, and multimodal settings where applicable. Performance is measured using accuracy, F1-score, and AUC. Table 7 summarizes all comparative results.

The results in Table 7 highlight two key comparative dimensions that validate the effectiveness of the proposed architecture.

First, classical baselines (Logistic Regression, Random Forest, and shallow CNN) are consistently and substantially outperformed by the proposed Full Model (98.3 ± 1.5%) across all evaluation metrics. This performance gap is systematic across all input modalities, indicating a structural limitation of shallow models rather than a training artifact.

Logistic Regression and Random Forest achieve relatively strong results in image-only and multimodal settings; however, their performance degrades significantly in sensor-only configurations. This behavior reflects their reliance on linear or weakly nonlinear decision boundaries, which are insufficient to model the complex interactions between temperature and humidity dynamics in greenhouse environments. Although ensemble-based methods such as Random Forest partially mitigate this limitation, they remain constrained by assumptions of weak feature dependencies and the absence of hierarchical representation learning.

The shallow CNN improves image feature extraction compared to classical methods, confirming the benefit of learned spatial representations. However, its limited depth restricts its ability to capture high-level semantic abstractions, resulting in suboptimal fusion compared to the proposed deep architecture. In contrast, the proposed model benefits from deep transfer learning (EfficientNetV2-S), which provides rich hierarchical feature representations pre-trained on large-scale datasets, enabling stronger generalization even under limited training data conditions.

Second, the comparison with AgriFusionNet highlights the superiority of the proposed fusion strategy over existing lightweight multimodal agricultural models. Despite being evaluated under its original 5-fold cross-validation protocol, AgriFusionNet achieves substantially lower performance (75.2% accuracy) on our dataset, indicating limited robustness under greenhouse-specific domain conditions. This performance degradation suggests that its fusion mechanism is not sufficiently adaptive to heterogeneous modality interactions or dataset shift.

In contrast, the proposed model consistently achieves superior performance (98.3%), demonstrating that the improvement is not attributable to model complexity alone. Instead, the key factor is the gated multimodal fusion mechanism, which adaptively weights image and sensor contributions depending on contextual relevance. This allows the model to suppress noisy modalities while emphasizing informative signals, leading to more stable and discriminative joint representations.

Overall, these results demonstrate that the proposed architecture achieves superior performance due to three synergistic factors: strong hierarchical feature extraction via transfer learning, adaptive gated multimodal fusion, and improved robustness under limited-data conditions. This combination enables consistent gains over both classical baselines and recent state-of-the-art multimodal agricultural models.

3.5. Results on PlantVillage Dataset

To evaluate generalization to different plant species and stress types, we tested the greenhouse-trained model on the PlantVillage dataset reformulated as binary classification (healthy vs. diseased). Table 8 describes dataset characteristics and Table 9 presents performance metrics.

The following tables present comprehensive performance metrics from the optimized agricultural stress detection model. Results are based on 5-fold cross-validation with a total test set of 2108 samples.

To assess the stability of the training process across different random initializations, the model was trained under three seeds (42, 123, and 456). Figure 12 illustrates the training and validation loss and accuracy curves for each run.

The model achieved an accuracy of 99.97% on the PlantVillage dataset, which is noticeably higher than the performance obtained on the greenhouse dataset (98.3%). This result can be attributed to several factors:

Domain Characteristics: PlantVillage images have uniform white backgrounds and controlled lighting. It reduces visual complexity compared to greenhouse images with natural backgrounds and variable illumination.
Disease Severity: Many PlantVillage disease examples show advanced symptoms (severe lesions, extensive discoloration) that are more visually distinctive than the subtle early-stage drought stress in our greenhouse dataset.
Feature Transferability: EfficientNetV2 learns general visual patterns related to plant stress such as changes in leaf color and texture that transfer effectively across species and stress types. This improves its performance in different environments.
Binary Simplification: Combining 38 disease classes into a single ‘diseased’ category reduced task complexity. This performance across datasets confirms the model’s ability to generalize and detect different plant stress types beyond barley drought.

3.6. Cross-Dataset Performance Summary

Table 10 summarizes the model’s performance on both datasets. It demonstrates consistent results across different conditions.

The model achieved an accuracy of 99.97% on the PlantVillage dataset, which is noticeably higher than the performance obtained on the greenhouse dataset (98.3%).

Both datasets show strong performance, with the greenhouse barley dataset achieving 98.3% accuracy and the PlantVillage binary dataset reaching 99.97%. This confirms the model’s effectiveness for plant stress detection across different conditions. The performance gap reflects differences in dataset characteristics (controlled imaging conditions and more visually distinctive symptoms in PlantVillage) rather than any limitation of the model. Key observations include:

Robust Generalization: Although trained only on barley drought stress, the model achieves near-perfect classification on multi-species disease detection. This indicates that the learned features transfer effectively across different stress types.
Consistently High Recall: Recall exceeds 98% on both datasets, which is important in agriculture, where failing to detect stressed plants can have greater consequences than occasional false alarms.
Stable Performance: Cross-validation standard deviations of ±2–3% on the greenhouse dataset and below ±0.2% on PlantVillage indicate low variance across runs. This suggests that the model is robust to both training sample variability and random initialization.

3.7. Computational Efficiency Analysis

One important contribution of this work is to show that high-accuracy multimodal plant stress detection can be achieved on CPU-only hardware through systematic optimization. Table 11, Table 12 and Table 13 quantify the impact of computational optimizations.

The optimization results demonstrate significant computational efficiency gains (Figure 13):

Training Acceleration: CPU-specific optimizations reduced the training time per epoch from 28.9 to 9.1 min (68.5% reduction), making model development practical on consumer hardware without GPU access. The drop in CPU usage (92% → 34%) indicates more efficient resource utilization, avoiding thread oversubscription and cache issues.
Memory Efficiency: RAM usage decreased by 25% (6.4 GB → 4.9 GB) through optimized data pipelines and batch size tuning. This allows training on machines with 8 GB RAM, as used in agricultural IoT edge devices.
Model Compression: INT8 quantization achieved a 90% size reduction (150 MB → 14.9 MB) with minimal accuracy loss (97.6% → 97.3%, only 0.3%). The compact model fits comfortably in the flash storage of microcontrollers and embedded systems. This enables deployment on edge devices (Figure 14).

These results suggest the proposed approach works well for deployment in real-world precision agriculture applications.

4. Discussion

The results show that the integration of visual and environmental data improves plant stress detection. In the ablation study over 10 random seeds, combining temperature and humidity sensor data with RGB images increased mean accuracy by 0.9 percentage points (98.3 ± 1.5% vs. 97.4 ± 1.8% image-only). The sensor-only MLP achieved only 73.8 ± 3.5%, confirming that sensor data alone is insufficient and that the visual branch is the primary driver of performance.

This improvement can be attributed to the complementary information provided by visual and environmental data. RGB images reflect visible stress symptoms, such as leaf wilting and color changes. However, these symptoms generally appear only after stress has already induced internal physiological alterations in the plant. From a physiological standpoint, high temperature and low humidity increase atmospheric evaporative demand, leading to stomatal closure as an early stress response [50] that occurs before any visible color change is detectable. Environmental sensors that quantify these conditions therefore provide earlier indicators of stress than visual imagery-based approaches [51], enabling detection prior to the appearance of visible symptoms.

The integration of these two modalities enables earlier and more reliable plant stress detection.

To better position the proposed approach within the broader context of plant stress monitoring, it is important to distinguish between direct physiological measurements and proxy-based environmental indicators. Physiological measurements such as leaf water potential and stomatal conductance are widely recognized as accurate indicators of plant water status [52]. Thermal-based indices such as the Crop Water Stress Index (CWSI) provide additional drought-related information derived from canopy temperature [53]. However, direct physiological measurements such as stomatal conductance, while accurate, are labour-intensive and unsuitable for automation [54], leading practical systems to rely on alternative sensing approaches combined with visual observations in multimodal frameworks.

The use of temperature and relative humidity as indirect indicators of plant water stress is supported by well-established physiological mechanisms. Under drought conditions, reduced water availability leads to stomatal closure, which decreases transpiration and results in increased leaf temperature. This principle forms the basis of the Crop Water Stress Index. In parallel, transpiration influences the humidity of the leaf boundary layer, which is linked to plant water status and gas exchange dynamics. Although the present study relies on low-cost sensors that measure environmental conditions rather than direct physiological variables, these measurements provide meaningful contextual information reflecting plant–environment interactions under water deficit conditions.

The attention mechanism allows the model to adaptively weigh each data modality. When stress symptoms are subtle, it assigns greater importance to sensor data, whereas for visible symptoms, it relies more on images. This adaptive strategy reflects the reasoning process of an experienced agronomist. However, the extended 10-seed ablation study shows that the gated fusion mechanism does not yield a statistically significant accuracy improvement over simpler fusion strategies (No-Gating: 99.1 ± 0.6%; Simple Concat: 99.3 ± 0.7%; Full Model: 98.3 ± 1.5%) on this 379-sample dataset. The gated approach is retained for its interpretability and potential benefits on larger, more heterogeneous datasets where adaptive modality weighting is expected to be more impactful.

4.1. Generalization Across Datasets

Although the model was trained using greenhouse barley images, it achieved 99.97% accuracy on the PlantVillage dataset, which contains more than 26,000 images from 14 different crop species with multiple disease types. This strong cross-dataset performance suggests that the model learned general patterns of plant stress rather than barley-specific features.

This generalization can be explained by multiple factors. First, EfficientNetV2, pretrained on ImageNet, provides strong general visual features that perform well in plant image analysis. Second, many stress symptoms such as chlorophyll degradation and cell damage present similar patterns across different crops and stress types. Third, the model retained high performance even with differences in background and lighting between datasets. This indicates that it targets plant features rather than image artifacts.

4.2. Comparison with Related Work

Table 14 compares the proposed approach with recent deep learning methods for plant disease detection.

Most existing studies in Table 14 rely exclusively on image-based models. For instance, hybrid CNN–Vision Transformer architectures proposed by Prashant S. Thakur et al. [55] (98.86%) and lightweight convolutional models such as MobileNetV3-Small by A. T. Khan et al. [56] (99.50%) achieve high classification accuracy. Similarly, transformer-based approaches including I. Pacal [57] (99.24%) and the PDLC-ViT model by S. Hemalatha [58] (99.97%) further improve performance through attention mechanisms. However, as indicated in the table, all these methods perform on visual inputs.

This reliance on image-only data introduces important limitations. High accuracies are typically reported on controlled datasets, where visual symptoms are clearly distinguishable. In such settings, models may implicitly learn dataset-specific features rather than robust disease representations. Moreover, image-based approaches cannot capture early physiological stress signals that are not yet visually observable, limiting their applicability for early detection and real-world deployment under varying environmental conditions.

In contrast, the AgriFusionNet model [22] demonstrates the impact of multimodal fusion combining RGB imagery with multispectral data and environmental sensor data, reporting 94.3% accuracy on its original multi-species, multi-disease dataset. When re-implemented on our single-species, single-stress 379-sample greenhouse dataset using 5-fold cross-validation, AgriFusionNet achieved 75.2 ± 5.2% mean accuracy, significantly lower than the proposed method (98.3 ± 1.5%). This gap reflects both the domain difference (AgriFusionNet was designed for multispectral + multi-disease data) and the benefit of the pretrained EfficientNetV2-S backbone used in our approach. Shallow baselines trained on the same dataset—Logistic Regression (92.1 ± 3.5%), Random Forest (89.9 ± 6.2%), and a 3-layer Shallow CNN (92.5 ± 3.0%), all perform substantially below our full model, confirming the added value of deep transfer learning even on small datasets.

The proposed method in our work adopts a multimodal framework by integrating visual plant features with sensor data via a gated cross-modal attention module. This design enables the model to learn interactions between visual plant traits and sensor data such as temperature and humidity, which indicate early physiological stress.

The model is also adapted for edge devices thanks to its compact size and INT8 quantization, with a memory footprint of about 14.9 MB while preserving high accuracy. As a result, it provides a well-rounded trade-off between accuracy, reliability, and efficiency for real-time plant stress detection in precision agriculture.

4.3. Limitations

Despite the strong performance of the proposed model, several limitations should be acknowledged. First, the annotation process relies on proxy-based indicators rather than direct physiological measurements such as leaf water potential or stomatal conductance. While the combination of visual symptoms and environmental context improves labeling consistency, it does not provide a definitive physiological characterization of drought stress.

Second, the use of temperature and humidity measurements introduces potential ambiguity, as similar environmental conditions may also be associated with other abiotic stresses, particularly heat stress. As a result, the proposed system identifies stress conditions consistent with drought but cannot fully distinguish between different stress types.

Third, the environmental measurements were obtained using a low-cost DHT11 sensor, which provides limited precision compared to scientific-grade instrumentation. Although this choice aligns with the objective of developing a cost-effective and deployable system, it may affect the accuracy of environmental characterization.

Additionally, several key environmental factors were not recorded or controlled, including soil type, light intensity, ventilation or airflow conditions, and precise irrigation levels. These variables are known to significantly influence plant responses to drought stress, and their absence limits both the interpretability and reproducibility of the experimental setup.

In the same context, all data were collected from a single greenhouse environment without replication across seasons, locations, or different plant batches. This restricts the variability captured in the dataset and limits the assessment of model robustness under diverse environmental and agronomic conditions.

In addition to these aspects, several methodological and experimental constraints further delimit the scope of the findings. The dataset comprises 379 samples collected from a single greenhouse over a limited time period. Although this scale is comparable to exploratory studies in precision agriculture, it remains relatively small for training a deep architecture such as EfficientNetV2-S. Measures including transfer learning, data augmentation, dropout regularization, and multi-seed evaluation were implemented to mitigate overfitting; however, the dataset does not fully capture variability in plant development stages, environmental dynamics, or inter-plant heterogeneity encountered in real-world deployments. Moreover, the single-environment setting prevents evaluation across different climatic conditions, greenhouse configurations, or geographical contexts.

The proxy-based annotation protocol, while more robust than purely visual labeling, does not provide physiologically validated ground truth. Direct measurements such as leaf water potential or stomatal conductance would enable a more precise quantification of drought stress severity. Consequently, class boundaries may remain approximate, particularly in borderline cases where stress symptoms are not yet fully expressed.

The environmental sensing setup introduces additional limitations. The DHT11 sensor provides moderate accuracy, which is sufficient for capturing general environmental trends but may not detect subtle transitions near stress thresholds.

Furthermore, the absence of calibration against reference instruments and the lack of detailed documentation of greenhouse conditions—such as irrigation regimes, light intensity, or substrate properties—limit experimental reproducibility and environmental interpretability.

From a modeling perspective, the ablation study indicates that the proposed gated fusion mechanism does not yield statistically significant improvements over simpler fusion strategies under the current experimental conditions. This suggests that, in relatively homogeneous environments, simpler multimodal fusion approaches may already be sufficient, while the benefits of adaptive gating may only emerge in more complex or heterogeneous datasets.

Finally, the framework is restricted to binary classification between healthy and drought-stressed plants. It does not address multi-stress scenarios or provide continuous stress severity estimation, both of which are important for practical precision agriculture applications. In addition, the absence of temporal modeling prevents the system from capturing the progressive nature of drought stress, limiting its ability to detect early-stage stress dynamics.

5. Conclusions

This study presents a multimodal deep learning framework for real-time drought stress detection in barley. It offers an accurate and computationally efficient solution for precision agriculture. The proposed architecture combines EfficientNetV2-S visual features with temperature and humidity data through cross-modal attention and gated fusion. This allows adaptive integration of visual and sensor inputs and is suitable for edge deployment. The model achieved 98.3 ± 1.5% mean accuracy across 10 random seeds with an 80/20 train–test split (379 samples), with the best individual seed achieving 100% across all metrics. An extended ablation study with ANOVA (F(4,36) = 4.44, p = 0.005) confirms that multimodal fusion (98.3%) outperforms image-only (97.4%) and sensor-only (73.8%) baselines. External validation on the PlantVillage dataset confirmed strong generalization, with 99.97% accuracy, indicating that the model captures stress patterns transferable across other plant species. Systematic CPU optimization reduced training time by 68.5%, while INT8 quantization decreased model size by 90% with minimal accuracy loss, enabling deployment on embedded devices. The overall system is cost-effective and integrates with existing farm management.

In future work, we will explore temporal modeling using sequential networks to detect stress before visual symptoms appear and extend the framework to classify multiple stress types, such as drought, heat, nutrient deficiency, or disease. Field validation under diverse conditions is essential to ensure robustness and adaptability. Improving interpretability through explainable AI techniques would increase trust among agronomists, while federated learning could enable collaborative model improvement across farms while preserving privacy. Predicting continuous stress severity and integrating additional sensors, such as multispectral, thermal, LiDAR, or soil sensors, will enhance monitoring precision. Overall, this work demonstrates that multimodal deep learning can be effectively applied in precision agriculture, by combining affordable sensors, efficient edge AI, and transfer learning to support sustainable crop management under changing environmental conditions.

Author Contributions

Conceptualization, all authors; methodology, R.B.; software, R.B.; validation, R.B., A.G. and N.S.; formal analysis, R.B., D.B.A., A.G. and N.S.; investigation, R.B.; resources, C.E.; data curation, R.B., A.G. and C.E.; writing—original draft preparation, R.B.; writing—review and editing, D.B.A., A.G., N.S. and C.E.; visualization, R.B.; supervision, D.B.A., N.S. and C.E.; project administration, D.B.A. and C.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available as they were collected in a private greenhouse facility and are intended for future research use.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AI	Artificial Intelligence
API	Application Programming Interface
AUC	Area Under the Curve
AUTOTUNE	Automatic Tuning (TensorFlow tf.data parallelism parameter)
AVX-512	Advanced Vector Extensions 512-bit (Intel CPU instruction set)
BPNN	Back-Propagation Neural Network
CBAM	Convolutional Block Attention Module
CNN	Convolutional Neural Network
CPU	Central Processing Unit
CWSI	Crop Water Stress Index
DHT11	Digital Humidity and Temperature Sensor (model DHT11)
EfficientNet	Efficient Neural Network (compound scaling architecture)
EfficientNetV2	Efficient Neural Network Version 2
EfficientNetV2-S	Efficient Neural Network Version 2—Small variant
ESP32	Espressif Systems 32-bit Microcontroller
FN	False Negative
FP	False Positive
GAP	Global Average Pooling
GB	Gigabyte
GPU	Graphics Processing Unit
INT8	8-bit Integer (quantization format)
IoT	Internet of Things
JIT	Just-In-Time (compilation)
MB	Megabyte
MLP	Multilayer Perceptron
MobileNetV2	Mobile Neural Network Version 2
MRI	Magnetic Resonance Imaging
NCHW	Batch × Channels × Height × Width (data layout format)
NHWC	Batch × Height × Width × Channels (data layout format)
PET	Positron Emission Tomography
RAM	Random Access Memory
ReLU	Rectified Linear Unit
ResNet	Residual Network
RGB	Red, Green, Blue (color image format)
ROC	Receiver Operating Characteristic
SVM	Support Vector Machine
TFLite	TensorFlow Lite (lightweight inference framework)
tf.data	TensorFlow Data Pipeline API
TN	True Negative
TP	True Positive
UAV	Unmanned Aerial Vehicle
VGGNet	Visual Geometry Group Network
XLA	Accelerated Linear Algebra (TensorFlow compiler)

References

Lukinac, J.; Jukić, M. Barley in the production of cereal-based products. Plants 2022, 11, 3519. [Google Scholar] [CrossRef]
Kaur, A.; Purewal, S.S.; Phimolsiripol, Y.; Punia Bangar, S. Unraveling the hidden potential of barley (Hordeum vulgare): An important review. Plants 2024, 13, 2421. [Google Scholar] [CrossRef]
Benito-Verdugo, P.; Martínez-Fernández, J.; González-Zamora, Á.; Almendra-Martín, L.; Gaona, J.; Herrero-Jiménez, C.M. Impact of agricultural drought on barley and wheat yield: A comparative case study of Spain and Germany. Agriculture 2023, 13, 2111. [Google Scholar] [CrossRef]
Zhang, L.; Bai, G.; Evett, S.R.; Colaizzi, P.D.; Xue, Q.; Marek, G.; Dhungel, R.; Zhao, H.; Wan, N.; Lin, X. Increased Irrigation Could Mitigate Future Warming-Induced Maize Yield Losses in the Ogallala Aquifer. Commun. Earth Environ. 2025, 6, 483. [Google Scholar] [CrossRef]
Zarbakhsh, S.; Fakhrzad, F.; Rajkovic, D.; Niedbała, G.; Piekutowska, M. Approaches and challenges in machine learning for monitoring agricultural products and predicting plant physiological responses to biotic and abiotic stresses. Curr. Plant Biol. 2025, 43, 100535. [Google Scholar] [CrossRef]
Sharma, N.; Sharma, P.; Kumar, N. Feature Engineering to Early Detection of Plant Disease Using Image Processing and Artificial Intelligence: A Comparative Analysis. Int. J. Latest Technol. Eng. Manag. Appl. Sci. 2025, 14, 1107–1113. [Google Scholar] [CrossRef]
Salka, T.D.; Hanafi, M.B.; Rahman, S.M.S.A.A.; Zulperi, D.B.M.; Omar, Z. Plant Leaf Disease Detection and Classification Using Convolution Neural Networks Model: A Review. Artif. Intell. Rev. 2025, 58, 322. [Google Scholar] [CrossRef]
Shafik, W.; Tufail, A.; De Silva Liyanage, C.; Apong, R.A.A.H.M. Using transfer learning-based plant disease classification and detection for sustainable agriculture. BMC Plant Biol. 2024, 24, 136. [Google Scholar] [CrossRef]
Duhan, S.; Gulia, P.; Gill, N.S.; Narwal, E. RTR_Lite_MobileNetV2: A lightweight and efficient model for plant disease detection and classification. Curr. Plant Biol. 2025, 42, 100459. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar] [CrossRef]
Sapes, G.; Schroeder, L.; Scott, A.; Clark, I.; Juzwik, J.; Montgomery, R.A.; Guzmán, Q.J.A.; Cavender-Bares, J. Mechanistic links between physiology and spectral reflectance enable previsual detection of oak wilt and drought stress. Proc. Natl. Acad. Sci. USA 2024, 121, e2316164121. [Google Scholar] [CrossRef]
Paul, N.; Sunil, G.C.; Horvath, D.; Sun, X. Deep learning for plant stress detection: A comprehensive review of technologies, challenges, and future directions. Comput. Electron. Agric. 2025, 229, 109734. [Google Scholar] [CrossRef]
Paulo, R.L.D.; Garcia, A.P.; Umezu, C.K.; Camargo, A.P.D.; Soares, F.T.; Albiero, D. Water stress index detection using a low-cost infrared sensor and excess green image processing. Sensors 2023, 23, 1318. [Google Scholar] [CrossRef]
Hou, X.X.; Liu, Y.; Zhang, X.; Ma, Q.; Shang, G. Spatiotemporal dynamics and drivers of agricultural drought in the Huang-Huai-Hai Plain based on crop water stress index and spatial machine learning. Remote Sens. 2025, 17, 3678. [Google Scholar] [CrossRef]
Kalaivani, T.S.; Kamireddy, T.; Govindakumar, S. IoT-Enabled Soil and Crop Monitoring System Using Low-Cost Smart Sensors for Precision Agriculture. Eng. Proc. 2025, 118, 77. [Google Scholar] [CrossRef]
Meriç, M.K. Implementation of a Wireless Sensor Network for Irrigation Management in Drip Irrigation Systems. Sci. Rep. 2025, 15, 14157. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Zhao, Z.; Rezaeipanah, A. Intelligent and Automatic Irrigation System Based on Internet of Things Using Fuzzy Control Technology. Sci. Rep. 2025, 15, 14577. [Google Scholar] [CrossRef] [PubMed]
Abdelmoneim, A.A.; Kimaita, H.N.; Al Kalaany, C.M.; Derardja, B.; Dragonetti, G.; Khadra, R. IoT sensing for advanced irrigation management: A systematic review of trends, challenges, and future prospects. Sensors 2025, 25, 2291. [Google Scholar] [CrossRef]
Abebe, A.M.; Kim, Y.; Kim, J.; Kim, S.L.; Baek, J. Image-based high-throughput phenotyping in horticultural crops. Plants 2023, 12, 2061. [Google Scholar] [CrossRef]
Yan, Y.; Li, Y.; Jia, S.; Bai, Y.; Cao, B.; Mashori, A.S.; Zhang, W. Integration of UAV multispectral and meteorological data to improve maize yield prediction accuracy. Agronomy 2026, 16, 163. [Google Scholar] [CrossRef]
Sameera, V.; Meghana, H.; Ameer, P.; Parayangat, M.; Abbas, M. Transformers for Multi-Modal Image Analysis in Healthcare. Comput. Mater. Contin. 2025, 84, 4259–4297. [Google Scholar] [CrossRef]
Albahli, S. AgriFusionNet: A lightweight deep learning model for multisource plant disease diagnosis. Agriculture 2025, 15, 1523. [Google Scholar] [CrossRef]
Wang, X.; Yan, F.; Li, B.; Yu, B.; Zhou, X.; Tang, X.; Lv, C. A multimodal data fusion and embedding attention mechanism-based method for eggplant disease detection. Plants 2025, 14, 786. [Google Scholar] [CrossRef]
Madiwal, A.S.; Jha, R.B.; Barthakur, P. Edge AI and IoT for Real-Time Crop Disease Detection: A Survey of Trends, Architectures, and Challenges. Int. J. Res. Innov. Appl. Sci. 2025, 10, 1011–1027. [Google Scholar] [CrossRef]
Liang, T.; Glossner, J.; Wang, L.; Shi, S.; Zhang, X. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 2021, 461, 370–403. [Google Scholar] [CrossRef]
Lee, H.; Lee, N.; Lee, S. A method of deep learning model optimization for image classification on edge device. Sensors 2022, 22, 7344. [Google Scholar] [CrossRef] [PubMed]
Ghofrani, A.; Mahdian Toroghi, R. Knowledge Distillation in Plant Disease Recognition. Neural Comput. Appl. 2022, 34, 14287–14296. [Google Scholar] [CrossRef]
Huang, Q.; Wu, X.; Wang, Q.; Dong, X.; Qin, Y.; Wu, X.; Hao, G. Knowledge distillation facilitates the lightweight and efficient plant diseases detection model. Plant Phenomics 2023, 5, 0062. [Google Scholar] [CrossRef] [PubMed]
George, R.; Thuseethan, S.; Ragel, R.G.; Mahendrakumaran, K.; Nimishan, S.; Wimalasooriya, C.; Alazab, M. Past, present and future of deep plant leaf disease recognition: A survey. Comput. Electron. Agric. 2025, 234, 110128. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Sebastianelli, A.; Nowakowski, A.; Di Cosmo, G.; Spiller, D.; Vitulli, R.; Mathieu, P.P.; Ullo, S.L. Data augmentation in classification and segmentation: A survey and new strategies. J. Imaging 2023, 9, 46. [Google Scholar] [CrossRef]
Sivamani, S.; Chon, S.I.; Choi, D.Y.; Lee, D.H.; Park, J.H. Importance of Adaptive Photometric Augmentation for Different Convolutional Neural Networks. Comput. Mater. Contin. 2022, 73, 4433. [Google Scholar] [CrossRef]
Nayak, S.C.; Nayak, S.; Patel, S. A Review on Data Preprocessing Techniques for Machine Learning: Z-Score Normalization and Feature Scaling. Eng. Appl. Artif. Intell. 2023, 126, 107022. [Google Scholar] [CrossRef]
Singh, D.; Singh, B. Investigating the Impact of Data Normalization on Classification Performance. Appl. Soft Comput. 2020, 97, 105524. [Google Scholar] [CrossRef]
Szeghalmy, S.; Fazekas, A. A comparative study of the use of stratified cross-validation and distribution-balanced stratified cross-validation in imbalanced learning. Sensors 2023, 23, 2333. [Google Scholar] [CrossRef]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
Murray, D.G.; Simsa, J.; Klimovic, A.; Indyk, I. tf.data: A Machine Learning Data Processing Framework. Proc. VLDB Endow. 2021, 14, 2945–2958. [Google Scholar] [CrossRef]
He, X. Accelerated Linear Algebra Compiler for Computationally Efficient Numerical Models: Success and Potential Area of Improvement. PLoS ONE 2023, 18, e0282265. [Google Scholar] [CrossRef]
Torres-Hernández, M.A.; Escobedo-Barajas, M.H.; Guerrero-Osuna, H.A.; Ibarra-Pérez, T.; Solís-Sánchez, L.O.; Martínez-Blanco, M.R. Performance Analysis of Embedded Multilayer Perceptron Artificial Neural Networks on Smart Cyber-Physical Systems for IoT Environments. Sensors 2023, 23, 6935. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Lille, France, 6–11 July 2015; pp. 448–456. Available online: https://proceedings.mlr.press/v37/ioffe15.html (accessed on 3 June 2021).
Lu, S.; Liu, M.; Yin, L.; Yin, Z.; Liu, X.; Zheng, W. The Multi-Modal Fusion in Visual Question Answering: A Review of Attention Mechanisms. PeerJ Comput. Sci. 2023, 9, e1400. [Google Scholar] [CrossRef] [PubMed]
Kumar, H.; Rudd, S.; Ganesan, S. Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition. Multimodal Technol. Interact. 2025, 9, 116. [Google Scholar] [CrossRef]
Arevalo, J.; Solorio, T.; Montes-y-Gómez, M.; González, F.A. Gated Multimodal Networks. Neural Comput. Appl. 2020, 32, 10209–10228. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar] [CrossRef]
Xu, J.; Sun, X.; Zhang, Z.; Zhao, G.; Lin, J. Understanding and Improving Layer Normalization. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 4381–4391. Available online: https://dl.acm.org/doi/10.5555/3454287.3454681 (accessed on 3 March 2021).
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. Proc. IEEE 2021, 109, 43–76. [Google Scholar] [CrossRef]
Nakamura, K.; Hong, B.W. Learning-Rate Annealing Methods for Deep Neural Networks. Electronics 2021, 10, 2029. [Google Scholar] [CrossRef]
Lekić, M.; Gardašević, G. TensorFlow Lite Micro: Embedded Machine Learning for Resource-Constrained Devices. Sensors 2021, 21, 2984. [Google Scholar] [CrossRef]
Driesen, E.; Van den Ende, W.; De Proft, M.; Saeys, W. Influence of Environmental Factors Light, CO₂, Temperature, and Relative Humidity on Stomatal Opening and Development: A Review. Agronomy 2020, 10, 1975. [Google Scholar] [CrossRef]
Elvanidi, A.; Katsoulas, N. Machine Learning-Based Crop Stress Detection in Greenhouses. Plants 2023, 12, 52. [Google Scholar] [CrossRef]
Flexas, J.; Medrano, H. Drought-Inhibition of Photosynthesis in C3 Plants: Stomatal and Non-Stomatal Limitations Revisited. Ann. Bot. 2002, 89, 183–189. [Google Scholar] [CrossRef]
Jackson, R.D.; Idso, S.B.; Reginato, R.J.; Pinter, P.J., Jr. Canopy Temperature as a Crop Water Stress Indicator. Water Resour. Res. 1981, 17, 1133–1138. [Google Scholar] [CrossRef]
Jones, H.G. Irrigation Scheduling: Advantages and Pitfalls of Plant-Based Methods. J. Exp. Bot. 2004, 55, 2427–2436. [Google Scholar] [CrossRef]
Thakur, P.S.; Chaturvedi, S.; Khanna, P.; Sheorey, T.; Ojha, A. Vision Transformer Meets Convolutional Neural Network for Plant Disease Classification. Ecol. Inform. 2023, 77, 102245. [Google Scholar] [CrossRef]
Khan, A.T.; Jensen, S.M.; Khan, A.R.; Li, S. Plant Disease Detection Model for Edge Computing Devices. Front. Plant Sci. 2023, 14, 1308528. [Google Scholar] [CrossRef] [PubMed]
Pacal, I. Enhancing Crop Productivity and Sustainability through Disease Identification in Maize Leaves: Exploiting a Large Dataset with an Advanced Vision Transformer Model. Expert Syst. Appl. 2024, 238, 122099. [Google Scholar] [CrossRef]
Hemalatha, S.; Jayachandran, J.J.B. A Multitask Learning-Based Vision Transformer for Plant Disease Localization and Classification. Int. J. Comput. Intell. Syst. 2024, 17, 188. [Google Scholar] [CrossRef]

Figure 1. Data collection system. Directional arrows indicate the direction of image and environmental data flow between the barley plant, the Canon EOS 2000D camera, the DHT11 sensor connected to the ESP32 module, and the workstation through the Wi-Fi web server.

Figure 2. Architecture of the multimodal barley drought stress detection system.

Figure 3. Proposed multimodal fusion architecture for Barley Stress Detection.

Figure 4. Confusion matrix of the full multimodal model.

Figure 5. Performance metrics of the model on the greenhouse test set.

Figure 6. Training and validation loss and accuracy curves.

Figure 7. Evolution of classification metrics across the five folds.

Figure 8. Mean performance metrics with standard deviations across the 5-fold cross-validation.

Figure 9. Boxplot distribution of classification metrics across the five folds. The red rhombus (◆) indicates the mean value.

Figure 10. Training time per fold during 5-fold cross-validation.

Figure 11. Radar chart comparing the mean performance of the five ablation configurations across six evaluation metrics.

Figure 12. Training and validation loss and accuracy curves of the full multimodal model on the PlantVillage binary classification task across three random seeds: (a) seed = 42, (b) seed = 123, (c) seed = 456.

Figure 13. Training time and memory usage across optimization stages.

Figure 14. Visual comparison of model file sizes across deployment formats.

Table 1. Greenhouse Barley dataset characteristics.

Characteristic	Value
Total Samples	379
Healthy Samples	190 (50.1%)
Stressed Samples	189 (49.9%)
Image Size	224 × 224 pixels
Sensor Features	Temperature, Humidity
Training Set	303 samples (70%)
Split Strategy	Stratified (random_state = 42)

Table 2. Full multimodal model performance.

Metric	Equation	Value
Accuracy	(TP + TN)/(TP + TN + FP + FN)	98.7%
Precision	TP/(TP + FP)	97.5%
Recall	TP/(TP + FN)	100.0%
F1-Score	2 × (Precision × Recall)/(Precision + Recall)	98.7%
Specificity	TN/(TN + FP)	97.3%

Table 3. 5-fold cross-validation performance—Greenhouse dataset.

Fold	Accuracy	Precision	Recall	Specificity	F1-Score
1	98.7%	97.4%	100.0%	97.4%	98.7%
2	100.0%	100.0%	100.0%	100.0%	100.0%
3	94.7%	94.9%	94.9%	94.6%	94.9%
4	98.7%	100.0%	97.4%	100.0%	98.7%
5	96.0%	92.7%	100.0%	91.9%	96.2%
Mean	97.6%	97.0%	98.5%	96.8%	97.7%
Std Dev	±2.2%	±3.2%	±2.3%	±3.5%	±2.1%

Table 4. Training time per fold—Greenhouse dataset.

Fold	Training Time (Minutes)
1	28.7
2	34.5
3	32.2
4	31.0
5	26.6
Mean	30.6

Table 5. Ablation study performance comparison (Mean ± Std across 10 seeds).

Configuration	Accuracy	Precision	Recall	F1-Score	Specificity	AUC
Full Model	98.3 ± 1.5%	98.2 ± 2.1%	98.5 ± 1.8%	98.3 ± 1.5%	98.1 ± 2.2%	0.999 ± 0.001
Image-Only	97.4 ± 1.8%	97.8 ± 2.7%	97.2 ± 2.2%	97.4 ± 1.7%	97.6 ± 3.0%	0.996 ± 0.003
No Gating	99.1 ± 0.6%	99.0 ± 1.3%	99.2 ± 1.2%	99.1 ± 0.6%	98.9 ± 1.4%	1.000 ± 0.001
Concat Fusion	99.3 ± 0.7%	99.3 ± 1.2%	99.5 ± 1.1%	99.4 ± 0.7%	99.2 ± 1.3%	1.000 ± 0.001
Frozen Backbone	97.0 ± 2.5%	96.1 ± 3.2%	98.2 ± 4.8%	97.1 ± 2.5%	96.0 ± 4.3%	0.997 ± 0.005
Sensor-Only MLP	73.8 ± 3.5%	95.0 ± 6.5%	52.1 ± 6.2%	67.0 ± 5.3%	97.6 ± 4.2%	0.718 ± 0.051

Table 6. One-Way ANOVA results across ablation configurations (10 seeds each).

Metric	F-Statistic	p-Value	df (Between, Within)	Partial η²	Interpretation
Accuracy	4.44	0.005	4, 36	0.33	Significant (p < 0.01)
F1-Score	4.10	0.008	4, 36	0.31	Significant (p < 0.01)

Between-group df = 4; within-group df = 36 (9 residual df per group × 5 groups).

Table 7. Baseline comparison on greenhouse dataset (Mean ± Std, 80/20 split).

Model	Mode	Acc. (%)	F1 (%)	AUC	Seeds	Implementation Details
Logistic Regression	Image-Only	91.7 ± 3.0	91.6 ± 3.4	0.969 ± 0.024	3	Flattened features
Logistic Regression	Sensor-Only	72.8 ± 2.0	68.6 ± 3.3	0.738 ± 0.046	3	Sensor Only
Logistic Regression	Multimodal	92.1 ± 3.5	92.0 ± 3.8	0.971 ± 0.023	3	Image + sensor Fusion
Random Forest	Image-Only	87.3 ± 6.2	87.5 ± 6.1	0.953 ± 0.042	3	Flattened features
Random Forest	Sensor-Only	78.5 ± 2.7	73.4 ± 4.2	0.824 ± 0.036	3	Sensor Only
Random Forest	Multimodal	89.9 ± 6.2	90.3 ± 5.9	0.967 ± 0.039	3	Image + sensor Fusion
Shallow CNN (3-layers)	Image-Only	92.5 ± 3.0	92.6 ± 3.1	0.983 ± 0.008	3	CNN image features
Sensor-Only MLP	Sensor-Only	73.8 ± 3.5	67.0 ± 5.3	0.718 ± 0.051	10	Sensor Only
AgriFusionNet (re-impl.) *	Multimodal	75.2 ± 5.2	67.7 ± 8.0	0.851 ± 0.030	5-fold	On our dataset
Our Full Model	Multimodal	98.3 ± 1.5	98.3 ± 1.5	0.999 ± 0.001	10	EfficientNetV2-S + Gated

* AgriFusionNet re-implemented on our greenhouse dataset using 5-fold CV. The original paper reports 94.3% on a different multi-species multispectral dataset.

Table 8. PlantVillage dataset characteristics.

Characteristic	Value
Total Images	~26,000
Healthy Class	~13,000 images
Diseased Class	~13,000 images
Original Classes Mapped	38 → 2
Plant Species	14 species
Training Set	70% (~18,200)
Validation Set	15% (~3900)
Test Set	15% (~3900)

Table 9. PlantVillage binary classification performance.

Metric	Value
Accuracy	99.97%
Precision	99.96%
Recall	100%
F1-Score	99.98%
Specificity	99.81%

Table 10. Model performance across datasets.

Dataset	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Specificity (%)
Greenhouse Barley (mean ± std)	98.3 ± 1.5	98.2 ± 2.1	98.5 ± 1.8	98.3 ± 1.5	98.1 ± 2.2
PlantVillage Binary (mean ± std)	99.97 ± 0.03	99.96 ± 0.03	100.00 ± 0.00	99.98 ± 0.02	99.81 ± 0.17

Table 11. CPU optimization impact.

Configuration	Training Time/Epoch	CPU Usage	RAM Usage
Non-Optimized	28.9 min	92%	6400 MB
Optimized	9.1 min	34%	4930 MB
Reduction	68.5%	65%	25%

Table 12. Model compression results.

Model Format	Size	Reduction from Original
Original Keras (.h5)	150 MB	—
TFLite Float32	46.3 MB	69%
TFLite INT8 Quantized	14.9 MB	90%

Table 13. Resource configuration.

Parameter	Value
Maximum Epochs	60
Batch Size	16 images
RAM Usage	2048 MB
CPU Threads	6
Training Platform	CPU (Intel Core i7)

Table 14. Comparison of the proposed method with recent plant disease detection approaches, including image-only, multimodal, and shallow baseline studies. Rows marked “This Work” show experiments conducted on our 379-sample greenhouse dataset; * AgriFusionNet was re-implemented on our dataset using 5-fold cross-validation.

Study	Approach	Accuracy	Multimodal	Edge Deploy.	Model Size (M Params)
P. S. Thakur et al. [55]	CNN + Vision Transformer hybrid (PlantXViT)	98.86%	No	Partial	0.85
A. T. Khan et al. [56]	MobileNetV3-Small (edge-optimized)	99.50%	No	Yes	0.93
I. Pacal [57]	MaxViT large-scale benchmark	99.24%	No	No	—
S. Hemalatha & J. J. B. Jayachandran [58]	PDLC-ViT multitask transformer	99.97%	No	Partial	—
This Work (Logistic Regression)	LR: Image features + sensors (multimodal)	92.1 ± 3.5%	No	Yes	—
This Work (Random Forest)	RF: Image features + sensors (multimodal)	89.9 ± 6.2%	No	Yes	—
This Work (Shallow CNN)	3-layer CNN, image-only baseline	92.5 ± 3.0%	No	Partial	—
This Work (AgriFusionNet re-impl.)	AgriFusionNet on our greenhouse dataset (5-fold)	75.2 ± 5.2% *	Yes	Yes	12
Saleh Albahli [22]	AgriFusionNet—EfficientNetV2-B4 + IoT sensor fusion	94.3% (orig. dataset)	Yes	Yes	12
This Work (ours)	Multimodal EfficientNetV2 + Gated Cross-Modal Attention	99.97%	Yes	Yes	~3.7 (INT8 ≈ 14.9 MB)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Boukouba, R.; Ben Aissa, D.; Guidara, A.; Smaoui, N.; Ebel, C. An Efficient Multimodal Framework for Barley Drought Stress Detection on Resource-Constrained Devices. AgriEngineering 2026, 8, 230. https://doi.org/10.3390/agriengineering8060230

AMA Style

Boukouba R, Ben Aissa D, Guidara A, Smaoui N, Ebel C. An Efficient Multimodal Framework for Barley Drought Stress Detection on Resource-Constrained Devices. AgriEngineering. 2026; 8(6):230. https://doi.org/10.3390/agriengineering8060230

Chicago/Turabian Style

Boukouba, Rihab, Dalenda Ben Aissa, Amira Guidara, Nadia Smaoui, and Chantal Ebel. 2026. "An Efficient Multimodal Framework for Barley Drought Stress Detection on Resource-Constrained Devices" AgriEngineering 8, no. 6: 230. https://doi.org/10.3390/agriengineering8060230

APA Style

Boukouba, R., Ben Aissa, D., Guidara, A., Smaoui, N., & Ebel, C. (2026). An Efficient Multimodal Framework for Barley Drought Stress Detection on Resource-Constrained Devices. AgriEngineering, 8(6), 230. https://doi.org/10.3390/agriengineering8060230

Article Menu

An Efficient Multimodal Framework for Barley Drought Stress Detection on Resource-Constrained Devices

Abstract

1. Introduction

2. Materials and Methods

2.1. System Description and Data Acquisition

2.2. Data Preprocessing Pipeline

2.2.1. Image Preprocessing

2.2.2. Data Augmentation

2.2.3. Sensor Data Preprocessing

2.2.4. Dataset Partitioning

2.2.5. Class Balancing Strategy

2.2.6. Data Pipeline Optimization

2.3. Proposed Model Architecture

2.3.1. Visual Feature Extraction (EfficientNetV2-S)

2.3.2. Sensor Encoding Branch

2.3.3. Cross-Modal Attention Mechanism

2.3.4. Gated Multimodal Fusion and Classification Head

2.4. Computational Optimization and Model Compression

2.4.1. CPU Optimization Strategy

2.4.2. Training Acceleration Techniques

2.4.3. TensorFlow Lite Conversion

2.4.4. INT8 Quantization

3. Results

3.1. Performance on the Greenhouse Barley Dataset

3.2. K-Fold Cross-Validation Results

3.3. Ablation Study Results

3.3.1. Component-Wise Ablation Results

3.3.2. Statistical Analysis

3.4. Baseline Comparison

3.5. Results on PlantVillage Dataset

3.6. Cross-Dataset Performance Summary

3.7. Computational Efficiency Analysis

4. Discussion

4.1. Generalization Across Datasets

4.2. Comparison with Related Work

4.3. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI