1. Introduction
Global climate change has emerged as one of the most critical threats to the stability of agricultural production as the frequency and intensity of extreme weather events continue to increase [
1]. These environmental disturbances destabilize the productivity of open-field farming and highlight the structural limitations of conventional production systems that rely heavily on regional and seasonal conditions. Consequently, Controlled-Environment Agriculture (CEA) has gained prominence as a key global strategy for ensuring stable food production by enabling the precise regulation of diverse environmental factors.
The advancement of CEA has accelerated the integration of engineering-based technologies—such as embedded sensing, automated irrigation, optical monitoring, and data-driven decision-making systems—into modern agricultural production frameworks. Recent progress in optical biosensing, multimodal imaging, and machine learning has enabled non-destructive and high-resolution monitoring of plant physiological states, offering engineering-driven solutions to challenges previously addressed solely through traditional agronomic approaches.
These technological innovations are particularly crucial for high-value horticultural crops cultivated in greenhouses and plant factories, where precise environmental control is directly linked to production efficiency and crop performance. Among such crops, basil (
Ocimum basilicum L.) is one of the most widely cultivated herbs in controlled environments and has emerged as an ideal model crop for developing and validating engineering-based phenotyping and environmental optimization strategies [
2].
Among various environmental factors, water availability is the most critical determinant of basil’s growth and physiological function. Water deficit reduces photosynthetic efficiency, nutrient uptake, and leaf expansion, ultimately degrading quality and marketable yield [
3]. Although moderate water stress can enhance essential oil accumulation, severe drought significantly alters metabolic activity, leading to poor aroma quality and growth reduction [
4]. Therefore, real-time, accurate monitoring of basil’s physiological response to water stress and subsequent recovery is essential for precision irrigation management.
Traditional diagnostic approaches—such as observing leaf wilting, discoloration, or reduced leaf area—detect visible symptoms that appear only after physiological disruption has occurred [
5]. Although biochemical assays and gene-expression analyses can quantify drought responses [
6], they are destructive, labor-intensive, and unsuitable for continuous monitoring in commercial cultivation.
To overcome these limitations, optical biosensing offers a non-destructive, real-time approach to monitor plant physiological states through optical signals [
7]. In this framework, biological recognition elements such as chlorophyll fluorescence (CF) represent intrinsic photosynthetic indicators of stress; optical transducers (e.g., LEDs, filters, and sensors) convert these biological signals into measurable optical data; and computational interpretation using deep learning translates optical data into interpretable physiological information. This integration allows quantitative assessment of plant stress adaptation and recovery mechanisms in complex cultivation environments.
CF imaging, in particular, captures photochemical (qP) and non-photochemical quenching (NPQ) processes that directly reflect photosystem II (PSII) efficiency [
8,
9]. Because these dynamics precede morphological symptoms, CF serves as an ideal biorecognition signal for optical biosensing. However, single-modality measurements are often influenced by external illumination or geometric variations [
10].
Recent advances in deep learning, especially convolutional neural networks (CNNs), have enabled the extraction of hierarchical spatial and temporal features from multimodal imaging data [
11,
12,
13]. In particular, 3D convolutional neural networks (3D-CNNs) process volumetric data along x–y–z dimensions, capturing dynamic patterns within optical signal cubes more effectively than 2D models [
14,
15]. This approach has proven powerful in plant disease diagnosis and hyperspectral feature learning, yet its application to physiological stress detection using cost-effective optical sensors remains limited.
Previous studies employing 3D convolutional neural networks in agriculture have primarily focused on disease diagnosis or biochemical anomaly detection using hyperspectral image cubes, where volumetric learning has been shown to outperform 2D-CNNs by jointly modeling spatial and spectral correlations [
16]. In parallel, chlorophyll fluorescence imaging has been extensively applied to early stress detection; however, many CF-based studies still rely on handcrafted feature extraction or sequential modeling approaches rather than direct volumetric representation learning [
17]. More recently, multimodal fusion frameworks integrating hyperspectral and chlorophyll fluorescence information have demonstrated improved classification performance, yet these approaches typically treat each modality as an independent feature stream and do not explicitly preserve spatial–physiological alignment within a unified 3D volume [
11].
In contrast, the present study advances beyond existing work by constructing an aligned RGB–depth–chlorophyll fluorescence fusion cube that encodes spatial structure and physiological dynamics simultaneously, and by applying a 3D-CNN to directly learn discriminative representations from this fused volume using cost-effective optical sensors. This design philosophy is consistent with recent cross-modal learning approaches in agricultural sensing, which emphasize modality-aware architectural integration to extract meaningful representations from heterogeneous inputs [
18]. Furthermore, the proposed framework explicitly benchmarks its performance against traditional machine learning and 2D-CNN baselines, and systematically evaluates modality complementarity, thereby clarifying the specific contribution of volumetric multimodal learning for physiological stress and recovery phenotyping.
Therefore, this study proposes a 3D-CNN-based optical biosensing framework to classify basil’s physiological responses—normal, resistance, and recovery—under water deficit stress. RGB, depth, and chlorophyll fluorescence data were collected under controlled environmental conditions simulating plant factory systems. The specific objectives are to (1) acquire multimodal optical biosensing data of basil under controlled water-deficit and recovery conditions; (2) construct a 3D-CNN model that fuses multimodal optical signals into a unified biosensing parameter for feature learning; and (3) evaluate its performance compared to traditional machine learning and 2D-CNN approaches.
By bridging biological signal acquisition with deep multimodal representation learning, this study establishes a robust framework for non-destructive physiological monitoring. The proposed approach contributes to the advancement of AI-driven precision agriculture, offering a foundation for adaptive irrigation management and intelligent stress diagnosis in smart agriculture.
2. Materials and Methods
The overall workflow of this study (
Figure 1) illustrates the optical biosensing pipeline, consisting of (i) biosignal acquisition from basil leaves, (ii) optical transduction and digitization, (iii) multimodal fusion, and (iv) deep-learning-based physiological classification.
2.1. Sample Preparation
Sweet basil (Ocimum basilicum L.) was selected as the biological recognition material due to its well-characterized photosynthetic and volatile responses to water availability.
The growth and imaging system used in this study consisted of a custom-built chamber equipped with environmental control and optical biosensing units (
Figure 2a). The chamber integrated LED-based illumination and multiple optical sensors for synchronized image acquisition of RGB, depth, and chlorophyll fluorescence signals. The control unit managed imaging schedules and environmental parameters, including temperature, humidity, and light intensity, which were monitored and adjusted via a computer interface.
The environmental conditions followed general basil cultivation conditions [
19], with daytime temperatures of 28 °C to 32 °C, nighttime temperatures of 22 °C to 24 °C, relative humidity of 40% to 70%, light intensity of 150 µmol/m
2s to 200 µmol/m
2s, and a photoperiod of 14 h.
Basil plants were cultivated hydroponically in growth trays (
Figure 2b). After the basil plants developed at least four pairs of true leaves, plants with uniform growth were selected and divided into four groups: one control group and three treatment groups subjected to different irrigation conditions to induce varying degrees of water-deficit stress. The plants were then transferred to individual imaging containers within the growth and imaging chamber (
Figure 2c), where both stress induction and subsequent recovery treatments were conducted under controlled environmental conditions. This setup ensured consistent imaging geometry, uniform environmental exposure, and reliable physiological measurements across all experimental groups.
Following the method of Gräf et al. (2021) [
20], we captured thermal images of the leaves to check the leaf temperature differences between plants under normal and water-deficient conditions. During the water-deficit stress response, we observed that the average leaf temperature was more than 1 °C higher than that of the control group. Upon re-watering, the leaf temperature in the recovery response dropped to within 1 °C of the control group. To induce water-deficit stress responses, the treatment groups were maintained under drained conditions for 1 day, 3 days, and 9 days, respectively. To induce recovery responses, they were then re-watered under the same conditions as the control group. As per the treatment conditions, the plants in each treatment group showed increased leaf temperatures during the drainage period compared to the control group, and their leaf temperatures returned to levels similar to the control group after re-watering, indicating recovery responses. Although thermal imaging was acquired concurrently, it was not included in the multimodal fusion cube for 3D-CNN training. Thermal data were solely used to physiologically validate the Resistance and Recovery labels, as leaf temperature is an indirect and environment-sensitive indicator compared to PSII-centered chlorophyll fluorescence signals.
2.2. Image Acquisition
For 9 days, basil plants underwent a drainage and re-watering treatment while RGB, depth, and chlorophyll fluorescence image data were collected. The data collection was carried out three times a day (morning, afternoon, and evening) using a growth and imaging chamber (PhytoChamber; PhytoWorks Inc., Gangneung-si, South Korea) (
Figure 3). This equipment was developed by PhytoWorks Inc. and is provided as a modular, assembled system. The equipment’s structure ensures consistent environmental conditions and includes a control part (
Figure 3a) located at the top, which commands the imaging process. The LED part (
Figure 3b) and camera part (
Figure 3c) are fixed above to consistently capture the top view of the plant growth part. The equipment, as shown in
Figure 3d, is a chamber that isolates it from the external environment, allowing for independent setting of conditions such as temperature and humidity. It includes fans and Peltier elements for temperature and humidity control, as well as internal sensors for monitoring temperature, humidity, and CO
2 levels.
The equipment configuration involves a control part with a single-board computer (Raspberry Pi 4; Raspberry Pi Foundation, Cambridge, UK) (
Figure 3a-1) that manages the chamber’s internal environment and another single-board computer (LattePanda Alpha; DF Robot, Shanghai, China) (
Figure 3a-2) that commands the image sensors. The LED part contains white LEDs for plant growth and red and blue LEDs for chlorophyll fluorescence imaging. The camera part includes an RGB-D camera (
Figure 3c-1) with an Intel
® RealSense™ LiDAR Camera L515 (Intel Corporation, Santa Clara, CA, USA), a thermal camera (
Figure 3c-2) with a FLIR Lepton 3.5 thermal sensor (Teledyne FLIR LLC, Wilsonville, OR, USA), and a chlorophyll fluorescence image acquisition device (
Figure 3c-3) consisting of a Basler ace acA1300-60gm-NIR GigE camera (Basler AG, Ahrensburg, Germany), a 6 mm C Series VIS-NIR fixed focal length lens (Edmund Optics Inc., Barrington, NJ, USA), and a longpass OD4 650 nm 12.5 mm filter (Edmund Optics Inc., Barrington, NJ, USA). All devices are fixed in place.
Thus, through this single unit of equipment (
Figure 3), all types of plant image data used in the experiment were collected under consistent environmental conditions, including RGB, depth, thermal, and CF images. For the CF images, a modified protocol based on [
21] was employed. Initially, images of Fo and Fm were captured in a dark-adapted state after 20 min of dark adaptation, followed by exposure to actinic light to induce the Kautsky effect and acquire Fp images. Subsequently, during the light adaptation phase, the number of saturating flashes, denoted by n, was set to four. This notation explicitly indicates that the saturating flashes were activated a total of four times to capture Ft_Ln and Fm_Ln images. After these flashes, Ft_Lss and Fm_Lss images were taken upon reaching the steady-state fluorescence level in light. With these 13 directly measured images, we obtained 31 types of physiologically significant chlorophyll fluorescence parameters in the form of images. A detailed description of these 31 chlorophyll fluorescence parameters is provided in
Table 1.
2.3. Dataset Preparation
Multimodal optical data were prepared through two parallel pipelines depending on the target learning framework: (i) volumetric image-based inputs for 3D-CNN training (
Figure 4) and (ii) tabular feature vectors for conventional machine learning models.
2.3.1. ROIs Extraction and Labelling
After image acquisition, multimodal alignment was performed to ensure spatial correspondence among RGB, depth, and chlorophyll fluorescence (CF) modalities using the ORB feature-matching algorithm [
28]. Regions of interest (ROIs) corresponding to basil leaves were then extracted and resized to 32 × 32 pixels. Each ROI was labeled as ‘Normal’, ‘Resistance’, or ‘Recovery’ according to the corresponding water treatment condition.
For deep learning–based analysis, multimodal image features within each ROI were organized into a unified volumetric representation. In total, the 3D fusion cube comprised 130 optical parameter layers, constructed from 31 chlorophyll fluorescence (CF) parameters mapped to three color channels (31 × 3 = 93 layers), together with six RGB–Depth channels and 31 single-channel CF parameter maps. Specifically, these 130 feature layers—consisting of 6 RGB–Depth channels, 93 RGB-mapped CF parameter layers, and 31 additional calculated CF parameter maps—were assembled and stacked to form a 3D fusion cube of size 130 × 32 × 32.
The overall preprocessing and fusion pipeline is schematically illustrated in
Figure 4, including multimodal spatial alignment, ROI extraction, modality-wise normalization, and channel-wise stacking to construct the 3D fusion cube. This volumetric representation preserves both spatial heterogeneity and temporal–spectral continuity of the optical biosensing signals, forming the input for subsequent 3D-CNN model learning.
2.3.2. Preparation of Input Features for Machine Learning Models
In contrast to deep learning networks, conventional machine learning algorithms require a tabular input structure in which each sample is represented by a fixed-length feature vector. To enable a direct comparison between machine learning–based and deep learning–based approaches, a separate feature preparation pipeline was employed.
For this purpose, image-derived chlorophyll fluorescence (CF) and color parameters were aggregated over each ROI to generate representative numerical descriptors. Specifically, mean intensity values of CF indices that have been reported to correlate strongly with drought stress responses [
8,
29] were extracted. Two feature configurations were considered: (1) a single-parameter case using only the Fv/Fm index, and (2) a multi-parameter case combining eight parameters (Fv/Fm, Y_Lss, Rfd_L3, NPQ_L2, and RGB intensity values).
Each feature vector thus represented the averaged physiological and optical responses of a single plant under a specific water treatment condition. The resulting tabular dataset was standardized using z-score normalization prior to model training to ensure comparable feature scaling across all machine learning models (Equation (1)).
where
is the value of the
-th feature for the
i-th sample, and
and
denote the mean and standard deviation of the
-th feature across all samples, respectively. This normalization ensured that each feature contributed equally to the classification process regardless of its original magnitude or unit.
2.3.3. Fusion as a 3D Fusion Parameter
The composition of the 3D fusion cube used for 3D-CNN training is summarized here. A total of 92, 78, and 56 plant images were collected for the ‘Normal’, ‘Resistance’, and ‘Recovery’ classes, respectively, yielding 368, 312, and 224 ROIs for each class.
Each ROI cube consisted of 130 image layers, including 6 layers from RGB and depth images, 93 layers corresponding to RGB-mapped images derived from 31 CF parameters, and 31 additional calculated CF parameter maps. These 32 × 32 pixel images were stacked along the spectral (z) axis to construct a 130 × 32 × 32 fusion parameter.
By organizing multimodal optical features in this volumetric form, parameters become learnable along the z-axis, particularly for CF data, where continuously acquired and gradually varying physiological responses are arranged sequentially within the cube. The resulting labeled fusion cubes were divided into training and test sets using an 8:2 ratio for each class. A detailed summary of the dataset composition is provided in
Table 2.
2.4. Construction of Machine Learning and Deep Learning Models
2.4.1. Machine Learning Models
To establish baseline predictive models for basil’s physiological response to water availability, several machine learning algorithms were implemented, including Logistic Regression, k-Nearest Neighbors (k-NN), Support Vector Machine (SVM), and Light Gradient Boosting Machine (LightGBM). These models were designed to compare the effectiveness of conventional machine learning with deep learning-based approaches in phenotyping data analysis.
All models were developed in Python 3.8 using open-source libraries such as Scikit-learn and Keras. The dataset was divided into training and test sets at an 8:2 ratio. Before training, all features were standardized using z-score normalization to ensure consistent scaling across variables.
The Logistic Regression model estimates class probabilities by applying the sigmoid function to the linear combination of input features and weights (Equation (2)).
Classification is performed using a threshold of 0.5. L2 regularization was applied to prevent overfitting, the convergence tolerance was set to
the optimization algorithm was set to ‘lbfgs’, and the maximum iteration count was 100. The model minimizes the cross-entropy loss (Equation (3)).
The k-NN model classifies a test sample based on the majority class among its nearest neighbors in the feature space. The Euclidean distance metric was used to calculate similarity, and the number of neighbors k was set to 3. Uniform weights were applied, meaning that all neighboring samples contributed equally to the final decision.
The SVM model seeks the optimal hyperplane that maximizes the margin between classes. The model minimizes the following objective function (Equation (4))
subject to
. Here,
is the penalty parameter controlling the trade-off between margin width and classification error, and
represents the kernel mapping. The RBF (Radial Basis Function) kernel was selected with the kernel coefficient set to ‘scale’ and the convergence tolerance to
.
The LightGBM model is a gradient boosting framework that sequentially builds decision trees to minimize classification error. The model uses the Gradient Boosting Decision Tree (GBDT) algorithm, where each new tree corrects the residuals of the previous ones. The objective function at iteration t is expressed as Equation (5)
where
is the loss function and
represents the regularization term. The maximum number of leaf nodes per tree was set to 31, the learning rate to 0.1, and the number of boosting iterations to 100.
2.4.2. Deep Learning Models: 2D-CNN and 3D-CNN
A model based on 3D-CNN was constructed to learn the features of cube-shaped 3D fusion parameters. Additionally, a 2D-CNN-based model structure was developed to learn using 2D image parameters for comparison with the method utilizing the 3D fusion parameters. The convolution used in each model is explained as shown in
Figure 5, and its equations follow (Equations (6)–(8)).
In Equations (6)–(8), v represents the output variable in the feature map. B, H, and W denote the size of the kernel along the spectral and spatial dimensions, respectively. (b, h, w) are the indices of the kernel, and (z, x, y) are the indices of the feature map, corresponding to the two spatial dimensions and one spectral dimension, respectively. k represents the kernel parameters. i, j, and m are the indices of the input layer, output layer, and feature map, respectively. M indicates the number of feature maps; thus, Mi represents the number of feature maps in the ith layer. r is the bias term. In this study, the Rectified Linear Unit (ReLU) is chosen as the activation function (Equation (8)).
In
Figure 5, B, H, and W represent the size of the kernel along the spectral and spatial dimensions, respectively. M is the number of feature maps.
The architectures of the two models are illustrated in
Figure 6. The 2D-CNN model (
Figure 6a) consists of sequential 2D convolutional, max pooling, dropout, flatten, and dense layers. Each convolutional layer applies kernels of size (3, 3) across the x and y dimensions to generate spatial activation maps. The max pooling layers reduce feature map resolution, while the dropout layers randomly deactivate 25% of neurons to prevent overfitting. The flatten and dense layers transform extracted features into a fully connected representation for classification into ‘Normal’, ‘Resistance’, and ‘Recovery’ classes, with Softmax as the activation function.
The 3D-CNN model (
Figure 6b) comprises 3D convolutional, max pooling, flatten, and dense layers. The 3D convolutional filters (kernel size: 3 × 3 × 3) slide across the x, y, and z dimensions of the 3D fusion cube, capturing both spatial and spectral correlations. The max pooling layer (2 × 2 × 2) reduces activation map size, while the flatten and dense layers convert the extracted features into a one-dimensional representation for final classification using Softmax.
In total, the network consists of five convolutional blocks with increasing numbers of filters (64–128), followed by fully connected layers for three-class classification using a Softmax activation function. The model was trained using the Adam optimizer with the default learning rate of 0.001, a batch size of 2, and 30 epochs. Early stopping and explicit regularization techniques were not applied, as the model exhibited stable convergence and no observable overfitting under the given training configuration. A detailed summary of the 3D-CNN architecture hyperparameters is provided in
Table 3.
Both models were trained using the backpropagation algorithm and the Adam optimizer, with categorical cross-entropy as the loss function. For the 2D-CNN model, training was performed with a batch size of 32 and 100 epochs. For the 3D-CNN model, a smaller batch size of 2 and 30 epochs was used due to the higher computational load of volumetric data.
Model weights were iteratively updated to minimize prediction error, allowing each network to learn discriminative representations of basil’s physiological responses. All models were implemented using TensorFlow and Keras libraries under Python 3.8.
2.5. Performance Evaluation
To evaluate the performance of the constructed models, the types of basil responses to water availability were classified on the test set using the 3D-CNN model trained with fusion parameters and the model trained with 2D image parameters. The values of accuracy, precision, recall, and F1-score were compared using the results obtained from TP (the number of true positives), TN (the number of true negatives), FP (the number of false positives), and FN (the number of false negatives). These performance metrics explain how accurately the models classify. (Equations (9)–(12)):
A confusion matrix method was used to create a 3 × 3 matrix illustrating the relationship between actual and predicted values, thus providing a detailed evaluation of the model’s performance in classifying each label.
To further assess the robustness and generalizability of the proposed 3D-CNN model, additional validation strategies were employed. First, a stratified K-fold cross-validation (K = 5) was conducted using the entire dataset to evaluate the stability of model performance across different data partitions. In each fold, the dataset was divided into training and validation subsets while preserving class distributions, and the model was trained and evaluated independently. Performance metrics were averaged across all folds to quantify variability and robustness.
Second, learning curve analysis was performed by monitoring training and validation losses and accuracies over successive training epochs. Learning curves were used to examine model convergence behavior and to identify potential overfitting or underfitting tendencies during training. These complementary validation strategies provide a more comprehensive assessment of model reliability beyond a single train–test split.
Additionally, the receiver operating characteristic (ROC) curve method was used to represent the relationship between the true positive rate (TPR) and false positive rate (FPR), assessing the classification model’s performance based on discrimination thresholds. This involves calculating the TPR and FPR using various thresholds, plotting the ROC curve, and then calculating the area under the curve (AUC) using the trapezoidal integration method. A value close to 1.0 is considered indicative of good classification performance. (Equations (13)–(15))
2.6. Feature Visualization Using t-SNE
To visualize the high-dimensional structure of the 3D-CNN input data and the feature representations learned by the network, t-distributed Stochastic Neighbor Embedding (t-SNE) was employed. t-SNE is a machine learning-based dimensionality reduction technique that converts pairwise distances between data points into probability distributions. It maps high-dimensional data to a lower-dimensional space by minimizing the divergence between the probability distribution of the original data (modeled by a Gaussian distribution) and the probability distribution of the low-dimensional embeddings (modeled by a t-distribution). The algorithm follows the formulation (Equations (16)–(18)).
In Equation (16), represents the similarity between high-dimensional data points and , calculated using a Gaussian distribution. Here, and are data points in the high-dimensional space, is the Euclidean distance between data points and , and is the standard deviation of data point . In Equation (17), represents the similarity between low-dimensional data points and . Here, and are data points in the low-dimensional space, and is the Euclidean distance between data points and . Equation (18) defines , the cost function, which uses the Kullback-Leibler (KL) Divergence, , to measure the difference between the high-dimensional data probability distribution and the low-dimensional data probability distribution . This difference is minimized using gradient descent.
3. Results and Discussion
3.1. Visualization of Chlorophyll Fluorescence Parameter Fusion
To illustrate the fusion process of optical biosensing data, representative chlorophyll fluorescence parameters were visualized before integration into the 3D fusion cube (
Figure 7). Among the 31 fluorescence parameters, Fv/Fm and NPQ were selected as representative indicators of photosystem II efficiency and non-photochemical quenching, respectively. Each parameter image consisted of red, green, and blue color channels, as well as corresponding numerical value maps representing pixel-level intensity distributions. Temporal sequences (L1–L4 and Lss) were captured to reflect dynamic physiological responses under varying water availability conditions. These time-resolved images revealed gradual transitions in chlorophyll fluorescence intensity, corresponding to stress induction and recovery phases.
Through the fusion process, 130 parameter layers—including RGB, Depth, and multiple CF-derived channels—were stacked to form a single 3D cube-shaped fusion parameter. This cube structure preserved both spatial and temporal–spectral continuity, allowing the 3D-CNN model to learn complex physiological patterns within the integrated feature space. As shown in
Figure 7, the color mapping along the temporal (z-axis) direction visually demonstrates the progressive changes in Fv/Fm and NPQ over time, confirming that the fusion approach effectively captures the continuous dynamics of basil’s physiological state under water-deficit stress.
In the context of AI-based analysis, the fusion cube serves not merely as a stacked collection of parameter images but as a temporal–spectral manifold that retains both spatial and chronological coherence of the basil’s physiological signals. This multidimensional representation allows the 3D-CNN to capture subtle variations that occur across time and spectral domains, enabling the model to recognize gradual stress–recovery transitions that may not be apparent in individual frames. By learning hierarchical spatiotemporal patterns within the fusion volume, the model effectively internalizes dynamic physiological cues that describe the plant’s adaptive responses under water deficit conditions.
Compared to conventional machine learning approaches such as Logistic Regression, k-NN, SVM, or lightGBM, which rely on pre-extracted statistical features from tabular data, the proposed 3D-CNN directly learns discriminative representations from raw fused images. Unlike 2D-CNNs that operate on single time slices and lack temporal continuity, the 3D-CNN framework leverages inter-frame dependencies along the temporal axis to infer sequential physiological processes.
This structural advantage allows the model to simultaneously analyze chlorophyll fluorescence dynamics and spatial heterogeneity, providing a more holistic interpretation of basil’s stress physiology. These observations are consistent with prior chlorophyll fluorescence imaging studies showing that time-resolved CF indicators sensitively capture early drought responses and recovery kinetics in horticultural and model plants [
8,
9].
3.2. Prediction Performance of Machine Learning Classifiers
To evaluate the ability of traditional algorithms to classify the physiological responses of basil to varying water availability, four representative machine learning classifiers—Logistic Regression, k-NN, SVM, and LightGBM—were trained and tested using chlorophyll fluorescence parameters (
Table 4). Two scenarios were considered: (i) using only Fv/Fm as a single representative indicator of PSII efficiency, and (ii) using Fv/Fm combined with seven additional parameters (Y_Lss, Rfd_L3, NPQ_L2, R, G, and B) that are known to be associated with drought stress responses.
Overall, the inclusion of multiple stress-related parameters resulted in higher classification accuracy compared to using Fv/Fm alone, indicating that multi-parametric inputs provide a more comprehensive representation of plant physiological states. When Fv/Fm alone was used, Logistic Regression achieved the highest accuracy (0.5193), demonstrating that even a simple linear classifier can capture distinguishable trends in photosynthetic efficiency under limited water availability. However, when multiple parameters were used, SVM exhibited the best performance (0.6077), reflecting its capability to handle non-linear and high-dimensional feature spaces. This improvement suggests that integrating parameters related to photochemical efficiency (Fv/Fm), non-photochemical quenching (NPQ), and spectral color features (R, G, B) enhances model sensitivity to complex drought-induced changes.
As the dimensionality of the input data increased, the underlying relationship between CF parameters and physiological states became more non-linear. Consequently, kernel-based models such as SVM outperformed linear approaches, effectively learning boundary distributions within the multi-dimensional feature space. These findings are consistent with previous research showing that simple models such as Logistic Regression tend to underperform in small or feature-interactive datasets, while SVMs can better generalize to non-linear relationships through kernel optimization [
30,
31]. In particular, studies applying k-NN and Logistic Regression to phenotyping and germination image analysis also reported limited discrimination power when underlying biological variability was high [
31], aligning with the trends observed in the present work.
Nevertheless, the overall accuracy of all tested machine learning classifiers remained below 0.61, highlighting intrinsic limitations in their capacity to capture the dynamic and spatially heterogeneous responses of basil under water-deficit stress. Similar constraints have been noted in prior optical sensing studies, where handcrafted features were insufficient to represent the temporal dependencies of chlorophyll fluorescence signals [
26]. Traditional models rely on manually derived statistical features and assume independence between temporal observations, thus lacking the ability to learn latent spatiotemporal dependencies inherent in biosensing data. These constraints emphasize the necessity of advanced deep learning approaches—such as the 3D-CNN fusion framework introduced in the following
Section 3.3—to extract hierarchical representations and interpret complex physiological responses beyond the scope of conventional classifiers.
3.3. Ablation Study on the Contribution of Individual and Combined Modalities
To quantitatively evaluate the contribution of each sensing modality and to clarify their complementary roles, an ablation study was conducted by varying the combinations of input modalities. For this purpose, 3D-CNN models were independently trained and evaluated using different modality configurations, including five cases: RGB-only, CF-only, RGB + Depth, RGB + CF, and full multimodal fusion (RGB + Depth + CF). To ensure a fair comparison, all models shared the same network architecture, training protocol, and train–validation data split, with only the input channel configurations differing across experiments.
Table 5 summarizes the classification accuracies obtained for each modality configuration. Among the single-modality models, the CF-only model achieved the highest classification accuracy (84.53%), indicating that chlorophyll fluorescence (CF) provides the most direct physiological information related to water-deficit stress. In contrast, the RGB-only model exhibited relatively limited performance, with an accuracy of 64.08%, which can be attributed to the fact that RGB information primarily relies on color and appearance cues and is therefore less sensitive to early physiological changes.
When dual-modality inputs were employed, classification performance generally improved compared to single-modality models. In particular, the RGB + CF configuration achieved an accuracy of 95.67%, significantly outperforming both the RGB-only and CF-only models. This result suggests that color and structural context from RGB images provides complementary information to fluorescence-derived physiological signals.
The full multimodal fusion model (RGB + Depth + CF) achieved the highest classification accuracy of 96.90%, outperforming all partial modality combinations and further improving upon the RGB + CF configuration. This improvement indicates that depth information contributes to capturing subtle structural recovery patterns that are not fully represented by fluorescence dynamics alone. Specifically, RGB encodes surface appearance, Depth captures structural geometry, and CF provides time-resolved physiological activity, with each modality contributing distinct yet complementary information. Consequently, the synergistic integration of these modalities enables the 3D-CNN model to construct a more holistic representation of basil’s stress–recovery dynamics, which cannot be achieved by any single modality or partial fusion alone.
3.4. Comparison of Prediction Performance Between 2D-CNN and 3D-CNN Models
To further evaluate the effectiveness of the proposed fusion approach, the classification performance of a 3D-CNN model using the fusion parameter was compared with that of a 2D-CNN model trained on 2D RGB image parameters (
Table 6). The 2D-CNN achieved an accuracy of 0.7679, a precision of 0.8585, a recall of 0.6644, and an F1 score of 0.6631. The relatively low recall and F1 scores (~0.66) indicate that the 2D-CNN was limited in detecting the Positive class, suggesting that RGB images alone lack sufficient spectral–temporal information to capture subtle physiological transitions. In contrast, the 3D-CNN model using the fusion parameter exhibited remarkable improvements, achieving 0.9690 accuracy, 0.9733 precision, 0.9661 recall, and 0.9694 F1 score. The high recall and F1 values indicate that the model performs robustly in identifying all physiological response types with minimal false negatives.
Both deep learning models outperformed all machine learning classifiers presented in
Table 4, confirming the superiority of deep feature extraction over manually engineered features in capturing basil’s complex physiological responses. Furthermore, the 3D-CNN, which learns from RGB, depth, and time-resolved CF parameters simultaneously, demonstrated a significant advantage over the 2D-CNN trained on single-frame RGB data. These results highlight that the inclusion of temporal–spectral information allows the model to infer subtle stress-induced patterns that static imaging cannot reveal.
This performance trend is consistent with prior multimodal imaging studies in plant phenotyping and stress detection, where 3D-CNN architectures outperformed 2D approaches by learning volumetric and spatiotemporal representations from hyperspectral and fluorescence imaging data. For example, Jung et al. [
16] demonstrated that 3D-CNNs trained on hyperspectral image cubes achieved superior disease classification performance compared to 2D-CNNs by jointly modeling spatial and spectral correlations within plant tissues. Similarly, Dong et al. [
17] reported that chlorophyll fluorescence-based stress diagnosis benefits from modeling temporal fluorescence dynamics, although their approach relied on feature extraction and sequential learning rather than direct volumetric convolution. More recently, Zhang et al. [
11] showed that fusing hyperspectral and chlorophyll fluorescence information improves stress classification accuracy; however, their framework treated each modality as a separate feature stream rather than as a spatially aligned 3D volume.
In this context, the present study extends existing work by integrating RGB, depth, and chlorophyll fluorescence signals into a unified fusion cube and directly learning discriminative physiological representations through 3D convolution. Such volumetric multimodal learning enables more effective characterization of continuous stress and recovery dynamics, particularly for visually similar states such as ‘Normal’ and ‘Recovery’, which are difficult to distinguish using single-modality or 2D-based approaches.
Figure 8 shows the confusion matrices of the two deep learning models. The 2D-CNN achieved a 98% true positive rate for the ‘Normal’ class and 81% for the ‘Resistance’ class but exhibited substantial misclassification of the ‘Recovery’ samples as ‘Normal’, indicating poor discrimination of post-stress recovery responses. Conversely, the 3D-CNN achieved true positive rates above 91% for all three classes (Normal, Resistance, Recovery), demonstrating its ability to effectively differentiate among various physiological states. This suggests that basil leaves in the ‘Normal’ and ‘Recovery’ states share visually similar RGB textures, and that the integration of CF and depth information in the 3D-CNN enables a more accurate characterization of recovery dynamics.
3.5. Reliability and Practical Applicability of the 3D-CNN Model
Beyond classification accuracy, the practical deployment of deep learning–based phenotyping models requires reliable decision boundaries and sufficient computational efficiency. Therefore, this section evaluates the reliability of the proposed 3D-CNN model using ROC analysis and examines its inference time and model complexity to assess real-world applicability.
Before assessing classification reliability, the training stability and convergence behavior of the proposed 3D-CNN model were examined using learning curve analysis. As shown in
Figure 9a, the training and validation curves exhibited consistent convergence with a minimal performance gap, indicating that the model learned generalized feature representations without severe overfitting. The smooth convergence trend further suggests that the selected training configuration was appropriate for multimodal feature learning.
Figure 9b presents the receiver operating characteristic (ROC) curves of the 3D-CNN fusion model, which depict the relationship between the true positive rate (TPR) and the false positive rate (FPR) for each response class. The ROC analysis was conducted to evaluate the model’s ability to distinguish among the three physiological states—Normal, Resistance, and Recovery—in basil leaves subjected to varying water conditions. The area under the curve (AUC) values were calculated as 0.90 for the Normal class, 0.93 for the Resistance class, and 0.92 for the Recovery class. These consistently high AUC values exceeding 0.90 demonstrate that the 3D fusion parameter model achieves strong separability in multi-class classification by maintaining a high TPR while minimizing the FPR across all categories.
The particularly high AUC observed for the Resistance class (0.93) indicates that the 3D-CNN model is exceptionally sensitive in detecting the onset of stress responses, even when physiological changes are subtle. This reflects the model’s ability to capture early alterations in photochemical efficiency and energy dissipation processes that occur during drought stress adaptation. Similar patterns have been observed in fluorescence-based physiological analyses, where Fv/Fm and NPQ were shown to sensitively reflect PSII regulation and photoprotective adjustments under water-deficit conditions [
21,
25].
Ultimately, the ROC analysis demonstrates that the proposed 3D-CNN fusion framework establishes stable and reliable decision boundaries across physiological response classes, providing a robust foundation for real-world deployment beyond improvements in classification accuracy. This emphasis on robust validation is consistent with recent multimodal machine learning studies, such as Guan et al. (2025) [
32].
Furthermore, K-fold cross-validation results (
Table 7) demonstrated stable classification performance across different data splits, with low variability in accuracy and F1-score among folds. This consistency confirms that the proposed 3D-CNN fusion framework is not sensitive to a specific train–test partition and is capable of generalizing across heterogeneous samples.
To evaluate the practical feasibility of the proposed 3D-CNN framework, the inference time and model complexity were quantitatively analyzed. Inference experiments were conducted on a workstation equipped with an NVIDIA RTX-3090 GPU, an Intel Core i9 CPU, and 64 GB of system memory. The inference time was measured as the average forward-pass latency per region of interest (ROI), excluding data loading and preprocessing, in order to reflect the pure computational cost of the model.
The proposed 3D-CNN model contains approximately 1.73 million trainable parameters, corresponding to a model size of approximately 6.6 MB. Despite its volumetric convolutional architecture, the average inference time of the 3D-CNN model was approximately 5 ms per ROI. From an operational perspective, stress diagnosis in greenhouse or vertical farming environments does not require frame-level video processing; instead, decision-support systems typically operate at minute- or hour-level intervals.
Considering these operational requirements, the observed inference time falls well within an acceptable range for practical deployment. Overall, the proposed 3D-CNN fusion model effectively achieves a balance between classification performance and computational efficiency, supporting its practical applicability in precision irrigation and smart farming applications.
3.6. Feature Distribution Analysis Using t-SNE
To interpret how the 3D-CNN model learns discriminative representations of basil physiological states, feature distributions at both the input level and the learned representation level were visualized using t-distributed stochastic neighbor embedding (t-SNE) (
Figure 10). This analysis enables a qualitative assessment of how Normal, Resistance, and Recovery responses are organized within the feature space under different water availability conditions.
At the input-level embeddings (
Figure 10a), the three classes exhibited partially separated yet overlapping distributions. Normal and Resistance samples were positioned relatively far apart, while Recovery samples formed an intermediate continuum between them. This overlap indicates that, despite the inclusion of multi-channel information, raw optical data alone are insufficient to clearly disentangle subtle physiological changes occurring during the early stages of water stress. Such ambiguity is consistent with previous chlorophyll fluorescence (CF)-based studies reporting that early drought responses are characterized by continuous dynamics—such as gradual induction of non-photochemical quenching (NPQ) and moderate declines in photosystem II (PSII) efficiency—rather than abrupt shifts in fluorescence indices [
12,
13].
In contrast, the t-SNE visualization of feature embeddings extracted from the final fully connected layer of the 3D-CNN (
Figure 10b) revealed compact and well-separated clusters for the Normal, Resistance, and Recovery groups. This pronounced separation demonstrates that the 3D-CNN effectively learned nonlinear correlations between optical signals and physiological responses through hierarchical convolutional operations. The learned latent space preserves both spectral variability and temporal continuity, indicating that samples are organized according to underlying physiological processes rather than superficial visual similarities.
These results indicate that the 3D-CNN does not merely memorize image-level differences but instead learns a meaningful latent manifold that represents plant stress–response trajectories as smooth transitions across conditions. In particular, the intermediate positioning of the Recovery state between Normal and Resistance reflects gradual functional restoration following water-deficit-induced physiological impairment, consistent with known adaptation processes in which NPQ relaxation follows photoprotective activation and PSII quantum efficiency progressively recovers after re-watering [
33].
The clear clustering in the learned feature space further suggests that the 3D-CNN performs implicit feature selection by emphasizing physiologically informative dimensions while suppressing less relevant noise. In this process, RGB, depth, and CF modalities provide complementary information: RGB features capture stress-related changes in leaf color distribution and chromatic uniformity [
34], depth features encode structural responses such as leaf drooping and morphological restoration during recovery [
35], and CF parameters provide direct insight into photosynthetic function.
Among these modalities, CF parameters play a central role by representing time-resolved functional states of PSII along the spectral–temporal axis of the 3D fusion cube. These include maximum and effective quantum efficiency indices (e.g., Fv/Fm and Y(II)) and dynamic NPQ metrics. As the 3D convolutional kernels slide across spatial (x–y) and temporal–spectral (z) dimensions, the network simultaneously learns the temporal evolution and spatial distribution of these signals [
15]. Features associated with persistently elevated NPQ and suppressed PSII efficiency dominate the Resistance cluster, whereas gradual NPQ relaxation and PSII functional recovery characterize the Recovery cluster, explaining its distinct yet intermediate position in the learned t-SNE space.
Overall, the t-SNE analysis demonstrates that the proposed 3D-CNN does not rely on static intensity-based differences for classification. Instead, it integratively encodes appearance cues from RGB data, structural information from depth measurements, and photosynthetic physiological dynamics from CF signals into a biologically meaningful latent space. By aligning multimodal optical information with established mechanisms of photosynthetic regulation and stress adaptation, the proposed framework substantially enhances the biological interpretability of AI-driven plant phenotyping.
3.7. Limitations and Future Perspectives
Despite the strong performance of the proposed multimodal 3D-CNN framework, several limitations should be considered. First, the experiments were conducted under controlled environmental conditions using a custom imaging chamber, which ensured stable illumination and imaging geometry. While this setting was essential for isolating physiological responses, it may limit direct transferability to commercial greenhouses or open-field environments where background complexity and environmental variability are greater.
Second, the present study focused on a single crop species, basil, under water-deficit stress. Although basil represents a suitable model crop for controlled-environment phenotyping, physiological responses and optical signatures can differ across species and stress types. Future work should therefore evaluate the generalizability of the proposed fusion strategy across multiple crops and environmental conditions to establish broader applicability.
Finally, multimodal learning using 3D-CNNs inherently involves higher computational complexity than 2D-based approaches. While the current model achieved acceptable inference speed for practical decision-support intervals, further optimization will be required for large-scale or real-time deployment. In this regard, future studies may explore modular sensing [
36] configurations and hybrid or attention-based network architectures [
37] to balance computational efficiency with physiological interpretability, thereby facilitating scalable deployment in smart farming systems.
4. Conclusions
This study demonstrated the effectiveness of a 3D-CNN-based multimodal data fusion approach for phenotyping basil (Ocimum basilicum L.) under varying water availability. By integrating RGB, depth, and time-resolved CF data, the proposed model captured both spatial and temporal–spectral features that collectively describe the plant’s physiological state transitions. This fusion framework enabled comprehensive monitoring of water-stress responses, providing insights into the mechanisms of resistance and recovery.
Compared to traditional machine learning classifiers and a 2D-CNN model trained on single-frame RGB images, the 3D-CNN achieved significantly higher classification accuracy and learned more distinct and biologically meaningful feature representations. The model effectively distinguished Normal, Resistance, and Recovery states, accurately reflecting basil’s adaptive dynamics under water-deficit and rehydration conditions. Feature-space visualization using t-SNE confirmed that the learned spatial–spectral embeddings corresponded to physiologically interpretable clusters rather than superficial visual differences, validating that the 3D-CNN captured latent manifold structures underlying real plant responses.
The proposed approach contributes to precision agriculture by providing a non-destructive method for continuous stress monitoring. By linking optical biosensing with deep learning, the 3D fusion framework enables the early detection of subtle stress cues and dynamic visualization of recovery processes. This method can be extended to other crops and abiotic stress scenarios, offering a scalable foundation for intelligent irrigation control and phenotyping automation.
Future work will focus on expanding the dataset to include broader environmental variability, validating model generalizability in real greenhouse and field conditions, and coupling the system with automated irrigation or climate-control mechanisms for closed-loop water management. Ultimately, this study establishes a basis for developing intelligent, data-driven phenotyping systems that connect multimodal optical sensing with temporal deep learning architectures to enhance water-use efficiency and crop resilience under diverse agricultural conditions.