1. Introduction
The accurate and reliable prediction of fresh weight in lettuce is a critical component for optimizing greenhouse production management, improving resource utilization efficiency, and ensuring yield [
1,
2]. As the most intuitive physical indicator of photosynthetic product accumulation and growth status, dynamic changes in fresh weight directly reflect the plant’s physiological response to environmental regulation [
3]. Traditional fresh weight measurements rely on destructive sampling and weighing using electronic balances. Although this method offers high precision, its irreversible nature precludes continuous, non-destructive monitoring of individual plants throughout their entire life cycle. Furthermore, the time-consuming and labor-intensive nature of these operations makes it difficult to meet the high-throughput demands of intelligent phenotyping platforms [
4,
5]. Consequently, the development of non-destructive methods for estimating fresh weight has become a research hotspot in the field of precision agriculture [
6,
7].
The three-dimensional geometric characteristics of vegetation are strongly correlated with biomass, as they directly quantify the spatial occupancy and morphological distribution of plants [
8,
9]. Wang et al. [
10] found that the correlation coefficient (
R2) between canopy structure features extracted from UAV multispectral imagery and aboveground biomass in grasslands was 0.73; by integrating coupled information on height and coverage, this approach significantly outperformed traditional spectral vegetation indices. Xie et al. [
11] employed LiDAR technology to calculate canopy height models and point cloud volumes, achieving an extraction accuracy of over 95%, thereby providing precise phenotypic parameters for biomass distribution modeling. However, relying solely on structural information has limitations. When plants are subjected to stress, internal physiological changes typically precede external morphological changes, resulting in a lag in the response of structural features during early growth stages or the initial phase of stress [
12]. Additionally, canopy overlap and shading can affect the estimation accuracy of key plant phenotypic parameters [
13,
14]. Therefore, relying solely on 3D structural information makes it difficult to fully reconstruct the true growth status of vegetation in complex canopy scenarios [
9,
15].
Multispectral features sensitively reflect physiological and biochemical states, such as chlorophyll content and water stress, whereas color and texture features extract richer spatial distribution information from plant surfaces; both serve as effective supplements to structural information [
16,
17]. However, relying solely on single-source data or two-dimensional information from a single perspective is highly susceptible to interference from canopy overlap, sensor saturation effects, and complex environmental backgrounds [
18]. Research indicates that the deep fusion of multisource phenotypic features through cross-modal information complementarity and multilevel feature synergy enables a more comprehensive analysis of complex systems [
19]. Zhang et al. [
20] significantly improved the estimation of aboveground biomass in sorghum by fusing morphological, color, and textural features from multiview images. In the diagnosis of water stress in summer maize, Xie et al. [
21] constructed a backpropagation neural network model by combining vegetation indices, image texture, and phenotypic parameters, significantly mitigating the underestimation of low stomatal conductance values. In a study predicting sorghum chlorophyll content, Zhang et al. [
22] constructed a PLSR model by integrating RGB color features, hyperspectral indices, and fluorescence intensity. The model achieved a prediction
R2 of 0.90, far exceeding that of models using single-sensor features, thereby significantly improving the model’s predictive performance. Che et al. [
23] acquired image sequences at five growth stages using UAVs for maize biomass estimation. They extracted parameters, such as canopy structure and spectral characteristics, from both raw and reconstructed images and constructed aboveground biomass estimation models based on both single-parameter and multimodal data. Multimodal data fusion achieved a higher biomass estimation accuracy than single-parameter methods, with a coefficient of determination of 0.83. The fusion of multisource remote sensing features also validated the significant advantages of multimodal information synergy over single data sources. In summary, cross-modal feature fusion can effectively compensate for the inherent limitations of single information sources, enabling a more comprehensive and accurate analysis of vegetation phenotype and physiological status [
24,
25].
Multi-view imaging technology is a key method for enhancing the robustness of plant phenotyping. By leveraging spatial redundancy to overcome the limitations of single-view imaging in terms of information coverage and data accuracy [
26,
27], it effectively addresses the shortcomings of traditional single-view, single-sensor approaches in analyzing complex three-dimensional plant structures, physiological states, and environmental interactions, thereby opening new avenues for achieving precise phenotypic analysis from the organ to the population scale [
28]. Zhang et al. [
20] utilized multi-view image fusion and multi-category features to assess the above-ground biomass of sorghum. Compared to single-type variables and image information from a single viewpoint, fusion based on averaged multi-view image information significantly enhances the ability to capture the phenotypic characteristics of sorghum above-ground biomass. Li et al. [
29] estimated blueberry yield using multi-view images combined with the YOLOv8 object detection framework. Compared to single-view image information, a regression model based on multi-view image fusion significantly improved the accuracy of blueberry yield estimation, reducing the mean absolute percentage error to 24.6% and achieving an
R2 of 0.77—representing an improvement of 5.2% to 15.7% over single-view methods. Zhang et al. [
30] used multi-angle remote sensing technology to estimate water use efficiency in winter wheat. Compared with traditional single-angle vertical observations, multi-angle remote sensing can capture richer information on canopy structure and outperform traditional single-angle spectral parameter models. Duan et al. [
31] utilized multi-angle imaging technology to collect color images of rice from multiple angles, enabling a comprehensive determination of the number of panicles. Multi-angle imaging effectively overcomes occlusion issues associated with a single viewpoint, significantly improving the accuracy and stability of panicle number identification. In summary, multi-view imaging effectively overcomes issues such as single-view occlusion and information loss by integrating multidimensional phenotypic characteristics. This method provides more comprehensive data and reduces random errors caused by a single measurement angle through information complementarity and averaging, thereby improving the accuracy of phenotypic analysis.
Traditional machine learning models generally suffer from a lack of transparency in their decision-making processes and poor interpretability. While striving for high predictive accuracy, models often sacrifice interpretability, making it difficult to balance predictive performance and interpretability [
32,
33]. To address this challenge, explainable machine learning has rapidly developed in recent years. It not only reveals the internal processes of models but also uses model-agnostic explanation methods to transform any “black-box” model into an explainable one without requiring knowledge of its internal structure, providing explanations at both global and local levels [
34,
35]. Applying the aforementioned interpretability framework to crop fresh weight prediction using multimodal data fusion allows for the clear visualization of the influence pathways of different data sources on prediction results while maintaining model accuracy, thereby enhancing the model’s credibility and usability in practical applications [
36,
37].
To address the aforementioned issues, in this study, we designed an integrated platform for data acquisition using multiple sensors. By capturing multiple-view morphological, color, and texture data, as well as multispectral features of mature lettuce, we propose a yield prediction model based on multi-view and multi-modal feature fusion for greenhouse-grown lettuce. This study established a feature fusion framework that integrates optical sensor data, including multispectral vegetation indices and visible light color and texture features. It introduced a collaborative strategy for multi-view and multi-scale features and used the variance inflation factor tests and correlation analysis to eliminate redundant features, thereby selecting an optimal feature subset. By combining ensemble learning methods, such as random forest, we constructed a mapping model between lettuce multimodal data and fresh weight. By utilizing SHAP explanations to enhance the scientific understanding of the prediction process, we fully leveraged the complementary nature of spectral, morphological, and color texture features to overcome the limitations of single-feature dimensions and angle dependency. This study provides a theoretical basis and technical guidance for non-destructive estimation of the fresh weight of greenhouse-grown lettuce.
3. Results
3.1. Statistical Analysis of Fresh Weight Under Nitrogen Gradients
Fresh weight is a key indicator for assessing lettuce yield and nutritional status, and its dynamic changes reflect the critical role of nitrogen in regulating the assimilation and distribution of photosynthetic products. As an essential element for the synthesis of proteins, chlorophyll, and nucleic acids, nitrogen serves as the primary driver of photosynthetic carbon assimilation and determines the biomass accumulation [
62]. The statistical results for the lettuce fresh weight under different nitrogen gradients are presented in
Table 6.
A statistical analysis of 120 lettuce samples revealed that the fresh weight ranged from 21.50 g to 81.65 g, with an average of 42.63 g across all samples. As nitrogen supply levels increased, the average fresh weight of lettuce in each group showed a stepwise upward trend: the average fresh weight at the low nitrogen level (N1) was 31.34 g, while at the high nitrogen level (N5) it was 51.68 g. This study plotted box-and-whisker plots of fresh weight distribution for different treatment groups (
Figure 3). The observation of the median values at the center of the boxes reveals that the response of lettuce fresh weight to the nitrogen gradient exhibits distinct phasic characteristics: within the N1 to N4 gradient range, the average fresh weight increased in a near-linear manner, with a total increase of 62.58% compared to the N1 group; upon entering the N4 to N5 range, the average fresh weight increased only slightly from 50.95 g to 51.68 g, with the growth rate slowing significantly to 1.43%, and the median position of the box remaining essentially unchanged. This indicates that, under the current greenhouse conditions, lettuce growth exhibits a physiological saturation effect. When nitrogen concentration approaches the N4 level, the plants’ assimilation capacity tends to reach saturation, and the marginal contribution of continued nitrogen input to fresh weight accumulation decreases significantly.
3.2. Correlation Analysis Between Multi-View Phenotypic Features and Fresh Weight
In studies predicting the fresh weight of lettuce, the correlation between color indices and fresh weight exhibited a significant angle dependence. Under a top-down viewing angle, the Pearson correlation coefficients (|r|) between various color indices and fresh weight were generally below 0.4, indicating that color information based on the canopy top struggled to adequately characterize the dynamic accumulation of fresh weights. In contrast, the correlations between indices such as VARI, NDI, IGRVI, and GRRI and fresh weight were significantly enhanced under the single-side and side-view-averaged perspectives. The correlation coefficients exceeded 0.48 under the single-side view and further increased to 0.52–0.53 after the side-view averaging. The experimental results confirmed that oblique imaging can more effectively capture the three-dimensional structure of plants and their pigment distribution characteristics in the vertical dimension, thereby demonstrating stronger representational capabilities for fresh weight estimation.
In terms of textural features, the average oblique view demonstrated a significant advantage. Specifically, the absolute values of the correlation coefficients between the local uniformity (HOM), textural complexity (ENT), and angular second-order moment (ASM) of the image (which reflect homogeneity, complexity, and consistency, respectively), and fresh weight reached 0.58, 0.60, and 0.59, indicating a strong correlation. Furthermore, the correlation coefficients for mean area (MEA) and dissimilarity (DIS) reached 0.53. From a biophysical perspective, the leaf orderliness, geometric complexity, and irregularity extracted from the side view can be precisely mapped to the cumulative fresh weight levels. In contrast, from the top-down perspective, most texture features exhibited low correlations because of the difficulty in capturing the complex spatial structure of the plant; only variance (VAR) and correlation (COR) demonstrated moderate correlations.
The performance of multispectral vegetation indices derived from data acquired at different viewing angles demonstrated clear complementary characteristics. The side-view angle generally yields superior results, with correlation coefficients of 0.63 and 0.61 for the SR and MSR indices, respectively, indicating excellent predictive potential. However, the top-down view possesses unique advantages in capturing spectral features that reflect canopy physiological status; the correlation coefficients for NDRE and MTCI under this view reached 0.57 and 0.58, respectively, significantly outperforming the oblique view. The correlations of GNDVI, GWDRVI, and SV_CI_green under the top-down view also reached 0.52. The above analysis indicates that the side-view perspective excels at characterizing the three-dimensional morphology and textural attributes of plants, whereas the top-down view is more sensitive to spectral physiological responses at the top of the canopy. The organic integration of these two perspectives provides a scientific basis for constructing high-precision multimodal data-prediction models.
3.3. Feature Selection and Multiple Collinearity Test Results
Based on a comprehensive evaluation using Pearson’s correlation coefficient and the variance inflation factor (VIF), this study selected the nine optimal features from the 66 original feature variables and constructed the final feature space for fresh weight prediction.
Table 7 lists the final variables selected through correlation and VIF screening.
3.4. Comparative Analysis of Fresh-Weight Prediction Models
To evaluate the predictive accuracy of fresh weight in lettuce, this study selected six machine learning algorithms for modeling and analysis: support vector regression (SVR), random forest regression (RFR), gradient boosted decision tree regression (GBDT), K-nearest neighbor regression (KNN), Extreme Gradient Boosting (XGBoost), and backpropagation neural networks (BPNNs). A baseline model was constructed using only morphological features as primary features; subsequently, RGB color indices were incorporated to investigate the complementary effect of visible light bands on spectral information; thereafter, textural features were introduced to correct for canopy geometric effects by utilizing microstructural information; finally, multispectral vegetation indices were introduced. Through an incremental feature fusion strategy, the study systematically examined the performance gains of multimodal data on the model. The regression evaluation results under different feature combinations, using four assessment metrics—coefficient of determination (
R2), root mean square error (RMSE), normalized root mean square error (RMSEn), and mean absolute error (MAE)—are shown in
Figure 4.
The RFR model demonstrated excellent fitting capabilities across all feature combinations, with R2 values consistently ranging between 0.80 and 0.82, indicating robust overall predictive performance. When using only morphological features as input, the RFR model achieved an R2 of 0.80, an RMSE of 5.27 g, and an MAE of 3.95 g on the test set, indicating that morphological features could adequately reflect the variation patterns in the fresh weight of lettuce. After introducing RGB color features, the model’s R2 value on the test set increased slightly to 0.81, while the improvement in RMSE was not significant, with an MAE of 3.90 g, indicating that color features had a limited effect on enhancing the performance of the RFR model. After further integrating texture features, the R2 value on the test set increased to 0.82, the RMSE decreased to 5.09 g, and the MAE decreased to 3.65 g, indicating that texture information could supplement plant structural details, thereby improving prediction accuracy. When multispectral features were integrated to form a complete feature set, the testing error of the model further decreased, with an MAE of 3.61 g; however, the rate of performance improvement gradually leveled off.
The SVR model demonstrated excellent fitting capability and stable generalization performance under various feature combinations, particularly under multimodal feature fusion conditions. When using only morphological features, the SVR model achieved an R2 of 0.84, an RMSE of 4.72 g, and an MAE of 3.51 g on the test set, with prediction accuracy significantly higher than that of the RFR model, indicating that SVR can effectively capture the nonlinear relationship between morphological features and fresh weight. After introducing RGB color features, the R2 on the test set increased to 0.86, the RMSE decreased to 4.39 g, and the MAE decreased to 3.24 g, indicating that color information plays a significant complementary role in the SVR model. After further integrating texture features, the model maintained a stable predictive performance on the test set, with R2 = 0.85, RMSE = 4.66 g, and MAE = 2.92 g, showing no obvious signs of overfitting. Under the full feature combination, the SVR model achieved the best prediction results: R2 = 0.93, RMSE = 3.23 g, RMSEn = 5.60%, and MAE = 2.31 g on the test set, significantly outperforming other models.
When using only morphological features, the GBDT model achieved an R2 value of 0.77, an RMSE of 5.64 g, and an MAE of 4.27 g on the test set. After introducing RGB color features, the R2 value on the test set slightly improved to 0.79, and the MAE to 3.95 g; however, the model’s overall predictive performance remained limited. Upon further integration of texture and multispectral features, the model performance on the test set declined, with R2 dropping to 0.72 and 0.70, respectively, while RMSE increased significantly, with MAE values of 3.55 g and 4.13 g, respectively. This indicates that, with a limited sample size, the GBDT model is sensitive to high-dimensional features, faces a significant risk of overfitting, and requires improvement in its generalization ability.
The XGBoost model maintains high training accuracy while demonstrating good predictive stability on the test set. When using only morphological features, the model achieved an R2 of 0.72 and an MAE of 4.68 g on the test set, indicating relatively average predictive performance. After introducing RGB color features, the R2 on the test set significantly improved to 0.85, whereas the RMSE decreased to 4.62 g and the MAE decreased to 3.46 g, indicating that color features significantly enhanced model performance. After further integrating texture features, the model’s R2 on the test set increased to 0.88, with the RMSE further decreasing to 4.10 g and the MAE to 2.85 g, achieving the model’s optimal predictive performance. Under the full feature combination, the model’s predictive performance declined slightly but remained at a high level, with R2 = 0.87 and an MAE of 2.87 g.
The KNN model exhibited a perfect fit for the training set, with R2 values reaching 1 for all feature combinations. However, the R2 values of the model for the test set were generally low, and its prediction errors were significant, indicating that the model was highly dependent on the training samples and had limited generalization ability. The predicted values of the KNN model almost perfectly aligned with the 1:1 reference line, indicating a strong local fitting capability for the training samples. However, in the regression scatter plot of the test set, the predicted points showed a clear divergence trend, with some samples deviating significantly from the 1:1 reference line. In particular, large prediction errors were observed in the medium-to-high fresh weight range, revealing distinct dispersion characteristics. The mean absolute error (MAE) for the morphological feature set was 5.12 g; after incorporating RGB features, the MAE was 3.75 g; after integrating texture features, the MAE was 3.58 g; and with the full feature combination, the MAE was 3.38 g. These results indicate that when the sample size is limited and the feature dimension is high, the KNN model struggles to make stable predictions for unseen samples.
The overall prediction performance of the BPNN model improved gradually with an increasing feature dimension, indicating that multimodal data fusion has a positive effect on neural network models. The MAE for morphological features alone was 3.95 g; after adding RGB features, the MAE decreased to 3.44 g; after integrating texture features, the MAE was 3.27 g; and with the full feature combination, the MAE reached 2.81 g. However, the prediction accuracy of this model on the test set consistently remained lower than that of the SVR and XGBoost models.
3.5. Cross-Model Comparison of Machine-Learning Algorithms
Figure 5 illustrates the differences in the fresh-weight prediction performance of six ML models under a four-stage IC feature combination. The features are introduced in stages, including morphological features (MFs), color indices (CIs), texture indices (TIs), and multispectral vegetation indices (VIs). By analyzing multiple metrics—including
R2, RMSE, MAE, and RMSEn—the study clearly demonstrates the patterns of how feature increment and model adaptability influence prediction performance.
When using only a single morphological feature, each model can achieve a basic estimation of fresh weight based on the three-dimensional structural information of the plant; however, the performance varies significantly. Among them, the SVR model performed best, with R2 = 0.84, RMSE = 4.72 g, and MAE = 3.51 g, effectively capturing the nonlinear relationship between morphological features and fresh weight. The BPNN and RFR models demonstrated good stability, with R2 values of 0.81 and 0.80, respectively, and MAE values of 3.95 g each. The XGBoost and GBDT models exhibited relatively lower prediction accuracy, with R2 values of 0.72 and 0.77, respectively, and MAE values of 4.68 g and 4.27 g, respectively. The KNN model had the highest error and poorest performance, with an R2 of only 0.69 and an MAE as high as 5.12 g. This indicates that morphological features are the core foundation for estimating the fresh weight of lettuce; however, there are significant differences in the ability of different models to extract information from individual structural features.
After introducing RGB color features, the overall prediction accuracy of all models improved, and the errors decreased slightly. The SVR model showed significant gains, with R2 increasing to 0.86, RMSE decreasing to 4.39 g, and MAE decreasing to 3.24 g; the XGBoost model achieved a leap in performance, with R2 increasing from 0.72 to 0.85 and MAE decreasing to 3.46 g; the BPNN, KNN, GBDT, and RFR models all showed varying degrees of optimization, with MAE decreasing to 3.44 g, 3.75 g, 3.95 g, and 3.90 g, respectively. It was evident that RGB color features could supplement information on the plant’s visual growth status and effectively address the information gaps in single morphological features, enhancing the models’ predictive capabilities.
After further integration of the texture features, the performance of most models continued to improve, demonstrating a notable synergistic effect from the combination of multiple features. XGBoost achieved the best performance at this stage, with R2 = 0.88 and MAE = 2.85 g; SVR maintained its lead, with R2 = 0.85 and MAE reduced to 2.92 g; the errors of the BPNN, RFR, and KNN models continued to converge, with MAE decreasing to 3.27 g, 3.65 g, and 3.58 g, respectively; and the performance of the GBDT model stabilized, showing no significant improvement. Texture features can precisely characterize differences in canopy surface details, complementing morphological and color features, and effectively enriching the model’s feature input dimensions.
After finally integrating multispectral features to construct a full-dimensional feature space, the performances of the various models diverged significantly. The SVR model demonstrated a distinct advantage in handling high-dimensional features, achieving the best results across the entire dataset with R2 = 0.93, RMSE = 3.23 g, RMSEn = 5.60%, and MAE = 2.31 g, demonstrating exceptional robustness and generalization ability. XGBoost and BPNN maintained high accuracy levels, with R2 values of 0.87 and 0.85, respectively, and MAE values of 2.87 g and 2.81 g, respectively. The RFR and KNN models showed performance saturation with no significant improvement in feature increments. The GBDT model exhibited significant overfitting and degradation, with R2 dropping to 0.70 and MAE rising to 4.13 g, indicating poor adaptability to high-dimensional features and a weak generalization ability.
In summary, progressive multimodal feature fusion can sequentially supplement plant structure, growth status, and physiological information, thereby effectively improving the prediction accuracy of the fresh weight. Low-dimensional feature fusion can achieve stable gains across all models, whereas model adaptability varies significantly under high-dimensional feature scenarios. Based on a comprehensive evaluation of all indicators, the SVR model demonstrated the highest prediction accuracy, lowest error rate, and best stability, indicating that it is the optimal model for estimating the fresh weight of lettuce.
3.6. SHAP-Based Feature Contribution Analysis
To elucidate the generalization decision logic of the optimal SVR model on independent samples and to clarify the distribution of mathematical contributions from each modality’s phenotypic variables within the algorithm, this study introduced the Kernel SHAP game-theoretic explainability framework and conducted a rigorous quantitative analysis using only the 30% independence validation test set. Because this reserved test set was never involved in the model’s weight training or hyperparameter grid search from the outset—and thus constitutes validation data entirely unseen by the model—the marginal contribution quantification performed on it can authentically and unbiasedly evaluate the generalization interpretability of features. The contribution weights quantified by the SHAP framework essentially constitute a posteriori verification of the internal mapping logic fitted by fully supervised machine learning algorithms, reflecting the mathematical marginal variation in the output prediction values contributed by each variable in the construction of the decision hyperplane.
As evidenced by the global feature importance calculated based on the independent test set (
Figure 6A), the decision-making of the optimal SVR model exhibits distinct characteristics of “shape dominance, with multidimensional synergy among multispectral and color indices.” The “point cloud convex hull surface area (PCCHSA),” which characterizes the three-dimensional geometric envelope scale of lettuce plants, holds an absolute dominant position in the test set decision-making, with its marginal contribution magnitude showing a clear lead. This post hoc analysis directly confirms that the overall baseline for fresh weight estimation of lettuce is highly dependent on the physical support provided by strong morphological and dimensional variables. However, relying solely on morphological features can easily lead to underlying fitting biases in algorithms lacking internal regularization constraints, as they fail to suppress spatial noise at the edges of the greenhouse, resulting in severe local overfitting within small sample spaces. The support vector regression (SVR) model selected in this study, through parameter regularization and a mechanism that maximizes structural margin, stably captures the dominant contribution of PCCHSA while synergistically integrating two-dimensional morphological features, such as the projected circumcircle radius (PCCR), the red–green ratio index (GRRI), and multispectral physiological features from oblique and top views (such as SV_SR, SV_CI_green, and TV_CARI). Although these multimodal features—which interweave spectral, color, and morphological information—have limited global marginal contributions, they provide essential fine-grained physiological state corrections and multidimensional topological complementarity within the algorithm’s hyperplane. This mathematically demonstrates that multimodal fusion improves generalization stability compared with predictions based on a single morphological benchmark.
As shown in the topological distribution map of feature influences on the test set (
Figure 6B), a clear nonlinear drive chain emerges between the feature values and model outputs. The distribution of SHAP values for PCCHSA in the test set samples is extremely skewed to the right, with high-value scatter points densely intertwined within the positive contribution range, indicating that when the volume of the three-dimensional lettuce structure exceeds the benchmark threshold, the fresh weight prediction output of the algorithm will exhibit a significant positive step response. The scatterplots of multispectral features (such as SV_SR and TV_CARI) and RGB color indices (GRRI) were relatively symmetrical and concentrated on both sides of the zero value, with high and low feature values exhibiting nonlinear overlap in localized regions. This indicates that, on the 30% independent test set, the algorithm’s utilization of spectral physiological features and color indices tends toward a posteriori “fine-tuning correction”—that is, based on the biomass baseline determined by morphological analysis, it performs bidirectional interactive fine-tuning of the final decision hyperplane according to the physiological activity mapped by multispectral reflectance and the differences in stress-induced pigments manifested in the RGB color space.
In summary, by conducting an SHAP analysis on an independently reserved validation dataset, this study elucidates the direction and magnitude of contributions from each modal feature to the model output from a post hoc algorithmic perspective. The SVR model does not blindly distribute equal weights across phenotypic variables; rather, it uses three-dimensional morphological features as the core mathematical driver, supplemented by post hoc multispectral physiological features and RGB color indices for fine-tuning. These findings correlate with the test set coefficient of determination (R2) in terms of external fitting performance and provide a critical internal logical basis for evaluating the generalization behavior and robustness optimization of multimodal phenotypic inversion techniques in complex, uncontrolled facility environments.
3.7. Independent Sample Trial of the Optimal Model for Fresh Weight of Lettuce
To further validate the generalization performance of the optimal model on independent samples, we conducted field validation trials on independent batches at a Venlo-type greenhouse at Jiangsu University (32.2° N, 119.5° E). The test material comprised Italian bolt-resistant lettuce. Cultivation methods and nutrient solution management conditions were consistent with those used in the modeling experiments, with standardized irrigation using a modified Hoagland nutrient solution. Once the lettuce reached maturity, phenotypic data were collected, and destructive fresh weight measurements were taken to construct an independent validation dataset.
The validation process strictly followed the previous modeling workflow: the collected phenotypic traits were input into the trained optimal prediction model to obtain fresh-weight predictions, which were then compared and analyzed against the simultaneously obtained experimental data. Ninety lettuce samples were collected for this independent validation. The results of the fit analysis between the model predictions and experimental values based on the independent validation set are shown in
Figure 7. The model predictions showed a significant linear correlation with the measured values, with a coefficient of determination (
R2) of 0.86 and a root mean square error (RMSE) of 3.36 g. The prediction error fell within a reasonable range, indicating that the optimal model possessed good generalization ability under controlled cultivation conditions. This validated the feasibility and reliability of this modeling method for the non-destructive prediction of fresh weight in mature lettuce.
4. Discussion
4.1. Association Mechanism Between Multimodal Features and Lettuce Fresh Weight
The phenotypic characteristics of a plant’s aboveground parts serve as the direct physical basis for biomass accumulation, and their precise characterization plays a decisive role in predicting the fresh weight. Multimodal phenotypic traits, encompassing structural, physiological, color, and textural dimensions, collectively regulate the spatial configuration of photosynthetic organs, light capture efficiency, and rates of material synthesis. Through phenotypic plasticity, these traits respond to environmental changes, influencing biomass allocation patterns and accumulation levels. Three-dimensional morphological traits, such as PCCHSA and PER_W, directly reflect a plant’s ability to occupy space within the canopy, its potential for vertical expansion, and the overall complexity of its structure. These traits determine the efficiency of light radiation transmission, interception, and utilization within the canopy and represent the most critical physical factors driving fresh weight in lettuce [
63]. The more developed the canopy structure and the more fully it expands spatially, the larger the plant’s photosynthetically active area becomes, and the faster the rate of fresh weight accumulation [
64].
Color indices, such as the GRRI, can sensitively characterize chlorophyll content, photosynthetic efficiency, nitrogen nutrient levels, and plant health, providing critical physiological information to complement fresh weight predictions. This enables models to capture the contribution of internal physiological changes to fresh weight, rather than relying solely on external morphology. Multispectral indices, such as the SV_CI_green index, can accurately distinguish between plant growth health and senescence levels, rapidly indicating differences in overall growth vigor. They provide an intuitive basis for the visual assessment of fresh weight gradients in lettuce and serve as a crucial bridge linking external appearance to internal growth status [
65]. RGB textural features analyze crop surface uniformity, complexity, and detail heterogeneity to reflect plant growth consistency, leaf arrangement patterns, and canopy compactness. They serve to refine and correct local variations that are difficult to capture through morphological and spectral features, further enhancing the model’s ability to distinguish fresh weight variations across different individuals and growth stages. The four categories of multimodal features—morphology, spectroscopy, color, and texture—are not mutually exclusive. Instead, they collectively describe the formation and accumulation patterns of lettuce fresh weight across four levels: structural volume, physiological vitality, visual color, and surface heterogeneity. These features complement and reinforce one another, jointly constructing a comprehensive, systematic, and biologically meaningful phenotypic characterization system [
66].
4.2. Analysis of Multimodal Feature Selection and Modeling Strategy
The accuracy and reliability of biomass estimation depend heavily on the scientific selection of feature variables. Establishing a rigorous and efficient feature selection framework is a core prerequisite for improving model learning efficiency, enhancing prediction stability, and ensuring biological interpretability. The multimodal feature set constructed in this study encompasses four major categories of indicators: 3D morphology, spectral physiology, color, and texture. Features derived from different modalities but calculated from the same source data generally exhibit strong linear correlations and information redundancy; in particular, canopy structure parameters extracted from 3D point cloud reconstructions are highly prone to severe multicollinearity issues because of the nature of data generation from the same source. If all raw features are directly fed into the model for training, this will not only significantly increase computational redundancy and reduce model runtime efficiency but also cause bias and distortion in regression parameter estimates, amplify the risk of model overfitting, and simultaneously obscure and dilute the true driving effects of key core phenotypes on fresh weight. Ultimately, this will weaken the model’s generalization ability across samples and growth stages, as well as the interpretability of the results.
To effectively address the issue of multicollinearity among multimodal features and improve the quality of model inputs, this study employs a two-step screening strategy that combines Pearson correlation analysis with the variance inflation factor (VIF) to achieve standardized dimensionality reduction of high-dimensional phenotypic data. First, Pearson correlation analysis is used to identify and eliminate highly redundant variable combinations within the feature set, thereby reducing information overlap. Subsequently, the VIF test was used to quantitatively assess the strength of multicollinearity among variables and eliminate indicators with excessively high VIF values that could compromise model stability. This screening process is grounded in statistical rigor while retaining key phenotypic features with clear biological significance. It avoids the loss of valuable information that can result from simple exclusion or subjective selection, ensuring that the final feature set input into the model possesses independence, representativeness, and interpretability.
To further elucidate the contribution mechanisms of each selected feature to fresh weight prediction, this study utilized the SHAP method to quantify and visually rank the contribution of the screened multimodal features. In terms of the global importance ranking, the three core features—PCCHSA, GRRI, and PER_W—retained through the two-step screening process consistently ranked in the top three, playing a dominant role in predicting lettuce fresh weight; followed by SV_SR, ENT, and PCCR, while spectral color and low-level texture features contributed the least. Among these, three-dimensional morphological features, such as PCCHSA, can accurately characterize the plant canopy’s horizontal coverage, three-dimensional spatial expansion capacity, and canopy compactness, directly determining the total amount of light energy captured by the plant and the population’s resource utilization efficiency, and are highly coupled with biomass accumulation and fresh weight formation; color features, such as GRRI, can sensitively characterize plant chlorophyll levels, photosynthetic physiological activity, and nutritional status; two-dimensional morphological features, such as PER-W, supplement the scale of lateral canopy growth; and color and texture features depict the apparent color and surface heterogeneity of the canopy. These four categories of data complement each other’s strengths and collaborate in modeling, reflecting the complex synergistic mechanisms of multimodal features.
In summary, the multi-stage feature selection method based on Pearson correlation coefficients and VIF tests can significantly reduce multicollinearity and information redundancy in multimodal data. While streamlining the model’s input dimensions, it effectively retains phenotypic features that play a key role in driving fresh weight. The optimized feature set constructed through this screening process supports high-precision, robust fresh weight prediction models and provides quantitative, interpretable phenotypic indicators for greenhouse lettuce cultivation management, population structure optimization, and the early, efficient screening of breeding materials, offering significant theoretical reference value and potential for practical application.
The six machine learning algorithms selected in this study exhibited significant performance differences when processing the pre-screened multimodal features, which is closely related to the adaptability of different algorithm architectures to small-sample, high-dimensional data. Among them, the K-Nearest Neighbors (KNN) model achieved extremely high fitting accuracy on the training set, but its prediction error increased significantly on the test set. This performance suggests that instance-learning algorithms, which lack internal regularization constraints and rely solely on spatial distance for prediction, are highly prone to overfitting due to local training noise when processing complex nonlinear plant phenotypic data, thereby losing their generalization ability. In stark contrast, models incorporating parameter penalties or ensemble learning mechanisms demonstrated strong robustness. Preferred models, represented by Support Vector Regression (SVR), maintained good continuity and stability in error metrics across the training and test sets. This is primarily because SVR constructs optimal decision boundaries based on support vectors and performs Z-score standardization on the data prior to modeling. This eliminates dimensional differences among features of different modalities and prevents features with excessively large absolute values from dominating model training, thereby achieving better global approximation.
From the perspective of big data requirements in traditional machine learning, the 120 lettuce samples used in this study indeed constitute a relatively small sample size, which is an objective reason for the slight discrepancy in performance between the training and testing sets for some core models. However, due to constraints such as controlled-environment cultivation, destructive phenotypic data collection, and intensive physicochemical measurements, individual phenotypic experiments in the field of agronomy are often subject to dual spatial and temporal limitations, making it difficult to accumulate large-scale samples in the short term. According to statistical learning theory, a model’s generalization ability depends not only on the absolute number of samples but also on the relative ratio of independent observations to the number of input features. In this study, Pearson correlation analysis and VIF tests were used to reduce the initial 66 input variables to 9 core variables, increasing the ratio of effective samples to features to over 13:1. This mathematical structure satisfies the basic requirements for constructing stable, non-random decision boundaries in shallow machine learning models. Furthermore, this study introduced a temporally independent external batch dataset and conducted a blind validation while maintaining identical preprocessing parameters and optimal model structure. The model maintained high predictive accuracy on entirely new, independent samples. This result confirms that, through rigorous feature selection, scaling, and regularization control, it is entirely feasible and reliable to construct a high-precision plant phenotype estimation model with cross-batch and cross-temporal generalization capabilities within the controlled sample space of 120 plant accessions.
4.3. Analysis of the Impact and Applications of the Greenhouse Microenvironment
Spatial heterogeneity within controlled-environment greenhouses and the resulting microclimate gradients are key factors influencing the precise analysis of crop multimodal phenotypes. Areas near exterior walls, evaporative cooling systems, or entrances within greenhouses commonly exhibit uneven ventilation rates, diminishing returns from sunlight, and localized shading. These variations in local total radiation directly trigger morphological plasticity responses in crops. For example, in locations with excessive direct sunlight or uneven local light distribution, lettuce often optimizes light capture by adjusting internode elongation and leaf spread angles, leading to abnormal increases in plant height or heterogeneous changes in canopy geometry. From the perspective of decision mechanisms in statistical modeling and machine learning algorithms, such spatial interference introduces underlying fitting biases for algorithms based on different mathematical architectures. For algorithms such as K-nearest neighbors (KNN)—which lack internal regularization constraints and rely solely on multidimensional topological spatial distances for instance learning—the introduction of spatial noise is catastrophic. Once certain locations exhibit abnormally tall plant heights or color shifts due to excessive sunlight absorption, the KNN algorithm’s neighborhood determination structure is directly dominated by these location-induced pseudo-features when calculating Euclidean distances in the feature space. This makes the algorithm highly susceptible to severe local overfitting within small sample spaces, resulting in a significant deterioration of prediction errors on the test set. Superior models, such as support vector regression (SVR), due to the introduction of regularization, maximization of structural margin, or residual iteration mechanisms, possess a certain degree of mathematical robustness against nonlinear biological phenotypic noise. They can effectively mitigate the interference of features with extreme absolute values on the global hyperplane, demonstrating stronger generalization stability.
Furthermore, a scientific assessment of the performance loss when applying this model to in situ greenhouse production environments without controlled darkbox components serves as a crucial basis for evaluating its industrial transferability. The core function of a darkbox platform is to provide a constant light source and completely isolate the system from external stray light, thereby enabling the capture of weak crop phenotypic signals with a high signal-to-noise ratio. However, when directly deployed in actual greenhouse production rows—where lighting is variable, natural shadows are intertwined, and complex cultivation backgrounds exist (such as reflections from perlite substrates and shadows from pipes)—the model’s predictive performance is expected to decline significantly. This is due to dynamic fluctuations in natural sunlight intensity, slight plant deformations caused by fans, and reduced canopy segmentation accuracy resulting from complex backgrounds. Therefore, a thorough analysis of the spatial distribution patterns of shadows and the physical denoising mechanisms in in situ greenhouse environments has become a key breakthrough for advancing the practical application of multimodal phenotypic inversion technology. This study pioneered the establishment of mapping boundaries under controlled conditions; subsequent work will gradually bridge the gap toward in situ online application in greenhouses by introducing a canopy segmentation network with illumination robustness and a dynamic natural light correction model.
It should also be emphasized that the core scenario of the multimodal phenotypic inversion system in this study is explicitly grounded in highly controlled greenhouse environments. Modern smart greenhouses exhibit highly standardized, industrialized, and equipment-integrated operational characteristics, typically equipped with automated PLC-controlled moving rails, suspended track-based phenotyping platforms, or inspection robots. Under this closed, intensive production model, integrating high signal-to-noise ratio controlled imaging darkrooms or shading yield measurement components directly into automated inspection lines through hardware modifications is highly feasible from an engineering perspective. Therefore, this study’s selection of a controlled greenhouse darkroom environment to establish phenotypic mapping boundaries precisely aligns with the industrial technological demands of modern facility agriculture for fully automated, non-destructive, high-precision yield measurement and precise growth regulation. By first elucidating the multimodal phenotype-driven mechanisms of lettuce fresh weight in an ideal controlled environment, this study lays a solid theoretical foundation for the next phase of solidifying this model into a standardized online detection algorithm module for factory-style smart greenhouse production lines. It also provides critical algorithmic and data support for subsequent, deeper-level digital production in smart facility agriculture, offering clear prospects for practical application in the modern, high-level facility horticulture industry.
4.4. Advantage Analysis of Multimodal Data Fusion in Lettuce Fresh Weight Prediction
Multimodal data fusion technology offers significant advantages in predicting the fresh weight of lettuce by enhancing information complementarity, improving model expressiveness, and increasing prediction stability. In terms of information representation, a single data source typically reflects only certain aspects of crop growth. For example, studies on fresh weight estimation using morphological parameters derived from 3D reconstruction primarily rely on geometric structural information, making it difficult to capture internal physiological changes within the plant [
38]. When using multispectral data or vegetation indices to characterize chlorophyll content and nitrogen status, estimates are easily disrupted by variations in canopy structure, limiting prediction accuracy [
51]. Furthermore, studies relying solely on RGB imagery tend to focus on color or texture information, resulting in a limited ability to capture complex canopy structures [
42]. Multimodal data fusion technology integrates morphological structural information with spectral physiological information, thereby enabling a transition from structural characterization to structure–physiology coupling.
Multimodal data fusion demonstrated a significant cumulative gain effect in terms of model performance. With the introduction of color, texture, and multispectral features, the prediction accuracy of each model showed a stepwise improvement trend, consistent with the conclusions of previous studies on crop biomass estimation [
46,
57]. In this study, the support vector regression (SVR) model achieved the best results after fusing multimodal data (
R2 = 0.93, RMSE = 3.23 g), significantly outperforming scenarios with single-feature inputs. This indicates that multimodal data inputs not only provide more effective features but also enhance the model’s sensitivity to key variables in high-dimensional feature spaces.
With respect to model generalization and stability, single-source data are susceptible to variations in imaging angles, lighting conditions, and individual plant differences; in contrast, multimodal data fusion enhances the model’s adaptability to complex environmental changes through information redundancy and complementarity. In a study on crop biomass estimation based on UAV multispectral imagery, Osco et al. [
67] found that models integrating multiple vegetation indices and textural features exhibited more stable predictive performance across different growth stages, with a smaller variation in the coefficient of determination (
R2) than single-feature models. This confirms the effectiveness of multimodal data fusion in enhancing the temporal stability of models [
68].
Multimodal data fusion overcomes the dimensional limitations of single data sources. By leveraging the synergy of multimodal data, it enhances the accuracy, reliability, and applicability of fresh weight prediction. Compared with traditional single-modal methods, it offers significant advantages and provides more efficient and robust technical support for the dynamic monitoring of greenhouse lettuce growth and the precise estimation of yield.
4.5. Limitations and Future Work
This study achieved high inversion accuracy and good interpretability in multimodal data fusion and fresh weight prediction for lettuce; however, due to limitations in experimental conditions and the depth of feature extraction—including a small sample size and a single experimental design—the following limitations remain to be addressed. First, this study utilized only one lettuce variety, a single greenhouse experiment, five nitrogen levels, and 120 plants, with data collected only once at maturity. It lacks validation across multiple varieties and batches. First, regarding data acquisition, all data in this study were collected under controlled greenhouse darkroom conditions, where the lighting was stable and the background was uniform, thereby reducing the impact of external interference on image quality. However, there is a significant gap between this idealized experimental environment and the actual application scenarios in open fields and open-air agricultural facilities. Existing research has confirmed that in real-world agricultural settings, numerous factors—including differences between RGB and near-infrared imaging, heterogeneity in plant geometry, fluctuations in light intensity, canopy shading, and image quality degradation—can significantly reduce the operational stability and generalization performance of visual phenotyping models. Consequently, advancing robust, highly transferable multimodal phenotyping modeling technologies has become a key research direction in this field [
69]. Therefore, future work should involve conducting validation trials in field or semi-open environments, expanding the sample size, and increasing the number of varieties and growth stages to assess the model’s robustness and generalizability in complex scenarios. Regarding feature construction and fusion methods, this study primarily employed manually designed morphological parameters, color indices, texture features, and vegetation indices for fusion analysis. Although this approach has clear physical significance and strong interpretability, there remains the potential for unmined information within the high-dimensional complex feature space. In the future, deep learning methods could be introduced to enable automatic feature extraction and end-to-end modeling, thereby further enhancing model performance.
5. Conclusions
This study developed a model for predicting the fresh weight of lettuce by integrating multi-view morphological, color, and texture, and multispectral features, thereby achieving accurate predictions of fresh weight under controlled greenhouse conditions. The results indicate that multimodal data fusion can effectively overcome the limitations of single data sources, jointly characterize plant geometry and physiological information, and improve prediction accuracy.
In terms of model performance, we compared the performance of six typical machine learning algorithms using an incremental feature fusion strategy. The results show that the SVR model performed best in the full-dimensional feature space, with a test set coefficient of determination (R2) of 0.93 and a root mean square error (RMSE) of only 3.23 g, significantly outperforming the RFR and other compared models. As color, texture, and multispectral features were progressively introduced, the performance of all models continued to improve, validating the effectiveness of multimodal data fusion in fresh-weight prediction. Feature contribution analysis further elucidated the decision-making mechanisms of the models. The SHAP explanation model indicated that the PCCHSA is the core driving factor for the fresh weight prediction of lettuce, establishing the physical foundation for fresh weight inversion; meanwhile, spectral indices and texture features play a fine-tuning role by providing information on physiological status and spatial heterogeneity. Furthermore, multi-view imaging demonstrated a distinct complementary effect among features: the side view excels at capturing the three-dimensional structure, while the top view is more sensitive to spectral responses; the fusion of these two views effectively enhanced the depth of feature representation.
In summary, this study provides a feasible approach and technical guidance for the nondestructive estimation of the fresh weight of lettuce under controlled greenhouse conditions and offers a scientific basis for the phenotypic modeling of protected-culture crops.