A Machine Learning Model for Photorespiration Response to Multi-Factors

: Photorespiration results in a large amount of leaf photosynthesis consumption. However, there are few studies on the response of photorespiration to multi-factors. In this study, a machine learning model for the photorespiration rate of cucumber leaves’ response to multi-factors was established. It provides a theoretical basis for studies related to photorespiration. Machine learning models of different methods were designed and compared. The photorespiration rate was expressed as the difference between the photosynthetic rate at 2% O2 and 21% O2 concentrations. The results show that the XGBoost models had the best fit performance with an explained variance score of 0.970 for both photosynthetic rate datasets measured using air and 2% O2, with mean absolute errors of 0.327 and 0.181, root mean square errors of 1.607 and 1.469, respectively, and coefficients of determination of 0.970 for both. In addition, this study indicates the importance of the features of temperature, humidity and the physiological status of the leaves for predicted results of photorespiration. The model established in this study performed well, with high accuracy and generalization ability. As a preferable exploration of the research on photorespiration rate simulation, it has theoretical significance and application prospects.


Introduction
Photosynthesis is a crucial factor in crop yield formation. Besides the extrinsic factors such as microclimate around the crop, including light intensity [1,2], CO2 concentration [3], temperature [4], and humidity [5], there are many intrinsic factors affecting the rate of photosynthesis, such as photorespiration, leaf age, chlorophyll content, and genes. In particular, photorespiration is a process closely related to photosynthesis, which physiologically reduces the efficiency of photosynthesis. The enzyme ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCO) catalyzes the addition of CO2 to ribulose-1,5-bisphosphate (RuBP), forming two 3-phosphoglycerate (3-PGA), which is called the carboxylation reaction. However, in C3 plants, a part of the enzyme RuBisCO oxidizes RuBP to 2-phosphoglycolate (2-PG) that is harmful to the plant and needs to be metabolized, thus causing waste and reducing the efficiency of photosynthesis [6]. In recent decades, many researchers have suggested the possibility of manipulating photorespiration in terms of modern biotechnology [7][8][9][10][11], while it remains a major challenge in applying biotechnological approaches in crop production. Instead of genetic editing and breeding, physical methods to reduce photorespiration can produce results instantly, and only require an understanding of how it changes with the environment.
Photorespiration is influenced by the surrounding environment, similar to photosynthesis. Most obviously, the ambient CO2 concentration is closely related to the rate of photorespiration because O2 acts as a competitive inhibitor for RuBisCO, i.e., the higher the ambient CO2 concentration, the less photosynthesis efficiency reduced by photorespiration. Like most other enzymes, the kinetic properties of Rubisco highly depend on temperature, and the correlation varies among different plants [12,13]. This results in generally higher rates of oxygenation and higher losses of carbon from photorespiration at higher leaf temperatures [14]. Studies have shown that the rate of photorespiration is increased with high light intensities [15][16][17]. In addition, humidity also influences the rate of photorespiration. Then, it becomes possible to control the environment to reduce the consumption of photosynthesis by photorespiration. For example, in greenhouse production or a closed system, slow release of dry ice can be applied to increase the concentration of CO2 in the air.
Photosynthesis models were greatly interesting to many researchers over a period of decades, since most of the dry matter of fruits originates from carbohydrate molecules produced by photosynthesis. In contrast, photorespiration models have received little attention from researchers. It has been extensively discussed in literatures that the response of photosynthesis to a specific environmental factor varies with plant species. For photosynthesis-irradiance curves, non-rectangular hyperbolic or rectangular hyperbolic models have been commonly used. Ye [18] and Ye et al. [19] developed an alternative model with additional restriction terms to improve the fit. Lanoue et al. [20] studied the translocation of Solanum lycopersicum cv. assimilates with differing spectral quality. The Farquhar-von Caemmerer-Berry (FvCB) model has been proposed for over 40 years [21], and the worldwide studies and improvements on the model had made significant progress [22][23][24]. Considerable results were accumulated using the model for fitting net assimilation rate-intercellular space CO2 concentration (A-Ci) curve and the estimation of photosynthetic biochemical parameters [25][26][27][28]. In field, irradiance changes all the time, which makes the static assimilation model no longer accurate, and the dynamic model developed to solve the problem [29,30]. For photosynthetic response models with multiple environmental factors, summarizing mechanistic models is more difficult when the dimensionality of the dependent variable is higher, whereas empirical models are easier to implement and have been widely studied. Li et al. [31] described the daily variation of photosynthesis with temperature and photosynthetically active radiation (PAR) in greenhouse cucumber. Peri et al. [32] studied the photosynthetic response of Berberidaceae to light intensity, water status, and leaf age. Zhang and Wang [33] simulated a model of canopy photosynthesis in greenhouse tomato based on simple leaf photosynthesis. While empirical models are mostly built with the help of traditional statistical methods, a new type of approach, designed to make the most accurate predictions possible-machine learning-is gradually gaining attention. Goltsev et al. [34] developed a neural network (NN) model to show the relationship between chlorophyll fluorescence and the effect of water stress on leaf photosynthetic processes. Heckmann et al. [35] evaluated the potential of leaf reflectance spectroscopy to predict parameters of photosynthetic capacity in Brassica oleracea and Zea mays using a two-layer NN. Jian et al. [36] established tomato population photosynthetic rate prediction models based on support vector regression (SVR), and the correlation coefficient reached a maximum of 0.9883. Zhang et al. [37] compared the effectiveness of several machine learning approaches in predicting photosynthetic rates from leaf phenotypes, with the extreme gradient boosting (XGBoost) performing the best. Although these studies have contributed to the application of machine learning models of photosynthesis, barely anyone has studied models of the effects of photorespiration on photosynthesis.
Photorespiration is so important, but it is often ignored and undiscussed by those who study photosynthesis models. Although the response of photorespiration to a single environmental factor has been studied in depth [38][39][40][41], its response to multiple environmental factors has been little studied. Currently, there is no effective and convenient method and evaluation index to assess the negative effects of photorespiration. Additionally, the determination of the photorespiration rate is mostly based on complex physiological experiments [42], which require a lot of time and resources and cannot be used quickly and directly in production practice. Consequently, there is an urgent need for available methods to rapidly assess the amount of photorespiration and its effect on photosynthesis rates.
Therefore, taking greenhouse cucumbers (Cucumis sativus L.) as an example and based on the previous machine learning model of photosynthesis, this study focused on photorespiration, and established a model of the response of photorespiration to multiple environmental factors, thus achieving an accurate estimation of the negative effects of photorespiration. In the meantime, the effects of environmental factors on photorespiration were analyzed, hence the impact factors that need particular attention were found. Machine learning methods such as polynomial regression, k-nearest neighbors (KNN), Gaussian process (GP), SVR, adaptive boosting (Adaboost), gradient boosting decision tree (GBDT), XGBoost, and NN were used to fit the relationship between environmental factors and photorespiration. The modeling effects of different approaches are compared. This study investigated using machine learning methods with different environmental conditions to predict photorespiration rate at different leaf positions in two growth stages (seedling and flowering-fruit stages) of cucumber. It will provide a theoretical basis for studies related to photorespiration.

Data Access
Cucumber, an annual creeping vine plant in the Cucurbitaceae gourd family, is an important cash crop widely cultivated in temperate and tropical regions around the world [43]. In this study, cucumber was used as the subject of the photorespiration model. The cultivar of cucumber for experiment was 'Jingyou 409′, a northern Chinese variety. The experiment was conducted in a plastic greenhouse of Baima Teaching and Research Base, Nanjing Agricultural University, Nanjing, China (31°36′ N, 119°10′ E). The mono-cropping system of cucumber was cropped twice in spring (April to July) and fall (September to November) of the year 2020, with a density of 4.5 plants/m 2 (Figure 1a). During the experimental period, fertigation, pest control and environmental control measures are taken according to cucumber production routines. Plants of much the same progress and conditions were selected for measurement by simple random sampling at seedling stages and the flowering-fruit stage. Photosynthetic parameters in cucumber leaves were measured using LI-6400XT Portable Photosynthesis System, manufactured by LI-COR, USA (Figure 1b). To avoid the influence of midday depression on data collection, the experiment was conducted from 8:00-11:00 and 14:00-17:00 daily. The CO2 injector and LED leaf chamber modules of LI-6400XT created the CO2 concentration and photosynthetically active radiation (PAR, substituted by photosynthetic photon flux density, PPFD) in the microclimate, respectively. CO2 concentration gradients of 200-800 ppm and PAR gradients of 0-2000 μmolm −2 s −1 are set using the AutoPrograms. Then, humidity and temperature were measured by the greenhouse environmental parameters at the same time. To reduce measurement error, the leaves of three adjacent cucumber plants were randomly selected for measurement in each group of environmental conditions. Leaf positions 1, 3, 5, 7 and 9 of each cucumber plant were selected (counting from the first fully expanded individual leaf from growing point) for measurement. The location of each tested plant in the greenhouse was also recorded.
One realizable approach to quantifying photorespiration in field is to suppress it by supplying the leaf with only 2% O2 (instead of the normal 21% O2) and measure the increased rates of photosynthesis [42]. An air tank with only 2% O2 was connected to the air inlet of the LI-6400XT, where it was vented to the atmosphere through a T-fitting on the pipe, to prevent the internal pump from fighting against the external tank pressure ( Figure  1c).
A total of 4550 sets of data were obtained from these measurements, which were used to form the initial dataset. A total of seven independent variables, features (PAR, CO2 concentration, temperature, humidity, growth stage, leaf position, plant location), and one dependent variable, label (photosynthetic rate with the effect of photorespiration and photosynthetic rate without the effect of photorespiration), were included.

Data Preprocessing
Despite the stability definition being implemented, the logged data from the Auto-Program may occasionally be incorrect. This is because the machine judgment is not truly representative of the adaptation of the leaves to the environment, while the net photosynthetic rate stabilizes quite rapidly. These erroneous data are not valid for building the model, but rather reduce the accuracy of the model. After cleaning, 4318 sets of data remained. Figure 2 shows a box plot of these data. For those data with large PPFD and low net photosynthesis rate and low PPFD and high net photosynthesis rate, the PPFD was treated as missing values to reduce the noise in the dataset. The missing values were then filled using a random forest (RF) regression algorithm [44]. Any regression is a process of learning from the feature matrix (where denotes a row of data) and then solving for the continuous label . This process is possible because the regression algorithm assumes that between and there is some relationship: In fact, features and labels can be converted to each other. In this case, unmissed values of PPFD were taken as new labels, and corresponding other features and original labels-net photosynthesis rate-were taken as additional features to predict the missing value of PPFD and fill in.
RF, first introduced in an article by Breiman [45], is an ensemble learning method that uses bootstrap aggregating, which operates by constructing numerous decision trees (CART) at training time and outputting the mean prediction of the individual trees. In random forest regression, each tree is built using a deterministic algorithm by selecting a random set of variables and a random sample from the dataset.
A total of 186 sets of data with missing values were filled by RF. Because the branch nodes of the decision tree select some features (not all features), for high-dimensional data, the filled data have randomness and uncertainty, which better reflects the real distribution of the missing data, and have good accuracy. Although this will make the correlation between the features stronger, it will not affect the training of the final model in this experiment.
Plants at the seedling and flowering-fruit stages were measured. In order for the computer to compute and calculate more easily, codes must be specified for the categorical features of the fitted regression model. The seedling stage was labeled with value 0 and the flowering and fruiting stage with value 1 for the label 'growth cycle' in the dataset oriented to the general machine learning approach; the seedling stage was labeled with value 1 and the flowering and fruiting stage with value 2 in the dataset oriented to the neural network approach. As shown in Figure 1d, a two-dimensional Cartesian coordinate system was established on the earth plane of the greenhouse. The southwest corner of the greenhouse was taken as the coordinate origin. Positive coordinates were to the span direction (east, -axis) and the length direction (north, -axis). Thus, the greenhouse was divided into a total of nine sections, with the location labels of each section being (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3,2), and (3,3).
The range of values for each continuous feature in the dataset varies widely and in different units. It is a significant obstacle as a few algorithms are highly sensitive to these features. Therefore, scaling these features is imperative. In gradient and matrix-based algorithms, such as SVR and NN, nondimensionalization can speed up the solution, while in distance-based models, such as KNN, it can improve the accuracy of the model and avoid the impact of a feature with a particularly large range of values on the distance calculation. A frequently used dimensionless method is standardization, which is insensitive to outliers. The standard score of a sample x is calculated as: where is the mean of the samples, and σ is the standard deviation of the mean.

Polynomial Regression
The traditional approach to solving multivariate nonlinear regression models is to treat them by converting them into linear forms of multiple regression models. For a set of eight features vector [ , , , ⋯ , ], the equation of the fitted linear model is: where is the bias, [ , , , ⋯ , ] are the weights, and is the predicted value. If the features are transformed into a two-dimensional form, as follows: And the linear model becomes the following equation: In fact, this is still a linear model. With the re-labeling of the features, the equation can be written: where is the bias, and [ , , , ⋯ , ] are the substitute features. Polynomial regression can fit a nonlinear dataset better, is not prone to overfitting, and has a certain degree of interpretability [47]. However, the complexity of the model increases dramatically with dimensions of the feature conversion.

K-Nearest Neighbors
The KNN is a nonparametric regression method, one of the simplest machine learning algorithms. It makes predictions by searching the historical database for data that are similar to the current observations. Unlike the general regression method, instead of finding an exact correlation between labels and features, it uses a pattern-matching algorithm to find a set of data that are similar to the input features and assigns the weighted average of the labels of these neighbors to the sample.

Gaussian Process Regression
GP regression is a nonparametric, Bayesian approach to regression that works well on small datasets. The prediction interpolates the observations. The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide based on those if one should refit (adaptive fitting) the prediction in some region of interest.

Support Vector Regression
Support vector machine (SVM) is a machine learning algorithm based on Vapnik Chervonenkis (VC) dimension and Structural Risk Minimization (SRM) Inductive Principle. Its main idea is to map linearly indistinguishable data in low-dimensional space to high-dimensional space by kernel functions in order to find linearly distinguishable classification surfaces [48]. By adopting an ε-insensitive loss function, SVM was applied in regression fitting and developed into SVR.

Adaptive Boosting
AdaBoost algorithm is a meta-estimator that begins by fitting a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of instances are adjusted according to the error of the current prediction [49]. As such, subsequent regressors focus more on difficult cases.

Gradient Boosting Decision Tree
The GBDT algorithm is an iterative decision tree algorithm based on the idea of boosting, an ensemble learning technique. The GBDT algorithm works out the residuals based on each sample by training multiple weak learners (regression trees), then trains regression trees based on all residuals and updates the weights to combine single strong learners, i.e., the conclusions of all regression trees are accumulated to obtain the final prediction results.

Extreme Gradient Boosting
XGBoost is an optimized distributed gradient boosting system designed to be highly efficient, flexible and portable [50]. The core of XGBoost itself is an integrated algorithm implemented by a gradient boosting tree. Unlike GBDT, the predictions of XGBoost are obtained by directly summing the weights of the leaf nodes on all weak learners, with a second-order Taylor polynomial of the loss function and the inclusion of a regularization term.

Neural Network
An NN is an algorithm inspired by the neural structure of the human brain. A "neuron" in the NN is a mathematical function, i.e., a single layer perceptron: where is the input signal, is the weights, is the bias, is the activation function, and is the output signal.
A multilayer perceptron is an NN consisting of fully connected layers with at least one hidden layer, and the output signal of each hidden layer is transformed by an activation function. An NN can be trained by using backpropagation (BP) and stochastic gradient descent (SGD). SGD is an optimization method used to minimize a loss function. BP is an efficient technique to compute gradient of the model that SGD uses.

Optimization Technologies
After the model is selected, the hyperparameters of the model are also very important, and different hyperparameters have different effects on the generalization ability of the model. The best hyperparameters should be chosen to maximize the accuracy of the model. Grid search technology is a method to optimize the model performance by traversing a given combination of parameters. For NN models, grid search is used to iterate through different numbers of hidden layers, neurons per layer, activation functions, etc.
The genetic algorithm (GA) is a randomized search technology that draws on the mechanisms of genetics and selection in biology and follows the principles of 'survival of the fittest'. The genetic algorithm simulates the evolution of an artificial population (given a combination of parameters after coding), and through the mechanisms of selection, crossover and mutation, a set of candidate individuals is retained in each iteration and the process is repeated. After several generations of evolution, the population ideally reaches a state of near-optimal fitness, i.e., the optimal solution is obtained. GA is used for hyperparameter optimization of machine learning models except NN models.

Performance Evaluation
Generalization error, an inherent property of a learning method, is the predictive power of a model for unknown data. Generalization error itself can certainly guide model improvement, but it is more costly and less efficient. So, an alternative approach is to use existing data to calculate it. Despite the data preprocessing as described above, noise in the dataset cannot be avoided, i.e., the fit of the training model is not as representative of the generalization error as it could be, so the dataset needs to be partitioned. The entire dataset is divided into three parts, using a portion of the data to train the model, called the training set, a portion of the data to tune the hyperparameters of the model, called the validation set, and a portion of the data to test the model where the test results approximate the generalization error, called the test set.
The accuracy of the regression model is assessed by calculating the difference between the true label and the predicted value. There are two different perspectives to assess the effectiveness of regression. One is whether the model predicts the correct values, the other is whether it fits enough information.
The root mean squared error (RMSE) is a commonly used measure of the difference between predicted and actual values. It indicates the absolute fit of the model to the datahow close the observed data points are to the model's predicted values. The mean squared error (MSE) represents the sample standard deviation (SSD) of the difference between the predicted and actual values. However, RMSE is more widely used than MSE to evaluate the performance of the regression model with other random models, as it has the same units as the predicted variable. The formula for RMSE is as follows: where is the number of actual values, [ , , ⋯ , ] are actual values, and [ , , ⋯ , ] are predicted values.
For regression algorithms, it is not enough to explore the accuracy of the data prediction. In addition to the numerical size of the data itself, the model is expected to capture patterns in the data, such as distribution, monotonicity, information that cannot be evaluated using RMSE. To measure how well the model captures the amount of information on the data, the coefficient of determination (R 2 ) is defined. The R 2 represents the proportion of the variance in the dependent variable. It is a scale-free score, i.e., irrespective of the values being small or large, the value of R 2 will be less than one. The formula for R 2 is as follows: where is the predicted value and is the mean value.

Performance of the Models
This article divided the dataset using five-fold cross-validation and used genetic algorithm (GA) and grid search technologies to optimize the hyperparameters of polynomial regression, KNN, GP, SVR, RF, Adaboost, GBDT, XGBoost, and NN models, and then found the upper limits of the fitting effect of each model for two photosynthesis datasets measured using air and 2% O2, respectively. Table 1 shows the optimal hyperparameters for these models and the effectiveness of the four evaluation metrics (explained variance score, EV, mean absolute percentage error, MAPE, root mean square error, RMSE, coefficient of determination, R 2 ) when using the holdout cross-validation divided datasets (60% for the training set, 20% for the validation set, and 20% for the test set). As can be seen from Table 1, for both datasets, the XGBoost models have the best performance in terms of prediction. The R 2 of all these models is above 0.85, which indicates that all these models have good fit and a substantial extent of explanation of the dependent variable by the independent variables, with RF, GBDT, XGBoost and NN (air based) models reaching above 0.95 and XGBoost being the best. It can be observed from this table that the values of EV and R 2 scores for each model are approximately the same, that is because their average error is almost zero. For the evaluation metric MAPE, the SVR model also performs better besides the three methods RF, GBDT, and XGBoost, while the other models have the value > 1, which means that these models have poor prediction performance when the photosynthesis value is low. The XGBoost model has the smallest RMSE value, and it suggests that the model has a small maximum error of prediction, but with a limited difference from GBDT and RF. Among these models, polynomial regression has the fastest modeling and solving speed of 25 ms. Although it has the worst results among all the models that were established, it still has a certain prediction accuracy. GP, SVR and NN were modeled slower, where the number of epochs was set to 1000 in order to improve the accuracy of the NN model, and the number of epochs can usually be reduced to decrease the time-consuming model training. Except for the three slower models, the training and prediction elapsed time of all the models is within 1000 ms. Therefore, the XGBoost model is the optimal model in terms of both prediction accuracy and time efficiency. In particular, analyzing the XGBoost model in more detail (Figure 3), it is found that in most cases the predicted values are very close to the desired values. The accuracy of the photosynthesis model with the adverse effect of photorespiration decreases when measured values are higher, especially and most significantly in the validation set. Contrastingly, for the model of photosynthesis without the effect of photorespiration, there is no such problem. NN are increasingly used, and the interpretability of the models built by NN is given more importance [51]. However, the interpretability of NN models is relatively poor. In many studies, it was shown that neural networks seem to have stronger predictive capability [52]. However, for the light respiration model discussed in this study, not only did it require a lot of computing power and time to build the light respiration model using the NN approach, but it also required more time cost to make predictions (108.5 times more than the model built using XGBoost), and above all the model performed poorly (R 2 of 0.958 and 0.933 for the test set, which lower than 0.970 and 0.970 in XGBoost model). From these three aspects, the XGBoost algorithm developed based on decision trees is more advantageous. The XGBoost algorithm has been used to some extent in many fields [53,54], and there are also an increasing number of algorithms that use this idea. Many weak learner instances of the algorithm are being pooled (via boosting, bagging, etc.) together to create a strong ensemble learner, with some success [55,56]. Thus, researchers need to pay more attention to integration learning.

Potential for Model Performance Enhancing
Data and features determine the upper limit of machine learning, so both are crucial. Although the data provided for the building of this model is large enough (about 4000+ datasets), for the dataset obtained from the experimental measurements, there must be a range of features that it failed to cover, i.e., these samples were not fully representative of the overall population. It required a high generalization ability of the model. Especially, this study integrated the two machine learning models in an arithmetic operation rather than starting from the mechanics of the model. Thus, the generalization error of the model was also presented in the prediction results of photorespiration in the same way as arithmetic operations, making some predictions out of the theoretical range. This can only be solved by providing more data to the model.
The number of features is not always the more the better. In this study, the filtering method was used to select features for the datasets. The features were first filtered by the variances of themselves. If the variance of a feature is small, it means that the sample has less variation in this feature, which means that the feature is less useful for differentiating the sample. After calculating the variance of each feature, it was found that the variance of cucumber growth stage was the smallest at 0.21, followed by plant location (X, Y) in the greenhouse at 0.61 and 0.65, respectively, and the variance of all other features was larger (value > 5). Therefore, the growth stage can be excluded. If it is excluded and the XGBoost model is reconstructed, the model performance is shown in Table 2. The comparison between Table 2 and Table 1 shows that the new has limited improvement for the dataset measured using air and is rather worse for the dataset measured using 2% O2. It means that the exclusion of this feature has little effect on the model. Next, the correlation between features was checked. If two variables are highly correlated among themselves, they provide redundant information about the target. Essentially, an accurate prediction of the target can be made using only one of the redundant variables. In addition, removing redundant variables can help reduce dimensionality and damp out noise. Figure 4 represents the absence of a strong association between any two features (correlation coefficient < −0.5 or >0.5).

Interpretability of the Model and the Main Factors Affecting the Photorespiration
The XGBoost method is based on the gradient boosting tree and has shown high accuracy in this study, but the interpretability is not clear enough and is often referred to as a "black box" model. Therefore, the importance ranking of feature values ( Figure 5) can only give a general overview of each feature's importance and cannot determine the relationship between the features and the final prediction results. Some researchers have used the SHapley Additive explanation (SHAP), which originates from cooperative game theory, to explore the influence of each feature of the model [57,58]. Figure 6 shows the SHAP values that order features based on their importance to photosynthesis rate (a) with an adverse effect of photorespiration and (b) without the effect. The strength order of influence of the two groups of features is almost the same, except for their position in the greenhouse. PAR is a very important feature and is positively correlated with photosynthesis rate most of the time. However, when the relationship is negative, the influence is significantly stronger. Additionally, PAR is less influential in regular photosynthesis. In contrast, it is more influential when photorespiration is excluded. It is implied that PAR is less important for the predicted photorespiration values obtained by subtracting the two predictions. Humidity is mostly negatively correlated in Figure 6a, while the importance is relatively small, but when it shows a positive correlation, the importance became large, reaching almost half of PAR at the highest. The leaf position, on the other hand, behaves as just the opposite of humidity. In the prediction of photosynthetic rate without the effect of photorespiration (Figure 6b), the influence of humidity is lower overall, and the positive and negative correlations are more even; leaf position followed the same trend as the results in Figure 6a. Growth stage is the least important of the two models. Two data with approximately the same feature values from each of the two models' training sets are selected for visual analysis of the feature importance, as shown in Figure  7a-d. The features positively correlated with the prediction results are shown in red, and the negatively correlated features are shown in blue, and the effects of these features are plotted against each other to obtain the prediction results. In all plots, PAR has the largest positive effect. The difference between Figure 7a,b is that CO2 concentration and Location Y, which have a positive effect in (a), are negative effects in (b). The difference between Figure 7c,d is that Temperature, which has a weaker positive effect in (c), plays a stronger negative effect in (d), and Relative Humidity, which has a larger negative effect in (c), has a smaller effect in (d). In conclusion, it indicates that the importance of each feature in the photorespiration model differs significantly from that in the photosynthesis model. Figure 7. SHAP force plot. The feature values of (a) and (b) are approximately the same, with the feature value of (a) coming from the photosynthesis model with the effect of photorespiration and the feature value of (b) coming from the photosynthesis model without the effect of photorespiration; the feature values of (c) and (d) are approximately the same, with the feature value of (c) coming from the photosynthesis model with the effect of photorespiration and the feature value of (d) coming from the photosynthesis model without the effect of photorespiration.
Although the above showed partial trends of individual features, the SHAP analysis based solely on themselves was still not clear enough since the photorespiration model is obtained by subtracting two original XGBoost models. Therefore, models constructed by the polynomial regression approach, as a most interpretable model, were necessary to be analyzed.
Although the polynomial regression performed weakly in these models, the R 2 of the models still achieved about 0.9 (Table 1). According to the operational logic of Equation (5), the two models built using polynomial regression can be written with expressions in analytic form. The maximum term in the polynomial is a three-degree term. Thus, photorespiration model can be easily expressed in equations. The meaning and coefficients of each term of the photosynthesis model and photorespiration model are shown in Table 3. For the photosynthesis model, PAR, CO2 concentration, Location Y in greenhouse (which mainly affects the microclimate), leaf position, and relative humidity (absolute value of coefficient > 1.5) were the most important features. The coefficients of light and light 2 had the largest absolute values, 8.221 and −3.704, respectively. The second was the CO2 concentration. These results are consistent with the general understanding of photosynthesis.
In contrast, for the predictive model of photorespiration, it found that leaf position, growth stage, location in greenhouse, temperature, relative humidity, and their mutual products (absolute value of coefficient >9) were important features, while the absolute coefficient of PAR was not as high. This suggests that when studying photorespiration, more attention needs to be paid to the growth stage, leaf position, temperature, and relative humidity, rather than PAR.

Soft Sensors and Ability to Promote
Some studies have claimed that these predictive models can be applied as soft sensors in greenhouses [59,60]. This significantly reduces the cost required to configure the sensors. However, this sensor-like process requires a model that is sufficiently accurate and has a very high generalization ability. The model establishment for the photorespiration of cucumber makes it happen that the leaf photorespiration rate can be predicted through the basic meteorological parameters around the leaf with high simulation accuracy and obtainable parameters. Therefore, the model developed in this study can be applied as a soft sensor. It can be a useful exploration of research on photorespiration rate simulation. Machine-learning-simulated photorespiration involves statistical models based on datasets. In this study, measurements from two production seasons of cucumber were used in order to be more accurate. In fact, using the method provided in this study, only measurements from one production season of a certain crop are needed to model crop photorespiration with high accuracy. It means that the model can be easily extended, promoted and applied to other plants.

Conclusions
This study compared the performance of several machine learning models to predict photorespiration rate in cucumber leaves. The photorespiration rate was expressed as the difference between the photosynthetic rate at 2% O2 and 21% O2 concentrations. For this purpose, air temperature, relative humidity, CO2 concentration, PAR measurements near the leaves, crop growth period, leaf position, and plant location in the greenhouse were used as input variables to the models, and photosynthetic rates measured using air and 2% O2, respectively, were used as output variables. Additionally, the advantages and disadvantages of the two models built using different machine learning methods were tested. It was observed that the XGBoost models had the best generalization ability and took less time, where the R 2 of the test set for both models with air and 2% oxygen measurements was 0.970. The importance of features was also discussed. It was found that PAR and CO2 concentration have a greater effect on the photosynthetic rate itself, but temperature and relative humidity are more significant for photorespiration. The model established in this study performed well, with high accuracy and generalization ability. In addition, the model can be used as a soft sensor. Additionally, it has a certain ability to promote to other plants.