1. Introduction
The most important call for a sustainable future in the food sector is to produce more food per hectare without expanding agricultural land in order to accommodate the rapid growth of the world population. To achieve this, there is a need to increase the productivity in greenhouse hydroponic cultivation by redesigning the operation control system [
1].
The development of a machine learning (ML) model that combines climate and crop physiology data for detecting different type of stress will result in the improvement of the greenhouse operation.
Up to now, it was not feasible to incorporate crop physiology data in an ML model since most agronomy factors are measured using labor- and time-consuming protocols [
2]. Leaf temperature is one of the few indicators that can be measured in a time-series protocol to produce a large volume database required to build a machine learning model. However, leaf or crop temperature is an unstable factor that can present an intense variation according to the climatic and abiotic conditions and cannot be used on its own to estimate different types of crop stress [
3]. The combination of leaf temperature with the photosynthesis (P
s) could improve the methodology of defining the type of stress produced in vegetable cultivation.
Recently, the photochemical reflectance index (PRI) that is correlated with rapid changes in de-epoxidation of the xanthophylls cycle and the photosynthesis efficiency (Ps) showed very good results and are able to be measured using soft sensors (i.e., mathematical models using real-time sensor data) to produce a time-series database.
In this research, the methodology of developing a gradient boosting algorithm is presented in order to build an ML model that will combine climatic data with leaf temperature and photosynthesis rate. In this sense, a multisensory tower placed within the greenhouse was used to record how the physiology status of the tomato plants was changing according to their surrounding microclimate. The plants were cultivated under extreme conditions of air temperature and water in the root zone. The resulting database was used to train and test the model.
2. Material and Methods
The measurements were carried out from May to December of 2019 in one of the five compartments of a multi-tunnel greenhouse with a total ground area of 1500 m2 (250 m2 each compartment). The establishments were located at the facilities of the University of Thessaly, Velestino, Volos (latitude 39°22′, longitude 22°44′, and altitude 85 m) in the continental area of eastern Greece.
The tomato plants (Solanum lycopersicum cv. Elpida, Spirou Ltd., Athens, Greece) were cultivated in slabs filled with perlite slabs (ISOCON Perloflor Hydro 1, ISOCON S.A., Athens, Greece). The plants were fertigated with a fresh nutrient solution with set points of electrical conductivity (EC) around 2 dS m−1 and pH 5.8. The nutrient solution supplied to the crop was a standard nutrient solution for tomatoes grown in open hydroponic systems adapted to Mediterranean climatic conditions. The nutrient solution was supplied via a drip system and controlled by a time program irrigation controller (8 irrigation events per day).
In order to record the physiological response of the plants to their surrounding microclimate, tomato plants were imposed to different types of stress. Specifically, the plants were cultivated under (i) a low air temperature around 15 °C (LTS treatment) and (ii) a low water concentration in the root zone with a dose of 30 mL per plant (LWS treatment). Additionally, measurements of (iii) no stressed (NoS treatment) plants were recorded.
In order to build the database of crop physiology and environment microclimate under the mentioned extreme conditions, a multisensory tower was built, consisting of an air temperature sensor (Thygro SDI-12, Symmetron, Gerakas, Greece), relative humidity sensor (Thygro SDI-12, Symmetron, Gerakas, Greece), solar radiation sensor (SP-SS, Apogee Instruments, Logan, UT, USA), leaf temperature sensor (Thermocouples, type T, Delta Ltd., Pico Rivera, CA, USA), leaf wetness sensor (PS-0061-AD, Netsense, Calenzano, Italia), and PRI sensor (type SRS-PRI, Meter Group, Pullman, WA, USA). The multisensory tower was placed within the greenhouse, parallel to the vertical axis of the tomato’s main stem. The measurements started 10 days after each treatment was applied and lasted for 25 days.
In total, 9 features, air temperature (Ta, °C), relative humidity (RH, %), solar radiation (SR, W m−2), leaf temperature (TL, °C), leaf wetness in young leaves (Lwup, %), leaf wetness in mature leaves (Lwdn, %), photochemical reflectance index (PRI), photosynthesis rate (Ps, μmol m−2 s−1), and crop water stress index (CWSI), were added to the model in order to show three outputs (LTS, LWS, and NoS).
In the current research, the CWSI developed by Jackson et al. [
4] was calculated. The methodology followed in the current research was described in Baille et al. [
5]. The calibration procedure of the remote PRI sensor and the P
s calculation was presented in Elvanidi & Katsoulas [
6]. The resulting data sample was of 10,763 values.
In order to obtain high performance in greenhouse data, a series of ML algorithms, such as gradient boosting (GB), multilayer perceptron (MLP), and other artificial neural network algorithms, were examined. Among the algorithms, the GB technique corresponded more sufficiently in the studied tested sample where the measurable parameters were defined as distinct and not as time-series. The GB modeling part of the ensemble learning algorithms that rely on a collective decision from inefficient prediction models is called decision trees.
In the model, a list of hyperparameters were used (learning rate, number of estimators, max tree depth, max features). The cross-validation process was repeated 50 times. The methodology was followed in the current research and described in Friedman et al. [
7], Khan et al. [
8], and Karamoutsou [
9].
The dataset was divided into two parts: one for training validation (80%; 8610) and a second for testing (20%; 2152). All steps, learning, and classification were written in Python. For machine learning, the Python ML Scikit-learn [
10] library and the Spyder environment were used.
The statistical criteria involved the accuracy (1), positive predictive values (PPV or precision) (2), sensitivity (or recall) (3), and F1 (F1-score) (4) (where P is the number of real positive cases in the data and N is the number of real negative cases in the data):
3. Results
During the training procedure, the optimum rates of each hyperparameter were defined.
The range values of the learning rate were 0–1, with the most common values being 0.001–0.3. Smaller values made the model robust to the specific characteristics of each individual tree and reduced the possibility of overfitting. However, the low values increased the risk of not reaching the optimum with a fixed number of trees. For the development of the current GB-based classifier, the optimum value that has been chosen among the above learning values (0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1) was 0.5.
The optimum number of estimators in which the total number of sequential trees was defined has been chosen among the values (10, 20, 30, 40, 50, 60, 70, 80, 90, 100) and was 70.
In the max tree depth indicator, in which the depth of the individual trees was controlled, the optimum value has been chosen among the values (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) and was 9.
The optimum max features indicator that defines the number of features that will be used for a best split was chosen among the values (1, 2, 3, 4, 5, 6, 7, 8) and was 7.
The combination of the optimum hyperparameters developed the current GB algorithm for detecting the three specific types of stress. According to the training process, the number of features that will be imposed to the model was defined as 9.
Figure 1 shows the feature importance values obtained from the GB approach in histograms. It is observed that out of the 9 features, 2 features improve the present models to classify the three types of stress, namely (a) T
L and (b) T
a. The other characteristics complement the forecasting process by further improving the model. Therefore, in the current algorithm, the more variables were performed as an input, the higher the predictor accuracy. For decreasing the number of inputs, there is a need to increase the testing sample since the greenhouse system is considered a nonlinear system, where the lack of datasets produces a very complex dynamic relation between the climatic factors and the crop physiology, making the response difficult to predict.
Table 1 presents the statistical criteria performed in the GB algorithm during the training and testing processes. According to the data, the GB algorithm produced high criteria in the training set where the accuracy, precision, sensitivity, and F1-score were 100%. The GB model belongs to the family of models that can handle even features with a low predictive power. In addition, the GB model was found to have a high performance in the test set with 98% accuracy, 98% precision, 98% sensitivity, and 98% F1. A comparison of the metrics between the training and testing phase shows that overfitting was avoided.
Figure 2 shows the performance distribution for the GB model according to the three types of stress. More specifically, the GB model correctly “understood” all cases presented as LTS; it “confused” 16 NoS cases as LWS, 21 LWS cases as NoS, and only 1 LWS case as LTS.
4. Discussion
The gradient boosting algorithm is one of the most powerful algorithms in the field of machine learning. The gradient boosting algorithm can be used for predicting not only a continuous target variable (such as a regressor) but also a categorical target variable (such as a classifier). In the current research, quality and quantitative data are involved in the process of building an ML model. Additionally, the GB algorithm can build a highly efficient, more accurate, and high-quality ML model in less time. The GB algorithm performs well under a small, weak size of datasets and unbalanced data such as real-time data management [
11,
12]. Ravi and Baranidharan [
13] and Cai et al. [
14] sustain that the GB algorithm is faster than all other machine learning algorithms.
In the current research, the GB algorithm was performed for the first time ever to classify qualitative and quantitative data under greenhouse conditions with very good statistical results. The developed model can be applicable in other greenhouse systems in the Mediterranean region that cultivate tomato crops in hydroponics.
The next step of the current research is to improve the model that was developed by the GB algorithm by decreasing the number of inputs in order to define more types of stress, such as stress occurring in the plants due to high air temperature and low nutrient performance.
5. Conclusions
The current research presented the development of the gradient boosting algorithm to predict three types of stress under greenhouse conditions. The model was made for tomato crops while the training and the testing of the models was performed in a sample of 10,763 datasets. In the model, nine feature inputs were adjusted for predicting three outputs. The developed GB model presented high statistical criteria with more than 98% accuracy, producing high sustainability in greenhouse data that is able to be connected with the operation systems already used. The future perspective of the current research is to extend the model in order to predict more than three type of stress. Application of the current model in greenhouse cultivation allows more efficient and precise farming with less human manpower with high-quality production contributing to the further reduction of the resource’s inputs, energy, and environmental footprint.