Soil Temperature Estimation with Meteorological Parameters by Using Tree-Based Hybrid Data Mining Models

: The temperature of the soil at di ﬀ erent depths is one of the most important factors used in di ﬀ erent disciplines, such as hydrology, soil science, civil engineering, construction, geotechnology, ecology, meteorology, agriculture, and environmental studies. In addition to physical and spatial variables, meteorological elements are also e ﬀ ective in changing soil temperatures at di ﬀ erent depths. The use of machine-learning models is increasing day by day in many complex and nonlinear branches of science. These data-driven models seek solutions to complex and nonlinear problems using data observed in the past. In this research, decision tree (DT), gradient boosted trees (GBT), and hybrid DT–GBT models were used to estimate soil temperature. The soil temperatures at 5, 10, and 20 cm depths were estimated using the daily minimum, maximum, and mean temperature; sunshine intensity and duration, and precipitation data measured between 1993 and 2018 at Divrigi station in Sivas province in Turkey. To predict the soil temperature at di ﬀ erent depths, the time windowing technique was used on the input data. According to the results, hybrid DT–GBT, GBT, and DT methods estimated the soil temperature at 5 cm depth the most successfully, respectively. However, the best estimate was obtained with the DT model at soil depths of 10 and 20 cm. According to the results of the research, the accuracy rate of the models has also increased with increasing soil depth. In the prediction of soil temperature, sunshine duration and air temperature were determined as the most important factors and precipitation was the most insigniﬁcant meteorological variable. According to the evaluation criteria, such as Nash-Sutcli ﬀ e coe ﬃ cient, R, MAE, RMSE, and Taylor diagrams used, it is recommended that all three (DT, GBT, and hybrid DT–GBT) data-based models can be used for predicting soil temperature. soil temperature change at di ﬀ erent depths.


Introduction
Determination of the temperature in different soil depth is important in terms of planning in many disciplines and engineering fields. It is a parameter that needs to be known or predicted in different fields, such as hydrology, soil science, construction, geotechnology, ecology, meteorology, agriculture, and environmental studies. Frost forecasting in the soil is also important in terms of operating these projects and determining the working season in drinking and agricultural water networks, oil and and maximum air temperature; relative humidity; sunshine duration, and solar radiation as model inputs in modeling. Kim et al. [22] estimated the soil temperature by MLP-ANN and ANFIS methods. In the study, they used different meteorological parameters as model inputs and obtained successful results. Yener et al. [23] investigated the effect of meteorological parameters on soil temperature in Turkey. It has been observed that soil temperature values are affected by various parameters, such as thermal conductivity, short-term climatic conditions, and humidity. Sattari et al. [10] estimated the soil temperature for different depths in an agricultural region of Iran's Isfahan province with the help of meteorological parameters. They made successful predictions based on artificial border networks and using the M5 tree model. Samadianfard et al. [24] successfully predicted the daily average soil temperature in Tabriz in Iran with wavelet artificial neural networks and gene expression programming methods. According to the results of the study, it was seen that air temperature, sunshine duration, and radiation parameters were the most important factors on soil temperature. Feng et al. [25] estimated the soil temperature at various depths in the half-hour period in China using meteorological variables, such as wind speed, air temperature, relative humidity, solar radiation, and vapor pressure deficit, and four machine-learning models. Among the models used, the extreme learning machine method was found to be much more successful than artificial neural networks and random forest approaches. Costache et al. [26] successfully used the gradient boosting trees (GBT) and multilayer perceptron (MLP) method to evaluate the flood potential and to predict flood sensitive areas in the Trotus river basin in Romania. Matei et al. [27,28] and Anton et al. [29] used various techniques, such as collaborative or context-aware data mining, for predicting the soil moisture in Transylvania, Romania. Wu et al. [30] used the gradient boosting decision tree (GBDT) algorithm to predict urban floods in Zhengzhou City. In modeling, factors, such as amount of precipitation, duration, intensity, evaporation, land use, permeability, water collection area, and slope, were used.
The aim of this study is to estimate soil temperature at depths of 5, 10, and 20 cm using DT and GBT methods in Divrigi meteorology station in Sivas province in Turkey and compare the results with the proposed GBT-DT hybrid (hybrid DT-GBT) methods. In the study, the effect of meteorological variables on soil temperature will be investigated by using different input combinations.

Material
This study was carried out using values measured at the weather station located in Turkey's Sivas Divrigi district ( Figure 1). 27,202 km 2 area of Sivas province of Turkey's 2nd largest province is 66.5% of the active population in the agricultural sector. The province is an important vegetative production center offering a wide variety of agricultural products depending on the presence of a large agricultural land and microclimate agricultural basin. 41% of its land is suitable for agriculture, 27% is pasture, 13% is forest and shrubbery, and 19% is non-agricultural areas. According to the 2018 cultivation areas in Sivas, oats are the first, second is trefoil, third is wheat, sixth is alfalfa, seventh is sugar beet, and eighth is potato agriculture in the country [31,32].
Daily data measured in Turkish State Meteorological Service Sivas Divrigi station between 15 September 2009 and 31 December 2018 were used in the study. Measurements in the meteorological stations operated by the State Meteorological Service in Turkey are conducted according to standards set by the World Meteorological Organization. Measurements made manually in previous years are now made through automatic stations. Automatic meteorology stations consist of sensors sensitive to changes in meteorological parameters and measuring the amount of these changes. These stations have the main (central) processing unit that makes the necessary calculations to convert the measurements obtained by the sensors into meteorological information, the display units that enable the information to be displayed, and the communication units that enable the information to be transmitted to the center. The station also has a data acquisition unit, communication interface, and power supply [33,34]. Daily data measured in Turkish State Meteorological Service Sivas Divrigi station between 15 September 2009 and 31 December 2018 were used in the study. Measurements in the meteorological stations operated by the State Meteorological Service in Turkey are conducted according to standards set by the World Meteorological Organization. Measurements made manually in previous years are now made through automatic stations. Automatic meteorology stations consist of sensors sensitive to changes in meteorological parameters and measuring the amount of these changes. These stations have the main (central) processing unit that makes the necessary calculations to convert the measurements obtained by the sensors into meteorological information, the display units that enable the information to be displayed, and the communication units that enable the information to be transmitted to the center. The station also has a data acquisition unit, communication interface, and power supply [33,34].
Basic statistics about the data used are given in Table 1. Soil temperature values at a depth of 5 cm vary greatly compared to soil temperature values of 10 cm and 20 cm. The daily change of the average soil temperature at different depths throughout the year is given in Figure 2. In a sense, the change between the minimum and maximum temperature values is high. The testing was performed using the 70-30 report between training and test data. Data were split chronologically. Initial data had 3395 records. It was used in two separate repositories: the first 70% in the Training Data repository, between September 2009 and March 2016, was used to train the model, while the next 30% part, from March 2016 until December 2018, in the Test Data repository was used for validating it. These two repositories were used in all the created processes. We evaluated the best method and scenario for each of the proposed algorithms in order to implement and run a process that covered all the decided scenarios. Basic statistics about the data used are given in Table 1. Soil temperature values at a depth of 5 cm vary greatly compared to soil temperature values of 10 cm and 20 cm. The daily change of the average soil temperature at different depths throughout the year is given in Figure 2. In a sense, the change between the minimum and maximum temperature values is high. The testing was performed using the 70-30 report between training and test data. Data were split chronologically. Initial data had 3395 records. It was used in two separate repositories: the first 70% in the Training Data repository, between September 2009 and March 2016, was used to train the model, while the next 30% part, from March 2016 until December 2018, in the Test Data repository was used for validating it. These two repositories were used in all the created processes. We evaluated the best method and scenario for each of the proposed algorithms in order to implement and run a process that covered all the decided scenarios. Table 1. Statistical properties of daily data related to air and soil.

Methods
The data mining processes were implemented in Rapid Miner Studio (version 9.4-Educational Edition, RapidMiner Inc., Boston, MA, USA). It is a tool that provides a comprehensive set of operators and offers easy to use and understand structures for modelling complex data mining processes [35]. The machine-learning algorithms used for predicting the soil temperature are described below.

Gradient Boosted Trees (GBT)
Gradient boosted trees consists of an ensemble of regression/classification tree models. In the scenarios that we want to test, it is used for regression. According to Freund and Schapire [36], regression GBT is a generalization of boosting to arbitrary differentiable loss functions. These are learned in a sequential manner by a forward stagewise procedure [37]. The GBT implementation in Rapid Miner uses the H2O 3.8.2.6 algorithm. This follows the algorithm that was specified by Hastie et al. [38].

Methods
The data mining processes were implemented in Rapid Miner Studio (version 9.4-Educational Edition, RapidMiner Inc., Boston, MA, USA). It is a tool that provides a comprehensive set of operators and offers easy to use and understand structures for modelling complex data mining processes [35]. The machine-learning algorithms used for predicting the soil temperature are described below.

Gradient Boosted Trees (GBT)
Gradient boosted trees consists of an ensemble of regression/classification tree models. In the scenarios that we want to test, it is used for regression. According to Freund and Schapire [36], regression GBT is a generalization of boosting to arbitrary differentiable loss functions. These are learned in a sequential manner by a forward stagewise procedure [37]. The GBT implementation in Rapid Miner uses the H2O 3.8.2.6 algorithm. This follows the algorithm that was specified by Hastie et al. [38].

Decision Trees (DT)
Decision trees (DT)-a tree like a collection of nodes used to predict the affiliation to a class or an estimate of a numerical target value. Each node corresponds to a splitting rule for one specific attribute. This is a simple and widely used method in data mining [39].
The output of the model is a tree model, which is later used for prediction. The minimization of the sum of squares is used as a criterion.
As Hastie et al. [38] specified, the tree size will influence the resulted model complexity and the optimal size of the tree should be adaptively chosen. The correspondence for the tree size in Rapid Miner is "maximal depth" for which we tried different values in the optimization part.

Hybrid DT-GBT
The proposed hybrid DT-GBT approach uses the vote operator capabilities offered by Rapid Miner. It is a nested operator, meaning it has a subprocess. It also requires at least two learners, called base learners. For classification, this operator uses a majority vote, while for regression it uses the average on top of the predictions of the base learners provided in the subprocess. For classification, all the operators in the subprocess accept the given dataset and generate a classification model. For predicting an unknown example, this operator applies all the classification models from its subprocess and assigns the predicted class with maximum votes to the unknown example.
In case of regression, all the operators in the subprocess of the vote operator accept the given dataset and generate a regression model. In the proposed hybrid DT-GBT approach, GBT and DT are included in the subprocess and are considered base learners. To predict an unknown value, the operator uses the average on top of the predictions of the base learners defined.

Metrics Performed for Evaluation
Five different well-known metrics calculated for evaluating the models (Equations (1) and (2)).

•
Root mean squared error (RMSE)-the standard deviation of the residuals (prediction errors).

•
Pearson correlation coefficient (r)-used to obtain the strength and direction of the linear relationship between the predicted value and observed value for the soil temperature.

•
Mean absolute error (MAE)-it is commonly used in forecasting time series. • Nash-Sutcliffe coefficient (NS)-used to describe the accuracy of model outputs: where n is the number of outputs, p i is the i-th predicted output, and d i is the i-th desired observed output [40,41].

•
Kling-Gupta efficiency (KGE)-first introduced by Gupta et al. [42] as an improvement to the Nash-Sutcliffe efficiency. It facilitates the separate analysis of the relative importance of correlation, bias, and variability in the process of hydrological modelling.
where r is the linear correlation between observed and predicted values, σ obs is the standard deviation in observations, σ sim the standard deviation in simulations, µ sim the simulation mean, and µ obs the observation mean.

Parameter Setup
To predict the soil temperature at different depths, the time windowing technique was used on the input data. Windowing is used to split time series into input vectors. A time series is a set of measurements performed on a specific process that are registered sequentially in time. As Koskela et al. [43] point out, by using the windowing technique, the problem is translated into deciding the length and type of the window to be used.

Scenarios and Implementation
In the study, 8 different input scenarios were taken into account to determine the meteorological variables that have the most impact on soil temperature and to evaluate the predictive power of the prediction models to be used based on these variables. The scenarios in Table 2 are based on the physics of soil temperature change and a literature search.
For validating the best combination for the machine-learning algorithms, a particularization of the configurable scenarios platform for designing prediction models, described in Avram et al. [44] and Avram et al. [45] was used, if the platform was thought to be general enough to support collaborative 7 of 21 and context-aware data mining. As Anton et al. [46] specify, context-aware data mining respects the same steps as classical data mining, just that it includes real-time context in the data mining process, while the collaborative scenario involves having the data of the studied source completed with data taken from similar sources (for example one or more locations in close proximity to the studied one). In the current research, the focus was on the classical data mining approach, applied in the DT, GBT, and hybrid DT-GBT methods. Table 2. Scenarios used and input variables.

Scenario
Meteorological Variables Below are the steps describing the modelled process behind each machine-learning method. Since there were three chosen models: DT, GBT, and hybrid DT-GBT, there were 3 Rapid Miner processes, following the presented structure: for each test scenario in the list: establish predicted value as specified in the scenario; select only attributes specified; generate model on the training data using windowing; apply generated model on the test data; store results.
The aggregated results were then subject to analysis, and conclusions were drawn based on these.

Results
To predict the soil temperature at different depths (5 cm, 10 cm, and 20 cm), the machine-learning algorithms were trained using windows of previous days. For establishing the best values for the window size, the values 3, 5, and 7 were tested in the beginning of the experiments. Table 3 presents the RMSE ( • C) measured values per each algorithm used. It can be observed that the best results were obtained when using a window of 3 previous days, while increasing the number of days in the window did not improve the results. Table 4 presents the obtained results for different maximal depth values. We used in the experiments the maximal depth of 10 for the decision tree algorithm applied. For a maximal depth higher than 10, the overall accuracy of the predictions starts to decrease.  Table 5 depicts the results obtained for the combinations tested for GBT on maximal depth and no. of trees. After this phase, the combination 200 trees and 20 as maximal depth was further used in the experiments. For the hybrid DT-GBT approach, the best obtained parameters were used for each algorithm. In the study, the performance of the models and input scenarios used to estimate the temperature at different soil depths were determined. 70% of all data used in the study were used for training of models and the remaining 30% were used for testing.
RMSE ( • C) was computed for all scenarios and algorithms chosen, as seen in Table 6. The results with the lowest RMSE were considered as best scenario combinations and analyzed in more details.
Seen in Table 7, which is only for best selected scenario for each depth given, the DT model was able to predict the soil temperature at a depth of 20, 10, and 5 cm, respectively. The soil temperature at a depth of 5 cm is predicted with a relatively high accuracy and low error (NS = 0.9669, KGE = 0.957, R = 0.9833, MAE = 1.4533 and RMSE = 2.0188). Soil temperature at a depth of 5 cm was more affected by the parameters of Sunshine Intensity and Sunshine Duration than other variables. In Table 7  Time series and scatter plots for all three depths are given in Figure 3. The DT model has successfully estimated the soil temperature at different depths. In Table 8, the performance of the inputs and scenarios that give the best results for 5, 10, 20 cm soil depths according to the GBT model is given. The best estimates in GBT method were for 20, 10, and 5 cm depths, respectively, as in the DT method.  Table 8, it is sufficient to use the MeanT variable as an input to determine the temperature at a depth of 5 cm (NS = 0.9446, KGE = 0.857, R = 0.9793, MAE = 1.9144, RMSE = 2.6109). However, the input scenario consisting of four variables (MinT-MaxT-MeanT-Sunshine Duration) gave the best results for 10 and 20 cm soil depth. As seen in Table 8 According to the results of the GBT model, the time series and scatter plots for all three depths are given in Figure 4. A very high level of agreement was achieved between the values predicted from the GBT model and the observed values at all depths except for a few days. In Table 8, the performance of the inputs and scenarios that give the best results for 5, 10, 20 cm soil depths according to the GBT model is given. The best estimates in GBT method were for 20, 10, and 5 cm depths, respectively, as in the DT method. In Table 8, it is sufficient to use the MeanT variable as an input to determine the temperature at a depth of 5 cm (NS = 0.9446, KGE = 0.857, R = 0.9793, MAE = 1.9144, RMSE = 2.6109). However, the input scenario consisting of four variables (MinT-MaxT-MeanT-Sunshine Duration) gave the best results for 10 and 20 cm soil depth. As seen in Table 8 According to the results of the GBT model, the time series and scatter plots for all three depths are given in Figure 4. A very high level of agreement was achieved between the values predicted from the GBT model and the observed values at all depths except for a few days.
In Table 9, the performance of the input scenarios that give the best results for temperatures at 5, 10, and 20 cm soil depths according to the DT-GBT hybrid model is given. In Table 9, the best result was obtained when the temperature of 5 cm soil depth was taken as the input  Table 9 show that, as soil depth increases, the accuracy rate of the model also increases.
According to the DT-GBT hybrid model results, time series graphics and scatter plots for all three depths are given in Figure 5. Except for a few days, especially at 10 cm and 20 cm depths, a very high agreement was observed between the values estimated from the DT-GBT hybrid model and the observed values. In Table 9, the performance of the input scenarios that give the best results for temperatures at 5, 10, and 20 cm soil depths according to the DT-GBT hybrid model is given. In Table 9, the best result was obtained when the temperature of 5 cm soil depth was taken as the input  Table 9 show that, as soil depth increases, the accuracy rate of the model also increases.
According to the DT-GBT hybrid model results, time series graphics and scatter plots for all three depths are given in Figure 5. Except for a few days, especially at 10 cm and 20 cm depths, a very high agreement was observed between the values estimated from the DT-GBT hybrid model and the observed values.
(a)    The methods used in the continuation of the study were compared with each other for different depths.
The performance of the methods in test period for a depth of 5 cm is given in Table 10. In Table  10, the basic statistical values of the three different methods in the best successful scenarios can be The methods used in the continuation of the study were compared with each other for different depths.
The performance of the methods in test period for a depth of 5 cm is given in Table 10. In Table 10, the basic statistical values of the three different methods in the best successful scenarios can be compared with the measured values. The results obtained from the methods used are in the second, third, and fourth columns; in the last column, the figures for the measured values are given. The DT-GBT hybrid method with 5 cm depth in terms of R value gave more accurate results than other methods (R = 0.9954). However, DT was accurate in terms of minimum, maximum, and standard deviation values; in terms of mean value, it is seen that the results of GBT method are close to the observed temperature values. In general, it has been proved that all three methods can predict accurate soil temperature at a depth of 5 cm. The performance of the methods in the test period for the prediction of the soil temperature 10 cm deep is given in Table 11. In terms of R value, the DT method with a depth of 10 cm was more accurate than other methods (R = 0.9983). At the same time, the DT method results are very close to observed temperature values in terms of minimum, maximum, mean, and standard deviation values. In this case, it was proved that all three methods successfully predicted soil temperature at 10 cm deep. Table 11. Statistic for selected scenarios in used methods ST10.

DT GBT Hybrid DT-GBT Measured (ST10) Best Scenario
MinT-MaxT-MeanT-Sunshine Duration The performance of the methods in test period for a depth of 20 cm is given in Table 12. In terms of the R value, DT method with a depth of 20 cm showed better results with a little difference compared to other methods (R = 0.9994). At the same time, the DT method results are very close to observed temperature values in terms of minimum, maximum, mean, and standard deviation values. The DT method is very closely followed by the hybrid DT-GBT method. In this case, it was proved that all three methods successfully predicted soil temperature at 20 cm deep.

MeanT-Sunshine Duration
It can be understood from Tables 10-12 that the sunshine duration affects the soil temperature, especially at 10 and 20 cm depth compared to other meteorological variables. It is seen that the sunshine duration time variable is the most important variable, since it causes the soil to heat. After the sunshine duration meteorological variable, it is seen that it plays an important role in soil warming, especially at 5 cm depth, in other variables that express the air temperature. The performance of the models used for different depths is given visually as a Taylor diagram in Figure 6. As can be seen from Figure 6a, hybrid DT-GBT, GBT, and DT methods have best predicted soil temperature at 5 cm soil depth, respectively. As seen in Figure 6b,c, the best results in soil temperature estimation at 10 and 20 cm soil depths were obtained with DT, hybrid DT-GBT, and GBT methods, respectively.  The performance of the models used for different depths is given visually as a Taylor diagram in Figure 6. As can be seen from Figure 6a, hybrid DT-GBT, GBT, and DT methods have best predicted soil temperature at 5 cm soil depth, respectively. As seen in Figures 6b,c, the best results in soil temperature estimation at 10 and 20 cm soil depths were obtained with DT, hybrid DT-GBT, and GBT methods, respectively. (a)

Discussion
Estimation of soil temperature is one of the most important factors in the management of economic activities, such as agriculture and construction and agricultural insurance. Soil temperature is a factor that depends on meteorological variables and can be measured at meteorological stations, but it requires a relatively high cost, with expert staff.

Discussion
Estimation of soil temperature is one of the most important factors in the management of economic activities, such as agriculture and construction and agricultural insurance. Soil temperature is a factor that depends on meteorological variables and can be measured at meteorological stations, but it requires a relatively high cost, with expert staff.
Unfortunately, many meteorological parameters have measured at only one location in Turkey's district, such as Divrigi, except metropolitan areas. The transferability of the data mining model trained at a single point is likely to be low. However, the altitude change is not very high in the district, and there are no other long-term measuring stations. Naturally, the results obtained here cannot be generalized for other regions and other conditions.
The evaluation criteria were taken into account in the selection of the best scenario and the best model. The accuracy rate obtained under these operating conditions is quite good (NS: 0.9446-0.9942, KGE: 0.857-0.995, R: 0.9793-0.9971). Accuracy rate in all data-based models can be increased by discovering hidden patterns and minimizing the noise in the data. It is possible to make data smoother and more predictable with data preprocessing. With various preprocessing and filtering methods, the stochastic feature among the data can be reduced, and the accuracy of the model can be increased.
Using data-based models, soil temperature can be estimated at different depths with meteorological variables measured in the past. In this study, the performance of the hybrid DT-GBT method developed with DT and GBT methods in estimating soil temperature at different depths was compared. While estimating the soil temperature, the meteorological variables associated with the temperature were considered as input scenarios in eight different combinations. According to the results, the hybrid DT-GBT, GBT, and DT methods were best predicted at 5 cm soil depth, respectively. In 10 and 20 cm soil depths, the best estimate was obtained by DT, hybrid DT-GBT, and GBT models, respectively. At the same time, it was observed that the accuracy rate of the models increased with increasing soil depth. It was observed that the sunshine duration was the most important meteorological variable for soil temperature at 10 and 20 cm depth and the air temperature was the most important at 5 cm soil depth. It was observed that precipitation was ineffective on soil temperature in all models and at all depths. As a result, the DT, GBT, and hybrid DT-GBT models have been used successfully for predicting soil temperature.
Soil temperature is important for plant root development and the activity of microorganisms. It is not possible to measure this temperature at different depths, especially in the field conditions where vegetative production is made, because it is a costly process that requires equipment and expert staff. However, if successful models can be established for different regions and conditions, the soil temperature can be predicted without the need for land measurement, equipment, or labor. These predictions can assist in agricultural soil, fertilizer, and water resources management. Although three different artificial intelligence methods were used in this study, we did not have the chance to test them in different climatic and regional conditions. It cannot be generalized that the proposed model that makes the best estimates will be valid in all conditions, but it has been concluded that the methods can be used in the estimation of the soil temperature due to the successful results.