Forecasting Rainfed Agricultural Production in Arid and Semi-Arid Lands Using Learning Machine Methods: A Case Study

: With the rising demand for food products and the direct impact of climate change on food production in many parts of the world, recent years have seen growing interest in the subject of food security and the role of rainfed farming in this area. Machine learning methods can be used to predict crop yield based on a combination of remote sensing data and data collected by ground weather stations. This paper argues that forecasting drylands farming yield can be reliable for management purpose under uncertain conditions using machine learning methods and remote sensing data and determines which indicators are most important in predicting the yield of chickpea. In this study, the yield of rainfed chickpea farms in 11 top chickpea producing counties in Kermanshah province, Iran, was predicted using three machine learning methods, namely support vector regression (SVR), random forest (RF), and K-nearest neighbors (KNN). To improve prediction accuracy, for each county, remote sensing data were overlaid by the satellite images of rainfed farms with a suitable slope and altitude for rainfed farming. An integrated database was created by combining weather data, remote sensing data, and chickpea yield statistics. The methods were evaluated using the leave-one-out cross-validation (LOOCV) technique and compared in terms of multiple measures. Given the sensitivity of rainfed chickpea yield to the time of data, the predictions were made in two scenarios: (1) using the averages of the data of all growing months, and (2) using the data of a combination of months. The results showed that RF provides more accurate yield predictions than other methods. The predictions of this method were 7–8% different from the statistics reported by the Statistical Center and the Ministry of Agriculture of Iran. It was found that for pre-harvest prediction of rainfed chickpea yield, using the data of the March–April period (the averages of two months) offers the best result in terms of the correlation coefﬁcient for the relationship between the yield and the predictor indices.


Introduction
Agriculture is directly and indirectly the world's main source of food supply [1]. With the rising demand for food products due to global population growth, and also the direct impact of climate change on food production in many areas, food security is again becoming a crucial issue for many countries. The agricultural output of each region very much depends on its climate and the related parameters [2]. One of the most important determinants of agricultural production efficiency and output and consequently food security in a region is people's understanding of the region's climatic characteristics and their impact on the yield of different crops. This issue is especially important in rainfed farming. Rainfed farming is a type of agriculture in which crops are almost entirely irrigated by rainfall, which makes it extremely dependent on climatic conditions [3]. Legumes are a family of protein-rich agricultural products that are not only essential for a healthy diet, but also serve as an important source of income for millions of families who grow them alongside other crops. Farming legumes is also known to improve soil nitrogen levels, which plays an important role in sustainable production [4]. Chickpea (Cicer arietinum L.) is a drought-resistant plant that grows well in arid and semi-arid areas. This plant can tolerate not only high temperatures but also low temperatures to some extent [5], which makes it a very good option for rainfed farms. With a total output of 271,487 tons (2017), Iran is the world's ninth-largest producer of chickpea after the United States [6]. Almost 80% of Iranian chickpeas are produced in four provinces: Kermanshah, Lorestan, Kurdistan, and West Azerbaijan. Accounting for nearly 25.2% of the country's total chickpea production, Kermanshah province is the largest producer of chickpea in Iran. More than 80% of Kermanshah's chickpeas are produced in rainfed farms, which have an average yield of 456kg/hectare [7]. Kermanshah has a semi-arid mountainous climate [8], which is perfect for rainfed chickpea farming. However, in recent years, the province has struggled with water scarcity issues due to the effects of climate change, especially prolonged dry spells and droughts, and the consequent overexploitation of groundwater resources [9]. Allowing agriculture to expand beyond the region's ecological and environmental capacities has adverse consequences-such as soil erosion, desertification, land pollution, and environmental degradation-which will ultimately result in the destruction of the region's natural resources and the decline of its agricultural output; trends that will actually move the region away from sustainable development [10]. Any information about the vegetation cover of a region and how it is impacted by climatic variables can substantially benefit the management of agricultural yield in that region [11]. However, the traditional methods of gathering information about climate and vegetation cover could be not only time-consuming but also inaccurate. Today, such information can be collected more easily by using remote sensing technologies to monitor vegetation and climate variables and track their temporal and spatial changes on regional and global scales. Indeed, these technologies can offer an expansive and integrated view of any area with extremely high accuracy. By combining the remote sensing data with climatic measurements, one can create a real-time map of crop characteristics, which can be used in the quantification of annual net outputs on different scales and also the differentiation of vegetation covers on continental and regional scales, which could be highly valuable for effective land management [12,13]. Given the massive volume of these data (spatial and temporal) which grow with every second, many machine learning methods have been developed to take advantage of emerging computer hardware and their computing power in processing these data. Using the aforementioned methods and data, it is possible to predict the yield of different crops in different areas in order to guide government policies that prioritize food security, especially the decisions on the export and import of food products, and also boost the agricultural economy by helping farmers and agriculture-related businesses make more informed decisions about crops.

Review of Literature
Numerous studies have shown that massive spatio-temporal data obtained from the maps and satellite images produced by satellite spectroscopy can be combined with regional meteorological data to accurately predict crop yield on different scales using statistical models [11,[14][15][16][17][18]. The global problems during the recent decades have forced many researchers to pay more attention to clean energy sources for economic growth and environmental issues [19][20][21], therefore using rainfed agriculture would help to reduce pollutions. To predict crop yield, climate data such as temperature, precipitation, sunshine hours, humidity, etc. must be given to these models as input data. While climate data very well describe the environmental conditions that affect crop growth, they are not the only parameters that affect crop yield and cannot fully represent growth conditions. Climate data are more useful in crop yield estimations when combined with plant growth data [17]. Satellite images taken with remote sensing tools, such as spectral images in different bands, can offer direct information about the growth condition of plants. These spectral bands include optical spectra (visible and infrared) and microwave and thermal bands. Among these, light spectra such as infrared and red are widely used in the monitoring of plant growth parameters. Given the nonlinear nature of the empirical equations for the relationship between crop yield, crop growth indices, and climatic variables and the massive volume of spatio-temporal data that must be processed in statistical models to predict crop yield, these predictions require extensive computation. However, such computations can now be done easily thanks to advanced computer technology. In recent years, researchers have shown growing interest in the use of machine learning methods to predict crop yield [22]. Support vector machine (SVM), random forest (RF), and K-nearest neighbors (KNN) are three of the machine learning methods that have been successfully used to predict crop yield through classification and regression analysis of remote sensing data and climatic measurements.
For example, Gandhi et al. [10] used an SVM-based method to predict rice crop yield in order to facilitate crop selection in 27 districts of Maharashtra state in India.
In a study by Medar et al. [14], support vector regression (SVR) was used to predict the sugarcane yield in Karnataka, India. These researchers used a combination of climate and soil data with the satellite maps pertaining to the 2008-2018 period as the remote sensing data.
Chen et al. [15] used SVM to predict the yield of paddy rice in Chongqing, China, based on the weather data of 34 stations in this region for the 1985-2012 period. After compared the results of SVM with those of back-propagation neural network (BPNN) and multiple linear regression (MLR), this study reported that SVM is a better method for determining the effects of climate variables on crop yield.
In a study by Ejaz et al. [13], three methods including artificial neural network (ANN), linear regression (LR), and SVR were used to predict wheat yield in Punjab, Pakistan. These researchers used fertilizer and pesticide data as well as weather and soil data. All of these data, which were related to a 21-year period, were collected from the office of statistics and agriculture in Sargodha.
Kouadio et al. [16] used an ANN to predict the yield of robusta coffee in Vietnam based on SOM data and then compared the results with the results of RF and MLR. The data used in this study included soil fertility and soil yield data randomly collected from 44 farms in five districts of Lam Dong Province and the monthly averages of weather data for a 30-year period.
In a study by Kim & Lee [17], corn yield in Iowa, USA was estimated by the use of SVM, RF, ERT, and DL. The data used in this study were extracted from MODIS images of the Terra satellite and SM data obtained from ESA, CCI, which offer comprehensive global information with a resolution of 0.25 • from active and passive microwave sensors. For modeling, these researchers defined 13 time periods consisting of 5 months in which the crop is grown and harvested. Narasimhamurthy and Kumar [18] used RF to predict the rice yield in Andhra Pradesh, India. They used three types of data including temperature, precipitation, and perception data to estimate the rice yield in the studied area. The accuracy of their predictions with RF was 85.89%.
Parviz [23] used two methods, SVR and MLR, to predict barley yield in three Iranian provinces with arid, semi-arid, and humid climates. The results of this study showed that SVR is a better method for making a prediction for the considered 36-year period.
In a study by Kuwata and Shibasaki [24], SVR and DL were used to predict corn yield in Illinois, USA. In this study, DL was said to be preferred because of its advantage in the automatic selection of important features and also its greater flexibility in dealing with predefined features.
Tiwari and Shukla [25] conducted a similar study with BPNN and compared their training times and errors. These researchers used precipitation data and growth indices for yield prediction.
Han et al. [22] used 8 machine learning algorithms to predict the yield of winter wheat in 629 districts of a province of China. For this purpose, they defined multiple time windows and introduced an integrated framework of weather, remote sensing, and soil data for yield prediction based on the Google Earth Engine. These researchers used three groups of weather data and remote sensing data and soil data for their predictions.
In a study by Sharifi [26], which aimed to predict barley yield in Boshruyeh, Iran, this objective was pursued in three stages: (1) building a framework for predicting barley yield based on five-year weather data such as temperature and precipitation and remote sensing data such as NDVI And EVI for 24 districts; (2) identifying the best machine learning method for predicting barley yield by comparing four methods, namely GPR, KNN, DT, BPNN, which GPR to be the best method; and (3) identifying the best time for predicting crop yield before harvest. Tables 1 and 2 compares the past studies on the prediction of crop yield with machine learning methods.
Sharifi [23] investigated prediction of barley yield in Iran, which is one of the rainfed crops that can be grown in the region, NDVI, EVI, temperature and evaporation indicators were considered. Although, other factors such as soil moisture and rainfall can be considered in the growth of rain-fed barley. The irrigation method based on ground water, which means that farmers use water pumps to pump water was also investigated. Studies such as [13,17,20] have considered crops that are grown by irrigation. There have been other studies related to crops that are mainly located in areas where there is a lot of rain, such [10,[14][15][16]18,21]. Hen et al. [22] studies winter wheat, criteria such as air pressure that have slight changes in the growing season and can be considered almost constant. In irrigation agriculture, due to the fact that water resources are available and there is no vital need for rainfall, indicators such as rainfall can be ignored, such as [10]. In another study conducted in Iran for barley, rainfall index, temperature, and speed wind were considered as indicators affecting plant growth. It was found that wind speed index except in stormy areas, has no effect on forecasting and can be ignored.
SVR has been widely used in related studies such as [10,[13][14][15]17,19,20,22]. The SVR that is an optimized model of the SVM method, can be used to predict problems with continuous data like what we use in this study. Other researchers showed that, according to the intrinsic characteristics of crops and plant growth conditions, this method is suitable for forecasting [13,14,20].
Studies that used RF method, showed that this method has an acceptable performance for predicting the yield of agricultural products in different conditions with different indicator [10,16,17,22]. Kim and Lee [17] specifically demonstrated the capability of this method in using remote sensing indicators. Due to the unstable nature of climate phenomena and factors affecting plant growth, using the KNN method along with other methods gives more reliable performance to the results [22].  In this study, for the first time the prediction of rainfed chickpea farming for 11 counties with the highest chickpea production in Kermanshah Province of Iran was performed by the use of SVM, RF, and KNN machine learning methods, and a combination of remote sensing and local synoptic station data.
The above comparison shows that most of the past studies have been focused on crops that are either irrigated or cultivated in rainy areas. In Iran, however, rainfed agriculture in areas with limited access to water accounts for a fairly large portion of the country's total agricultural output and therefore plays a critical role in the food security of this country [3,6]. Considering the climate of Kermanshah and its evolving water scarcity issues, there seems to be some room for the expansion of rainfed farming in this province. Also, since this province has an international border (with Iraq) and good access to the region's major transit routes, it may be able to play a bolder role in the food security of the region, which may also generate an economic boost for its own farmers and agriculturerelated businesses. In this study, we tried to predict the yield of rainfed chickpea farming in 11 counties with the highest chickpea production in Kermanshah by the use of SVM, RF, and KNN machine learning methods and a combination of remote sensing, weather, and soil data.

Geographical Characteristics of Kermanshah Province
Kermanshah is one of the western provinces of Iran and has a border with Iraq. This province consists of 14 counties and its capital is the city of Kermanshah. Kermanshah province has an area of 24,640 km 2 (17th largest in Iran), a semi-arid mountainous climate [8], an average altitude of 1400 m (above sea level), an average temperature of 14 • C, and average annual precipitation of 450 mm, which is of Mediterranean type [27]. The geographical location of Kermanshah province is shown in Figure 1. The province has a very diverse climate (for example, while Qasr-e Shirin in the west of the province has a hot climate, Paveh in the north of the province has a cold climate, and Kermanshah in the center of the province has a temperate climate). Since time immemorial, rainfed farming has accounted for a major portion of agricultural output in this area. Kermanshah has long been one of Iran's top five provinces in terms of chickpea production and currently accounts for 20% of total chickpea production in this country. Over the last 12 years, the average chickpea yield in this province has been 441 kg/hectare. In terms of geographical coordinates, Kermanshah province is extended between 45 • 20 39" and 48 • 01 58" eastern longitudes and 33 • 37 08" and 35 • 17 08" northern latitudes. by the use of SVM, RF, and KNN machine learning methods, and a combination of remote sensing and local synoptic station data. The above comparison shows that most of the past studies have been focused on crops that are either irrigated or cultivated in rainy areas. In Iran, however, rainfed agriculture in areas with limited access to water accounts for a fairly large portion of the country's total agricultural output and therefore plays a critical role in the food security of this country [3,6]. Considering the climate of Kermanshah and its evolving water scarcity issues, there seems to be some room for the expansion of rainfed farming in this province. Also, since this province has an international border (with Iraq) and good access to the region's major transit routes, it may be able to play a bolder role in the food security of the region, which may also generate an economic boost for its own farmers and agriculture-related businesses. In this study, we tried to predict the yield of rainfed chickpea farming in 11 counties with the highest chickpea production in Kermanshah by the use of SVM, RF, and KNN machine learning methods and a combination of remote sensing, weather, and soil data.

Geographical Characteristics of Kermanshah Province
Kermanshah is one of the western provinces of Iran and has a border with Iraq. This province consists of 14 counties and its capital is the city of Kermanshah. Kermanshah province has an area of 24,640 km 2 (17th largest in Iran), a semi-arid mountainous climate [8], an average altitude of 1400 m (above sea level), an average temperature of 14 °C, and average annual precipitation of 450 mm, which is of Mediterranean type [27]. The geographical location of Kermanshah province is shown in Figure 1. The province has a very diverse climate (for example, while Qasr-e Shirin in the west of the province has a hot climate, Paveh in the north of the province has a cold climate, and Kermanshah in the center of the province has a temperate climate). Since time immemorial, rainfed farming has accounted for a major portion of agricultural output in this area. Kermanshah has long been one of Iran's top five provinces in terms of chickpea production and currently accounts for 20% of total chickpea production in this country. Over the last 12 years, the average chickpea yield in this province has been 441 kg/hectare. In terms of geographical coordinates, Kermanshah province is extended between 45°20′39″ and 48°01′58″ eastern longitudes and 33°37′08″ and 35°17′08″ northern latitudes.

Methodology
In this section, we first describe the factors affecting the growth of rainfed plants. The methods used to predict is then explained. First, 9 remote sensing indicators and its sources and features will be explained. Then, using three filters of slope, elevation, and dryland farm density, we consider only the data related to lands that have appropriate slope, elevation and in which dryland cultivation takes place. After that the remote sensing data are combined with seven meteorological data that obtained from Iran Meteorological Organization. Finally, by adding the actual amount of dryland chickpea yields, our main data set is created for each year. In the following, the machine learning methods that used this data set for forecasting are explained. After that validation technique that used to estimate the accuracy of these methods are explained.

Crop Yield Parameters
In this study, crop yield predictions were made based on two groups of data: remote sensing data collected by satellite and weather data collected by regional synoptic stations. The remote sensing data included the data from MODIS, which provide information about plant growth during planting, germination, growth, and harvest, and also the soil moisture and CHIRPS data, which offer information about the water used for irrigation in rainfed agriculture. The meteorological data, which were intended to give information about the environmental conditions of the plants, were obtained from ground stations. All collected data are presented in Appendices A and B.

Indicators of Plant Growth in the Region
MODIS indices that could be related to plant growth and crop yield-including gross primary production (GPP), enhanced vegetation index (EVI), leaf area index (LAI), fraction of photosynthetically active radiation (FPAR), normalized difference vegetation index (NDVI), and evapotranspiration (ET)-were obtained from the Terra satellite. Figure 2a-f show the averages of these induces for 11 counties in the study area for the period 2010-2017. It should be explained that the counties in the middle part of the province have a temperate climate, those in the northern part have a cold climate, and those in the western part have a hot and dry climate. NDVI, the map of which is illustrated in Figure 2a, is a simple graphical index developed for remote sensing analysis to assess the presence or absence of vegetation in an area. This index is computed based on the difference between near-infrared and red-light spectra. NDVI values close to zero (between −0.1 and +0.1) usually indicate bare rock, sand, or snow surfaces. Higher positive NDVI values (about +0.2 to +0.4) indicate shrub and grassland cover. EVI has been developed to cover some of the limitations of NDVI and particularly to minimize atmospheric effects and differences in blue and red reflectance [28]. The EVI map of the area is shown in Figure 2b. This index varies in the range of −1 to +1. LAI, which is plotted in Figure 2c, is the fraction of the surface area of the vegetation canopy to the surface area of the ground it covers. To obtain LAI for farmlands, the leaf area of the plants in one square meter of land should be measured. Another key parameter for the analysis of plant structures in ecosystems and also in crop yield models is FPAR, which provides a lot of information about photosynthesis, energy exchange, and carbon exchange of vegetation covers [29]. The FPAR map of the study area is shown in Figure 2d. FPAR is the fraction of photosynthetically active radiation (light available for photosynthesis) that is absorbed by the plant canopy. ET, the map of which is shown in Figure 2e, is the sum of evaporation and transpiration. This index has been defined because the evaporation from the moist soil surface cannot be easily separated from the transpiration from the plant surface. It has been shown ET has a relationship with crop yield and could be effective in its estimation [30]. The GPP map of the area is shown in Figure 2f. GPP is the ratio of solar energy captured in the plant's sugar molecules during the photosynthesis process, which then plants use for their cellular metabolisms, such as respiration and growth. Soil moisture content is also a key parameter for plant growth. Figure 2g-h shows the surface soil moisture and subsurface soil moisture, which both have an impact on planting, germination, harvesting of rainfed plants. In general, soil moisture is the water content in the soil, which can be expressed in terms of mass and volume percentages. In rainfed farming, where crops are not irrigated by any water source other than rain, soil moisture is an important determinant of crop yield, although it is also affected by a wide range of variables. In this study, the soil moisture map of the study area was created using NASA-USDA SMAP Global soil moisture data. for plant growth. Figure 2g-h shows the surface soil moisture and subsurface soil moisture, which both have an impact on planting, germination, harvesting of rainfed plants.
In general, soil moisture is the water content in the soil, which can be expressed in terms of mass and volume percentages. In rainfed farming, where crops are not irrigated by any water source other than rain, soil moisture is an important determinant of crop yield, although it is also affected by a wide range of variables. In this study, the soil moisture map of the study area was created using NASA-USDA SMAP Global soil moisture data.    The sources of the data used in the study and the temporal and spatial resolutions of the collected remote sensing data are given in Table 3. The indices related to plant growth were all obtained from MODIS data of the Terra satellite. The specifications of MODIS data products used in the study are also listed in the table. Given the differences in temporal and spatial resolutions, the satellite images were constructed using the monthly averages of the data.

. Indicators of Environmental Conditions of Plant Growth
To obtain the factors of environmental conditions of plant growth, the daily weather data collected by 11 synoptic stations of the Iran Meteorological Organization in the study area over an 8-year period were gathered. This included eight types of data, namely minimum temperature, maximum temperature, average temperature, sunny hours, maximum humidity, minimum humidity, average humidity, and dry temperature in growing, germination, and flowering seasons. Environmental factors also included slope and altitude, which were determined using the DEM of Kermanshah province. After giving a score to each 100 m 2 parcel of land, this map was transformed into the raster format. Since multiple layers of data had to be combined, the data of the layers were standardized before overlay. As the slope map illustrated in Figure 3a shows, a large portion of the province has a slope of 0-10 degrees. The northwestern and northeastern parts of the province generally have slopes of more than 10 degrees. In slopes of over 10 degrees, the shallowness of the soil and poor plowing conditions make the land more susceptible to environmental degradation. Between the mountains of the province, there are several plains with very gentle slopes from northwest to southeast. Figure 3b shows the altitude map of the study area. As can be seen, in Kermanshah province, altitude decreases from east to west. The conditions are more suitable in the central plains of the province, where the average altitude ranges from 800 to 1500 m. Some east and northeastern parts of the province, such The sources of the data used in the study and the temporal and spatial resolutions of the collected remote sensing data are given in Table 3. The indices related to plant growth were all obtained from MODIS data of the Terra satellite. The specifications of MODIS data products used in the study are also listed in the table. Given the differences in temporal and spatial resolutions, the satellite images were constructed using the monthly averages of the data.

. Indicators of Environmental Conditions of Plant Growth
To obtain the factors of environmental conditions of plant growth, the daily weather data collected by 11 synoptic stations of the Iran Meteorological Organization in the study area over an 8-year period were gathered. This included eight types of data, namely minimum temperature, maximum temperature, average temperature, sunny hours, maximum humidity, minimum humidity, average humidity, and dry temperature in growing, germination, and flowering seasons. Environmental factors also included slope and altitude, which were determined using the DEM of Kermanshah province. After giving a score to each 100 m 2 parcel of land, this map was transformed into the raster format. Since multiple layers of data had to be combined, the data of the layers were standardized before overlay. As the slope map illustrated in Figure 3a shows, a large portion of the province has a slope of 0-10 degrees. The northwestern and northeastern parts of the province generally have slopes of more than 10 degrees. In slopes of over 10 degrees, the shallowness of the soil and poor plowing conditions make the land more susceptible to environmental degradation. Between the mountains of the province, there are several plains with very gentle slopes from northwest to southeast. Figure 3b shows the altitude map of the study area. As can be seen, in Kermanshah province, altitude decreases from east to west. The conditions are more suitable in the central plains of the province, where the average altitude ranges from 800 to 1500 m. Some east and northeastern parts of the province, such as Sonqor County, have an average altitude of over 1500 m. In contrast, the western parts of the province-such as Qasr-e Shirin, Sarpol-e Zahab, and Gilan-e Gharb counties have an average altitude of less than 800 m. as Sonqor County, have an average altitude of over 1500 m. In contrast, the western parts of the province-such as Qasr-e Shirin, Sarpol-e Zahab, and Gilan-e Gharb counties have an average altitude of less than 800 m.
(a) Slope map of the study area (b) Altitude map of the study area

CHIRPS Data
The rainfall data of the study area were collected from the monthly rainfall data contained in the CHIRPS dataset. The monthly cumulative rainfall data of CHIRPS have been obtained by the integration of global monthly precipitation climatology (CHPclim) data, precipitation estimates based on the TIR band of satellites, and data collected from ground stations. This dataset covers the geographical coordinates ranging from − 50 to 50 degrees and time frames ranging from 1981 to the present. Several previous studies have shown the good consistency of this dataset with the data of ground stations [31]. According to [32], the average correlation of this dataset with Iran's rainfall data for spring, summer, autumn, and winter months is 0.61, 0.56, 0.59, and 0.45, respectively. CHIRPS dataset was downloaded from the website of the University of California, Santa Barbara (https://data.chc.ucsb.edu/products/CHIRPS-2.0/global_annual/tifs (accessed on 24 October 2020)). To create the rainfall map shown in Figure 4, we used the 8-year average of daily rainfall data of Kermanshah. This map shows that the annual rainfall is highest in the northern counties such as Paveh and lowest in the western counties such as Qasr-e Shirin. The ArcPy library was used to perform mathematical operations on daily rainfall raster files in millimeters.

CHIRPS Data
The rainfall data of the study area were collected from the monthly rainfall data contained in the CHIRPS dataset. The monthly cumulative rainfall data of CHIRPS have been obtained by the integration of global monthly precipitation climatology (CHPclim) data, precipitation estimates based on the TIR band of satellites, and data collected from ground stations. This dataset covers the geographical coordinates ranging from −50 to 50 degrees and time frames ranging from 1981 to the present. Several previous studies have shown the good consistency of this dataset with the data of ground stations [31]. According to [32], the average correlation of this dataset with Iran's rainfall data for spring, summer, autumn, and winter months is 0.61, 0.56, 0.59, and 0.45, respectively. CHIRPS dataset was downloaded from the website of the University of California, Santa Barbara (https://data.chc.ucsb.edu/products/CHIRPS-2.0/global_annual/tifs (accessed on 24 October 2020)). To create the rainfall map shown in Figure 4, we used the 8-year average of daily rainfall data of Kermanshah. This map shows that the annual rainfall is highest in the northern counties such as Paveh and lowest in the western counties such as Qasr-e Shirin. The ArcPy library was used to perform mathematical operations on daily rainfall raster files in millimeters.

CHIRPS Data
The rainfall data of the study area were collected from the monthly rainfall data contained in the CHIRPS dataset. The monthly cumulative rainfall data of CHIRPS have been obtained by the integration of global monthly precipitation climatology (CHPclim) data, precipitation estimates based on the TIR band of satellites, and data collected from ground stations. This dataset covers the geographical coordinates ranging from − 50 to 50 degrees and time frames ranging from 1981 to the present. Several previous studies have shown the good consistency of this dataset with the data of ground stations [31]. According to [32], the average correlation of this dataset with Iran's rainfall data for spring, summer, autumn, and winter months is 0.61, 0.56, 0.59, and 0.45, respectively. CHIRPS dataset was downloaded from the website of the University of California, Santa Barbara (https://data.chc.ucsb.edu/products/CHIRPS-2.0/global_annual/tifs (accessed on 24 October 2020)). To create the rainfall map shown in Figure 4, we used the 8-year average of daily rainfall data of Kermanshah. This map shows that the annual rainfall is highest in the northern counties such as Paveh and lowest in the western counties such as Qasr-e Shirin. The ArcPy library was used to perform mathematical operations on daily rainfall raster files in millimeters.

Rainfed Chickpea Farming and Yield Statistics of the Area
The rainfed chickpea production statistics of the study area were obtained from the database of the Statistical Center of Iran and the annual reports of the Ministry of Agriculture of Iran at the county level. Since the location information of rainfed farms by crop type was not available, we used GFSAD products to create raster images of rainfed farms in the study area. Combining the data of multiple satellite sensors, these products allow users to determine the type of land in terms of irrigation. Figure 5 shows the map of rainfed farms in Kermanshah according to GFSAD products (2010) with a resolution of 1 km. The rainfed chickpea production statistics of the study area were obtained from the database of the Statistical Center of Iran and the annual reports of the Ministry of Agriculture of Iran at the county level. Since the location information of rainfed farms by crop type was not available, we used GFSAD products to create raster images of rainfed farms in the study area. Combining the data of multiple satellite sensors, these products allow users to determine the type of land in terms of irrigation. Figure 5 shows the map of rainfed farms in Kermanshah according to GFSAD products (2010) with a resolution of 1 km.

Machine Learning (ML) Methods
Machine learning (ML) is a method for data analysis that automates construction of model. It is a kind of AI (artificial intelligence) based on the theory that computers which can learn from data, identify patterns and make decisions without being programmed to perform specific tasks and automatically improve by experience and by implementing of data. ML learn from past calculations for producing repeatable decisions, reliable, and results. The repetitive aspect of machine learning is essential because the models are exposed to new data, also are able to independently modify patterns according to the new data. An essence aim of a ML learner is to generalize from its experience. Generalization is the ability of a learning machine to perform accurately on unseen samples after taking a learning data set. The training samples come from some generally unknown probability distribution and the learner has to construct a general model about this space that empowers ML learner to produce appropriately accurate predictions in new cases. Depending on the nature of the feedback which the ML learner receives from its environment, ML methods are usually divided into three categories; supervised, unsupervised, and reinforcement learning. In this study, we use supervised learning methods. Supervised learning methods build a mathematical model of a data set that includes both the inputs and its related real outputs-also known as a supervisory feedback-and learn a function that can be used to predict the output associated with new inputs. The data set is known as training data, and consists of a set of training samples. Here, we mean data set of remote sensing and meteorological data which are integrated and described in Section 4.1. An optimal function will allow the ML method to accurately predict the output for inputs that were not a part of the training data set. When the algorithm can do the prediction well and improves the accuracy of its predictions is said learned to perform that task. Supervised methods consist of two classification and regression algorithms, classification algorithms are used when the outputs are restricted to a limited set of classes, for example, sorting emails to spam and unspam emails. Regression algorithms are employed when the outputs may have any numerical value within a range, for example, what we do in this study prediction of crop yield according to growing conditions data set. The machine

Machine Learning (ML) Methods
Machine learning (ML) is a method for data analysis that automates construction of model. It is a kind of AI (artificial intelligence) based on the theory that computers which can learn from data, identify patterns and make decisions without being programmed to perform specific tasks and automatically improve by experience and by implementing of data. ML learn from past calculations for producing repeatable decisions, reliable, and results. The repetitive aspect of machine learning is essential because the models are exposed to new data, also are able to independently modify patterns according to the new data. An essence aim of a ML learner is to generalize from its experience. Generalization is the ability of a learning machine to perform accurately on unseen samples after taking a learning data set. The training samples come from some generally unknown probability distribution and the learner has to construct a general model about this space that empowers ML learner to produce appropriately accurate predictions in new cases. Depending on the nature of the feedback which the ML learner receives from its environment, ML methods are usually divided into three categories; supervised, unsupervised, and reinforcement learning. In this study, we use supervised learning methods. Supervised learning methods build a mathematical model of a data set that includes both the inputs and its related real outputs-also known as a supervisory feedback-and learn a function that can be used to predict the output associated with new inputs. The data set is known as training data, and consists of a set of training samples. Here, we mean data set of remote sensing and meteorological data which are integrated and described in Section 4.1. An optimal function will allow the ML method to accurately predict the output for inputs that were not a part of the training data set. When the algorithm can do the prediction well and improves the accuracy of its predictions is said learned to perform that task. Supervised methods consist of two classification and regression algorithms, classification algorithms are used when the outputs are restricted to a limited set of classes, for example, sorting emails to spam and unspam emails. Regression algorithms are employed when the outputs may have any numerical value within a range, for example, what we do in this study prediction of crop yield according to growing conditions data set. The machine learning methods used in this study for yield chickpea prediction were random forest (RF), support vector machine (SVM), and K-nearest neighbors (KNN). First, these methods were trained with a set of training data, and then their ability to predict new data was evaluated by evaluation methods which are explained in Section 4.3.

Support Vector Machine
As in many other machine learning methods, SVM's modeling process involves two stages of training and testing. At the end of the training phase, testing data must be used to evaluate the generalizability of the trained model [33,34]. The solution of the SVR problem, which is the regression model of SVM, on a dataset consisting of L samples of form {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x L , y L ), x ∈ R m , y ∈ R} is a linear function in the form of Equation (1), which can estimate output values based on inputs [33] In this equation,ŷ is the estimated value, x is the input vector, w is the weight vector, and b is the bias.
As shown in Equation (2), the cost function introduced by Vapnik ignores the errors within a certain distance of the true value (epsilon). In other words, it considers some deviations to be acceptable [33]. The control parameters of the response function, i.e., weight and bias vectors, are obtained by solving the optimization problem of Equation (3) In this equation, ε is the acceptable error in the cost function,ŷ is the estimated value, and y is the true value.
Minimize : In this equation, w 2 is the weight vector, ξ * i and ξ i are the slack variables, C is the penalty or tuning parameter.
Using Lagrange coefficients, the optimization problem of Equation (3) can be converted to the Lagrangian function of Equation (4). The Lagrangian coefficients α and α* are obtained by maximizing this function under the defined constraints.
The above optimization problem can be solved with the help of quadratic programming methods, in which case the global extremum will be certainly reached and there will be no risk of being trapped in local extremums [35]. Thus, after computing α and α* in Equation (4), the optimal weight vector w 0 , the optimal bias b 0 , and the prediction function of Equation (1) can be written as Equations (5)-(7) [36] In these equations x i is the input vector with which the model is trained, χ is the new input vector, xr and xs are support vectors, w 0 is the optimal weight vector, and b 0 is the optimal bias value.
Data points with non-zero Lagrangian coefficients are called support vectors. Geometrically, these data points have a larger prediction error than epsilon. The parameters C and ε should be specified by the user. C is a tuning parameter and can be set to values ranging from zero to infinity. If C is set to large values, SVR will not allow errors in the training data, which results in lower generalizability. If C is set to values closer to zero, SVR will allow larger errors, because then the value of slack variables in Equation (3) becomes less important [36,37] and the model becomes less sensitive to the occurrence of errors in the training dataset. The parameter ε can also take values from zero to infinity. The value of this parameter greatly affects the condition of support vectors and consequently the performance of the model. Although setting ε to very large values will decrease the number of support vectors, which in itself is desirable, it is wrong to pursue this goal by widening the ε-band. On the other hand, setting ε to very small values will result in having a large number of support vectors, which increases the risk of over-training [37].

Random Forest
This method operates based on a structure consisting of numerous regression trees (CARTs) and its output is computed by integrating (or averaging) the output of these CARTs. The principles of ensemble training techniques are based on the assumption that some training methods perform better in certain situations and therefore using a group of these methods can result in good performance in a wide range of situations. In other words, a combination of several models (in our case, prediction models) tends to be more accurate than a single model. A decision tree is a non-parametric statistical model (in the sense that no value is pre-set for its variables) first introduced by Breiman et al. [38,39], which can be described as a structure consisting of many nodes and leaves that grow in the training process. To train a tree, one needs a set of training data like L n = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )} where n is the number of observations, x is the input vector with m features X = {x 1 , x 2 , . . . , x m }, and y is the output scalar. During training, the input data are split at each node, beginning with the root node. Each node uses its own splitting operator for this purpose. This process continues until the largest possible number of leaves are obtained. At the end of this training process, the prediction function t = (X, L n ) is built on the training data. In the random forest (RF) method, each tree creates a random sample of data (hence the term random in the name). All of the generated decision trees are then combined by a bootstrap-aggregation (bagging) algorithm [40]. A bootstrap dataset is a set of instances like L θ 1 n , L θ 2 n , . . . L θ q n that are obtained by random sampling with replacement from the training data. In RF, each decision tree is trained with a set of these values. Using the q trees created for the new input, the method obtains q predictions in the form ofŷ 1 = t x, L θ 1 n ,ŷ 2 = t x, L θ 2 n , . . . ,ŷ q = t x, L θ q n , and then integrates the predictions of individually trained trees by Equation (8) for regression problems [41,42].ŷ In this equation,ŷ i is the prediction of the i-th tree. The main advantage of using a bagging algorithm is that it helps avoid correlation between different trees, which results in the higher diversity of trees created from different samples. Another advantage of this algorithm is its noise resistance, because it produces independent trees based on different training samples, and while a decision tree alone could be noise-sensitive, the combination of several independent decision trees tends to be less noise-sensitive [43].

K-Nearest Neighbors
First introduced by Cover & Hart [44], KNN is a machine learning method used for regression. In this application, KNN considers each data record as a vector in an m-dimensional space (where m is the number of features), and predicts the value of each new sample based on the values of K records that are closest to that point in that space [45]. In this method, the closeness of the new point x and the training point x i is measured by a Euclidean distance function in the form of Equation (9) as following [45] d(x, . . , n and j = 1, 2, . . . , m In this equation, n is the number of training samples and m is the number of input samples.
The steps of KNN can be summarized as follows.
Step 1: Specifying the parameter K (the number of nearest neighbors).
Step 2: Computing the distance of the new input from all training samples.
Step 3: Sorting the computed distances in ascending order and selecting the K samples with the shortest distance.
Step 4: Forming the set of values of the selected K samples.
Step 5: Computing the predicted value of the new sample by weighted averaging of the values in the set of Step 4.
The first step in this method is to choose the value of K (the number of neighbors). After selecting K, the Euclidean distance between the new sample and the existing samples, i.e., d(x, x i ) for i = 1, 2, . . . , n must be computed. The values of the K samples with the lowest d (shortest distance from the new sample) must then be put in Equation (10) to compute the predicted value of the new input [46] Here, w i is the weight of each training sample, which depends on its distance from the new sample and is given by Equation (11) As this equation shows, samples that are closer to the new sample will have a greater impact on the prediction. In KNN, the new values are produced based on the previous data regardless of how much noise they have, which makes KNN sensitive to noise and irrelevant features [46].

Cross Validation
Overfitting is a modeling error that happens when a function is too closely fit to a limited data set. In such a case, the model tries to cover the data from the sample and even the noise values with a lot of changes. While such a model should reflect the behavior of society. In such cases, if obtained regression model is used to predict another sample, the predicted values will not seem appropriate at all. We used cross validation to overcome the problem of overfitting of ML methods. Cross-validation is model validation technique for assessing how the results of a mathematical analysis will generalize to unseen data set. Cross-validation randomly split the training dataset into k equal subsets. Then train Ml method on k−1 of them and measure the prediction error on the kth subset, and do this in such a way which all k subsets be used for training and for calculating the prediction error. Then get average of the prediction errors from all k subset. Leave-one-out cross-validation (LOOCV) is a kind of cross-validation which number of folds will be equal to number of instances in data set. Thus, the learning algorithm is applied once for each instance, using all other instances as a training set and using the selected instance as a single-item test set. Good machine learning models such as SVR, RF, and KNN are the ones which have good prediction accuracy, in other words, small prediction error shows that the method is capable of accurately predicting when faced with unfamiliar data. Validation measures quantify the deviation between the predicted data and the true data [17]. In this study, the accuracy of the machine learning methods for predicting the yield of products was evaluated using the LOOCV technique with MAE, MBE, RMSE, MAPE, and CC as measures. The equations of these measures are given below [15,17,26].
Mean absolute error (MAE) is a measure of errors between paired observations, which refers to the real yield of chickpeas obtained from the Statistical Center of Iran and the value predicted by forecasting methods. MAE is calculated as Equation (12) It is thus an arithmetic average of the absolute errors. In these equations, y i denotes the true values,ŷ i denotes the predicted values. The mean absolute error uses the same scale as the data being measured. Mean bias error (MBE) is mainly used to estimate the average bias in the model and to determine if any steps need to be taken to correct the model bias. MBE captures the average bias in the forecasting. MBE is calculated as Equation (13) In these equations, y i denotes the true values,ŷ i denotes the predicted values and n is the number of samples. RMSE is a measure of the differences between predicted values and real values or the quadratic mean of these differences. The RMSE serves to aggregate the errors in predictions for various data points into a single measure. In other word, is a measure of accuracy, to compare forecasting errors of different models for a particular data set. RMSE is calculated as Equation (14) where, y i denotes the true values,ŷ i denotes the predicted values and n is the number of samples. Mean absolute percentage error (MAPE) is also a measure of prediction accuracy of a forecasting method and is commonly used as a loss function for regression problems and in model evaluation, because of its very intuitive interpretation. This method expresses the prediction error as a percentage and is easier to express. MAPE is calculated as Equation (15) Here, y i denotes the true values,ŷ i denotes the predicted values, and n is the number of samples. Correlation coefficient (CC) is a statistical relationship between two variables including real and predicted values of chickpea yield. CC assumes values in the range from −1 to +1, where 0 the strongest possible disagreement and ±1 indicates the strongest possible agreement. A number closer to 1 indicates a direct relationship between the two input arguments. Conversely, a number closer to −1 indicates an inverse relationship between them. Correlation coefficient is calculated as Equation (16) [15,17,26] In these equations, y i denotes the true values,ŷ i denotes the predicted values, andŷ i and y i are their respective averages and n is the number of samples.
Clearly, there are similar and other methods of prediction which could be implemented for any research works [47][48][49][50].

Results and Discussion
Since goal of this research was to estimate the yield of rainfed chickpea farms in Kermanshah, information about other types of land and other areas were irrelevant. Therefore, we combined the raster map of rainfed farms of the province with the raster maps of slope and altitude (Figure 3a,b) in order to filter the obtained satellite data and reduce irrelevant and outlier data. Since the crop yield data provided in the reports of the Statistical Center of Iran and the Ministry of Agriculture of Iran are at the county level, other data were also transformed into county level by the use of zonal operators. For this purpose, all pixels except those related to rainfed farms, slopes of less than 10 degrees, and non-mountainous heights were filtered out. In this way, we created an integrated database of remote sensing data, weather data, and rainfed chickpea farms in Kermanshah. Rainfed spring chickpea, which accounts for the largest portion of chickpea production in Kermanshah, is typically planted in late winter and harvested in late spring. Since the selected factors had different degrees of sensitivity to the time of data, we defined two scenarios for using the data of each month and also the combination of consecutive months (e.g., MA for March and April): (1) scenario (A), in which we used the average values obtained for the entire cultivation period from planting to harvest; (2) scenario (OC), in which we used an optimal combination of months depending on to the correlation coefficient for the relationship between the crop yield and each predictor variable. Table 4 shows the correlation coefficient for the relationship between the average value of each factor (predictor) in each month between March and June (planting, germination, growth, and harvest months for rainfed chickpea farming in the area) and the crop yield in 2010-2017. The highlighted cells in this table show the periods selected for the OC scenario on account of having the highest correlation coefficient. Table 5 illustrate the average of real and predicted values by the methods for the whole province in different years from 2010 to 2017 according to these two scenarios.   scenarios. Similar results were obtained in terms of MAE and MBE, showing RF to be the most accurate, SVR to be the second most accurate, and KNN to be the least accurate of the implemented methods. All methods performed better in all evaluation measures in the OC scenario, which indicates that this method (the optimal combination of months) is more appropriate for predicting the yield of rainfed chickpeas. Figure 6a-f show the difference between the actual values and the prediction made by the methods in the A and OC scenarios. For both scenarios, the line drawn for RF (Figure 6c,d) is closest to the 1:1 line, reflecting the better performance of this method in predicting the yield of rainfed chickpea farms. Given that the lowest chickpea yield has occurred in 2015 due to drought and the highest yield has occurred in 2014 because of heavy Mediterranean rainfall, KNN has performed poorly in predicting the yield of both of these years in both scenarios. This is evident in the lines plotted for KNN in Figure 6e,f.
In Figure 6a,b, it can be seen that SVR has not performed as poorly as KNN, but has failed to produce acceptable predictions for the yield in 2014 (the year with the highest yield in the data). Overall, RF has had the most reliable performance in predicting chickpea yield even in the presence of yield fluctuations due to rainy years and droughts.
Figure 6a-f show the difference between the actual values and the prediction made by the methods in the A and OC scenarios. For both scenarios, the line drawn for RF (Figure 6c,d) is closest to the 1:1 line, reflecting the better performance of this method in predicting the yield of rainfed chickpea farms. Given that the lowest chickpea yield has occurred in 2015 due to drought and the highest yield has occurred in 2014 because of heavy Mediterranean rainfall, KNN has performed poorly in predicting the yield of both of these years in both scenarios. This is evident in the lines plotted for KNN in Figure 6e, f. In Figure 6a, b, it can be seen that SVR has not performed as poorly as KNN, but has failed to produce acceptable predictions for the yield in 2014 (the year with the highest yield in the data). Overall, RF has had the most reliable performance in predicting chickpea yield even in the presence of yield fluctuations due to rainy years and droughts.  . Results of the prediction models. Figure 6. Results of the prediction models.
The use of machine learning techniques for predicting crop yields can present new opportunities for decision makers to increase food security. Especially in areas where crop yields are affected by uncertain conditions, like rainfed agriculture. In rainfed agriculture, due to the fact that the plant is not irrigated by the farmer and the plant receives its required moisture from the environment, and these environmental conditions are variable. Use of these influential factors and machine learning techniques for predicting crop yields provides useful information to government decision makers, warehousing and farmers.
Remote sensing data are comprehensive, accurate and varied. However, they are only available in developed countries. The needs of these countries determine the priority of the time and place of collection of this data. Also due to the fact that the collection of this information is done from different satellites and different sensors. There are different scales, units and time periods, which prolongs the process of refining and collecting data. Meteorological data obtained from synoptic stations have less lost information and are easier to access for less developed countries because they are collected locally. This data also needs to be refined. Access to them in the studied area was difficult. Also, some sensors were not present in all stations. Considering two different time scenarios for data collection as well as combining data for months shows the impact of some indicators is greater in certain seasons. For example, in the RF method for the OC time scenario, the most important criteria were surface moisture, minimum air humidity, and subsurface humidity in March, January, and January, respectively. For SVR and KNN methods, the most important indicators were ET and GPP, which were MA (average March and April) and April, respectively. The application of machine learning methods using the collected data showed that the random forest method had the least errors for the OC scenario. In summary, defining different time scenarios for data collection enables the researcher to make more accurate estimates of future rainfed agricultural yield by collecting specific data at specific times.

Conclusions
The rising world population and the consequent increase in demand for reliable food supplies and also the impact of climate change on food production have rekindled concerns about food security in many parts of the world. In many areas that lack sufficient water resources, rainfed farms are the main source of food and income, but these farms are being increasingly affected by climate change. Predicting the yields of these farms based on climate and remote sensing data can greatly guide government policies and help farmers and agriculture-related businesses make better decisions regarding resource procurement and allocation. In this study, the yield of rainfed chickpea farms in 11 top chickpea producing counties in Kermanshah province of Iran was predicted using SVR, RF, and KNN. The predictions were then evaluated in terms of multiple measures. Given the sensitivity of rainfed chickpea yield to the time of data, the predictions were made in two scenarios: (1) using the averages of the data of all growing months; and (2) using the data of a combination of months with the highest correlation with the yield. A total of 16 factors that tend to affect plant growth were collected from weather and remote sensing data, and an integrated database of these factors was formed. To improve prediction accuracy, the raster maps of rainfed farms were merged with slope and altitude maps of the province. Overall, RF showed the highest prediction accuracy in terms of all computed validation measures. The obtained predictions were only 7-8% different from the statistics reported by the Statistical Center and the Ministry of Agriculture of Iran, indicating that machine learning methods can indeed be a reliable choice for crop yield modeling. In future studies, it could be worthwhile to examine the effect of genotype and seed type of spring and autumn chickpea cultivated in local farms and also the soil type of these farms on the yield prediction at farm and regional level and also to explore the economic dimension of this discussion. Uncertainties of this study are related to amount of rainfall in the future which has been affected by global warming and environmental issues in the past two decades all over the world specially in Iran. We tried to predict production of chickpea in Kermanshah Province as one of the main producers of this product in Iran. Results of this research could help government policy makers to have an accurate statistic of chickpea. It is essential for import and export too.

Appendix B
Tables A9-A16 are data for real and prediction values of dryland chickpea yield for optimum combination scenario from 2010 to 2017.