Predicting Rice Pest Population Occurrence with Satellite-Derived Crop Phenology, Ground Meteorological Observation, and Machine Learning: A Case Study for the Central Plain of Thailand

Featured Application: Integration of ground-based weather variables, satellite-derived host-plant phenology and ANN modelling were applied for rice pest warning and prediction to support an integrated pest management (IPM) programme in the central plain of Thailand. Abstract: The brown planthopper Nilaparvata lugens (BPH) is one of the most harmful insect pests in rice paddy fields, which causes considerable yield loss and consequent economic problems, particularly in the central plain of Thailand. Accurate and timely forecasting of pest population incidence would support farmers in planning effective mitigation. In this study, artificial neural network (ANN), random forest (RF) and classic linear multiple regression (MLR) analyses were applied and compared to forecast the BPH population using weather and host-plant phenology factors during the crop dry season from 2006 to 2016 in the central plain of Thailand. Data from satellite earth observation was used to monitor crop phenology factors affecting BPH population density. An ANN model with integrated ground-based meteorological variables and satellite-derived host plant variables was more accurate for short-term forecasting of the peak abundance of BPH when compared with RF and MLR, according to a reasonably validating dataset (RMSE of natural log-transformed (ln) BPH light trap catches = 1.686, 1.737, and 2.015, respectively). This finding indicates that the utilization of ground meteorological observations, satellite-derived NDVI time series, and ANN have the potential to predict BPH population density in support of integrated pest management programs. We expect the results from this study can be applied in conjunction with the satellite-based rice monitoring system developed by the Geo-Informatic and Space Technology Development Agency of Thailand (GISTDA; http://rice.gistda.or.th) to support an effective pest early warning system.


Introduction
Nilaparvata lugens (brown planthopper; hereafter BPH) is one of the most damaging rice insect pests in the temperate and tropical regions of East and Southeast Asia [1]. It causes serious crop loss directly by sucking the sap of plants, causing the plants to wilt (known as hopper burn), as well as indirectly by transmitting viral diseases [2]. The life cycle of the BPH lasts around 30 to 40 days, consisting of an egg stage, five nymphal instar stages, and an adult stage. In the tropics, the egg stage lasts 7 to 11 days. The nymphal stage takes about 10 to 18 days from the hatching of the first-instar nymph to the adult stage, and adult planthoppers live for 18 to 20 days [3]. Both BPH nymphs and adults can damage rice plants [4]. For short duration rice varieties, BPH can feed on rice plants throughout the cropping season and reproduce continuously for three to four generations. From 2010 to 2013, BPH infestation in Thailand was estimated to have affected at least 800,000 hectares of rice and caused a 60% loss in production costing $800 million per year, especially in the central plain and lower northern regions [5]. This damage was partly a result of the rapid increase in insect populations, with farmers being unable to forecast outbreaks. Continual monitoring and accurate forecasting of pest population during the crop growing period could be useful in protecting rice crops against BPH.
Both abiotic and biotic factors are believed to be responsible for changes in BPH populations [6]. The influence of abiotic factors such as weather variables has been well documented. Yadav, Chander, and Selvaraj [7] found that temperature and evening relative humidity were key factors influencing BPH light trap catches. In addition, Win et al. [6] found that temperature, rainfall, and relative humidity influenced populations of both BPH and white-backed planthoppers during the rainy and summer seasons. Prasannakumar and Chander [8] similarly observed that weekly maximum temperature, rainfall, and evening humidity were significant factors that affected BPH light trap catches in the rainy season. Thus, climatic factors can be expected to be indicators of BPH attacks [9].
Biotic factors such as host plant phenology can also affect BPH populations [10]. Alam [11] found that crop age has a highly significant effect on populations. In addition, BPH tends to be more numerous after 60 days' post-transplantation when crops are close to maturity, as well as between the heading and harvesting stages during dry seasons [12,13]. Otuka et al. [14] found that BPH is a monophagous insect whose occurrence depends on host rice plants. Therefore, the temporal and spatial patterns of rice cultivation in the region should affect BPH incidence.
Satellite images present an alternative approach for investigating the spatiotemporal pattern of rice phenology by providing information on phenological stages on a large scale. Vegetation indices derived from satellite images are used as an essential parameter for crop growth information detection, especially the normalized difference vegetation index (NDVI). A time series of NDVI data has been popularly used in continuous monitoring of land cover characteristics and vegetation phenology [15][16][17][18][19][20][21]. Among optical satellite data, the moderate resolution imaging spectroradiometer (MODIS) is advantageous in that it provides broad coverage, high spectral and temporal resolution, and free access to data. Therefore, many researchers in remote sensing have applied MODIS NDVItime series data to study the growth stages of rice [22][23][24][25]. To date, however, remote sensing data have not been used to model BPH fluctuations in Thailand.
Statistical modelling for early prediction of pest risks is one strategy that has been widely adopted. The multiple linear regression model (MLR) is a statistical method used to investigate the relationship between one response variable (the dependent variable) with two or more variables (independent variables). Some studies have used this approach to predict BPH population abundance in both field and light trap catches [6,7,9,[26][27][28][29][30]. Yadav, Chander, and Selvaraj [7] applied MLR and Geographic Information Systems with light trap and weather data to develop BPH outbreak potential zones at Maruteru in India. They used MLR pest-weather models to determine the risk of BPH outbreaks. Jayanthi [31] used the occurrence of small immature guajava (Psidium Guajava) fruits as a single and accurate indicator of host plant phenology variables and found that it was a more reliable indicator of changes in fruit fly (Bactrocera dorsalis) populations than variables. However, due to the nature of the linear relationships in the parameters, linear regression models may not provide accurate predictions in some complex situations such as with non-linear and extreme values data [32].
The artificial neural network (ANN) model is a tool that can model complex and non-linear relationships without any prior assumptions about the relations between independent and dependent variables. ANN is a type of machine learning model and is relatively competitive in terms of usefulness with conventional regression and statistical models. ANN has been successfully applied in various disciplines, such as agriculture, medical science, education, finance, management, security, engineering, trading commodity, and art [33][34][35][36][37][38][39][40][41][42]. For pest forecasting, ANN has been applied to model the pest population dynamics for the pod borer Helicoverpa armigera (Lepidoptera: Noctuidae) [43], the fruit fly Bactrocera dorsalis (Diptera: Tephritidae) [44], rice pests [45] and the paddy stem borer (Scirpophaga incertulas) [46]. In most of these studies, the authors referred to the limitation of linear multiple regression (MLR) and concluded that the ANN technique is more precise than MLR and is a good choice for predicting pest occurrence. However, ANN is unfamiliar to most entomology researchers so its use is not widespread in entomological research [46].
Random forest (RF) is an ensemble algorithm proposed by Breiman [47]. The algorithm combines bagging a sampling approach and the random selection of features to construct a collection of decision trees with controlled variation [48]. RF are fast and easy to implement, yields highly accurate predictions and can handle a very large number of input variables without overfitting [49]. This method has been implemented in many different fields in recent years [50] including crop pest and disease prediction. Ayub [51] applied several data mining techniques such as random forest, support vector machine, neural network, k-nearest neighbors, decision tree, and Gaussian naïve bayes to predict grass grub damage and indicated that neural network and random forest performed slightly better than other classifiers. Balaban et al. [52] used RF to predict the nymphal stage percentages of the sunn pest in the Middle Eastern region and showed that their models have accuracy over 99% and also offer credible confidence intervals. However, the benefits of RF in predicting the crop pest population are still limited.
This study focused on the use of ground-based weather parameters, time series of satellitederived vegetation indices, and machine learning approaches. We aimed to: (1) investigate the association between lagged ground meteorological parameters and satellite-based rice phenology in relation to the temporal patterns of BPH population incidences; (2) compare models with only weather variables and models with a combination of weather and host-plant variables; (3) develop a predictive model for the population occurrence of BPH based on ANN and RF; and (4) evaluate the performance of the ANN and RF models using a traditional MLR model.
We expected this method to result in an improved approach to BPH management for the following reasons:

•
In contrast to previous studies that have only used weather factors from ground stations for pest population forecasting, this study proposed an approach that also uses host-plant phenology data extracted from satellite-based NDVI time series in order to improve the accuracy of forecasting models.

•
By using a combination of weather and host-plant phenology data integrated with an ANN algorithm, a more accurate and precise forecasting can be provided (in comparison to the traditionally used linear regression models). These findings could be applied to support an integrated pest management (IPM) programme to help farmers reduce pesticide use and minimize crop loss in rice paddy fields, such as in the central plain of Thailand.

Study Sites
Four study sites were equipped with light trap stations installed near the rice paddy fields. The site were located at (A) Chai Nat Rice Research Centre in the Mueang district of Chai Nat province (15.197°   All sites were situated within the central plain of Thailand, which is characterized by flat topography, alluvium soil, and a tropical climate. Most agriculture in this area is rice plantation. Two main rice crop seasons pertain to the dry season and the wet season, where the dry season crop is generally planted in November-December and harvested in March-April and the rainy season crop is planted in May-June and harvested in September-October. Water for cultivation is mainly supplied by well-irrigation systems, so that double or continuous rice cropping with staggered planting times is sometimes practiced. The rice varieties most planted in the area are nonphotoperiod sensitive and takes around 100 to 140 days to mature, including such types as "Pathum Thani-1", "Chai Nat-1", and "Suphan Buri-1". Due to the suitable environment and continuous rice planting throughout the year, BPH outbreaks occur frequently in this area.

Meteorological Data
Daily meteorological data including minimum temperature (Tmin), maximum temperature (Tmax), mean relative humidity (RH) and rainfall were collected from a weather logger at Site A from 2006 to 2016 and three meteorological stations belonging to the Thai Meteorological Department from 2010 to 2013. All daily data were calculated as average monthly minimum and maximum temperature, average monthly relative humidity, and accumulated monthly rainfall.

Satellite Data
The MODIS Surface Reflectance (MOD09Q1) instrument was used in this study to detect the growth stages of rice crops through a time-series profile of vegetation indices. These data were derived from the TERRA MODIS satellite and downloaded from the Land Processes Distributed Active Archive Center (https://lpdaac.usgs.gov) on a REVERB interface with h27v07 tile between 2006 and 2016 in 10-year periods. These data provided an eight-day composite at a 250-metre spatial resolution as a gridded level-3 product in a sinusoidal projection.

BPH trap Catches and Weather Data Preprocessing
To predict BPH abundances, weather variables (including Tmin, Tmax, RH, and rainfall) and host-plant variables such as NDVI derived from satellite remote sensed data (more detail given in Section 2.3.1.2) for the prediction months and one-month prior were used as input variables. In addition, the monthly BPH light trap catches were natural log-transformed prior to analysis in order to meet the assumptions of the regression [53][54][55][56] Table 1. MODIS HDF products were re-projected from the sinusoidal projection to geographical projection WGS84 and converted from an HDF to a GeoTIFF format using the MODIS Reprojection Tool. Both the eight-day red band and NIR band data were then calculated to eight-day NDVI data and stacked to the original 8-day NDVI time series. After preprocessing, TIMESAT software was used to remove noise and generate the smoothed eight-day NDVI time series using a Savitzky and Golay filter [57].
We obtained monthly NDVI values for each crop during the dry season, which are shown in Table 1 according to each study site, to use as input variables for the host-plant phenology factor for model development. The method was as follows: 1) The buffer was drawn with a radius of 10 km around the light trap. This distance is considered appropriate in terms of the migratory behavior of BPH [58]. 2) Only the rice paddy fields that are within the buffer from the land use map were extracted. 3) Only the grids within the rice paddy fields from the smoothed eight-day NDVI-time series data obtained from MODIS were selected ( Figure 3). The values in the grid are the NDVI profile of the rice at that grid location from 2006 to 2016. 4) Then, the NDVI profiles from each grid were classified into ten groups using the k-mean clustering classification method, so that they have a similar NDVI profile pattern within each group. This classification method can be used to identify and monitor the same information in the same group based on Euclidean distance analysis [59]. 5) The start date of season data was extracted from the smoothed eight-day NDVI-time series data using TIMESAT software. 6) One of ten groups that has a sowing date similar to the start date of season was selected based on the start date of season data. 7) The selected NDVI profile group was analyzed using median statistics in order to obtain a single NDVI profile that could represent the rice paddy field area in the buffer zone. 8) Finally, the eight-day NDVI-time series data were used to compute monthly means. Population occurrences of BPH have quantified the relationship with climatic and host-plant phenology variables, including Tmin, Tmax, RH, rainfall, and NDVI at both the prediction months and one month prior, using Pearson correlation analyses with regard to the crop dry season from December to March. Significant correlation coefficient (R) values were taken as criteria to select suitable factor(s) for developing linear models with trap catches of BPH.
An MLR model with a stepwise selection method was developed, considering only weather variables and a combination of weather and host-plant phenology variables to achieve a maximum coefficient of determination (R 2 ) for estimating the trap catch.
MLR is a statistical method in regression for analysis of relationships between a single dependent variable and two or more independent variables [60] in the form (Equation (1)): where is a response variable, to are explanatory variables, represents the intercept, to are the slopes of the relationship between and to , and is the residual or fitted error. Stepwise regression is an automatic method, useful when the number of explanatory variables is large and where it is not possible to fit all potential models [61]. The MLR model was generated by IBM SPSS Statistic 25.

Artificial Neural Networks (ANNs)
Artificial neural network (ANN) modelling is a non-linear statistical technique. It can be used to solve problems for which conventional statistical methods are not suitable [40]. The main neural network topology consists of an input layer, an output layer, and usually one or more hidden layers. Each layer contains nodes connected to nodes at the adjacent layer(s). Each connection link has an associated weight which, in a typical neural network, multiplies the signal being transmitted. Then, each neuron applies an activation function (usually non-linear) to its net input to determine its output signal.
A three-layer (including input, hidden, and output layers) back propagation-ANN (BP-ANN) with sigmoid activation function was employed to predict monthly BPH trap catches using the same set of independent variables considered in the MLR. The number of hidden neurons was determined, to follow Morariu [62]. Optimal learning rates, momentum constants, and the number of training epochs were adjusted empirically based on training efficiency and accuracy. All analytic parameters for the trained neural network applied to BPH trap catch estimation are specified in Table 2. In this work, a customized python script has been used to construct the ANN models. Generally, random forest is used for data classification and regression [63]. RF for regression is formed by growing trees depending on a random vector such that the tree predictor takes on numerical values as opposed to class labels [47].
To train the RF model, we experimented with the set of a number of trees, consisting of 200, 500 and 1000 trees. Other hyperparameters were set to the default. The training dataset consisted of all variables that remained, following the ANN and MLR modelling procedure. A regression tree grew based on the training data for each of the boostrap samples. At each node, predictors were randomly sampled and the best split from those variables was selected. Next, new data were predicted by aggregating the predictions. The RF model then have n-estimators, and the models with the highest R 2 and smallest root mean square error (RMSE) were selected as the best RF model. The importance of the input variables was calculated and plotted. In this study, the scikit-learn library with python was applied to develop the model.

Model Validity and Performance Evaluation for MLR, ANN, and RF Models
A test dataset of BPH light trap catches that was not used for the training process was obtained from the CHN, SPB, NTB, and KKL light trap stations in the 2016, 2013, 2012, and 2012 crop dry seasons, respectively. These data were used to evaluate the statistical prediction performance of the ANN, RF, and MLR models according to R 2 values between observed and predicted values and the RMSE [64,65].
All procedures from Section 2.2 and 2.3 are summarized in the methodology flowchart in Figure  4. .

Relationship between Climatic and Host-Plant Phenology Variables and the BPH Light Trap Catch
Correlation coefficients (R) were calculated between the natural log-transformed BPH light trap catch (lnBPH) and the time-lagged environmental variables selected (0 and 1-previous-month weather and host plant parameters) ( Table 3). The results showed that the BPH light trap catch had significant positive relationships with minimum temperatures with no lag (R = 0.608, p < 0.01) and with maximum temperatures at no lag and at a lag of one month (R = 0.546 and 0.457, p < 0.01, respectively) and NDVI at a lag of one month (R = 0.738, p < 0.01). Meanwhile, the relationships between trap catch and RH and rainfall were not significant. The highly significant correlation variables were selected into the linear and non-linear models for predicting BPH light trap catches.

Prediction with Multiple Linear Regression
An MLR model with stepwise selection was employed based on a combination of weather and host-plant phenology variables (Table 4). Table 4 shows the standardized beta coefficients that present the contributions of each variable to the model. The t and p values show the impact of the independent variables on the dependent variable. Only Tmin and NDVI showed a significant positive relationship with BPH trap catches. The large t value (t = 5.140) and corresponding lowest p-value (p < 0.01) support the result for NDVI at a one-month lag, which had the highest beta coefficient. A model based on weather variables alone (Model 1) could explain approximately 35% (adjusted R 2 = 0.356) of the variability in the BPH light trap catch. In contrast, the model involving incorporated weather and host-plant phenology variables could explain approximately 58% of the variability of the light trap catch (adjusted R 2 = 0.585) (Model 2) with an acceptable variance inflation factor (VIF <10) value, indicating that there was no multicollinearity among independent variables. Considering both the adjusted R 2 and the standard error of estimate (SEE), the model with the combination of both significant weather and host-plant phenology was selected as the best model (Model 2). The SEE represents the degree to which the predicted values differ from the observed values on the criterion measure. Lower values of the SEE indicate greater accuracy in prediction [66].
The residual of the stepwise regression model for the prediction of BPH was tested for normality and homoscedasticity. The normal probability plot (Figure 5a) indicates that residuals follow normality. However, the residual plot ( Figure 5b) showed a signal of funneling-shape, suggesting that the residuals were not random. Based on the stepwise regression model results shown in Table 4, the predictive models can be expressed using regression equations (Equation (2)): ln (BPH) = -11.470 + (0.432 × Tmin at lag 0 month) + (10.512 × NDVI at lag 1 month) (2) where the response variable ln (BPH) denotes the natural log of BPH incidence, and the explanatory variables Tmin at lag 0 month denotes the minimum temperature during the current month, and NDVI at lag 1 month denotes NDVI during the previous month.

Prediction with Artificial Neural Network
The layer and processor elements of the neural network were trained to produce the optimum number that fulfills the process of prediction with the lowest error, using trial and error. The neuron number in this hidden layer was found to be five. The learning rate was taken as 0.4, the momentum was 0.5, and the maximum iteration number was 1000. The optimum number of different neural network elements is shown in Table 5, and their architecture structure is shown in Figure 6. Table 5. Architecture, specification, and statistical information of the optimal neural network model.

Prediction with Random Forest
Based on the training dataset, the result indicated that the RF models with different numbers of trees (ntree = 200, 500, and 1000) produced nearly identical results (R 2 = 0.956, 0.953, and 0.951; RMSE = 0.723, 0.731, and 0.738, respectively). Figure 7 provides an indication of the importance of the input variables in the RF model of natural log-transformed (ln) BPH light trap catches. NDVI at a lag time of one month is the most important input variable. Rainfall at a lag time of one month is the least important input variable in the RF model.

Model Validation and Prediction Performance Comparison between MLR, ANN and RF Models
Plot validation comparing the actual and predicted natural log-transformed (ln) BPH light trap catches for the testing dataset in the different light trap stations and different crop dry seasons using the traditional MLR, ANN, and RF modelling is shown in Figure 8. The results indicated that all three models usually provided good prediction in February and March for all sites except in the cases of abnormally high BPH population levels.   Figure 9 shows the scatter plots of the comparison between actual natural log transformed BPH populations as predicted by the MLR, ANN and RF models. Comparing R 2 and RMSE, ANN produced more accurate yield predictions than MLR but was slightly different to RF. The R 2 for ANN, RF, and MLR were 0.770, 0.754, and 0.645, respectively. The RMSE for ANN, RF, and MLR were 1.686, 1.737, and 2.015, respectively.

Discussion
Forecasting pest population abundance can help in developing pest management strategies that reduce the quantity of insecticide application, and is a useful component for successful area-wide implementation of IPM [31]. For a pest forecasting model, weather variables such as temperature, humidity, rainfall, sunshine duration and others are often used as abiotic predictors for model development [54,[67][68][69]. Biotic factors such as host plant phenology can also affect pest population dynamics [31,70]. We found that the minimum temperature affects BPH development and population dynamics. This is similar to Ali et al. [27] who found that increasing temperatures positively influence BPH populations at low temperatures but have a negative influence at high temperatures.
Satellite-based NDVI time series were applied to represent the host plant phenology factor to study the association between rice growth stages and pest population. We found that during the months of January and March in the dry crop season, NDVI at a lag time of one month with regard to rice paddy fields has a significant positive correlation with BPH population fluctuations. This relationship may be due to the association between peak BPH population in each generation and different rice growth stages [10,71]. In general, the peak catches of adult BPH were observed when rice was mature and post-harvest, as a result of peak nymph populations occurring in the rice paddy fields in the preceding month [71]. In terms of the utilization of phenological information with regard to rice paddy fields based on the satellite time series, these results are in agreement with a previous study in which rice phenology was assessed by remote sensing for the first time to observe BPH migration in the Vietnamese Mekong Delta; the peak BPH occurrence periodically changed according to the harvesting stage in the rice cultivated area [14]. One advantage of this remotely sensed hostplant indicator is the reduction in cost, time and labor required to obtain rice development status from field observations, since satellite earth observation data can take advantage of both spatial and temporal crop information.
The stepwise multiple regression model developed in this study suggests that the combination of the satellite-based rice phenology variable and ground-based weather variables could slightly increase the accuracy of the model (adjusted R 2 = 0.585) when compared with weather factors alone (adjusted R 2 = 0.356), to a statistically significant degree (R 2 change = 0.229, p < 0.01).
ANN is an appropriate choice for pest occurrence forecasting [46]. As has been observed in previous studies [45], our findings indicated that the ANN model yielded more accurate pest prediction than the MLR model. This may be because the weights assigned to each neural network connection allow a more precise pattern in the input data to be identified [44]. The RF model also performed better than the MLR model, but not better than the ANN model for our dataset. This difference does not mean that RF could not predict the insect catches. To make the ANN and RF models more robust, all related factors such as hyperparameters should be adjusted appropriately and the amount of collected data (in terms of spatial area and length of collection period) should be greater.
Since a rice paddy field is an open environment, factors driving BPH population growth are variable. Apart from meteorological and host plant factors, natural enemies, rice resistance and even farmer practices may affect BPH population dynamics. The differences between the observed and predicted BPH abundance, especially at the SPB light trap station, could be due to changes in the seasonal and surrounding environment [26]. The patterns of rice cultivation and rice phenological stage, as observed by satellites with greater spatial, spectral, and temporal resolution, could be considered in further work.

Conclusions
This investigation examined the use of integrated ground-based meteorological variables and satellite-derived host plant variables to build a prediction model for population occurrence of BPH during the rice crop dry season in the central plain of Thailand. MLR, ANN, and RF models were developed based on these variables. MLR was employed to investigate the prediction ability of weather variables alone or in combination with host-plant phenology, while ANN and RF models were established to enhance the prediction performance of the model compared to MLR.
Tmin without lag and NDVI at a one-month lag were selected as the major factors in the MLR model based on their high correlation coefficients. Moreover, we found that the combination of these variables yielded higher accuracy in predicting light trap catches than weather variables alone. Use of ANN (with a 10-5-1 structure) and RF (with number of trees: 500) yielded a more efficient performance than the MLR in terms of R 2 and RMSE in testing (ANN model: R 2 = 0.770, RMSE = 1.686; RF model: R 2 = 0.754, RMSE = 1.737; and MLR model: R 2 = 0.645, RMSE = 2.015) dataset validation.
In summary, an ANN model integrating ground-based meteorological variables and satellitederived host plant variables can be utilized for short-term forecasting of the peak abundance of BPH, to be applied in timely management of crop pests after reasonable validation with different location and seasonal data. We expect that these results can be applied in conjunction with the satellite-based rice monitoring system developed by the Geo-Informatic and Space Technology Development Agency (GISTDA; http://rice.gistda.or.th) to support an effective pest early warning system.