Climate-Based Modeling and Prediction of Rice Gall Midge Populations Using Count Time Series and Machine Learning Approaches

: The Asian rice gall midge ( Orseolia oryzae (Wood-Mason)) is a major insect pest in rice cultivation. Therefore, development of a reliable system for the timely prediction of this insect would be a valuable tool in pest management. In this study, occurring between the period from 2013–2018: (i) gall midge populations were recorded using a light trap with an incandescent bulb, and (ii) climatological parameters (air temperature, air relative humidity, rainfall and insulations) were measured at four intensive rice cropping agroecosystems that are endemic for gall midge incidence in India. In addition, weekly cumulative trapped gall midge populations and weekly averages of climatological data were subjected to count time series (Integer-valued Generalized Autoregressive Conditional Heteroscedastic—INGARCH) and machine learning (Artiﬁcial Neural Network—ANN, and Support Vector Regression—SVR) models. The empirical results revealed that the ANN with exogenous variable (ANNX) model outperformed INGRACH with exogenous variable (INGRCHX) and SVR with exogenous variable (SVRX) models in the prediction of gall midge populations in both training and testing data sets. Moreover, the Diebold–Mariano (DM) test conﬁrmed the signiﬁcant superiority of the ANNX model over INGARCHX and SVRX models in modeling and predicting rice gall midge populations. Utilizing the presented efﬁcient early warning system based on a robust statistical model to predict the build-up of gall midge population could greatly contribute to the design and implementation of both proactive and more sustainable site-speciﬁc pest management strategies to avoid signiﬁcant rice yield losses.


Introduction
Rice is the staple food crop for more than half of the world's population. The Asian gall midge, Orseolia oryzae (Wood-Mason) (Cecidomyiidae: Diptera) (Figure 1a), is one of the most common difficult-to-control rice pests in South and Southeast Asia [1][2][3]. In India, it is the third most important rice pest after the stem borer and the plant hoppers [2], affecting 30-70% of the total rice area [4]. It is most prevalent in the states of Andhra Pradesh, Telangana, Tamil Nadu, Kerala, Goa, Karnataka, Maharashtra, Madhya Pradesh, Bihar, Odisha, Assam, Manipur, and in certain niches of West Bengal, and Uttar Pradesh of India [5][6][7][8]. The gall midge completes its life cycle in 19-23 days at 22 to 28 • C and about 85% humidity. In April and May, pre-monsoon rains in India amplify insect activity in rice stubble, self-sown rice, and other hosts. The late planted rice varieties receive extensive damage. Insect activity peaks between the last week of August and the first week of October. Graminaceous weeds (Leersia hexandra and Echinochloa crus-galli) and wild rice varieties (Oryza nivara, O. barthii, and O. rufipogon) serve as alternate hosts [6]. Younglings feed on the shoot meristem during the tillering stage of crops, resulting in the formation of a tubular structure called 'silver shoots' (Figure 1b). The affected tillers fail to bear panicles. Yield losses caused by the gall midge are highly variable depending on the severity of attack; however, in extreme cases, complete yield loss of crop is possible. In Southern India alone, yield loss due to gall midge was estimated to be about 0.8% of total yield or approximately US$ 80.00 million [2]. Besides the inherent biotic potential of insects, to a large extent, abiotic factors determine the abundance of insect pests in a crop ecosystem. Therefore, an efficient early warning system based on a robust statistical model to predict gall midge population buildup is of great importance in designing and implementing a proactive and more sustainable site-specific pest control and management strategy. Count time series modeling is a popular statistical approach in which integer autocorrelated discrete count observations are considered as inputs, and the observations are assumed to be derived from Poisson and negative Binomial distributions. Crop pest modeling is one of the major areas of count time series modeling wherein daily or weekly counts of insects (pests), which are autocorrelated in nature, are considered. Though count time series models and machine learning techniques are applied in different areas, application of these techniques is novel for the modeling and forecasting of gall midge populations. Some of the count time series models were applied: in stock exchange data [9,10], monthly claims count of workers in the heavy manufacturing industry data [11], monthly strike count time series [12], Campylobacterosis infections count time series [13,14], prediction of number of dengue incidents in Jakarta [15], and network traffic count time series [16]. Ref. [17] reviewed regression-and machine learning-based crop pest prediction methods. Refs. [18,19] developed hybrid time series and machine learning models for crop yield predictions.
The machine learning models were employed in the prediction of various agricultural fields: prediction of oil seed production [20], banana yield prediction [21], rice yield prediction [22,23], prediction of rice pests [24], prediction of early blight severity in tomato crop [25], and prediction of sugarcane borer disease [26].
Predicting rice gall midge populations based on climatological parameters greatly aids the take up of preventive crop protection measures. However, past works on forecasting insect pest populations were mostly limited to multiple regression analysis and classical time series models. For count data that follows non-Gaussian distribution; Poission and negative binomial, transformation to normality does not improve the accuracy of the prediction model. Despite the generalized linear model (INGARCH) being better suited for count data, their applications are limited in the field of pest modelling [27,28]. For highly heterogeneous and nonlinear data, prediction models like multiple linear regression and auto regressive integrated moving average models were also reported to be not effective [20,21]. However, machine learning models such as SVR, and ANN could be effective in such conditions as they are assumption-free and data driven.
Modeling and forecasting of insect pest populations is used to provide an aid in decision making and in planning of crop management activities adequately. However, the present work is undertaken to develop a robust statistical model for predicting gall midge populations based on climatological input parameters that are crucial for in life cycles of this rice pest using count time series and machine learning approaches. The models were developed to predict gall midge population to minimize the yield losses. For the first time, we have applied the count time series model, i.e., INGARCH, with weather variables in insect pest modelling area of agriculture, revealing few applications of machine learning models in modeling and forecasting pest populations in general. However, the gall midge prediction is the first attempt in modeling and forecasting using machine learning techniques like ANN and SVR.
The methodological framework begins with basic descriptive statistics, correlation analysis, and stepwise regression analysis to understand the causal relationships between gall midge populations and input variables. Advanced computational methods, such as INGARCH, ANN and SVR with exogenous weather variables, are developed to model and predict gall midge populations in Indian hot spots.

Data Collection
The Chinsurah type light trap with a 200-watt bulb was used in the study because it is successfully used in monitoring of rice gall midge and other insect pests in rice ecosystems throughout the year [29,30]. The bulb was illuminated daily from 6:00 pm to 6:00 am. In the morning, the collected rice gall midges were brought to the laboratory and the number of individuals caught per day were manually counted, summed and presented as weekly cumulative catches [31]. If the insect catches were too large, the population was divided into different equal subgroups, one subgroup was counted and then multiplied with the remaining number of subgroups. Data were collected at four hot spot locations in India: Warangal, Maruteru, Pattambi and Jagdalpur ( Figure 2). Corresponding climatological data on maximum temperature (MAXT), minimum temperature (MINT), total rainfall (RF), morning relative humidity (RHM), evening relative humidity (RHE) and sunshine hours (SSH) were also collected from automatic weather stations at the respective locations. Standard meteorological week (SMW)-wise cumulative catches of gall midge and weekly averages of climatological parameters were considered for this study. Six-week observations were used as testing/validation sets, and remaining observations were used as the training data set.

Statistical Models
Statistical modeling started with descriptive statistical parameters encompassing mean, standard error (SE), skewness, kurtosis, minimum observation, maximum observation, and coefficient of variations (CV), which are important in depicting the nature of the studied data. Apart from the descriptive statistics, data were depicted graphically with time series plots. Pearson's product moment correlation analysis was carried out to determine the interrelationship among the variables used in the study. Stepwise, a multiple regression analysis was done to understand the cause-and-effect relationship among the gall midge populations and exogenous weather variables. The regression equation in terms of matrix notation can be expressed as; where, Y is the variable, X is the vector of exogenous variables, β is the regression coefficient vector, and e is the residuals term assumed to be normally distributed with e ∼ N 0, σ 2 .
The time series plots, INGARCH, ANN and SVR models were developed in R software (R Core Team 2018). Correlation analysis and stepwise regression analysis were carried out in SAS 9.3 version [32], available at ICAR-Indian Institute of Rice Research, Hyderabad, India. The time series following the generalized linear model (GLM) framework was elaborated by [33]. INGARCH models are the class of GLM [34,35], in which the conditional distribution of dependent variable is assumed to follow popular discrete distributions like Poisson, negative binomial, generalized Poisson and double Poisson distributions [10].
Let the count time series be {Y t : t ∈ N} and time varying r-dimensional covariate vector say {X t : t ∈ N} i.e., X t = (X t,1, . . . , X t,r, ) T . The conditional mean becomes E(Y t |F t−1 ) = λ t and F t is historical data. The generalized model form is expressed as follows: Case 1: Consider the situation where g and g are equal to identity, i.e., g(x)=, Assuming further that Y t |Y t−1 is Poisson distributed, then we obtain an INGARCH model of order p and q, abbreviated as INGARCH (p, q) model. If q = 0, the model can be referred to as the INAGARCH (p) model. These models are also known as autoregressive conditional Poisson (ACP) models [9].
Case 2: The negative binomial distribution allows for a conditional variance to be larger than the mean λ t , which is often referred to as over-dispersion (with overdispersion parameter ∅) [36]. It is assumed that the Poisson distribution is a limiting case of the negative binomial distribution by the assumption: Further details about INGARCH model estimation using conditional likelihood estimation, especially on asymptotic properties, are given by [34] and [37]. The standard INGARCH model allows forecasts to be made based on only past values of the forecast variable. The model assumes that future values of a variable depend on its past values and values of past exogenous variables. The INGARCHX model is an extended version of the INGARCH model [38].

Support Vector Regression (SVR)
The principal idea involved in SVR is to transform the original input space into high dimensional variable space and then build the regression or time series model in a transformed high dimensional feature space [39]. A vector of data set says , where x i ∈ R n is the input vector, y i is the scalar output, and N is the size of data set. The general equation SVR can be written as follows: where, W is weight vector, b is bias term, and superscript T denotes the transpose. The coefficients W and b are estimated from data by minimizing the following regularized risk function: This regularized risk function minimizes both the empirical error and regularized term simultaneously, which helps in avoiding both under and overfitting of the model. In Equation (8), the first term 1 2 w 2 is called the 'regularized term', which measures the flatness of the function. Minimizing 1 2 w 2 will make a function as flat as possible. The second term 1 is called the 'empirical error', which was estimated by Vapnik ε-insensitive loss function as follows: where, y i is actual value and f (xi) is an estimate value. The most commonly used kernel function is the radial basis function (RBF) which is given as follows: The performance of RBF kernel function requires optimization of two hyper-parameters: regularization parameter C, which balances the complexity and approximation accuracy of the model, and the Kernel bandwidth parameter, which represents the variance of the RBF kernel function, γ. In SVR and ANN also the exogenous variables are used for both modeling and forecasting purposes as in INGARCHX model. Schematic representation of SVR architecture is given in Figure S1.

Artificial Neural Network (ANN)
ANN is the most widely used machine learning technique in the last several decades. In the area of time series modeling, the ANN is commonly referred to as the autoregressive neural network as it considers time lags as inputs. The time series framework for ANN can be mathematically modeled using a neural network with implicit functional representation of time. The general expression for the final output Y t of a multi-layer feed forward autoregressive neural network is expressed as follows: where, α j (j = 0, 1, 2, . . . , q) and β ij (i = 0, 1, 2, . . . , p, j = 0, 1, 2, . . . , q) are the model parameters, also called as the synopsis weights, p is the number of input nodes, q is the number of hidden nodes, and g is the activation function. Training part in ANN minimizes the error function between actual and predicted values. The error function of autoregressive ANN is expressed as follows: where, N is the total number of error terms. The parameters of the neural network w ij are changed by an amount of changes in ∆w ij as ∆w ij = −η ∂E ∂w ij , where, η is the learning rate [20,40]. As in INGARCHX and SVRX models, the exogenous variable will also be used to model the pest count, and hence becomes ANNX model. Graphical representation of ANN architecture is given in Figure S2.

Comparison Criteria
Mean square error (MSE) and root mean square error (RMSE) were used as comparison criteria for the model performance. The mean square error (MSE) is the average of the sum of squared error values and given as: RMSE is also known as standard error of estimate in regression analysis, and is given as: where, Y i is the actual value,Ŷ i is the predicted value, and N is the number of observations.

Diebold-Merino Test
The Diebold-Mariano (DM) test is employed to determine the statistical significance difference among the models used, based on the residuals of the models [41]. Consider the residuals of two models as r 1 and r 2 , and d i is the absolute difference between residuals; d i = |r 1 | − |r 2 | and the autocovariance function γ k is expressed as: The Diebold-Mariano test statistic is expressed as: where, h = n 1/3 + 1. For testing of the hypothesis, the null hypothesis (H 0 ) and the alternative hypothesis (H 1 ), H 0 = E(d) = 0 or the forecast accuracy is similar for two models, and H 1 = E(d) = 0 or the forecast accuracy is different for two models.

Results
The time series plots of weekly (SMW wise) counts of gall midge light trap catches of four study sites during the observed 2013-2018 period were plotted in Figure 3. Year-wise time series plots of gall midge populations at all hot spot locations are depicted in Figures  S3-S6. The time series plots show that at all examined locations, the gall midge incidence was higher between the 35th to 45th SMWs, except at the Maruteru centre, where it showed two peaks, between the 10th to 20th SMWs and between the 35th to 45th SMWs. Summary statistics of the dependent variable gall midge population and exogenous weather variables were calculated and presented in Table 1. For instance, Asian gall midge populations at Warangal, Maruteru, Pattambi and Jagdalpur were 42, 215, 22 and 6, respectively. The number of population oscillates are in a wide range (0-875), leading to a high percentage of CV and an abnormality of data as skewness and kurtosis are out of normal range. Summary statistics of weather variables presented in Table 1 are self-explanatory, showing that data under consideration were highly heterogeneous in nature.

Correlation Analysis
Pearson correlation coefficients between gall midge populations and considered climatological variables are depicted in Table 2. A low positive significant correlation between gall midge population and RHM, RHE and SSH was observed at Warangal. In Pattambi, gall midge populations also showed a low positive significant correlation with RHM. However, correlation with MAXT was of low negative significance. At Jagdalpur, the gall midge population showed a weak significant correlation with RHM and RHE. Similarly, at Maruteru, the correlation between the trapped gall midge individuals and meteorological parameters was weak. Overall, correlation analysis revealed that gall midge population has association with RHM, RHE, RF and SSH of lower magnitude, ascribable to the heterogeneity or high percentage of CV among gall midge populations.

Stepwise Regression Analysis
To identify the climatological factors influencing the incidence of gall midge population buildup, a stepwise regression analysis was carried out with the results depicted in Table 3. Some of generated outputs like: (i) MINT, RHE, SSH at Warangal; (ii) RHM at Maruteru; (iii) MAXT, RHM and SSH at Pattambi, and (iv) MINT, RHM, RHE and SSH at Jagdalpur showed significant influence on the gall midge population. Though the listed variables have significant influence on the gall midge populations, the model R 2 value for the fitted regression in all four of the centers is low, indicating that the model is not a strong fit, for which non-linearity and high heterogeneity in dependent variables may be responsible.

INGARCHX Model
Prior to subjecting the gall midge individual count time series data into the INGARCH model, the presence of autocorrelation was confirmed by employing the Box-Pierce noncorrelation test, and the statistic test revealed a highly significant (p < 0.0001) autocorrelation ( Table 4). The INGARCH model with exogenous climatological variables was fitted and the model parameter was found to be significant, but none of the climatological parameters were significant at all hot spot locations. The over-dispersion parameters obtained per location (7.32, 3.30, 2.23 and 5.47) clearly indicated the heterogeneous and over-dispersed nature of the data, following a negative binomial distribution (Table 4). Diagnostic checking of residuals by the Box-Pierce non-correlation test revealed that the residuals were autocorrelated or non-random (p < 0.0001) at all examined locations, except at Pattambi, where the residuals are un-correlated and random in nature (p = 0.7403) ( Table 4). Inability of the INGARCHX model to capture the heterogeneity and complex nature of the data might have led to the non-significant effect of weather variables and significant residuals of the model.

SVRX Model
The nonlinear SVR model with exogenous variables for the time series of gall midge population count was built with parameter specifications given in Table 5. The diagnostic checking of residuals by the Box-Pierce non-correlation test indicated that the residuals are autocorrelated or non-random (p < 0.0001) ( Table 5).

ANNX Model
The ANNX model parameters were given in Table 5 for all the four locations. The gall midge count time series were subjected to the ANN model with 4, 5, 8 and 3 tapped time delays, six exogenous variables, and 6, 6, 10 and 10 optimum nodes for Warangal, Maruteru, Pattambi and Jagdalpur centers, respectively. A sigmoidal activation function in input to hidden layer, and linear identity function in hidden layer to output layer, was used with feed forward network architecture. The total number of parameters or synaptic weights obtained was 73, 79, 161 and 111 for four centers, respectively (Table 5). After model fitting, diagnostic checking of residuals by Box-Pierce non-correlation test indicated the not correlated or random nature of the residuals (p = 0.55, p = 0.99, p = 0.15 and p = 0.18 for Warangal, Maruteru, Pattambi and Jagdalpur, respectively) ( Table 5). Finally, the model performance in all the four centers in both training and testing sets are given in Table 6.

Discussion
The results of modeling and predictions of the gall midge population at examined study sites obtained by employing different models were compared in terms of MSE and RMSE in both training and testing datasets, and are presented in Table 6. In this study, the fitness of the stepwise regression model was found to be weak (low R2) due to non-linearity and high heterogeneity in the dependent variable. However, Samui et al. (2004) [7] found a strong relationship between temperature, relative humidity, rainfall and sunshine hours on the development of gall midges in successive generations using stepwise regression. Amongst the attempted techniques, the ANNX model superiorly outperformed the INGARCHX and SVRX models in both training and testing data sets, as revealed by the low MSE and RMSE values. Furthermore, the INGARCHX model performed better compared to the SVRX model in both training and testing data sets. Performance hierarchy of these models is as follows: ANNX > INGARCHX > SVRX in both training and validation sets at all four locations. Similar results were obtained in [42][43][44], where the ANN model outperformed the classical autoregressive integrated moving average and SVR models.
In SVR, we considered candidate hyper parameters among several combinations of user defined parameters. The 10-fold cross validation was carried for each model combination of hyper parameters, and the lowest cross validation error obtained is reported in Table 5. In this modelling exercise, we have tuned the model with different combinations of hyper parameters and chosen the optimum parameters based on the lowest training error with a margin of error tolerance epsilon. The ANN model for this exercise was developed with the 'Liebenberg-Marquardt back propagation algorithm' in a feed forward network based on repetitive experimentation. The learning rate and momentum terms was 0.03 and 0.01, respectively. To tune the model, the network was repeated 25 times with a maximum of 1000 iterations. Different combinations of input lags and hidden nodes were tried, and candidate model parameters were selected based on the fewest training errors.
The predicted population size of the gall midge by the ANNX model is closer to the actual gall midge population size, as compared to both INGARCHX and SVRX models ( Figure 4). The comparison criteria (MSE and RMSE) exhibit only the observed differences between the predicted values of the models. Therefore, the Diebold-Mariano test statistic (DM test) has been used to determine the statistically significant difference between the different models used in this study. The INGARCHX (M1), and SVRX (M2) models are significantly different with respect to the ANNX (M3) model (Table 7), confirming the superior performance of the ANNX model over the other two models.  The outperformance of the ANNX model over the INGARCHX and SVRX models in both training and testing data sets could be due to its ability to capture the heterogeneous, nonlinear and complex nature of the data. In ANN, we applied the sigmoidal activation function to map the input to the hidden layer, whereas the RBF function turns to Gaussian distribution when we increase the value of gamma. As we explained earlier, count time series data are derived from non-Gaussian distributional assumptions, which could be the possible reason why SVR fails to capture the trend in count time series data. Similar results were obtained in [45], where the ANN outperformed the SVR in modelling and forecasting ancillary energy market prices.
In addition, the diagnostic checking of residuals obtained by both INGARCHX and SVRX models were correlated and non-random in nature, whereas the residuals obtained by ANNX models were uncorrelated and random in nature, which conclusively weighs in favor of the ANNX model as a good fit compared to INGARCHX and SVRX models. Inter combinational significances are clearly represented in Table 7. Similar studies conducted by [46] for the yield prediction of early potatoes, by [47] for rice blast, and by [48] regarding the wheat yield in Pakistan revealed that machine learning techniques were superior in their performance of prediction.

Conclusions
In the present study, relevant count time series and machine learning techniques were applied to develop the rice gall midge occurrence models based on climatological input variables. The results showed that the INGRACHX and SVRX models were not suitable for the time series of the gall midge incidence due to the highly nonlinear and heterogeneous nature of the data. On the other hand, the study clearly revealed that the ANNX model is a viable and effective alternative for modeling and predicting the gall midge incidence based on time series data. It can also be inferred that the application of machine learning techniques such as ANN with exogenous variables in modeling and predicting count time series can increase the prediction accuracy. Further DM test statistics confirm the superiority of ANNX models over INGRACHX and SVRX models.
Rice gall midge is a disastrous pest in rice cultivation, causing significant economic losses not only in the examined Indian agroecosystem, but also across numerous other Asian agroecosystems. The on-time warning models developed in this study utilizing machine learning techniques will be of great assistance in predicting the occurrence of gall midge so that the appropriate management measures can be engaged to minimize the yield losses. In the future, it is expected that various machine learning techniques will be intensively used to model the count time series of other various crop pests.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/agronomy12010022/s1, Figure S1: Architecture of the SVR model; Figure S2: Architecture of the ANN model; Figure S3: Year-wise time series plots of gall midge population in Warangal study site; Figure S4

Data Availability Statement:
The data utilized in this study are collected from the All India Co-Ordinated Rice Improvement Project (AICRIP), available at ICAR-Indian Institute of Rice Research, Hyderabad, India. The data presented in this study are available on request from the corresponding author through the AICRIP project.