Early Forecasting of Rice Blast Disease Using Long Short-Term Memory Recurrent Neural Networks

Among all diseases affecting rice production, rice blast disease has the greatest impact. Thus, monitoring and precise prediction of the occurrence of this disease are important; early prediction of the disease would be especially helpful for prevention. Here, we propose an artificial-intelligence-based model for rice blast disease prediction. Historical data on rice blast occurrence in representative areas of rice production in South Korea and historical climatic data are used to develop a region-specific model for three different regions: Cheolwon, Icheon and Milyang. A rice blast incidence is then predicted a year in advance using long-term memory networks (LSTMs). The predictive performance of the proposed LSTM model is evaluated by varying the input variables (i.e., rice blast disease scores, air temperature, relative humidity and sunshine hours). The most widely cultivated rice varieties are also selected and the prediction results for those varieties are analyzed. Application of the LSTM model to the accumulated rice-blast disease score data confirms successful prediction of rice blast incidence. In all regions, the predictions are most accurate when all four input variables are combined. Rice blast fungus prediction using the proposed LSTM model is variety-based; therefore, this model will be more helpful for rice breeders and rice blast researchers than conventional rice blast prediction models.


Introduction
Rice is the most important crop not only in Asia but is also a staple food in many other countries.Currently, approximately 5.6 billion people (80% of the world's population) have rice as their staple food and according to the World Agricultural Supply and Demand Estimates (WASDE), the annual worldwide rice consumption in 2015 was 4.842 million tons, with rice production being 4.71 million tons.Rice consumption is expected to reach 8 million tons in 2030 owing to population and economic growth.It is predicted that in 2050, 9 billion people will be dependent on rice for food [1].Rice blast disease (also simply called "blast"), which affects most of the rice-producing countries, has already spread to approximately 85 countries [2], leading to loss of food that could feed 60 million people per year.The rice blast pathogen Magnaporthe oryzae is a fungus that causes disease throughout the rice growth period and develops in the node, leaves, collars, necks, panicles and seed.The phases of the disease are generally divided into leaf blast and panicle blast.However, depending on the onset area and onset time, the blast may progress through leaf, collar and neck node and panicle blast phases.
The rice variety, environmental factors and pathogens influence the occurrence of blast.In particular, when the environmental conditions are favorable to rice blast fungus during harvest season, an extremely large loss of yield can occur.In rice-blast-susceptible varieties, yield reductions Sustainability 2018, 10, 34; doi:10.3390/su10010034www.mdpi.com/journal/sustainability of up to 65% have been reported [3].Environmental factors associated with disease development include weather conditions, fertilizer management and soil and fungicides control.Because recent weather conditions and fertilizer management methods, conditions have not been favorable to the occurrence of rice blast and there has been a decrease in the occurrence of leaf blast.In particular, higher temperatures during the growing season due to global warming and the disappearance of the monsoon in July [4], along with the spread of cultivation techniques using less nitrogen-based fertilizers, are considered to be the main reasons for the decreased leaf blast occurrence.In addition, cultivation and propagation of blast-resistant rice varieties were initiated in the 1960s and this change seems to have made the most significant contribution to the reduction of rice blast [5].However, despite the ongoing spread of new blast-resistant rice varieties each year, a number of cases have been reported in which resistant varieties were converted into susceptible varieties within a few years [6].National Institute of Crop Science, Rural Development Administration (RDA), Suwon, Republic of Korea, is continuously monitoring the occurrence of rice blast; in addition, it has been monitoring the responses of both newly bred and existing rice varieties to rice blast in test sites throughout the country.The aim is to develop methods to prevent the outbreak of rice blast.However, as many farmers prefer rice cultivars with superior taste to disease-resistant varieties, it is highly likely that the damage caused by the disease will increase.Therefore, it is desirable to prevent damage by introducing various resistance genes into the cultivars that are primarily distributed in farms [7].
In recent years, use of chemical fertilizers has reduced to meet consumer demand for environmentally friendly agricultural products; however, this can lead to an increase in the occurrence of rice blast.Further, in 2016 and 2017, rice blast disease became a serious problem in Bangladesh and India.Moreover, blast disease affects not only rice but also wheat.Therefore, prevention of blast disease is highly necessary and early prediction of its occurrence will be very helpful in achieving the abovementioned aim.
Most existing studies on rice blast fungus prediction are based on meteorological variables such as the leaf wetness duration, temperature, relative humidity and rainfall associated with blast occurrence and progression; these studies also consider other factors such as the use of host varieties and nitrogen fertilization.The correlations among the considered variables in these researches are then analyzed [8][9][10][11].In these studies, the most commonly used weather variables are air temperature, relative humidity and rainfall and researches were conducted to construct blast simulation systems based on such weather information [12].However, those studies were hindered by difficulties in identifying the mathematical relationship between blast disease and environmental factors and the complexity of the mechanism involving the plant, pathogen and environmental factors; thus, practical application of those developed models is difficult.
In addition to mathematical formulations of the mechanism involving blast and environmental factors, attempts are also being made to predict rice blast disease using data-based machine learning methods.Previously, Kaundal et al. used weather variables such as temperature (max, min), relative humidity (max, min), rainfall and the number of rainy days per week for multiple regression, back propagation neural networks, generalized neural networks and support vector machine (SVM) methods, with the aim of predicting blast occurrence in India [13].The prediction results were compared with the actual blast occurrence; based on the results, the SVM method was found to be the most suitable machine learning method for blast disease prediction.These researchers now provide an SVM-based rice blast prediction web server.However, in the study by Kaundal et al. [13], the historical patterns of the input variables (such as the temperature, relative humidity and rainfall) were not considered in the modeling.In addition, non-meteorological variables were not included.
Malicdem and Fernandez have proposed a blast prediction model for the northern Philippines using an artificial neural network and SVM [14].Under given weather conditions, models for classifying possible rice blast occurrence in specific rice growth stages and for predicting onset severity were generated.Principal component analysis (PCA) of the effect of weather data on the rice blast disease onset showed that precipitation had the greatest influence (48%), followed by the temperature minimum (31%), temperature maximum (17%) and humidity (3%).In that study, however, no non-weather-related variables were used and the temporal patterns of the weather variables were not considered during modeling.In addition, the developed model predicts blast disease occurrence based on the weather conditions of the same year; thus, the capability for preemptive blast disease prevention is limited.
A potentially effective approach to blast disease prediction involves the use of artificial intelligence techniques.In artificial neural networks, a shallow neural network with one hidden layer is mainly employed, because of data and computational power limitations.Recently, deep-learning-based algorithms called "deep neural networks" have been used for image classification, face recognition and speech recognition and have been attracting attention because of their high accuracy in many fields [15][16][17].Deep-learning-based models perform feature extraction and classification simultaneously in deep neural networks, unlike traditional machine learning methods that extract hand-crafted features and then input features extracted from simple classifiers such as SVMs.The ability to learn important features from data is their most important advantage; therefore, deep neural network-based algorithms have high performance if provided with sufficient data [18].
Among deep-learning algorithms, recurrent neural networks (RNNs) can effectively learn sequential patterns from data containing temporal or sequential information.In particular, long short-term memory networks (LSTMs), a kind of RNNs, are becoming prominent [17,19].LSTMs are designed to solve the vanishing gradient/exploding gradient problem of conventional RNNs and can efficiently capture long-term dependencies through memory cells and gates [20].Many recent studies have shown that LSTMs perform better than basic deep feedforward neural networks for time series data [21][22][23].
In this study, to establish a system for early prediction of rice blast disease occurrence, we apply artificial intelligence techniques to past degree of blast onset data and historical climatic data, unlike previous studies that require climate information in the same year as the forecasting time.
A region-specific model that can predict the incidence of rice blast in the target area for the coming year is developed; this is achieved by applying LSTMs to data on the past rice blast disease scores in four representative rice-producing regions in South Korea and to the historical climatic data of the target area.In the proposed model, the data for the year to be predicted is not used as an input, instead, only data for the previous three years are used.In addition, the influences of the LSTM input factors, i.e., rice blast disease scores, air temperature, relative humidity and sunshine hours, on the predictive accuracy are analyzed.Among the various rice cultivars cultivated in Korea, the rice varieties cultivated over the largest area (according to the Korean RDA) are selected for evaluation of the rice blast prediction results.
The Rural Development Administration is conducting inspections and forecasting regarding the occurrence of disease and insect pests, including rice blast in South Korea.The decision of whether or not to control pests is based on the information obtained from these inspections and forecasting.This information is sent to the municipal agricultural technology center to help farmers perform timely pest control.This proposed system for the early prediction of the occurrence of rice blast disease can contribute to the disease management of related administrative departments.The proposed forecasting system is developed based on regions and cultivars and hence a user who simply chooses a cultivar and a region by creating a web-server based system will help the management of an individual rice grower by providing the rice blast prediction result of the selected cultivar of the selected region.
The remainder of this paper is organized as follows.Section 2 describes the data and LSTMs, while Section 3 describes the experimental procedures.In Section 4, we present and discuss the experimental results and draw conclusions.

Data
In this study, we used rice blast disease monitoring data and climatic data to generate a rice blast disease early prediction model.Field blotting tests were conducted to monitor rice blast disease occurrence; the method and data are described in Section 2.1.1.Details of the climatic data are given in Section 2.1.2.

Rice Blast Disease Score Data
The National Institute of Crop Science, RDA, conducted field trials in 12 regions (Icheon, Suwon, Cheolwon, Jinbu, Iksan, Unbong, Gyehwa, Milyang, Sangju, Yeongdeok, Yesan and Naju) of South Korea from 2003 to 2016.In each field trial, 358 kinds of primarily cultivated and reference cultivars were planted.The distances between cultivars were 10 cm (widthwise) and 20 cm (lengthwise) for each variety and sowing was performed at the end of June.In every 10 acres, 24, 9 and 9 kg of nitrogen, phosphoric acid and chlorine chloride were used, respectively.The highly susceptible cultivars (i.e., the Nakdong and Hopyeong cultivars) were sown using a spreader.Thirty days after sowing, the degree of rice blast onset was investigated according to the field test method.The rice blast disease incidence was graded using measures of 0-9, where ratings of 0-3 indicate resistance, 4-6 indicate moderate resistance and 7-9 indicate susceptibility [24].For each variety, disease scores were awarded based on the incidence of rice blast disease in each year and region.Table 1 presents selected data from 2013, i.e., information on the rice blast incidence for the Nampyung, Ilpum and Dongang cultivars in six provinces subjected to the RDA test.Note that no information is available for the Dongang cultivars in the Sangju region, because this test was not performed.The purpose of the present study is to develop a model to estimate the blast incidence a year in advance using data from the previous three years in the same area.Therefore, data on the blast disease incidence in the same area and for the same variety for at least four consecutive years were required (where the final-year data are used to evaluate the prediction).However, the data collected through the field blotting tests discussed above indicate that the varieties planted each year and within each region differ.Therefore, it was necessary to reduce the data usage area to allow use of as much data as possible.Tests were conducted in the four major rice-growing regions of South Korea-Cheolwon in the north, Icheon in the center and Milyang and Naju in the south.We constructed an observation system using the rice blast monitoring data for those four areas.Figure 1a,b show the geographical locations of those four areas on the map of South Korea and the latitude and longitude of each region, respectively.Note that a regional prediction model was created in this study.Because there were two southern regions among the four regions, only Milyang was selected for development of this model; thus, the blast-disease predictive model was created for application to three regions, namely, Cheolwon, Icheon and Milyang.The data preparation methods for the LSTM-based rice blast prediction model are described in detail in Section 3.

Historical Climatic Data
The rice blast occurrence factors include environmental factors such as climatic and soil conditions, along with fertilizer application methods [25,26].Therefore, incorporation of information on environmental factors in the prediction model helps improve the prediction accuracy.Data on soil conditions and fertilizer application methods are unavailable; however, past climate data can be obtained from the Korean Meteorological Agency.Therefore, this study considered data on rice blast incidence scores as well as climate from three areas: Cheolwon, Icheon and Milyang.Climatic data were obtained for June and July in the period between 2003 and 2016, which is consistent with the rice-blast data collection period.Among the climate data, various factors influence rice blast disease, such as humidity and rainfall [12,27,28].Here, three major factors were selected, i.e. average temperature (°C), relative humidity (%) and sunshine hours and the daily data were obtained.Sample results are listed in Table 2.This raw climatic data was subjected to data preprocessing.Further details are given in Section 3.1.2.

Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs)
To develop a model for rice blast incidence prediction in the following year using past data on climate and rice blast incidence in four regions, we cast this multivariate time-series problem as a three-class classification.Recently, RNN-based methods for learning temporal patterns have been applied to time-series data analysis models rather than deep feed-forward neural networks [21][22][23].Among the RNN methodologies, a methodology based on LSTM networks, which can prevent vanishing gradient problems [29] and reflect long-term dependency, has been attracting attention [17,19,22].Therefore, in this study, LSTMs were employed to determine the temporal patterns in the rice blast disease incidence data.In this section, the RNNs and LSTMs used in this study are explained.
For data incorporating time-series or sequences, it should be possible for sequential and temporal patterns to be learned; however, deep feed-forward neural networks are difficult to train.To overcome this drawback, RNNs, which are networks of neurons with recurrent connections, have

Historical Climatic Data
The rice blast occurrence factors include environmental factors such as climatic and soil conditions, along with fertilizer application methods [25,26].Therefore, incorporation of information on environmental factors in the prediction model helps improve the prediction accuracy.Data on soil conditions and fertilizer application methods are unavailable; however, past climate data can be obtained from the Korean Meteorological Agency.Therefore, this study considered data on rice blast incidence scores as well as climate from three areas: Cheolwon, Icheon and Milyang.Climatic data were obtained for June and July in the period between 2003 and 2016, which is consistent with the rice-blast data collection period.Among the climate data, various factors influence rice blast disease, such as humidity and rainfall [12,27,28].Here, three major factors were selected, i.e., average temperature ( • C), relative humidity (%) and sunshine hours and the daily data were obtained.Sample results are listed in Table 2.This raw climatic data was subjected to data preprocessing.Further details are given in Section 3.1.2.To develop a model for rice blast incidence prediction in the following year using past data on climate and rice blast incidence in four regions, we cast this multivariate time-series problem as a three-class classification.Recently, RNN-based methods for learning temporal patterns have been applied to time-series data analysis models rather than deep feed-forward neural networks [21][22][23].Among the RNN methodologies, a methodology based on LSTM networks, which can prevent vanishing gradient problems [29] and reflect long-term dependency, has been attracting attention [17,19,22].Therefore, in this study, LSTMs were employed to determine the temporal patterns in the rice blast disease incidence data.In this section, the RNNs and LSTMs used in this study are explained.
For data incorporating time-series or sequences, it should be possible for sequential and temporal patterns to be learned; however, deep feed-forward neural networks are difficult to train.To overcome this drawback, RNNs, which are networks of neurons with recurrent connections, have been developed [30].These recurrent connections or internal loops (called "feedback connections") can reflect temporal information during training.The RNN concept is illustrated in detail in Figure 2.
Sustainability 2018, 10, 34 6 of 20 been developed [30].These recurrent connections or internal loops (called "feedback connections") can reflect temporal information during training.The RNN concept is illustrated in detail in Figure 2. Here, is the input layer, is the hidden layer and is the output layer.Figure 2a shows a feedforward neural network with one hidden layer.This neural network is computed in one direction from to and finally to and can be mathematically expressed as follows: , .
Here, and are weight matrices and and are bias vectors.Further, and are activation functions such as the sigmoid function or hyperbolic tangent function.
Figure 2b shows an RNN structure similar to the feedforward network shown in Figure 2a but with a feedback loop in the hidden layer.In this case, the at time t is calculated by receiving input information at time t ( ) and using the previous hidden state vector ( , such that where W is the weight matrix.The calculated value is used to calculate value as shown in Equation ( 2) and, at the same time, to calculate the hidden state at the next time 1.Hence, the temporal pattern can be learned.This RNN can be considered in the unfolded state shown in Figure 2c [31].As it is spread over time, it is apparent that the hidden state information is continuously reflected in time.In addition, it is apparent that the RNN in the time direction is a very deep neural network.RNNs are mainly trained using backpropagation through time [32].This is a similar mechanism to standard backpropagation, except that it is back propagated through time rather than through layers.As described above, as RNNs are very deep neural networks in the time direction, vanishing or exploding gradients can occur [29].In addition, RNNs can store short-term memory but are vulnerable to long-term dependency in terms of time.To overcome these shortcomings, LSTMs have been designed to hold short-term memory for a longer period.The LSTM network structure is comprised of LSTM memory blocks similar to the hidden neurons in the RNN described above.These LSTM memory blocks are composed of memory cells and gates and play an important role in training long-range dependency, while controlling information storage and flow.Basic vanilla LSTM networks are described here, having LSTM memory blocks with one memory cell ( ) and three gates, i.e. the input ( ), forget ( ) and output gates ( ), as shown in Figure 3. Here, x t is the input layer, h t is the hidden layer and y t is the output layer.Figure 2a shows a feedforward neural network with one hidden layer.This neural network is computed in one direction from x t to h t and finally to y t and can be mathematically expressed as follows: Here, U and V are weight matrices and b 1 and b 2 are bias vectors.Further, f 1 and f 2 are activation functions such as the sigmoid function or hyperbolic tangent function.
Figure 2b shows an RNN structure similar to the feedforward network shown in Figure 2a but with a feedback loop in the hidden layer.In this case, the h t at time t is calculated by receiving input information at time t (x t ) and using the previous hidden state vector (h t−1 ) such that where W is the weight matrix.The calculated h t value is used to calculate y t value as shown in Equation ( 2) and, at the same time, to calculate the hidden state h t+1 at the next time t + 1.Hence, the temporal pattern can be learned.This RNN can be considered in the unfolded state shown in Figure 2c [31].As it is spread over time, it is apparent that the hidden state information is continuously reflected in time.In addition, it is apparent that the RNN in the time direction is a very deep neural network.RNNs are mainly trained using backpropagation through time [32].This is a similar mechanism to standard backpropagation, except that it is back propagated through time rather than through layers.As described above, as RNNs are very deep neural networks in the time direction, vanishing or exploding gradients can occur [29].In addition, RNNs can store short-term memory but are vulnerable to long-term dependency in terms of time.To overcome these shortcomings, LSTMs have been designed to hold short-term memory for a longer period.The LSTM network structure is comprised of LSTM memory blocks similar to the hidden neurons in the RNN described above.These LSTM memory blocks are composed of memory cells and gates and play an important role in training long-range dependency, while controlling information storage and flow.Basic vanilla LSTM networks are described here, having LSTM memory blocks with one memory cell (c t ) and three gates, i.e., the input (i t ), forget ( f t ) and output gates (o t ), as shown in Figure 3.The LSTM layer is calculated using the following equations: , , .
In these equations, U and W are weight matrices and b indicates bias.Further, σ(•) is the sigmoid function and the ⊗ symbol indicates element-wise multiplication.Equations ( 4), ( 5) and ( 8) are formulas for calculating the , and gates at time t.The three gates take and as inputs, which are multiplied by the weight matrices.The sum of the results is added to the bias term and the sigmoid function of that result is taken.Their outputs range from 0 to 1 and a gate output close to zero indicates gate closure, i.e. the gate does not accept information.Conversely, if the gate output is close to 1, the information is fully entered.Therefore, the information input/output is controlled through these three gates, i.e. these gates are used to calculate and , which are the main computational components of the LSTM memory blocks.First, is calculated as shown in Equation (7), where the previous cell state is multiplied by and the newly input information ( ) is multiplied by .Therefore, decides the amount of information on to be forgotten and decides how much newly added information is added to .Then, Calculation of is performed as shown in equation (9), where is multiplied by taking the activation function in at time t.The values are not unconditionally transmitted but are controlled by the .The calculated and at time t are transmitted in the next time calculation, as shown in Figure 3b.In the computation of the basic RNN, only the hidden state of the previous timestep ( ) is transmitted.However, in the calculation of the LSTM, both and are transmitted.However, in the output ( ) calculation, is not transferred and only is transmitted.

Experiments
To develop early prediction models for rice blast incidence, we conducted experiments to predict the rice blast disease score for the fourth year based on rice blast disease score data and climate data from the previous three years.As described in Section 2.1.1,the rice blast disease scores were divided into three classes to indicate resistance (0-3), moderate resistance (4-6) and susceptibility (7)(8)(9).The experimental and analytical procedures were implemented in the order shown in Figure 4.The LSTM layer is calculated using the following equations: In these equations, U and W are weight matrices and b indicates bias.Further, σ(•) is the sigmoid function and the ⊗ symbol indicates element-wise multiplication.Equations ( 4), ( 5) and ( 8) are formulas for calculating the i t , f t and o t gates at time t.The three gates take x t and h t−1 as inputs, which are multiplied by the weight matrices.The sum of the results is added to the bias term and the sigmoid function of that result is taken.Their outputs range from 0 to 1 and a gate output close to zero indicates gate closure, i.e., the gate does not accept information.Conversely, if the gate output is close to 1, the information is fully entered.Therefore, the information input/output is controlled through these three gates, i.e., these gates are used to calculate c t and h t , which are the main computational components of the LSTM memory blocks.First, c t is calculated as shown in Equation (7), where the previous cell state (c t−1 ) is multiplied by f t and the newly input information (g t ) is multiplied by i t .Therefore, f t decides the amount of information on c t−1 to be forgotten and i t decides how much newly added information is added to c t .Then, Calculation of h t is performed as shown in Equation (9), where o t is multiplied by taking the activation function in c t at time t.The h t values are not unconditionally transmitted but are controlled by the o t .
The calculated c t and h t at time t are transmitted in the next time calculation, as shown in Figure 3b.In the h t computation of the basic RNN, only the hidden state of the previous timestep (h t−1 ) is transmitted.However, in the h t calculation of the LSTM, both c t−1 and h t−1 are transmitted.However, in the output (y t ) calculation, c t is not transferred and only h t is transmitted.

Experiments
To develop early prediction models for rice blast incidence, we conducted experiments to predict the rice blast disease score for the fourth year based on rice blast disease score data and climate data from the previous three years.As described in Section 2.1.1,the rice blast disease scores were divided into three classes to indicate resistance (0-3), moderate resistance (4-6) and susceptibility (7)(8)(9).The experimental and analytical procedures were implemented in the order shown in Figure 4.The experiments and analysis procedure are briefly described here.First, data were generated for training of LSTM models.With this generated data, region-specific models for three regions, Cheolwon, Icheon and Milyang, were developed.Experiments were performed for five combinations of input variables, where the considered input variables were the rice blast disease scores, temperature, relative humidity and sunshine hours.Hence, the input variables important for prediction of blast disease incidence were determined.After training the regional models, the test results yielded by each model were analyzed.In addition, the trained region-specific models were used to predict rice blast disease in the next year using only the three-year rice blast disease scores for any breed.Because the rice blast prediction accuracy for the cultivars, which possess different resistance genes, is also significant, rice varieties popularly cultivated in South Korea were selected.The test results of the region-specific model trained through cultivar-agnostic study were then examined and compared.Details of the data preparation and model training methods are presented in the following Sections 3.1 and 3.2.The comparative analysis results of the model validation experiment by region and input variable combination are presented in Section 4, along with those for the popularly cultivated varieties.

Data Preparation for LSTMs
The raw data used in this study are described in Section 2.1.To create LSTM models, the model input and output must be defined and data cleansing and preprocessing procedures are required.Therefore, Section 3.1.1describes the rice blast disease score data preparation for the prediction model generation, while Section 3.1.2describes the climate data preparation.

Blast Disease Score Data Preparation
LSTM is a data-driven approach that learns features that are useful in predicting blast disease in the algorithm itself, so the amount of data has a significant impact on algorithm performance [18].The raw data acquired for the various rice varieties in the 12 Korean regions in the period between 2003 and 2016 include data from Cheolwon, Icheon and Milyang-the regions targeted in this study.However, as the raw data differ among regions and the varieties planted each year also vary, if data from all regions were used, the amount of useable data would be reduced significantly as a result of the many missing values.Therefore, in order to use the maximum possible amount of data and to employ representative data, we narrowed the sample range to data for Cheolwon, Icheon, Milyang and Naju.
As explained above, data for four consecutive years for the same cultivars and each of the four different regions were required.This is because the data for the initial three years are used as the input of the developed model and that for the fourth year is used as the target for the model output The experiments and analysis procedure are briefly described here.First, data were generated for training of LSTM models.With this generated data, region-specific models for three regions, Cheolwon, Icheon and Milyang, were developed.Experiments were performed for five combinations of input variables, where the considered input variables were the rice blast disease scores, temperature, relative humidity and sunshine hours.Hence, the input variables important for prediction of blast disease incidence were determined.After training the regional models, the test results yielded by each model were analyzed.In addition, the trained region-specific models were used to predict rice blast disease in the next year using only the three-year rice blast disease scores for any breed.Because the rice blast prediction accuracy for the cultivars, which possess different resistance genes, is also significant, rice varieties popularly cultivated in South Korea were selected.The test results of the region-specific model trained through cultivar-agnostic study were then examined and compared.Details of the data preparation and model training methods are presented in the following Sections 3.1 and 3.2.The comparative analysis results of the model validation experiment by region and input variable combination are presented in Section 4, along with those for the popularly cultivated varieties.

Data Preparation for LSTMs
The raw data used in this study are described in Section 2.1.To create LSTM models, the model input and output must be defined and data cleansing and preprocessing procedures are required.Therefore, Section 3.1.1describes the rice blast disease score data preparation for the prediction model generation, while Section 3.1.2describes the climate data preparation.

Blast Disease Score Data Preparation
LSTM is a data-driven approach that learns features that are useful in predicting blast disease in the algorithm itself, so the amount of data has a significant impact on algorithm performance [18].The raw data acquired for the various rice varieties in the 12 Korean regions in the period between 2003 and 2016 include data from Cheolwon, Icheon and Milyang-the regions targeted in this study.However, as the raw data differ among regions and the varieties planted each year also vary, if data from all regions were used, the amount of useable data would be reduced significantly as a result of the many missing values.Therefore, in order to use the maximum possible amount of data and to employ representative data, we narrowed the sample range to data for Cheolwon, Icheon, Milyang and Naju.
As explained above, data for four consecutive years for the same cultivars and each of the four different regions were required.This is because the data for the initial three years are used as the input of the developed model and that for the fourth year is used as the target for the model output value.For example, Figure 5   Using this approach, if any data were missing for a particular variety and for any of the four regions in the four-year period, that dataset was removed from the overall experimental dataset.Hence, a dataset containing a total of 1191 elements was obtained for each region following removal of missing data.Then, 70% of those elements were used as training data (833), 10% as validation data (119) and 20% as test data (239).
The target values of the LSTM model were classified as Classes 0-2, corresponding to resistance (0-3), moderate resistance (4-6) and susceptibility (7-9), respectively.Table 3 lists the number of data elements for the training, validation and test dataset for each region used in the experiment.Note that the total number of training, validation and test data elements listed for each region in Table 3 corresponds to the total number of data elements used for the regional model generation.Figure 6 shows the class distribution of all data in each region.
The distributions differ from region to region.Cheolwon had a high percentage of Class 0 elements (59%), followed by those for Classes 1 (24%) and 2 (17%).In Icheon, Class 0 was again the most common (44%), with Classes 1 and 2 having smaller and identical proportions (28%).Finally, unlike Cheolwon and Icheon, Class 1 was most common in Milyang (45%), followed by Classes 0 (29%) and 2 (26%).Using this approach, if any data were missing for a particular variety and for any of the four regions in the four-year period, that dataset was removed from the overall experimental dataset.Hence, a dataset containing a total of 1191 elements was obtained for each region following removal of missing data.Then, 70% of those elements were used as training data (833), 10% as validation data (119) and 20% as test data (239).
The target values of the LSTM model were classified as Classes 0-2, corresponding to resistance (0-3), moderate resistance (4-6) and susceptibility (7-9), respectively.Table 3 lists the number of data elements for the training, validation and test dataset for each region used in the experiment.Note that the total number of training, validation and test data elements listed for each region in Table 3 corresponds to the total number of data elements used for the regional model generation.Figure 6 shows the class distribution of all data in each region.
The distributions differ from region to region.Cheolwon had a high percentage of Class 0 elements (59%), followed by those for Classes 1 (24%) and 2 (17%).In Icheon, Class 0 was again the most common (44%), with Classes 1 and 2 having smaller and identical proportions (28%).Finally, unlike Cheolwon and Icheon, Class 1 was most common in Milyang (45%), followed by Classes 0 (29%) and 2 (26%).In addition, the numerical ranges of the data used in this study varied (see Tables 1 and 2), because different input variables were considered in the model, such as the air temperature, relative humidity and sunshine hours.Therefore, data normalization was required [33].The rice blast disease incidence values were incorporated into the model as integer data with values between 0 and 9.These values were rescaled to 0 and 1 using the min-max normalization method [34], expressed as where is the original data value, and are the minimum and maximum values of the variable, respectively and is the normalized value.In the model developed in this study, the blast disease score data as well as the average temperature, relative humidity and sunshine hour values were normalized using Equation (10).

Climate Data Preparation
In order to perform the experiment with additional climatic factors, as described in Section 2.1.2,the daily average temperatures, humidities and sunshine hours of the months of June and July in the period between 2003 and 2016 were obtained for the target regions (Cheolwon, Icheon, Milyang); this was the same period for which the rice blast incidence data were obtained.Therefore, 61 (from 1 June to 31 July) datasets of temperature, humidity and sunshine hour values were obtained for each year.The average temperature, average relative humidity and average sunshine from June to July are graphically shown in Figure 7, to allow identification of climate characteristics and climate change trends by region.In addition, the numerical ranges of the data used in this study varied (see Tables 1 and 2), because different input variables were considered in the model, such as the air temperature, relative humidity and sunshine hours.Therefore, data normalization was required [33].The rice blast disease incidence values were incorporated into the model as integer data with values between 0 and 9.These values were rescaled to 0 and 1 using the min-max normalization method [34], expressed as where x raw is the original data value, x min and x max are the minimum and maximum values of the variable, respectively and x norm is the normalized value.In the model developed in this study, the blast disease score data as well as the average temperature, relative humidity and sunshine hour values were normalized using Equation (10).

Climate Data Preparation
In order to perform the experiment with additional climatic factors, as described in Section 2.1.2,the daily average temperatures, humidities and sunshine hours of the months of June and July in the period between 2003 and 2016 were obtained for the target regions (Cheolwon, Icheon, Milyang); this was the same period for which the rice blast incidence data were obtained.Therefore, 61 (from 1 June to 31 July) datasets of temperature, humidity and sunshine hour values were obtained for each year.The average temperature, average relative humidity and average sunshine from June to July are graphically shown in Figure 7, to allow identification of climate characteristics and climate change trends by region.In addition, the numerical ranges of the data used in this study varied (see Tables 1 and 2), because different input variables were considered in the model, such as the air temperature, relative humidity and sunshine hours.Therefore, data normalization was required [33].The rice blast disease incidence values were incorporated into the model as integer data with values between 0 and 9.These values were rescaled to 0 and 1 using the min-max normalization method [34], expressed as , where is the original data value, and are the minimum and maximum values of the variable, respectively and is the normalized value.In the model developed in this study, the blast disease score data as well as the average temperature, relative humidity and sunshine hour values were normalized using Equation (10).

Climate Data Preparation
In order to perform the experiment with additional climatic factors, as described in Section 2.1.2,the daily average temperatures, humidities and sunshine hours of the months of June and July in the period between 2003 and 2016 were obtained for the target regions (Cheolwon, Icheon, Milyang); this was the same period for which the rice blast incidence data were obtained.Therefore, 61 (from 1 June to 31 July) datasets of temperature, humidity and sunshine hour values were obtained for each year.The average temperature, average relative humidity and average sunshine from June to July are graphically shown in Figure 7, to allow identification of climate characteristics and climate change trends by region.
(a)  As shown in Figure 7, the average June-to-July temperature in Cheolwon, which is located in the northern part of South Korea, is lower than those of Icheon (center) and Milyang (south).It has been confirmed that the temperatures required for the onset of blast disease are within the range of 22-26 °C.Rice blast is caused by the high humidity occurring after the monsoon season in Korea, i.e. at the end of June.Since 2014, rainfall in Korea has decreased as a result of global climate change and humidity has also been lower than previous years.These meteorological conditions have yielded low blast incidence.The rice blast disease scores used as inputs for the LSTM predictive model in each time step were the four-dimensional data for the Cheolwon, Icheon, Milyang and Naju regions.However, the raw climatic data were 61-dimensional.In that case, model training was difficult, because the rice blast score occupied a very small part of the data.Therefore, the average temperature, relative humidity and sunshine hours were calculated as the averages of 15-day periods (1-15 June, 16-30 June, 1-15 July, 16-31 July) and transformed into four-dimensional data per time step for each climatic variable.These variables were then normalized using Equation (10).

Model Design
The purpose of this study is to construct an artificial intelligence system that can predict the onset of rice blast based on historical blast disease scores and climatic data.Environmental factors such as climate and soil differ among regions; thus, regional specific models were created, as described above.Three regions, Cheolwon (northern), Icheon (central) and Milyang (southern), which are active rice farming regions in Korea, were set as target prediction regions and localized LSTM model variations were created.To investigate the factors that may contribute to blast disease incidence prediction, various combinations of input variables (rice blast disease score, average air temperature, relative humidity and sunshine hours) were used.Table 4 lists the model variations and input variables used in this study.As shown in Figure 7, the average June-to-July temperature in Cheolwon, which is located in the northern part of South Korea, is lower than those of Icheon (center) and Milyang (south).It has been confirmed that the temperatures required for the onset of blast disease are within the range of 22-26 • C. Rice blast is caused by the high humidity occurring after the monsoon season in Korea, i.e., at the end of June.Since 2014, rainfall in Korea has decreased as a result of global climate change and humidity has also been lower than previous years.These meteorological conditions have yielded low blast incidence.The rice blast disease scores used as inputs for the LSTM predictive model in each time step were the four-dimensional data for the Cheolwon, Icheon, Milyang and Naju regions.However, the raw climatic data were 61-dimensional.In that case, model training was difficult, because the rice blast score occupied a very small part of the data.Therefore, the average temperature, relative humidity and sunshine hours were calculated as the averages of 15-day periods (1-15 June, 16-30 June, 1-15 July, 16-31 July) and transformed into four-dimensional data per time step for each climatic variable.These variables were then normalized using Equation (10).

Model Design
The purpose of this study is to construct an artificial intelligence system that can predict the onset of rice blast based on historical blast disease scores and climatic data.Environmental factors such as climate and soil differ among regions; thus, regional specific models were created, as described above.Three regions, Cheolwon (northern), Icheon (central) and Milyang (southern), which are active rice farming regions in Korea, were set as target prediction regions and localized LSTM model variations were created.To investigate the factors that may contribute to blast disease incidence prediction, various combinations of input variables (rice blast disease score, average air temperature, relative humidity and sunshine hours) were used.In order to determine the usefulness of past rice blast information for predicting blast disease incidence a year ahead, a model variation using only the blast score variable to generate predictions was developed-the first model variation listed in Table 4, Blast_LSTM.Thus, a regional prediction model was developed that uses not only the rice blast scores of the target region but also the characteristics of the target variety regarding the incidence of blast disease in other regions.That is, the blast disease outbreak scores for all four regions (Cheolwon, Icheon, Milyang and Naju) in the first three years were used as the input variables of Blast_LSTM.Thus, for this model variation, the size of the input variable for each time step is 4.
The second through fourth model variations in Table 4 incorporate climate information together with the rice blast disease incidence data for the four regions.In each case, one climate variable is added at a time to analyze the impact of each variable on the rice blast disease incidence prediction accuracy.The second model variation, BlastT_LSTM, incorporates the input variables of Blast_LSTM and the average temperature of the target region.As described in Section 3.1.2,the target regions have very different climatic characteristics; thus, only local information is used for the climate variables.For example, the Cheolwon-area BlastT_LSTM prediction model was created by considering the degree of occurrence of rice blast in the four regions in the past three years and the temperature of the Cheolwon area as the target area.As described in Section 3.1.2,the average data for the two months of June and July in each year, i.e., the average temperature, relative humidity and sunshine hours, were averaged over 15-day periods; thus, these variables have a size of 4 and the size of the input variable for each time step of BlastT_LSTM is 8.
By adding the relative humidity of the target region to BlastT_LSTM, the BlastTH_LSTM model variation was obtained.Similarly, BlastTHS_LSTM was developed by adding the sunshine hour data to BlastTH_LSTM.Therefore, the latter model variation takes the rice blast score information and all the climatic variables considered in this study as input.Finally, in order to analyze the efficacy of blast disease prediction using past climate information only, we created a model variation considering only climatic variables (i.e., excluding the rice blast scores), which we called "Climate_LSTM."Thus, for the considered input variables, we developed a rice blast prediction model using the LSTM network structure with a single LSTM hidden layer and having a time step of 3, as shown in Figure 8.  4. A given input value passes through the LSTM layer at time t, which is the most recent time, through calculation of the LSTM layer according to Equations ( 4)-( 9).The value ( ) passing through the LSTM layer at the last point in time t is the predicted class of the degree of rice blast occurrence in the next year.This prediction is output as through the softmax layer.Recall that the output classes are numbered 0-2, as defined above.The value of is calculated as follows: , where Wz is the weight matrix, is a bias term and is a three-dimensional vector from which the softmax function in Equation ( 12) is obtained.In that equation, , is the i-th unit value of , called a "logit."The final value is the probability of each class.To study this model, we used the following cross-entropy loss function: where is the total number of training data elements, is the target value, , , is the i-th value of the of the n-th sample and , , is the i-th value of the of the n-th sample.With this loss function, we trained the proposed LSTM models using the Adam optimizer [35] and implemented them using Tensorflow version 1.3.0[36], which is one of the leading deep learning platforms.

Results and Discussion
Through the process described in the paper, a rice blast prediction model applicable to the Cheolwon, Icheon and Milyang areas and facilitating different input variable combinations was developed.Our model that can be adjusted to predict the rice blast occurrence in each region in the next year, based on the rice blast scores for the previous three years in Cheolwon, Icheon, Milyang and Naju and/or the weather, humidity and sunlight hours data for each region.Table 5 lists the prediction performance for rice blast occurrence in the Cheolwon, Icheon and Milyang regions for each of the different model variations (see Table 4).To assess their performance, we used standard performance metrics, i.e. the accuracy and F1-score.Here, the accuracy is defined as the number of correctly classified data elements among the total number of test data elements and the F1-score is the harmonic mean of the precision and recall.In this figure, x t−2 , x t−1 and x t are the rice blast disease scores (B t ) and average air temperature (T t ), relative humidity (H t ) and sunshine hour (S t ) values for each of the preceding three years.These are the input values comprising the combinations used in the model variations described in Table 4.A given input value passes through the LSTM layer at time t, which is the most recent time, through calculation of the LSTM layer according to Equations ( 4)-( 9).The value (h t ) passing through the LSTM layer at the last point in time t is the predicted class of the degree of rice blast occurrence in the next year.This prediction is output as y t through the softmax layer.Recall that the output classes are numbered 0-2, as defined above.The value of y t is calculated as follows: where W z is the weight matrix, b z is a bias term and z t is a three-dimensional vector from which the softmax function in Equation ( 12) is obtained.In that equation, z t,i is the i-th unit value of z t , called a "logit."The final y t value is the probability of each class.
To study this model, we used the following cross-entropy loss function: where N is the total number of training data elements, p t is the target value, p t,i,n is the i-th value of the p t of the n-th sample and y t,i,n is the i-th value of the y t of the n-th sample.With this loss function, we trained the proposed LSTM models using the Adam optimizer [35] and implemented them using Tensorflow version 1.3.0[36], which is one of the leading deep learning platforms.

Results and Discussion
Through the process described in the paper, a rice blast prediction model applicable to the Cheolwon, Icheon and Milyang areas and facilitating different input variable combinations was developed.Our model that can be adjusted to predict the rice blast occurrence in each region in the next year, based on the rice blast scores for the previous three years in Cheolwon, Icheon, Milyang and Naju and/or the weather, humidity and sunlight hours data for each region.Table 5 lists the prediction performance for rice blast occurrence in the Cheolwon, Icheon and Milyang regions for each of the different model variations (see Table 4).To assess their performance, we used standard performance metrics, i.e., the accuracy and F1-score.Here, the accuracy is defined as the number of correctly classified data elements among the total number of test data elements and the F1-score is the harmonic mean of the precision and recall.We first discuss the performance of Blast_LSTM with regards to its prediction of the incidence of rice blast one year after the previous rice blast occurrence.Recall that the trained model yields predictions classified into three groups: resistance, middle resistance and susceptible groups.Thus, if the information on past rice blast occurrence did not contribute to the prediction of rice blast occurrence in the following year, the accuracy would be approximately 33%.However, as apparent from Table 5, the Blast_LSTM model yielded 62.3% accuracy (F1-score: 59.6%) for the Cheolwon region and 61.5% (59.8%) and 46.9% (44.9%) accuracies for the Icheon and Milyang regions, respectively, all of which are considerably higher than the threshold accuracy and F1-scores of 33%.In particular, the accuracy for the Cheolwon region was quite high, at 62.3%.Thus, the results indicate that information on past rice blast occurrence helps predict the occurrence of rice blast in the distant future.Based on this finding, the accuracy of Blast_LSTM was used as a base model for comparison with the results of the other model variations.
In comparison with the Blast_LSTM base model, for BlastT_LSTM, in which the past average temperature information is combined with that of past rice blast incidence, the accuracy was improved by 5.5% (from 62.3% to 65.7%) for the Cheolwon region model (with 2.3% improvement in F1-score), by 2.1% (F1-score improvement: 1.5%) for the Icheon area model and by 8.7% (F1-score improvement: 13.8%) for the Milyang model.Therefore, it was found that the average temperature information from the preceding three years is helpful for prediction of rice blast disease occurrence.Note that, for the Milyang area model in particular, the Blast_LSTM accuracy was lower than that for the other regions.For this region, the addition of temperature information in BlastT_LSTM helped improve the prediction performance.As shown in Figure 7a, the average temperatures of Cheolwon, Icheon and Milyang gradually increased from 2003 to 2016 and the temperature gradually changed over several years.Therefore, even without knowledge of the temperature of the next year, the temperature information from the past three years is thought to help improve the accuracy of blast disease occurrence prediction.
Compared with the BlastT_LSTM results, the model variable incorporating the relative humidity input variable as well as the above variables, i.e., BlastTH_LSTM, showed improvements of 3.9%, 1.2% and 2.9% in the F1-score for the Cheolwon, Icheon and Milyang regions.Compared with Blast_LSTM, which is the model having only rice blast disease scores as the input variables, the F1-scores improved by 6.4%, 2.7% and 17.1% for Cheolwon, Icheon and Milyang, respectively.It is not possible to accurately determine the influence of each input variable on the prediction by examining the correlation between input variables.However, as BlastTH_LSTM exhibits superior accuracy to the preceding model variations, it is apparent that the past relative humidity is also an important factor affecting rice blast prediction.
Comparison of the prediction results of BlastTHS_LSTM, the model variation also incorporating the sunshine hours, with those of BlastTH_LSTM indicates that the F1-scores were improved by 0.3%, 1.1% and 1.5% for the Cheolwon, Icheon and Milyang regions, respectively and the accuracy was improved by 0.7%, 0.6% and 4.0%, respectively.For all three regions, BlastTHS_LSTM had the highest accuracy among the examined model variations; this model variation incorporated the rice blast incidence scores and the average temperature, relative humidity and sunshine hour data.
Finally, we discuss the prediction results of Climate_LSTM, which considered the climate data of the previous three years only.This model variation excludes the information on the past rice blast incidence and the lowest prediction accuracy for all three regions, ranging between 44.4% and 55.2%.However, these results confirm that past climate information can be helpful for prediction of rice blast disease incidence.Note that this approach differs from methods forecasting the rice blast occurrence of the current year based on the previous year's climate information only [8][9][10][11].
In comparison with the base model, Blast_LSTM, which considers past rice blast disease scores only without climate information, the prediction performance of the Climate_LSTM model was 11.4% for Cheolwon and 14.3% for Icheon.In the case of the Milyang region, the forecast performance was also low, at 5.3% (although the decline is different, the F1-scores were also considerably inferior.) Figure 9 shows a comparison of the prediction accuracies of the Climate_LSTM, Blast_LSTM and BlastTHS_LSTM model variations by region.From Figure 9 and Table 5, it is apparent the prediction performance of each of these prediction model variations differed from region to region.For all model variations, the highest prediction performance was obtained for the Cheolwon region, followed by the Icheon region, for which the accuracy was 0.8-4.3%lower than that for the Cheolwon region.The lowest prediction performance was obtained for the Milyang region, for all prediction model variations, with the accuracy being lower than that of the Cheolwon area by 10.8-15.4%.
Sustainability 2018, 10, 34 15 of 20 incidence and exhibits the lowest prediction accuracy for all three regions, ranging between 44.4% and 55.2%.However, these results confirm that past climate information can be helpful for prediction of rice blast disease incidence.Note that this approach differs from methods forecasting the rice blast occurrence of the current year based on the previous year's climate information only [8][9][10][11].In comparison with the base model, Blast_LSTM, which considers past rice blast disease scores only without climate information, the prediction performance of the Climate_LSTM model was 11.4% for Cheolwon and 14.3% for Icheon.In the case of the Milyang region, the forecast performance was also low, at 5.3% (although the decline is different, the F1-scores were also considerably inferior.)Figure 9 shows a comparison of the prediction accuracies of the Climate_LSTM, Blast_LSTM and BlastTHS_LSTM model variations by region.From Figure 9 and Table 5, it is apparent the prediction performance of each of these prediction model variations differed from region to region.For all model variations, the highest prediction performance was obtained for the Cheolwon region, followed by the Icheon region, for which the accuracy was 0.8-4.3%lower than that for the Cheolwon region.The lowest prediction performance was obtained for the Milyang region, for all prediction model variations, with the accuracy being lower than that of the Cheolwon area by 10.8-15.4%.incidence and exhibits the lowest prediction accuracy for all three regions, ranging between 44.4% and 55.2%.However, these results confirm that past climate information can be helpful for prediction of rice blast disease incidence.Note that this approach differs from methods forecasting the rice blast occurrence of the current year based on the previous year's climate information only [8][9][10][11].In comparison with the base model, Blast_LSTM, which considers past rice blast disease scores only without climate information, the prediction performance of the Climate_LSTM model was 11.4% for Cheolwon and 14.3% for Icheon.In the case of the Milyang region, the forecast performance was also low, at 5.3% (although the decline is different, the F1-scores were also considerably inferior.)Figure 9 shows a comparison of the prediction accuracies of the Climate_LSTM, Blast_LSTM and BlastTHS_LSTM model variations by region.From Figure 9 and Table 5, it is apparent the prediction performance of each of these prediction model variations differed from region to region.For all model variations, the highest prediction performance was obtained for the Cheolwon region, followed by the Icheon region, for which the accuracy was 0.8-4.3%lower than that for the Cheolwon region.The lowest prediction performance was obtained for the Milyang region, for all prediction model variations, with the accuracy being lower than that of the Cheolwon area by 10.8-15.4%.From Figure 10, the class distributions of training and test data are similar for the Cheolwon region but differ significantly for Milyang region.Therefore, the class distribution and characteristics of the training and test data for the Milyang region differ.It is thought that these differences may have caused the significant reduction in the prediction accuracy.
The prediction system proposed in this study was developed cultivar-agnostically.Therefore, regardless of the cultivar, the rice blast incidence in one year can be predicted using data from the past three years and the model developed in this study.To examine this feature, the rice varieties most widely cultivated in Korea were selected and the predictions yielded by the developed model variations were analyzed.
According to the Korean RDA, rice varieties with high yield and superior taste are primarily grown each year.The most popular rice varieties for cultivation in Korea are Jopyeong, Samdeok, Onnuri, Nampyeong, Hwanggeumnuri, Koshihikari, Saenuri, Unkwang, Ilpum, Chucheong, Dongjinchal, Ilmi, Odae, Daean, Samgwang, Hopum, Chilbo, Saeilmi, Wungwang and Sindongjin.Data for 17 of these varieties (excluding Chilbo, Saeilmi and Wungwang) were included in the test dataset.Then, the effectiveness of each proposed prediction model variation was analyzed by examining the rice blast disease incidence prediction results.The average prediction for each of the 17 varieties cultivated in Cheolwon, Icheon and Milyang are listed in Table 6.Among the three regions, the most accurate results were obtained for Cheolwon.The BlastTHS_LSTM model variation, which is the LSTM model incorporating the past blast occurrence rate, average temperature, relative humidity and sunshine hours as input variables, yielded the highest accuracies among all variations.For this model, accuracies of 79.4%, 64.7% and 55.6% were obtained for the Cheolwon, Icheon and Milyang regions, respectively.The prediction accuracies for each of the cultivars given by BlastTHS_LSTM are shown in Figure 11.It is apparent that the accuracies for some individual cultivars were better than others.However, the sample size was small for each cultivar, ranging from 1 to 4. These results indicate that rice blast information for cultivars cultivated in Korea for the previous three years, in conjunction with climate information, can be used to predict the occurrence of rice blast a year later.These findings will be helpful for preventing blast disease in the future.If weather conditions such as humidity vary within an area, the blast population in the area is expected to vary also and the response of rice to the blast is also expected to differ.The accuracies of the proposed models are not very high; however, it is a meaningful starting point, being the first attempt of the LSTM-based rice blast prediction model.The performance of the models can be further improved by adding more training data and optimizing the LSTM architecture.Rice blast fungus is a representative model phytopathogenic fungus for which gene-for-gene interactions with the host are applicable.To date, more than 40 resistance genes have been identified in host rice and among the corresponding pathogen avirulence, 9 genes have been identified through molecular biology and functional genetics [37][38][39][40][41][42].As a strategy to control rice blast fungus, introduction of a resistance gene through breeding is considered to be most effective.To understand the race or pathotype of rice blast fungus in South Korea and to collect information for introducing resistance genes to rice varieties, the distribution and transposon of nonpathogenic genes of rice blast fungus at the group level have been determined through DNA-fingerprinting studies using molecular biomarkers [43].According to the results of such studies, the pathogenic race pathotype is becoming more diverse than in the past, despite the decrease of rice blast throughout South Korea [44].These different types of pathogenesis are presumed to be caused by genetic variation of the rice blast fungus, which may lead to increased affinity strains for resistant cultivars [6].Resistant reversal of these resistant varieties to become susceptible varieties has been reported for many crops, including rice [45].The rice blast disease prediction system presented in this study can provide results for specific rice varieties; therefore, it will be of considerable assistance to rice blast researchers, especially in comparison with conventional rice blast predictions.It is possible to suggest the direction to breed as a resistant breed, by providing the data to rice breeders.Rice blast resistance genes are constantly being studied and hence using all this information together will be helpful for resistant breeding.
In this study, we incorporated different combinations of input variables, i.e. the degree of rice blast occurrence scores and the temperature, humidity and sunshine hours data, into the developed model variations.For all regions, the predicted results were most accurate when the rice blast occurrence scores, temperature, humidity and sunshine hours data were all considered.Note that other studies involving artificial intelligence have indicated that botanical disease occurrence is related to the combination of pathogens, environmental conditions and host plants.
In addition, we found that early prediction of rice blast occurrence based on climate data for the past three years is possible and that blast disease prevention can be further facilitated by incorporating knowledge of the rice blast occurrence for each of those years.In this study, we developed models for three representative regions in South Korea to analyze the feasibility of the LSTM-based methodology and analyzed the prediction results for 17 varieties.However, in addition to the three regions, there are data for monitoring rice blast disease in 9 regions and 358 varieties of rice in South Korea.Therefore, the framework used in this study can be extended to a prediction system for the remaining regions and all varieties.The utility of the proposed LSTM models is expected to be high.In addition, because the deep learning method used in this study is capable of transfer learning, it can easily be applied to data from other countries or regions.Thus, although this study was based on data from South Korea, the findings and developed system will be helpful for the various countries in which rice is grown as a primary crop.Rice blast fungus is a representative model phytopathogenic fungus for which gene-for-gene interactions with the host are applicable.To date, more than 40 resistance genes have been identified in host rice and among the corresponding pathogen avirulence, 9 genes have been identified through molecular biology and functional genetics [37][38][39][40][41][42].As a strategy to control rice blast fungus, introduction of a resistance gene through breeding is considered to be most effective.To understand the race or pathotype of rice blast fungus in South Korea and to collect information for introducing resistance genes to rice varieties, the distribution and transposon of nonpathogenic genes of rice blast fungus at the group level have been determined through DNA-fingerprinting studies using molecular biomarkers [43].According to the results of such studies, the pathogenic race pathotype is becoming more diverse than in the past, despite the decrease of rice blast throughout South Korea [44].These different types of pathogenesis are presumed to be caused by genetic variation of the rice blast fungus, which may lead to increased affinity strains for resistant cultivars [6].Resistant reversal of these resistant varieties to become susceptible varieties has been reported for many crops, including rice [45].The rice blast disease prediction system presented in this study can provide results for specific rice varieties; therefore, it will be of considerable assistance to rice blast researchers, especially in comparison with conventional rice blast predictions.It is possible to suggest the direction to breed as a resistant breed, by providing the data to rice breeders.Rice blast resistance genes are constantly being studied and hence using all this information together will be helpful for resistant breeding.
In this study, we incorporated different combinations of input variables, i.e., the degree of rice blast occurrence scores and the temperature, humidity and sunshine hours data, into the developed model variations.For all regions, the predicted results were most accurate when the rice blast occurrence scores, temperature, humidity and sunshine hours data were all considered.Note that other studies involving artificial intelligence have indicated that botanical disease occurrence is related to the combination of pathogens, environmental conditions and host plants.
In addition, we found that early prediction of rice blast occurrence based on climate data for the past three years is possible and that blast disease prevention can be further facilitated by incorporating knowledge of the rice blast occurrence for each of those years.In this study, we developed models for three representative regions in South Korea to analyze the feasibility of the LSTM-based methodology and analyzed the prediction results for 17 varieties.However, in addition to the three regions, there are data for monitoring rice blast disease in 9 regions and 358 varieties of rice in South Korea.Therefore, the framework used in this study can be extended to a prediction system for the remaining regions and all varieties.The utility of the proposed LSTM models is expected to be high.In addition, because the deep learning method used in this study is capable of transfer learning, it can easily be applied to data from other countries or regions.Thus, although this study was based on data from South Korea, the findings and developed system will be helpful for the various countries in which rice is grown as a primary crop. 1.1.

Figure 1 .
Figure 1.Four major rice-growing regions in South Korea.(a) Map of South Korea showing four regions used in this study; (b) Latitudes and longitudes of four selected regions.

Figure 1 .
Figure 1.Four major rice-growing regions in South Korea.(a) Map of South Korea showing four regions used in this study; (b) Latitudes and longitudes of four selected regions.

Figure 4 .
Figure 4. Flowchart of procedures employed in this study.

Figure 4 .
Figure 4. Flowchart of procedures employed in this study.
is an example of the Nampyung cultivar's 2003-2006 rice blast disease scores for the Cheolwon, Icheon, Milyang and Naju regions.(The score range is set from 0 to 9, as described earlier.)The blast disease scores for each region between 2003 and 2005 were used as inputs and the blast disease scores from 2006 were used as the output values.For example, the 2006 score from Cheolwon was 4, which falls in the middle resistance class.Sustainability 2018, 10, 34 9 of 20 value.For example, Figure 5 is an example of the Nampyung cultivar's 2003-2006 rice blast disease scores for the Cheolwon, Icheon, Milyang and Naju regions.(The score range is set from 0 to 9, as described earlier.)The blast disease scores for each region between 2003 and 2005 were used as inputs and the blast disease scores from 2006 were used as the output values.For example, the 2006 score from Cheolwon was 4, which falls in the middle resistance class.

Figure 7 .
Figure 7. Average climate trends for target areas from June to July and from 2003 to 2016: (a) Average temperature; (b) Average relative temperature; (c) Average sunshine hours.

Figure 7 .
Figure 7. Average climate trends for target areas from June to July and from 2003 to 2016: (a) Average temperature; (b) Average relative temperature; (c) Average sunshine hours.

Figure 8 .
Figure 8. LSTM architecture of blast disease score prediction model.

Figure 8 .
Figure 8. LSTM architecture of blast disease score prediction model.

Figure 9 .
Figure 9.Comparison of prediction accuracies of selected LSTM model variations by region.The deep learning model used in this study implements a data-driven approach, in which the learning data is very important because the LSTM model itself learns the features needed for prediction from the learning data.Note that we divided the data from 2003 to 2016 into training (70%), validation (10%) and test (20%) data in order of time.Therefore, the test data were the most recent data and the training data were the most historical data.The Class 0-2 distribution varies with time and Figure10shows the class distributions for the training (historical) and test (most recent) data for each region.

Figure 10 .
Figure 10.Class distributions of training (historical) and test (most recent) data by region.

Figure 9 .
Figure 9.Comparison of prediction accuracies of selected LSTM model variations by region.

Figure 9 .
Figure 9.Comparison of prediction accuracies of selected LSTM model variations by region.The deep learning model used in this study implements a data-driven approach, in which the learning data is very important because the LSTM model itself learns the features needed for prediction from the learning data.Note that we divided the data from 2003 to 2016 into training (70%), validation (10%) and test (20%) data in order of time.Therefore, the test data were the most recent data and the training data were the most historical data.The Class 0-2 distribution varies with time and Figure10shows the class distributions for the training (historical) and test (most recent) data for each region.

Figure 10 .
Figure 10.Class distributions of training (historical) and test (most recent) data by region.

Figure 10 .
Figure 10.Class distributions of training (historical) and test (most recent) data by region.

Figure 11 .
Figure 11.Prediction results for 17 most popular cultivars yielded by BlastTHS_LSTM model.

Figure 11 .
Figure 11.Prediction results for 17 most popular cultivars yielded by BlastTHS_LSTM model.

Table 1 .
Rice blast disease monitoring data example from 2013.

Table 3 .
Data elements by training, validation and test class for each region.There are 1191 elements for each region, with 833 (70%), 119 (10%) and 239 (20%) elements in the training, validation and test classes, respectively.

Table 3 .
Data elements by training, validation and test class for each region.There are 1191 elements for each region, with 833 (70%), 119 (10%) and 239 (20%) elements in the training, validation and test classes, respectively.
Table 4 lists the model variations and input variables used in this study.

Table 4 .
LSTM model variations developed in study and used in experiment, with input variable lists.

Table 5 .
Prediction results of proposed variations for Cheolwon, Icheon and Milyang regions.

Table 6 .
Average model prediction accuracies for 17 most popular cultivars.