Advanced Data Analysis as a Tool for Net Blotch Density Estimation in Spring Barley

: A novel data analysis method for the evaluation of plant disease risk that utilizes weather information is presented in this paper. This research considers two di ﬀ erent datasets: open weather data from the Finnish Meteorological Institute and long-term (1991–2017) plant disease severity observations in di ﬀ erent hardiness zones in Finland. Historical net blotch severity data on spring barley were collected from o ﬃ cial variety trials carried out by the Natural Resources Institute Finland (Luke) and the analysis was performed with existing data without additional measurements. Feature generation was used to combine di ﬀ erent datasets and to enrich the information content of the data. The t -test was applied to validate features and select the most suitable one for the identiﬁcation of datasets with high net blotch risk. Based on the analysis, the selected daily measured variables for the estimation of net blotch density were the average temperature, minimum temperature, and rainfall. The results strongly indicate that thorough data analysis and feature generation methods enable new tools for plant disease prediction. This is crucial when predicting the disease risk and optimizing the use of pesticides in modern agriculture. Here, the developed system resolves the correlation between weather measurements and net blotch observations in a novel way.


Introduction
Barley, Hordeum vulgare L., is a cereal plant of the grass family Poaceae. It is the fourth largest grain crop, and it was grown globally on 47 million hectares in 2016 [1]. Barley is primarily grown as animal fodder and as a source of malt for alcoholic beverages, but is also commonly used in food products, e.g., breads, soups and stews, and health products. However, barley production is challenged by several biotic and abiotic pressure factors. On average, plant diseases caused by microbes can decrease the annual average yield of the barley crop by up to 20% [2]. One of the most commonly distributed fungal diseases in barley is net blotch, which is caused by the ascomycete Pyrenophora teres Drechsler. In Finland, net blotch was present in 86% of barley fields investigated in 2009 [3]. The pathogen overwinters on barley debris or seed. During the growing season, it reproduces asexually on barley leaves. The symptoms start as small brown lesions, which elongate and produce dark brown streaks across the leaf blades, creating a net-like pattern surrounded by a yellow margin. Environmental conditions play a significant role in disease development. The leaf wetness period that is required for conidium germination relates to the temperature. In studies by van den Berg and Rossnagel [4], it was shown that the minimum leaf wetness period required for P. teres infection was halved as the temperature was doubled in degrees Celsius. Martin and Clough [5] reported that the spore release of P. teres correlated positively with temperature, but negatively with relative humidity and even in the early stage of the growing season using existing measurements. This information forms the basis for predicting net blotch occurrence that can be used in deciding on the use of pesticides. To avoid complex model structures, multiple models, and the costs of extra measurement arrangements, this research aims solely at combining existing data from different sources, public and private. The resulting methodology is available for future routine analysis without any specific tests. In addition to data analysis, the usability of open weather data is demonstrated and discussed.

Materials and Methods
This study combines information from two different datasets-weather measurements and the prevalence of net blotch at the observation fields. During the research, no extra measurements were arranged; instead, the available data were mathematically combined for a new purpose. The principle of this study is presented in Figure 1. Net blotch data had been collected and pre-processed by the Natural Resources Institute Finland (Luke) during the years 1991-2017. The numerical data used in this research exists in the Oracle database. Measurements included information about the observation year, field location (municipality), cultivated barley genotype, and the disease severity of net blotch. The test fields were located in Central and Southern Finland. In this research, the net blotch observation data from hardiness zones I-IV were utilized. The approximate locations of the observation fields can be seen in Figure 2. The data analysis, feature generation, and evaluation steps were exactly the same in each of the four cases. Net blotch data had been collected and pre-processed by the Natural Resources Institute Finland (Luke) during the years 1991-2017. The numerical data used in this research exists in the Oracle database. Measurements included information about the observation year, field location (municipality), cultivated barley genotype, and the disease severity of net blotch. The test fields were located in Central and Southern Finland. In this research, the net blotch observation data from hardiness zones I-IV were utilized. The approximate locations of the observation fields can be seen in Figure 2. The data analysis, feature generation, and evaluation steps were exactly the same in each of the four cases. Field experiments had been conducted in 1991-2017 in different locations in Finland, representing areas where spring barley is typically grown. Experiments were included as part of the Official Variety Trials and they all followed the standard procedures specified for that purpose [25]. These were managed by Luke Finland at its numerous regional research units and by plant breeding companies and private agricultural research stations. All experiments were arranged as randomized complete block designs or incomplete block designs. The number of replicates varied from three to  Field experiments had been conducted in 1991-2017 in different locations in Finland, representing areas where spring barley is typically grown. Experiments were included as part of the Official Variety Trials and they all followed the standard procedures specified for that purpose [25]. These were managed by Luke Finland at its numerous regional research units and by plant breeding companies and private agricultural research stations. All experiments were arranged as randomized complete block designs or incomplete block designs. The number of replicates varied from three to four depending on the location and year. In each year, the set of cultivars and breeding lines changed, but only partly; long-term check cultivars were also used. A typical trial included 30 cultivars. Long-term check cultivars ensure that, in any well-defined linear model analysis, the effects of cultivars and environments can always be estimated [26].
The plots were 7-10 m × 1.25 m, depending on the location and year. The seeding rate was 450-550 viable seeds per square metre, conforming to the commonly used seeding rates in Finland. Fertilizer use depended on cropping history, soil type, and fertility. Weeds and pests were chemically controlled with the active ingredients largely used in commercial farming. However, diseases were not controlled with fungicides.
The disease pressure, a risk index depending on environmental factors and the genotype of cereal, is quantified by Luke Finland by means of equation 1 and using the following steps. The effects of the environment and genotype were separated by the following statistical model based on the structure of the data collection: where y ijk is the observed value for the i th cultivar in the j th year and the k th experimental site.
In addition, all experiments have 3 or 4 replications, and the replication is a nested factor: replication l is nested in the environmental effect of the j th year and k th experimental site. Parameter µ is the intercept, b l(jk) is the random effect of the lth replication, g i the effect of the i th genotype, e jk is the effect of the environment, ge ijk is the error term for the environmental effect, and ε ijkl is the residual. For the incomplete block design, the effect of the block was divided into two parts: variance between incomplete and complete blocks. In this research, the estimated values of the environment,ê jk , are mutually comparable estimates, i.e., despite the fact that the set of genotypes (cultivars) varied between trials and disease resistance between genotypes vary, trials can be put into order according to the disease pressure. This is important because modern genotypes have a higher disease resistance than older genotypes. The estimated values (per year and location) were scaled into three categories: 0 (maximum value 0.5%), 1 (0.6-5%), and 2 (over 5.1%). One example of the scale for appraising plant disease severity in cereals is presented in Saari and Prescott [27].
The weather data were obtained from the open database of the Finnish Meteorological Institute (FMI). More information about FMI open weather data is available in the report by Honkola et al. [28]. In every presented case, the distance between the local weather station and the observation field was the same throughout the observation years. The information content of weather data was compared during the whole period under the review. The loaded data were in the .xlsx format and usable in MATLAB ® . The variables analyzed in this study were: The FMI data included some missing information and the data required further pre-processing. First, FMI data were arranged into datasets according to the year of observation. The datasets which included consecutive missing observations were discarded at this stage. The FMI datasets were then Agriculture 2020, 10, 179 5 of 15 grouped according to the observation place and the net blotch category (0-2). Later, the datasets in the 0-category were referred to as the reference data and the datasets from categories 1 or 2 were compared to them. Four years' data of independent weather observations from each hardiness zone and each net blotch category were utilized, except for hardiness zone IV and category 0 data, where measurements from three years were available. Brief information about the utilized data is presented in Table 1. It is important to notice that the different datasets were later indexed both spatially and temporally. The particular years and weather stations related to the data used are presented in Appendix A. The net blotch observation data included one value per year while the weather data consisted of daily observations. The number of weather variables was four in each analysis and the number of tested feature candidates was 1760.
Because of the different weather conditions, the beginning of the growing season and the sowing date varied according to the year and the observation field. This must be taken into account in deciding the starting point of the analyzed period (t o ). Two variants were compared in this study. The first one defined the starting point as the beginning of the growing season, defined as the time when the mean temperature remained over plus five degrees Celsius for five consecutive days. In the second variant, the sowing date was used as the starting point. The data before this starting point and after the growing season was omitted. The analyzed period was 14 days from the starting point.
All of the data analysis and result evaluation were performed in the MATLAB ® programming environment. First, the statistical values of the weather measurements were analyzed to find out whether the reference data differed from the datasets in category 1 or 2. The mean value of daily rainfall, R [mm], increased as well as the net blotch category when referring to the datasets related to the beginning of the growing season. In most cases with the datasets starting from the sowing date, R also increased by net blotch category, but in the case of hardiness zone III, categories 0 and 2 had the same mean R value. The statistical characteristics of the variables are presented in Table 2.
The feature generation was performed because it was not possible to classify the datasets into different net blotch categories with the initial calculated statistical values. This means that new computational variables were generated from the original data by mathematical operations and the features with the highest information content were selected by using the t-test. More information about the feature generation methods is published, for example, in [29][30][31][32][33].
The feature generation method used in this study is presented by Ruusunen [34] (p. 50). The method used composes new variables from the original ones (R, T av , T max , and T min ) with different mathematical operations, such as addition, subtraction, multiplication, division, involution, logarithm, square root, and combinations of them. A list of possible feature prototypes which were generated as mentioned above are presented with details in [34] (Appendix A). All of those candidates were tested in this study and the feature validation was performed with the t-test. The utilization of the t-test was carried out in a MATLAB ® environment with the function t-test2 and 70% confidence intervals. Two-sample t-tests were selected with the assumption that the data vectors were from independent random samples with unknown variance. The selected features were then the candidates with which categories 1 or 2 could be separated with most certainty from the reference datasets (category 0). The data analysis procedure is presented in Figure 3.
candidates were tested in this study and the feature validation was performed with the t-test (Equation (2)). The utilization of the t-test was carried out in a MATLAB ® environment with the function t-test2 and 70% confidence intervals. Two-sample t-tests were selected with the assumption that the data vectors were from independent random samples with unknown variance. The selected features were then the candidates with which categories 1 or 2 could be separated with most certainty from the reference datasets (category 0). The data analysis procedure is presented in Figure 3. The two-sample t-test was applied to evaluate the analysis results. In this case, where two data samples are assumed to be from populations with unequal variances, the test statistic t under the null hypothesis has an approximate Student's t distribution with a number of degrees of freedom given by Satterthwaite's approximation [35]. This arrangement can also be called Welch's t-test.

Results and Discussion
The statistical characteristics (mean value, standard deviation, and median) of the weather data are listed in Table 2. The characteristics are indexed by variables, locations and net blotch categories, and analysis was performed for two alternatives of the starting point, to:t0 equals the beginning of the growing season and t0 equals the sowing time. From the statistical point of view, the weather conditions were quite similar in the selected years. The temperature increased from the beginning of the growing season to the sowing time, which is quite understandable since the beginning of the growing season was typically two to four weeks earlier than the sowing time.   The two-sample t-test was applied to evaluate the analysis results. In this case, where two data samples are assumed to be from populations with unequal variances, the test statistic t under the null hypothesis has an approximate Student's t distribution with a number of degrees of freedom given by Satterthwaite's approximation [35]. This arrangement can also be called Welch's t-test.

Results and Discussion
The statistical characteristics (mean value, standard deviation, and median) of the weather data are listed in Table 2. The characteristics are indexed by variables, locations and net blotch categories, and analysis was performed for two alternatives of the starting point, t o :t 0 equals the beginning of the growing season and t 0 equals the sowing time. From the statistical point of view, the weather conditions were quite similar in the selected years. The temperature increased from the beginning of the growing season to the sowing time, which is quite understandable since the beginning of the growing season was typically two to four weeks earlier than the sowing time.
All of the generated features were tested. The weather data belonging to net blotch categories 1 and 2 were compared with the reference data category 0 with the t-test and the following hypotheses:  In this case (Hardiness zone III), the null hypothesis was rejected nine times and accepted five times during the 14-day observation period. Then, the two tested datasets differed statistically with the feature generation technique and t-test with a 70% confidence interval in the case of nine days.
The results of analyzed locations and categories 0 compared to 1, and 0 compared to 2 are presented in Table 3. In this case (Hardiness zone III), the null hypothesis was rejected nine times and accepted five times during the 14-day observation period. Then, the two tested datasets differed statistically with the feature generation technique and t-test with a 70% confidence interval in the case of nine days.
The results of analyzed locations and categories 0 compared to 1, and 0 compared to 2 are presented in Table 3. As can be seen from Table 3, the separation ability of the most suitable features, where t 0 equals the beginning of the growing season, was at least sufficient and in several cases statistically stronger than the separation ability of the datasets where t 0 equals sowing dates. Consequently, the following results are presented only with the datasets where t 0 equals the beginning of the growing season. It seems that the information content of the data varies during the growing season, and the optimal starting point for the analyzed time window has to be studied carefully.
Several features were generated from every spatial dataset with which the best separation results were achieved. The features that were the most suitable for separating the reference data and categories 1 and 2 are listed in Table 4. The original variables are marked as a, b, c, and d and are R, T av , T max , and T min respectively. According to the t-test, the daily feature values included unequal means 8-11 times (out of 14) when comparing the reference data and category 1 data, and 9-11 times (out of 14) when comparing the reference data and category 2 data. The separation ability increased or remained the same when comparing categories 0 vs. 1 and 0 vs. 2, except for hardiness zone III. Table 4. The most suitable features for separating between the reference data (category 0) and category 1 and 2 datasets. The original variables are denoted as a, b, c, and d-namely R, T av , T max , and T min . The cumulative summed feature values (features of Table 4 for each hardiness zone) calculated for hardiness zones I-IV and categories 0 vs. 1 are presented in Figure 5 and those for categories 0 vs. 2 in Figure 6. The idea to test the cumulative sum here was based on the assumption that the growth of net blotch is some kind of dynamic phenomenon. Thus, the effects, for example, of rainfall were assumed to accumulate during the growing period. With the cumulative sum applied to the time series of listed features, the effectiveness of the utilization of these features can be demonstrated visually.

Place of Observations
The separation ability of the presented features is shown in Figures 5 and 6. The results show that the infected years can be potentially separated from the reference data using the weather measurements and the feature generation technique. Feature selection was based on summing up the number of days in a certain time window when the two datasets differed statistically at the 70% confidence level. Thus, for example, in Table 3 and in hardiness zone I, the day sums of 9 and 11 both indicate full classification capability with the method at the respective time. This way, the numbers in Table 3 are related to the robustness of the features against uncertainties in the measured data.
for hardiness zones I-IV and categories 0 vs. 1 are presented in Figure 5 and those for categories 0 vs. 2 in Figure 6. The idea to test the cumulative sum here was based on the assumption that the growth of net blotch is some kind of dynamic phenomenon. Thus, the effects, for example, of rainfall were assumed to accumulate during the growing period. With the cumulative sum applied to the time series of listed features, the effectiveness of the utilization of these features can be demonstrated visually. Figure 5. The cumulative summed feature values (y-axis) generated from weather data of the different hardiness zones. The category 0 data (four years' data in hardiness zones I, II and III, and three years data in hardiness zone IV) are marked with a solid line and category 1 data (four years in each hardiness zone) with a dashed grey line. The observation period is 14 days, t0 is the beginning of growing season, and the time step is one day (x-axis).  The separation ability of the presented features is shown in Figures 5 and 6. The results show that the infected years can be potentially separated from the reference data using the weather measurements and the feature generation technique. Feature selection was based on summing up the number of days in a certain time window when the two datasets differed statistically at the 70% confidence level. Thus, for example, in Table 3 and in hardiness zone I, the day sums of 9 and 11 both indicate full classification capability with the method at the respective time. This way, the numbers in Table 3 are related to the robustness of the features against uncertainties in the measured data.
However, the best suitable features (selected by the t-test) depend on the hardiness zone and the estimation should be further extended to a form that is more general in order to increase practical However, the best suitable features (selected by the t-test) depend on the hardiness zone and the estimation should be further extended to a form that is more general in order to increase practical usability of the analysis. For that reason, the hardiness zone I-IV datasets were merged and the earlier described analyzing steps were then performed. This new dataset included the weather measurements from 15 reference years, 16 category 1 years, and 16 category 2 years. The cumulative summed feature values for both cases are presented in Figure 7. The selected feature based on the analysis in the cases of the reference data vs. category 1 data and the reference data vs. category 2 data is T av /T min + R.
Agriculture 2019, 9, x FOR PEER REVIEW 12 of 16 Figure 7. The cumulative summed feature values (y-axis) for the cases' reference data vs. category 1 data (above) and reference data vs. category 2 data (below). The weather data used included hardiness zones I-IV.
The classification task was repeated with the new independent dataset applying the same feature as above. Seven years' data (category 2) was analyzed as described and compared to the original reference data (category 0). The classification results with the new dataset are presented in Figure 8. The results are interesting and especially the lower graphs in Figure 7 and the graphs in Figure  8 show that there is a difference between the cumulative summed feature values when comparing the reference data and the category 2 data. Nevertheless, the classification accuracy needs to be improved, and therefore the generalization potentiality of the method needs further study.

Conclusions
Thorough statistical analyses of weather measurements and net blotch observations were performed, and the results are presented in this article. This research confirms that weather Figure 7. The cumulative summed feature values (y-axis) for the cases' reference data vs. category 1 data (above) and reference data vs. category 2 data (below). The weather data used included hardiness zones I-IV.
The classification task was repeated with the new independent dataset applying the same feature as above. Seven years' data (category 2) was analyzed as described and compared to the original reference data (category 0). The classification results with the new dataset are presented in Figure 8. The cumulative summed feature values (y-axis) for the cases' reference data vs. category 1 data (above) and reference data vs. category 2 data (below). The weather data used included hardiness zones I-IV.
The classification task was repeated with the new independent dataset applying the same feature as above. Seven years' data (category 2) was analyzed as described and compared to the original reference data (category 0). The classification results with the new dataset are presented in Figure 8. The results are interesting and especially the lower graphs in Figure 7 and the graphs in Figure  8 show that there is a difference between the cumulative summed feature values when comparing the reference data and the category 2 data. Nevertheless, the classification accuracy needs to be improved, and therefore the generalization potentiality of the method needs further study.

Conclusions
Thorough statistical analyses of weather measurements and net blotch observations were performed, and the results are presented in this article. This research confirms that weather The results are interesting and especially the lower graphs in Figure 7 and the graphs in Figure 8 show that there is a difference between the cumulative summed feature values when comparing the reference data and the category 2 data. Nevertheless, the classification accuracy needs to be improved, and therefore the generalization potentiality of the method needs further study.

Conclusions
Thorough statistical analyses of weather measurements and net blotch observations were performed, and the results are presented in this article. This research confirms that weather conditions have a significant effect on net blotch density. Using advanced data analysis, the information content of the existing weather measurements was enriched, and extra measurement campaigns were unnecessary. The feature generation and validation results show that the most suitable features were combinations of the original measurements, which supports the assumption that the influence of the weather and the infection of plants are a complex phenomenon.
The analysis was performed with data from four different hardiness zones, each zone separately, and also jointly as one set of data to test the generalization ability of the developed method. Each spatial dataset was also analyzed from the temporal point of view in a time window of 14 days using two datasets: one where the starting point, t 0 , is the very early stage of the growing season and the other where t 0 is the sowing date. According to the analysis, the separation ability of datasets where t 0 equals the beginning of the growing season was at least sufficient and, in several cases, statistically stronger than the separation ability of datasets where t 0 equals sowing dates. However, the information content of the data varies during the growing season and the optimal date of t 0 still needs thorough research.
The datasets were categorized according to the yearly net blotch density. Category 0 (no net blotch) was used as the reference data and the datasets from categories 1 and 2 were compared with that. The aim was to develop a method that can identify the increasing risk for barley net blotch and verify it with existing data. This method is valuable when predicting net blotch occurrence and possible need for pesticide use. The best suitable features were evaluated by the t-test. Here, the t-test was a sufficient evaluation method; however, the feature evaluation step still needs more research.
The reliable identification of the weather conditions that led to a net blotch infection can be utilized for modeling and eventually for the optimization of pesticides. The FMI open database includes reliable and usable weather measurements, and the applicability of public data has been demonstrated in this paper.
This study proves the effectiveness of data analysis and offers a new perspective for net blotch estimation. Accurate plant disease prediction is a valuable tool for optimizing pesticides and minimizing their harmful effects on the environment. To achieve a reliable model for net blotch forecasting, data on the combined four hardiness zones needs more research. The estimation accuracy and the generalization of the presented method need to be tested with new datasets. In addition, new measurements such as air humidity should be considered. In conclusion, justified and optimized chemical protection saves money and the environment in the long run and is part of sustainable agriculture.