Spatial Layout of Multi-Environment Test Sites: A Case Study of Maize in Jilin Province

Variety regional tests based on multiple environments play a critical role in understanding the high yield and adaptability of new crop varieties. However, the current approach mainly depends on experience from breeding experts and is difficulty to promote because of inconsistency between testing and actual situation. We propose a spatial layout method based on the existing systematic regional test network. First, the method of spatial clustering was used to cluster the planting environment. Then, we used spatial stratified sampling to determine the minimum number of test sites in each type of environment. Finally, combined with the factors such as the convenience of transportation and the planting area, we used spatial balance sampling to generate the layout of multi-environment test sites. We present a case study for maize in Jilin Province and show the utility of the method with an accuracy of about 94.5%. The experimental results showed that 66.7% of sites are located in the same county and the unbalanced layout of original sites is improved. Furthermore, we conclude that the set of operational technical ideas for carrying out the layout of multi-environment test sites based on crop varieties in this paper can be applied to future research.


Introduction
Variety regional test, as a key to new crop variety performance and market prospects, has an irreplaceable role in breeding [1]. Since 2000, United States constructed a regional test network based on hundreds of test sites to represent almost all types of planting environments [2]. China has also built a systematic regional test network [3][4][5]. To accurately assess each variety within 2-3 years, every test site must be highly representative of planting environments, which cover several elements such as weather, soil, terrain, biological factors, etc., called multi-environments. However, regional test results are still inconsistent with actual crop results. An important reason for this result is that neither the number nor the locations of test sites could adequately represent the multi-environments.
Clustering, a fundamental method in regionalization based on multi-environments, has been used in maize planting environments for different applications. To deal with the sparse data from observed stations such as meteorological data, previous studies can be divided into two major categories: (1) studies that focus on environments clustering of the site itself, for instance, ecological and climatic factors limiting maize production including drought [6][7][8], sunshine [9], insufficient accumulated temperature [10], high temperature heat damage [11], and terrain conditions [12]; and (2) studies that build index system for clustering based on transferring the observed station data mainly on calendar time into the regular grid data by spatial interpolation method [13][14][15][16][17][18]. In fact, the environmental conditions of phenophase have a greater impact on crop varieties.
The accurate prediction of new varieties would be increased by the accumulated data from the same test sites for years. Therefore, selecting suitable test sites is a prerequisite for successful testing. The factors affecting the selection of test sites are not only limit to multi-environments [19,20], but also include flat terrain, uniform fertility, good irrigation, drainage conditions, and convenient transportation [21]. Currently, the existing system mainly concerns representativeness, stability, and yield of every site. The study results from researchers whom mainly focus on the representativeness, stability, and area discrimination of every site and rarely consider the planting area [22] are difficult to apply.
Planting environmental factors, as the geographical objects, usually have a certain spatial correlation. The traditional sampling methods could not apply to planting environment. Spatial sampling algorithms that are very popular [23][24][25] are seldom used in test site layout. In some previous practical applications [26][27][28][29], sites were selected based on expert knowledge. However, as the theoretical research continues to deepen in the fields of spatial sampling algorithms and spatial autocorrelation algorithms, researchers began to use spatial sampling models to optimize the sampling results to ensure the accuracy and avoid the bias caused by the subjective judgment.
In this paper, we propose a three-stage spatial layout method: (1) based on meteorological data, soil nutrient data, and topographical data during the phenophase period, we clustered the planting environment by spatial clustering method; (2) we used spatial stratified sampling to determine the minimum number of test sites in each type of environment; and (3) combined with factors such as the convenience of transportation and the planting area, the layout of multi-environment test sites was constructed according to spatially balanced sampling method. We take maize in Jilin Province, one of the main maize producing areas, as a case study.The experimental results were compared with existing sites to verify the availability of our method.

Index System Construction
The selection of indicators is the fundamental step for environmental recognition. In this paper, we use indicators as a theoretical basis for the layout of maize variety regional test sites which are used to select new varieties that are not only suitable for planting in different environments but also stable and high yield. The first principle of selecting indicators is the ecological factors that have a great impact on maize production but are independent from each other. The limiting factors of maize production mainly include cultivation and planting techniques, ecological and climatic factors, seed variety, soil, and biological stress [30]. Here, we use the following four ecological and climatic factors to build the indicator system: let {n,i} denote the days and index in grown period.
1. Accumulated temperature (AT) which refers to the accumulation daily average temperature (t) from sowing to maturity is formulated as: 2. Accumulated Precipitation (AP) which refers to accumulation of precipitation during whole grown period is formulated as: P i refers to daily i precipitation (units:mm) 3. Cumulative Sunshine Hours (CSH) which refers to accumulation of sunshine hours during whole grown period is formulated as:

Data Pre-Processing
In this study, meteorological data were provided by National Meteorological Center of China. The phenological data of spring maize and summer maize were obtained from China crop growthand farmland soil moisture data were provided by the National Meteorological Information Center of China and the China Ministry of Agriculture. Meanwhile, we collected some survey data from Agro-Seed Industry Companies. The national geography data used in this study were 1:4 million scale provincial and county administrative division vector data. The slope data came from the DRM (Shuttle Radar Topography Mission) 90 m resolution DEM data [31]. The distribution of test sites and weather stations in Jilin Province [32] is shown in Figure 1. We used the Global Moaran's I index and Z points to measure the spatial autocorrelation of each indicators. Positive and negative Maran's I index values represent the corresponding positive and negative correlations of indicators. If Z points > 1.96, the spatial object is aggregated. If Z points < −1.96, the spatial object is decentralized [33]. We used the normal QQ plot to see if the data are normally distributed, and normalized the data using Equation (4).
where Z denotes the normalized value of each pixel (x), and min(X) and max(X) denote the minimum and maximum values of all pixels (X) before normalization, respectively.

Spatial Clustering
We proposed an integrated clustering algorithm for spatial attributes based on ISOData method [34,35], where the implementation is in three phases. (1) Cluster pedigree maps: R 2 and semi-R 2 , are used to evaluate the clustering results and determine the number of clusters. (2) ISOData clustering algorithm is used for the planting environment. (3) Spatial continuity adjustments are made according to spatial adjustment rules. Partitions obtained have as many differences between classes as possible, small intraclass differences, and spatial continuity.
Four criteria for determining the number of clusters based on pedigree maps are as follows: (1) The distance between the centres of gravity are as far as possible. (2) The number of classes must meet practical purposes. (3) Practical purposes lead to the number of clusters. (4) The results obtained by different clustering methods should have the same class. Assume that sample size of n is divided into k categories, as C 1 , C 2 , . . . , C k . n t denotes the number of samples in Class C t . Let {Ẋ t , X t i } denote the center of gravity and the i-th (i = 1, . . . , n t ) sample of C t . R 2 k is defined as follows: We use R 2 k to evaluate the performance of the clustering with k clusters.The larger R 2 k , the better performance based on k clusters. We also choose the semi-R 2 in this paper: A larger semi-R 2 k means a better performance of K+1 clusters. Two spatial data adjustment rules for the raster data are defined as follows. Scenario 1: Other clusters distribute sporadic in a certain cluster. We use area threshold to determine whether the sporadic areas are retained. Scenario 2: the areas neighbor multiple clusters. Here, we calculate the difference between this area and all neighboring clusters and merge with the nearest one. The difference value D is defined as: whereẊ i denotes the mean of the i-th attribute of class of this area andẎ i denotes one of the near cluster. n is number of attributes of one class.

Sample Strategy
We used a spatial sampling model proposed by Zhao et al. [32]; in this study, the relationship between the number of test sites x and sampling accuracy is defined as: The number of samples in each layer is calculated as the following formula: where N is the total number of grid samples, n is the number of test sites, h is the planting environment class, N h is the number of samples in the h type of planting environments, W h = N h N is the weight of h type of planting environments, S h is true standard deviation of the h type of planting environments, and C h is the cost of investigating a single sample of a planting environment.

Data Processing
Moran's I index of all three indicators is positive, which means all of them have spatial autocorrelation. The Z points are all greater than 1.96, indicating that the spatial distributions of the factors are clustered and the reliability is high (Table 1). Hence, we use statistical spatial interpolation for all three indicators.  It can be seen in Figure 2 that all three factors are close to the normal distribution and the mathematical exceptions are unknown. Thus, we used the Ordinary Kriging method to interpolate AP and CSH. As the temperature is reduced by 0.6 • C for each 1000 m increased in elevation, we used Cokriging method to consider elevation for AT. The normal transformation parameters were not set and the grid resolution is 5000 m (Figure 3).

Multi-Environments Clustering
We divided the planting environments into 2-9 categories, and calculated the R 2 to evaluate the effect of clustering. The R 2 and semi-R 2 for different numbers of classes are shown in Table 2. We found R 2 are all around 0.9 with no significant difference using more than five clusters. Furthermore, it is largest at four clusters, thus we selected five as the number of clusters to divide the planting environments (Figure 4). We used spatial continuity adjustment rules in the planting environment of Jilin Province as a case study. Two methods are used to adjust planting environment ( Figure 5).

Test Sites Layout
The layout of test sites based on multi-environments in Jilin Province mainly considers the following three issues: (1) the minimum number of test sites; (2) the number of sites for each type of planting environment; and (3) the site locations. Test sites layout consists of four steps: Step 1. In Jilin Province, in addition to the site's ability to fully represent the different regional planting environment, the following factors should be considered: the distance to roads, and the total planted area. We use the national and provincial road data to calculate the degree of convenience for traffic.
Step 2. Using Equation (8), we can conclude that at least 25 sites should be deployed to meet the sampling accuracy requirements (error = 0.05). In this study, we also considered that the cost of single-sample surveys of all types is equal. According to Equation (9), we obtained the number of each type: N (HAFCM) = 5, N (CPA) = 5, N (LAFCM) = 4, N (WAP) = 5, andN (EAF) = 6.
Step 3. A probability grid, which was used for site location, was constructed ( Figure 7) based on following three factors: (1) representation of the planting environment (pdist): the distance from the sample to the cluster center; (2) planting area (area); and (3) road distance (roaddist). Based on the expert knowledge, the weights of pdist, area, and roaddist in Jilin Province were set to: asw (pdist) = 0.1, w (area) = 0.8, and w (roaddist) = 0.1. The calculation formula for the probability raster (prb) is: ] × w (roaddist) ) × slope (10) Step 4. We used spatially balanced sampling to set up a testing site for each type of planting environment. (Figure 8) The sample relative accuracy calculated using Equation (8) is 94.5% when the number of test sites x = 25.

Clustering Attribute Statistics
The statistics of five types of planting environment are shown in Table 3. In HAFCM, the terrain fluctuated greatly, area of maize is less, and it is prone to frost damage because of the earlier frost period. In contract, AP and CSH of LAFCM are relatively small. EAF, a semi-mountainous valley, is generally basin and plain, where the arable land is relatively large and suitable for agricultural development. The climate of this area is mild, e.g. AT is between 2330 • C and 2770 • C with abundant AP. CPA is the main maize belt in Jilin Province. Most of the cultivated land is concentrated and contiguous, which is suitable for mechanized farming. It has abundant photothermal resources, for instance, AT is between 2600 • C and 2900 • C and AP is between 240 m and 330 mm.

Planting Environmental Representation
We clustered each type of planting environment using the same clustering method, and compared them with the test site layout results.
Test sites proposed in this paper cover 19 of 25 sub-clusters ( Figure 9). However, the number of test sites is still unbalanced.Two reasons may cause this situation: (1) the randomness of spatially-balanced sampling; and (2) the planting areas in sub-clusters are too small.

Comparison Number of Test Sites
The number of original test sites in Jilin province is 26 ( Table 4). The spatial distribution of the existing test sites is not well-balanced. For example, in CPA, the number of test sites is 13 which is about half of the total sites. The total established test sites number of HAFCM and LAFCM cannot reflect actual complexity, comparing with nine test sites from our method. In addition, EAF and WAP also need to add two sites according to our method.

Conclusions
To tackle with the problems of spare data from multi-environments and inconsistence between regional tests and actual promotion, we propose a spatial layout method that includes the following two novel features: (1) It constructs a clustering index system of planting environment with test site layout as the application purpose. (2) It deduces the appropriate spatial distribution of test points in different cluster by integrating the complexity of each planting environment type. The experiment was run in Jilin Province to simulate the layout of maize variety testing sites. The results show the proposed methods can not only meet requirements for quantity measurement and spatial distribution of test sites, but also provides a set of operational technical ideas for the layout of multi-environment test sites for crop varieties.