Using Canopy Measurements to Predict Soybean Seed Yield

: Predicting soybean [ Glycine max (L.) Merr.] seed yield is of interest for crop producers to make important agronomic and economic decisions. Evaluating the soybean canopy across a range of common agronomic practices, using canopy measurements, provides a large inference for soybean producers. The individual and synergistic relationships between fractional green canopy cover (FGCC), photosynthetically active radiation (PAR) interception, and a normalized difference vegetative index (NDVI) measurements taken throughout the growing season to predict soybean seed yield in North Dakota, USA, were investigated in 12 environments. Canopy measurements were evaluated across early and late planting dates, 407,000 and 457,000 seeds ha − 1 seeding rates, 0.5 and 0.8 relative maturities, and 30.5 and 61 cm row spacings. The single best yield predictor was an NDVI measurement at R5 (beginning of seed development) with a coefﬁcient of determination of 0.65 followed by an FGCC measurement at R5 (R 2 = 0.52). Stepwise and Lasso multiple regression methods were used to select the best prediction models using the canopy measurements explaining 69% and 67% of the variation in yield, respectively. Including plant density, which can be easily measured by a producer, with an individual canopy measurement did not improve the explanation in yield. Using FGCC to estimate yield across the growing season explained a range of 49% to 56% of yield variation, and a single FGCC measurement at R5 (R 2 = 0.52) being the most efﬁcient and practical method for a soybean producer to estimate yield.


Introduction
Soybean is a major crop in the north-central USA region, with the states North Dakota, South Dakota, and Minnesota producing about 20% of the total soybean production [1]. Predicting the yield of crops such as soybean provides crucial information to producers, consultants, and economists for improving crop management decisions and subsequent profit. Early crop yield estimation using non-destructive measures can provide additional benefits to scientists and breeding programs furthering the identification of advantageous agronomic practices or high-yielding genotypes. Predicting soybean yield through a handheld or remote sensing canopy measurements is of high interest as the soybean canopy often reflects the progress and development of the crop. In addition, remote sensing images often suffer from various degradation, noise effects, or variability in image processing [2]. However, most methods used to quantify or estimate canopy cover are unpractical for soybean producers to use or equipment is expensive.
The normalized difference vegetative index (NDVI) is calculated from reflectance measurements in the red and near-infrared spectrums [3]. These reflectance measurements have been proven to indicate environmental and nutrient inadequacies [4][5][6]. A flexible Fourier transform model to predict yield using NDVI has been used previously but did not account for NDVI on a pixel basis [7]. Using NDVI to predict yield has provided variable results, Ma et al. [8] found NDVI had R 2 values between 0.65 and 0.80 for soybean yield during the first year of an experiment and then explained 45 and 70% of variation the subsequent year. Similar coefficients of determination of 0.63 and 0.65 were reported in Wisconsin and Indiana, USA, by Mourtzinis et al. [9]. Several studies have shown NDVI explaining high amounts of variation for corn (Zea mays L.), cover crops, rice (Oryza sativa L.), and wheat (Triticum aestivum, L. emend. Thell.) [10][11][12][13][14].
Canopy cover is a useful proxy measurement for light interception potential and crop productivity. Maximum photosynthesis is achieved when plants maximize light interception and utilization of photosynthetic radiation [15,16]. Light interception can be quantified with methods such as quantum line sensors [17], approximated by fractional green canopy cover (FGCC) from pictures using the Canopeo app, as demonstrated by Patrignani and Ochsner [18], and leaf area index (LAI). Light interception measurements using quantum line sensors are considerably more time consuming, compared to measuring FGCC using the Canopeo app. In addition, precise LAI measurements require plant destruction in order not to overestimate the LAI in dense canopies [19]. The NDVI can also be used to approximate fractional canopy coverage but is not a useful substitute for above-ground biomass measurements [20]. Ma et al. [8] reported that plant density had no effect on the yield and NDVI relationship during the soybean reproductive growth stages. However, it would be useful to measure NDVI throughout the entire growing season to determine if early season plant density and NDVI measurements can improve seed yield prediction.
Estimating and predicting crop yields using canopy cover measurements is of interest to producers. Crop growth stage [21], row spacing [22], and canopy structure [23] can affect light interception, FGCC, and NDVI. Measurements such as NDVI and FGCC can predict yields for wheat [21], rice [24], and soybean [8]. Although previous research has evaluated the relationship between established stand, canopy measurements, and seed yield, this research evaluated combining measurements on an established stand, light interception, green canopy cover quantification, and NDVI to potentially allow for better yield prediction and provide a useful application in soybean production.
The objective of this experiment was to determine if measurements of canopy development could be used to predict soybean yield and if canopy measurements can predict yield, to determine the most accurate and most practical strategy for yield prediction.

Materials and Methods
Data were collected in the 2019 and 2020 cropping seasons at Casselton, Prosper, and Fargo, North Dakota, USA. Location and soil characteristics can be found in Table 1. Location and year were combined and are termed "environment", for a total of 12 environments. At the Fargo location, there were four experiments each year. Two experiments were on a tile-drained soil, and the other two were on a non-tile drained, and each soil drainage type had a 30.5 cm and a 61 cm row spacing experiment. Each experiment was a randomized complete block with a split-plot arrangement with four replicates. The whole plot was planting date and the sub-plots were a factorial combination the two of cultivars and seeding rate. Planting dates were at an optimal time in mid-May and a late planting date of two weeks thereafter. Seeding rates were 407,500 and 457,000 germinable seeds ha −1 based on rates of previous research results [25,26].
Cultivars used were AG 05X9 (0.5 maturity group) and AG 08X8 (0.8 maturity group) (Asgrow, Bayer Crop Science, Creve Coeur, MO, USA), which are adapted to the region [27]. Two row spacings (30.5 and 61 cm) were evaluated to represent the common range of row spacings used in North Dakota, USA [28]. A range of planting dates, seeding rates, relative maturities, and row spacing were used to create a large inference of common production practices in the North Dakota and northwestern Minnesota, USA, and Manitoba, Canada, region. The planting date, relative maturity, and seeding rate data were combined across all 12 environments. Early planting date treatments were seeded once soil temperatures reached 10°C in early to mid-May but not earlier than five days prior to the last historical projected frost date, using a Hege 1000 no-till planter (Hege Company, Waldenberg, Germany), with 30.5 cm row spacing at Casselton and Prosper and 30.5 and 61 cm at the Fargo location. The second planting date was two to three weeks after the first planting, depending on field conditions. The plot size for the experimental unit was 1.52 m by 5.47 m. Soils test data are provided in Table 2. Fertility was not considered a limiting factor for yield [29], and experiments were kept weed-free using the herbicide Glyphosate [N-(phosphonomethyl) glycine] (Bayer Crop Science, St. Louis, MO, USA). Table 2. Planting date and soil test results for soybean environments in 2019 and 2020.

Location
Planting Date  After planting, established plants were counted at the V2 (two trifoliolate stages [30]) by counting a 0.91 m length from the middle soybean rows. During the growing season, soil cover percent (Canopeo, Oklahoma State University, Stillwater, OK, USA) was recorded. Fractional green canopy cover photos were processed providing canopy coverage percentage [18]. Canopy pictures were taken approximately 1.5 m from the soil surface in the center of each plot using an iPad (Apple, Cupertino, CA, USA). MATLAB software (MathWorks, Inc., Natick, MA, USA) was used to calculate canopy cover by FGCC. Canopy photosynthetically active radiation (PAR) interception measurements were collected randomly in the front, middle, and back third of each experimental unit using an Accupar LP-80 (METER Group Inc, Pullman, WA, USA) with the sensor perpendicular to the plot at a height of 2 cm above the soil surface with the above canopy PAR measured at 1.5 m. The above and below canopy PAR was averaged for each experimental unit. The PAR interception was calculated by dividing the above canopy PAR by below canopy PAR, subtracting that value from one, and multiplying by 100. The NDVI was recorded at a height of 0.5 m above the canopy using a RapidSCAN CS-45 (Holland Scientific, Lincoln, NE, USA) with NDVI being averaged across the experimental unit. The standard NDVI was calculated using the 670 nm (red light) and 780 nm (near-infrared) wavelengths. Fractional green canopy coverage, Accupar, and RapidSCAN measurements were recorded when the soybean plants in the early planting were at the V2, V4, R1, R3, R5, or R7 (two trifoliolate, four trifoliolate, beginning of flowering, beginning of pod formation, beginning of bean development, and pod and leaf yellowing, respectively) growth stage for a total of 18 different canopy measurements per experimental unit. The late-planted soybean samples were only slightly behind in the growth stage, and data were averaged across planting dates as soybean growth stages in a production field also vary slightly. Soybean seed yield was harvested after physiological maturity when the seed was harvestable using a Wintersteiger Classic plot combine (Wintersteiger Ag, Ried im Innkreis, Austria) and corrected to 13% moisture content.
Fractional green canopy cover, PAR interception, and NDVI measurements at each growth stage and the established plant density recorded at all environments were used in multiple linear regression analysis, for a total of 6688 data points. Measurements greater than 3 standard deviations of the mean in each environment for each measurement type and stage combination were removed from the data (165 data points). The multiple regression approach to predicting yield using stepwise and lasso regression was similar to Kumar et al. [31]. The Reg procedure in SAS (SAS Institute Inc., Cary, NC, USA) was used to analyze the relationship between each individual measurement and yield. Variable variance inflation factors (VIF) were reviewed to ensure VIF values were below 5 [32]. The Glmselect procedure in SAS was used for stepwise and lasso multiple linear regression methods. Models were compared using the lowest root mean square error (RMSE) and highest adjusted R 2 [31] and lowest Akaike information criterion (AIC) [33]. Stepwise regression, using a p-value entry-level selection criteria of 0.15 [34], was used to build a model to best predict soybean seed yield using the 18 canopy measurement variables for 6523 total data points from 768 total experimental units. The "validate" statement was used to randomly select 20% of the data, and adjusted R 2 was averaged over 50 iterations to validate the model for both regression methods.

Results and Discussion
Individual canopy measurements combined across environments moderately indicate that the variation in yield with FGCC and NDVI are better descriptors on average (Table 3). Through the R1 to R3 growth stages, soybean is actively producing more trifoliolates and producing seed, and FGCC was consistently (R 2 from 0.43 to 0.52) related to the yield at these stages. Canopy PAR interception was poorly related to seed yield throughout the season (R 2 from 0.01 to 0.30). The best time to record FGCC and NDVI is at R5. At R5, the PAR interception relationship with yield is considerably lower than at the other reproductive stages. This is likely due to most experimental units having similar PAR interception values regardless of the yield potential of the unit. Narrow row spacing (30.5 cm) typically improves PAR interception capacity and yield potential, compared to wide rows for soybean [28]. The R5 NDVI measurement was the best single observation (R 2 = 0.65) explaining yield differences. The relationship between NDVI and yield was expected to increase from planting until R6. The poor relationship between NDVI and yield (R 2 = 0.05) at R3 may have been due to experimental units absorbing comparable amounts of visible light at that stage. Ma et al. [8] found soybean NDVI and seed yield relationships improved from the R2 (full flowering) to R5 stages, which can discern between high-and low-yielding genotypes when measured at R5. The NDVI is strongly related to above soybean ground biomass [35], and soybean seed production potential increases as plant growth increases [36]. Board [37] found that total dry mass at R5, plant height, and length of the seed-filling period was highly correlated with seed yield (R 2 = 0.86). Christenson et al. [38] found no differences between canopy reflectance and yield estimation across growth stages although maturity was more accurately predicted during the seed-filling period. Therefore, the R5 NDVI results, which best describe yield in this study, are similar to those found by Ma et al. and Hoyos-Villegas and Fritschi [8,39]. However, genetic differences have a high impact on vegetative indices and vary depending on development and growing conditions [40,41].
The stepwise and Lasso regression model parameters were comparable with stepwise having a slight advantage with lesser deviation from the regression line (Table 4). The primary differences between the two methods are the variables used in the models with Lasso variable selection typically, minimizing overfitting of models, compared to stepwise regression [42]. Within the two models, the importance of NDVI at R1, R3, and R5 and FGCC at R3 are similar (Table 4). In this case, the variable combination produced by the stepwise (Adj. R 2 = 0.69) model is similar to the Lasso (Adj. R 2 = 0.67) model. These results are similar to linear multiple regression for soybean yield prediction using yield components (R 2 = 0.70) [43]. However, the Lasso variable selection provides a more practical use as only NDVI and FGCC measurements are necessary with a relatively negligible Adj. R 2 reduction. Previous soybean canopy measurements studies relate yield to canopy reflectance [8,9] and PAR interception [28]. However, incorporating NDVI, FGCC, and PAR interception across environments with early and late planting date, early and late relative maturity, 408,000 and 457,000 germinable seeds ha −1 seeding rate, and narrow (30.5 cm) and wide (61 cm) row spacing allows for a yield prediction model with a greater inference and, in this case, a greater explanation of yield, compared to a single canopy measurement. Understanding the practical use of these models can provide researchers different estimates depending on which equation is used. For Example, Ma et al. [8] suggest measuring NDVI between R4 and R5 stages to screen and rank soybean genotypes. Although the models were validated, we demonstrate the applied usage of the models from Table 5 using sensor data from an experimental unit using the recommended farming practices from the north-central USA region [44]. Using the stepwise model in Table 5 with the same measurement data from the same experimental unit with an actual yield of 3723 kg ha −1 and measurements of 0.72, 57.8, 0.82, 92.8, 0.84, and 85.7 for NDVI at R1, PAR at R1, NDVI at R3, FGCC at R3, NDVI at R5, and PAR at R5, respectively, the estimated yield is 3711 kg ha −1 . Using the Lasso model in Table 5 and values of 0.72, 0.82, 92.8, 0.84, and 93.3 for NDVI at R1, NDVI at R3, FGCC at R3, NDVI at R5, and FGCC at R5, respectively, the estimated yield is 3554 kg ha −1 . The yield predictions display how the stepwise model can provide higher yield values than the Lasso model although Lasso regression can have smaller prediction errors comparatively [31]. The behavior of these models is important to note to better understand the yield prediction from regression equations. Table 5. Stepwise and Lasso regression equations to best predict soybean yield.

Method Equation 1
StepwiseŶ Combining established plant density with a canopy measurement could be a simple means to improve yield prediction. However, established plant density was not a strong predictor of yield within the range of plant densities we encountered in these experiments and did not improve R 2 values (Table 6), compared to canopy measurements alone (Table 3), similar to the study by Ma et al. [8]. Our FGCC results are comparable to Yu et al. [45], who reported a correlation coefficient of 0.56 between canopy cover and yield at R3. A single canopy measurement, especially FGCC at R3 or R5, is a more effective use of time to predict yield. Yu et al. [45] reported FGCC explained three times greater variation than the best vegetative index in Illinois, USA. Our results show that NDVI explains more yield variation than FGCC at R5; however, vegetative indices are more sensitive to genetic and environmental conditions and may not provide consistent results year to year [40,45]. The model that best predicts yield may not always be the most practical to use. Instruments for NDVI and PAR measurement can be expensive and may not be practical for every soybean producer or agronomist to own. However, FGCC can easily be obtained by using a cell phone with the Canopeo application in a few seconds. Table 3 provides R 2 values for the relationship between FGCC and yield at several growth stages, and Table 7 displays simple and multiple regression equations, which were most predictive of yield. The relationship between FGCC and yield slightly improves when both the R3 and R5 measurements are included. The best yield prediction model using FGCC includes observations at V2, R1, R3, and R5 with the R5 observation having the greatest effect on the seed yield. It is important to note that the FGCC only explains at most 56% of the variation in yield encountered. Using a single FGCC measurement at the R5 growth stage is likely the most efficient way to collect data that give a reasonable prediction. However, given the only moderate R 2 values, FGCC should primarily be used to monitor soybean progress throughout the season rather than predicting yield per se. Using aerial sensors to derive similar measurements would likely be less time consuming, less laborious, and more precise than using handheld sensors and, in turn, could improve the models. Our results fall within expected values of other soybean prediction modeling methods including spectral and canopy measurements range from 56 to 85% yield explanation [45][46][47]. Improving the soybean prediction model would benefit from a wider array of genotypes and additional years of data to expand inferences beyond the twoweek planting window, relative maturities groups, seeding rates, and row spacings used in the study. Hyperspectral data and analyses similar to Hong et al. [48] and Yao et al. [49] for future soybean prediction studies may improve upon the results from this study.

Conclusions
Canopy cover data have been widely used in agricultural research to estimate or predict both biomass and seed yield. Results from this study suggest that a single canopy measurement prior to the soybean reproductive phase does not provide a high level of seed yield estimation. The multiple linear regression techniques used in this study suggest that most of the soybean seed yield can be explained by canopy measurements taken throughout the growing season, whereas a single NDVI measurement at the R5 stage is a single observation, which most closely predicts yield. Predicting yield at the R5 stage allows soybean producers to estimate production and provides a marketing advantage. Not all soybean producers or consultants have access to reflectance or light quantification instruments. Therefore, measuring the amount of green canopy cover at R5 is an easily accessible and reasonable measurement provided by a free application, explaining half of the yield variation. Further research is needed to evaluate if other measurements or using other technologies such as using aerial sensors that can improve the prediction of soybean yield.
Author Contributions: Literature review, conceptualization, methodology, research, formal. Statistical analysis, writing-original draft preparation, and writing-review and editing, P.K.S.; funding acquisition, project administration, conceptualization, investigation, supervision, and writing-review and editing, H.J.K. All authors have read and agreed to the published version of the manuscript.