Soybean Yield Estimation and Its Components: A Linear Regression Approach

: Soybean yield estimation is either based on yield monitors or agro-meteorological and satellite imagery data, but they present several limiting factors regarding on-farm decision level. Aware that machine learning approaches have been largely applied to estimate soybean yield and the availability of data regarding soybean yield and its components (number of grains (NG) and thousand grains weight (TGW)), there is an opportunity to study their relationships. The objective was to explore the relationships between soybean yield and its components, generate equations to estimate yield and evaluate its prediction accuracy. The training dataset was composed of soybean yield and its components’ data from 2010 to 2019. Linear regression models based on NG, TGW and yield were ﬁtted on the training dataset and applied to a validation dataset composed of 58 on-ﬁeld collected samples. It was found that globally TGW and NG presented weak (r = 0.50) and strong (r = 0.92) linear relationships with yield, respectively. In addition to that, applying the ﬁtted models to the validation dataset, model based on NG presented the highest accuracy, coe ﬃ cient of determination (R 2 ) of 0.70, mean absolute error (MAE) of 639.99 kg ha − 1 and root mean squared error (RMSE) of 726.67 kg ha − 1 .


Introduction
Yield is a quantitative measurement of the crop and it is an important feature that can benefit decision makers by supporting and improving their crop management [1]. The common approach to estimate grain yield with higher spatial resolution providing farm level data is using harvesters coupled to yield monitors [2]. Yield is computed through the convergence of yield monitor flow, global navigation satellite system (GNSS) receiver and moisture sensor data [3]. Grain yield monitor estimation accuracy is affected by many factors such as: (i) mass flow sensor, (ii) cleanness of the system, (iii) wiring harnesses and (iv) moisture sensor. Hence, constant inspection of the sensor system and calibration is required before the start of harvesting to achieve higher accuracy rates and obtain reliable data of estimated yield [3].
Other approaches have also been developed to estimate grain yield, but they have a lower spatial resolution, limiting its application on a farm level. Among these approaches, the most common is the application of remote sensing techniques based on agro-meteorological [4] and satellite imagery data [5,6]. Regarding this kind of data, efforts have also been made applying advanced algorithms to estimate yield [7].
Worldwide, researchers are making efforts to apply machine learning (ML) techniques in different databases to estimate crop yield. Most of these databases are composed of spectral band/vegetation indices from crop canopies [8,9], chlorophyll index [9], fusion of chemical and physical soil properties, historical weather conditions and historical crop yield [10]. For example, Syngenta Crop Challenge [11] supplied a database with 2267 experimental corn hybrids planted in different locations across Canada and the United States between 2008 and 2016. Using that database to feed a deep learning neural network, the corn hybrid yield of 2017 was successfully predicted with a 12% root mean squared error compared to the average yield [12].
However, to support on-farm decision making, it is necessary to apply methods to gather higher spatial resolution data and the application of new sensor technologies and remote sensing techniques relying on the concepts of high-throughput phenotyping research could be suitable for this situation [3]. In this sense, some researchers have already applied ML techniques to analyze yield and its components. For example, Romero et al. [13] predicted durum wheat yield based on its plant height, peduncle length, spikelet number per ear, grain number per ear and grain weight per ear applying different ML decision trees. Aggelopoulou et al. [14] predicted apple tree yield based on its yield components, in their case, full bloom. The prediction was made by applying an image processing-based algorithm which learned texture patterns and correlated them to its yield.
Soybean yield components can be affected by several factors such: shading and temperature [15], genotype and environmental conditions [16], seed vigor [17] and seeding date, environment and cultivar [18]. Despite that, yield prediction through its components is becoming more popular in crops whose yield component is the product to be marketed, for example, mango [19] and rice [20], highlighting the possibility of applying it to soybean crops since the final product is grain.
Therefore, due to the ability to predict crop yield through its components (cotton [21] and rice [20]), large application of ML techniques in agriculture and soybean yield and its components' data availability, there is an opportunity to investigate the possibility to forecast soybean yield based on its components in a global scale, unlike the common approach which relies on vegetation indices. In this sense, this paper aimed to (i) explore soybean yield and yield components' relations from data published around the world, (ii) generate a global equation to estimate soybean yield and (iii) evaluate linear regressions prediction accuracy.

Materials and Methods
The training dataset was composed of published data from 2010 to 2019 that contained soybean yield and thousand grains weight (TGW) or hundred grains weight (HGW). In cases where HGW data was available instead of TGW, HGW values were multiplied by 10 to convert them to TGW. It was decided to use the training dataset relying on available data because it allowed the analysis of soybean yield and its components' relations on a global perspective since the dataset came from different locations of study, treatments, soybean varieties and year of study.
Based on literature reviews [22,23], Equation (1) was used to estimate the number of soybean grains (grains m −2 ) because normally this data is not available and it is known that soybean yield (kg ha −1 ) is a function of the number of grains in a certain area (grains m −2 ) and the average weight per grain [22], in this case, it was adapted to thousand grains weight (g 1000 grains −1 ). NG grains m −2 = 100×Yield kg ha −1 TGW g 1000 grains −1 (1) where, NG is the number of grains (grains m −2 ) and TGW is the weight of a thousand grains (g 1000 grains −1 ).
A descriptive analysis (number of observations, minimum, median, mean and maximum values, standard deviation, sample variance and coefficient of variation) and scatterplot (yield versus country, yield versus TGW and yield versus NG) were applied to the training dataset to provide better data visualization and inference. In addition to the scatterplots, Pearson's correlation (r) between soybean yield and its components was also calculated.
Linear regression models were used to predict yield because the number of observations was larger than the number of variables [24]. All yield observations, TGW and NG from the training dataset were used to fit three linear regression models: (A) Yield as function of TGW and NG, (B) yield as function of NG and (C) yield as function of TGW.
As linear regression allows formulation of the equation to predict the desired parameter, in our case, yield, it was also gathered from each model using the equation. Linear regression models generated from the training dataset were further applied to the validation dataset to predict soybean yield.
The validation dataset was composed of 58 randomly selected on-field samples of 1 m 2 from a soybean field (cultivar TMG 7062) collected at Piracicaba, Brazil (22.7356 • S, 47.6479 • W) in March of 2019. Number of grains (grains m −2 ) was carefully and manually counted grain by grain, thousand grains weight (g 1000 grains −1 ) and yield (kg ha −1 ) were obtained from these samples after the drying process (105 • C for 72 h in a forced circulation air oven).
Overall performance of the fitted equations was evaluated by comparing their root mean squared error (RMSE), mean absolute error (MAE) and coefficient of determination (R 2 ) for both the training and validation datasets. All the statistical analysis was performed in R environment [25].

Results and Discussion
In 707 observations of soybean data (training dataset) it was found that they contained at least "yield" and "thousand grains weight" or "hundred grains weight" from 56 eligible papers  across the world, as presented in Table 1. A large difference between soybean yield and its yield components between countries is observed. Yield presented a range from 167.1 to 10,170 kg ha −1 . Different yield values found across the world highlight the regional variability inherent among environment and genotype. Furthermore, note that the coefficient of variation of yield and number of grains are higher than 50% and thousand grains weight is approximately 24%. These values indicate that TGW presents less variance than yield and NG in a worldwide range. The validation dataset, also shown in Table 1, presents variability among samples which is higher for both yield and NG than TGW when looking at the coefficient of variation. Table 1.
Descriptive analysis of the soybean training dataset obtained from  and validation dataset. Soybean yield variability among countries despite year, genotype and treatments can be seen in Figure 1. Figure 1A (worldwide soybean yield) shows soybean yield distribution among countries. Note that within country there are large yield differences representing yield variability at a country level which is expected, as we can see in several governmental reports, for example, in the US and Brazil, United States Department of Agriculture-USDA [82] and Brazilian Agricultural Research Corporation-EMBRAPA [83], respectively, that are responsible for forecasting and presenting harvested yield by location. Based on Figure 1, it shows an opportunity to further investigate the possibility of applying machine learning techniques to estimate soybean yield based on its components. Corroborating the idea that yield prediction is one of the most important topics in precision agriculture and that ML techniques have widely been applied in agriculture, this paper also applied ML techniques to forecast yield [84], but relying on the use of its components as predictor variables.  The graphic of Figure 1B (soybean yield versus TGW) shows that there is weak linear relationship (r = 0.50) between these two variables which will further negatively impact the linear regression model prediction accuracy. On the other hand, Figure 1C (soybean yield versus NG) presents a strong linear relationship between yield and NG (r = 0.92). Comparing TGW and NG data ( Figure 1D), there is a weak linear relationship (r = 0.19) indicating that there is no collinearity between them.
Based on Figure 1, it shows an opportunity to further investigate the possibility of applying machine learning techniques to estimate soybean yield based on its components. Corroborating the idea that yield prediction is one of the most important topics in precision agriculture and that ML techniques have widely been applied in agriculture, this paper also applied ML techniques to forecast yield [84], but relying on the use of its components as predictor variables.
Yield is a function of the number of pods (np), grain number per pod (gnp) and grain weight (gw), as presented in Equation (2) adapted from Egli and Zhen-wen [22] and Lindsey [23]. Combining np and gnp, a new variable is generated, which is the number of grains (NG), providing Equation (3). Based on that, soybean yield is a direct function of NG and gw. Regarding these two variables related to yield, number of grains (grains m −2 ) has more influence on yield than grain weight/thousand grains weight [34,[85][86][87] which corroborates with the findings in Figure 1.
Yield kg ha −1 = NG grains ha −1 × gw kg grain −1 where, NG and gw are, respectively, number of grains (grain ha −1 ) and grain weight (kg grain −1 ). Figure 2 presents the relation between observed and predicted soybean yields. The linear models were generated from the training dataset. Model A (Figure 2A) presents the highest R 2 (0.97) and lowest RMSE (288.78 kg ha −1 ), model B ( Figure 2B) presents R 2 of 0.85 and RMSE of 455.98 kg ha −1 , and model C ( Figure 2C) presents the lowest R 2 (0.26) and the highest RMSE (1343.17 kg ha −1 ), as expected due to the weak linear relationship between yield and TGW. Higher R 2 and lower MAE values are found for models A and B and not for model C because NG has more global influence than TGW to predict yield, as seen previously in Figure 1. Hence, it is expected that models A and B present a better overall performance compared to model C when applied to the validation dataset.
Results of the fitted models A, B and C application to the validation dataset is shown in Figure 3. R 2 values for models A ( Figure 3A), B ( Figure 3B) and C ( Figure 3C) are 0.66, 0.70 and 0.02, respectively. Model B presents the highest R 2 , lowest MAE (639.99 kg ha −1 ) and RMSE (726.67 kg ha −1 ) followed by model A (MAE-3420.93 kg ha −1 and RMSE-3449.80 kg ha −1 ) and model C (MAE-5267.52 kg ha −1 and RMSE-5391.08 kg ha −1 ). These results indicate that model B presents higher accuracy rate to predict soybean yield based on number of grains from data collected in a single field. Therefore, in this case, yield prediction based on NG can support on-farm decision making which is the scale of precision agriculture actions that deals with the variability within field levels [88]. Note that, to achieve a good accuracy rate, both predicted and observed yields should present, not only high R 2 , low RMSE and MAE, but also, the amplitude ratio between predicted and observed yield values as close as possible to one (R 2 = 1.00), meaning that there was a perfect prediction without error and unexplained variance [89].
This paper presents a novel model to predict soybean yield based on its components and data published worldwide by applying linear regression models different from those published based on soil properties [90], historical data about soil, yield and soybean variety [91] and vegetation indices [92]. Model B, based on the number of grains per m 2 which presented the highest R 2 , demonstrated its potential to predict soybean yield.
Currently, our soybean yield estimations are either based on agro-meteorological/crop models or based on yield monitors. Agro-meteorological/crop models are restricted by their spatial resolution and, most of the time, they do not represent farm-level support for on-farm decision making. On the other hand, yield monitors have a higher spatial resolution compared to agro-meteorological/crop models, but they are limited due to operational limitations and several error factors. These factors can be caused by different sources such miss calibration, sensor system errors and many others [3].
As stated by Patrício and Rieder [93], the application of computer vision, machine learning and high-performance computing together are considered as one of the main drivers to solving different problems in agriculture; we have to look for options considering these applications that will deliver better data and information to farmers to support and improve their decision making regarding their crop at farm level.
Efforts have been made in digital image processing techniques related to plant diseases [94] and reviews about computer vision systems to evaluate grain quality [95]. In this sense, there are several researchers working on grain quality and plant disease diagnosis by applying computer vision, which is expected since its evaluations are based on the visualization. Hence, being aware that there is a strong linear relation of number of grains per m 2 with soybean yield (r = 0.92) in a worldwide range and that computational processing techniques are becoming more efficient, it is highlighted that there is an opportunity to apply computer vision techniques to gather these data to estimate soybean yield instead of looking for indirect factors to estimate its yield.
The proposed model in this paper can be suitable for different approaches from on-farm yield sampling for genetic plot trials to punctual yield scouting. The implication of applying this model is to acquire the predictor variable (NG) which is not easy at the moment. However, there are researchers working on high-throughput phenotyping, specifically, estimating the number of grains per pod of soybean plants [96] and number of grains [97], both under controlled environmental conditions. grains −1 ), (C) number of grains (grains m −2 ) and (D) thousand grains weight (g 1000 grains −1 ) versus number of grains (grains m −2 ). r = correlation coefficient.
Yield is a function of the number of pods (np), grain number per pod (gnp) and grain weight (gw), as presented in Equation (2) adapted from Egli and Zhen-wen [22] and Lindsey [23]. Combining np and gnp, a new variable is generated, which is the number of grains (NG), providing Equation (3). Based on that, soybean yield is a direct function of NG and gw. Regarding these two variables related to yield, number of grains (grains m −2 ) has more influence on yield than grain weight/thousand grains weight [34,[85][86][87] which corroborates with the findings in Figure 1.
Yield (kg ha ) = NG (grains ha ) x gw (kg grain ) where, NG and gw are, respectively, number of grains (grain ha −1 ) and grain weight (kg grain −1 ). Figure 2 presents the relation between observed and predicted soybean yields. The linear models were generated from the training dataset. Model A (Figure 2A) presents the highest R 2 (0.97) and lowest RMSE (288.78 kg ha −1 ), model B ( Figure 2B) presents R 2 of 0.85 and RMSE of 455.98 kg ha −1 , and model C ( Figure 2C) presents the lowest R 2 (0.26) and the highest RMSE (1343.17 kg ha −1 ), as expected due to the weak linear relationship between yield and TGW. Higher R 2 and lower MAE values are found for models A and B and not for model C because NG has more global influence than TGW to predict yield, as seen previously in Figure 1. Hence, it is expected that models A and B present a better overall performance compared to model C when applied to the validation dataset.   Therefore, in this case, yield prediction based on NG can support on-farm decision making which is the scale of precision agriculture actions that deals with the variability within field levels [88]. Note that, to achieve a good accuracy rate, both predicted and observed yields should present, not only high R 2 , low RMSE and MAE, but also, the amplitude ratio between predicted and observed yield values as close as possible to one (R 2 = 1.00), meaning that there was a perfect prediction without error and unexplained variance [89]. This paper presents a novel model to predict soybean yield based on its components and data published worldwide by applying linear regression models different from those published based on soil properties [90], historical data about soil, yield and soybean variety [91] and vegetation indices [92]. Model B, based on the number of grains per m 2 which presented the highest R 2 , demonstrated its potential to predict soybean yield.
Currently, our soybean yield estimations are either based on agro-meteorological/crop models or based on yield monitors. Agro-meteorological/crop models are restricted by their spatial resolution and, most of the time, they do not represent farm-level support for on-farm decision making. On the other hand, yield monitors have a higher spatial resolution compared to agro-meteorological/crop Uzal et al. [96] developed an approach based on convolutional neural networks and support vector machines to estimate seed per pod for soybean plants through image processing resulting in R 2 of validation and test ranging from 0.90 to 0.95 and 0.50 to 0.86, respectively. Li et al. [97] estimated soybean seed based on deep learning techniques, in their case, applying two convolution neural networks reaching MAE and MSE values of 13.21 and 17.62 number of seeds per image, respectively.
Considering non-controlled environmental conditions and targeting a yield component such as grain, Reza et al. [20] looking to estimate rice yield based on its components, in their case rice grain area, they applied a graph-cut algorithm with K-means clustering in images obtained from an unmanned aerial vehicle and found a coefficient of determination of 0.98 comparing the ground-truth and the proposed method. Thus, computer vision application in agriculture presents the potential to obtain yield components data, like rice grain which can be extrapolated to soybean grain. Figure 4 represents an example of computer vision technique application to gather the number of grains of a sample. It shows the potential of applying computer vision techniques to obtain soybean number of grains. Note that in this paper our focus is not to demonstrate how to implement computer vision, but highlight that soybean yield can be accurately predicted through its components, in our suggested model by the number of grains per m 2 , and one possible method to reach these components data is by applying computer vision techniques. Indeed, more research applying computer vision to acquire soybean number of grains is required, but we must rely on technological and computational advances in agriculture that have been improving in large steps. number of grains. Note that in this paper our focus is not to demonstrate how to implement computer vision, but highlight that soybean yield can be accurately predicted through its components, in our suggested model by the number of grains per m 2 , and one possible method to reach these components data is by applying computer vision techniques. Indeed, more research applying computer vision to acquire soybean number of grains is required, but we must rely on technological and computational advances in agriculture that have been improving in large steps. Being aware that yield monitors present several time consuming steps for calibration to improve their accuracy estimation [3] and that soybean yield can be accurately estimated by model B (based on the number of grains per m 2 ), efforts should be made aiming at achieving accurate numbers of grains in on-field situations without harvesting the crop to apply model B to forecast soybean yield. Therefore, in accordance that computer vision techniques are being applied to estimate crop components related to yield [98][99][100][101] and the example of Figure 4, it is highlighted that computer vision is a promising method that will support how soybean number of grains are gathered on-field in a fast, reliable and accurate way.

Conclusions
In this work, considering the dataset used, it can be concluded that globally soybean yield presents a strong linear relationship with number of grains even for different varieties, countries and treatments and presents a weak linear relationship with thousand grains weight. It was possible to estimate soybean yield based on two equations, one dependent on thousand grains weight and number of grains (model A) and another dependent only on the number of grains (model B). Model B, yield as function of number of grains, applied to the validation dataset presented the best prediction accuracy compared to models A and C, highest R 2 (0.70), lowest MAE (639.99 kg ha −1 ) and RMSE (726.67 kg ha −1 ).
Despite the suitability of fit of model B to predict soybean yield, it is worth noting that, currently, it is difficult and time-consuming to access NG data to use the proposed model. However, as the implementation of technology in agriculture advances, the proposed method corroborates towards the use and development of computer vision application to estimate crop yield, in this case, soybean yield.