Prediction of Soybean Plant Density Using a Machine Learning Model and Vegetation Indices Extracted from RGB Images Taken with a UAV

Soybean plant density is an important factor of successful agricultural production. Due to the high number of plants per unit area, early plant overlapping and eventual plant loss, the estimation of soybean plant density in the later stages of development should enable the determination of the final plant number and reflect the state of the harvest. In order to assess soybean plant density in a digital, nondestructive, and less intense way, analysis was performed on RGB images (containing three channels: RED, GREEN, and BLUE) taken with a UAV (Unmanned Aerial Vehicle) on 66 experimental plots in 2018, and 200 experimental plots in 2019. Mean values of the R, G, and B channels were extracted for each plot, then vegetation indices (VIs) were calculated and used as predictors for the machine learning model (MLM). The model was calibrated in 2018 and validated in 2019. For validation purposes, the predicted values for the 200 experimental plots were compared with the real number of plants per unit area (m2). Model validation resulted in the correlation coefficient—R = 0.87, mean absolute error (MAE) = 6.24, and root mean square error (RMSE) = 7.47. The results of the research indicate the possibility of using the MLM, based on simple values of VIs, for the prediction of plant density in agriculture without using human labor.


Introduction
Field-based phenotyping can be promising for gathering information about the number of plants per plot, which is important for the description of plant traits in agricultural practices [1]. Soybean plant architecture is highly influenced by plant density. An insufficient number of plants per unit area results in more branches and more pods per plant [2]. The number of plants per unit area at emergence is often different from the number of plants in the harvest. Plant losses are usually caused by technological operations (machinery, inter-row cultivation, mechanical weed control and pesticide application), severe weather conditions (hay, flooding, and frost) and biotic factors (pests and diseases). Due to plant loss, knowledge of the number of plants in the later stages of canopy development should enable a precise approximation of the final number of plants in the harvest. The number of plants per unit area will provide information about emergence and potential losses in plant density, which is important for both agricultural science and production. Furthermore, the later stages of crop development provide an extended timeframe for plant density estimation, which is not limited to the early period of crop emergence. The usual way of obtaining data about plant density on experimental plots can be tiring The main objective of this study was to develop an MLM based on values of simple VIs obtained from RGB images for the prediction of soybean plant density in mid-development stages. An additional objective of this study was to validate the model in an independent environment to test its robustness.

Trial Description
The trial was conducted in 2018 and 2019 on the experimental fields of the Institute of Field and Vegetable Crops in Rimski Šančevi, Serbia. The trial was performed in both years on chernozem soil with homogeneous texture across the entire experimental site, and these sites were deep and well drained. In both years, standard cultivation practices were applied to the experimental field with the sowing dates, row and seed spacing in soybean planting recorded (Table 1). In 2018, 66 soybean genotypes, each sown on its own 8 m 2 plot were used for calibration of the MLM, while in 2019, 200 soybean genotypes, each sown on its own 10 m 2 plot, were used for model validation. In both years, trials were planted on different fields. In total, the analysis included 266 different soybean genotypes, which were experimental lines from the soybean breeding programs. The genotypes included in the trial represented different maturity groups and different plant architecture.

UAV Description
The UAV platform used for collecting the RGB photos was Drone Phantom 4 (DJI, Shenzhen, China), powered by four propellers and operated with a remote controller that runs on 2.4 GHz. Soybean plots were filmed using an integrated camera with the following characteristics: 1/2.3" CMOS (Complementary Metal Oxide Semiconductor) sensor with 12.4 megapixels, focal length of 8.8 mm and 1.84 cm/pixel resolution. The maximum wind speed that allows for image acquisition with the UAV is 10 m/s. To determine the geographic position of each image, the UAV used the global positioning systems GPS/GLONASS (Global Positioning System/Global Navigation Satellite System).

Field Based and Remote Data Collection
In 2018, the number of plants was first counted manually for each of the 66 experimental plots and then, by dividing the total number of plants per plot by the plot area (8 m 2 ) we calculated the number of plants per unit area (m 2 ) for each plot. After that, RGB photos were taken with a UAV in the soybean development phases of four unfolded trifoliolate leaves (V4) (Figure 1a) and beginning pod (R3) (Figure 1b). In 2019, the same data collection procedure was repeated for 200 plots for validation of the model developed in 2018.
In both trial years, photos were taken on a sunny day, at a wind speed equal to or less than 10 m/s, and between 10:00 and 14:00; more details about the UAV photo acquisition are given in Table 2.
In 2018, after acquisition of all the individual images in phase V4, an orthophoto of the entire trial was created, and the same process was repeated with the images taken in the R3 phase. The same procedure was repeated in 2019. The creation of the orthophoto involved a process of stitching together all the individual photos, which was carried out using the open-source software WebODM [27].
The next step was the analysis of the individual plots on the RGB orthophoto through open-source software for image analysis called Fiji [28]. First, in Fiji, the region of interest (ROI) was created on In both trial years, photos were taken on a sunny day, at a wind speed equal to or less than 10 m/s, and between 10:00 and 14:00; more details about the UAV photo acquisition are given in Table  2. In 2018, after acquisition of all the individual images in phase V4, an orthophoto of the entire trial was created, and the same process was repeated with the images taken in the R3 phase. The same procedure was repeated in 2019. The creation of the orthophoto involved a process of stitching together all the individual photos, which was carried out using the open-source software WebODM [27].
The next step was the analysis of the individual plots on the RGB orthophoto through opensource software for image analysis called Fiji [28]. First, in Fiji, the region of interest (ROI) was created on RGB images for 66 plots from the 2018 trial, and 200 experimental plots from the 2019 trial. After the creation of ROI for each plot, with the Fiji's function stack to images, RGB images were separated into the individual channels R, G and B (Figure 2).   m/s, and between 10:00 and 14:00; more details about the UAV photo acquisition are given in Table  2. In 2018, after acquisition of all the individual images in phase V4, an orthophoto of the entire trial was created, and the same process was repeated with the images taken in the R3 phase. The same procedure was repeated in 2019. The creation of the orthophoto involved a process of stitching together all the individual photos, which was carried out using the open-source software WebODM [27].
The next step was the analysis of the individual plots on the RGB orthophoto through opensource software for image analysis called Fiji [28]. First, in Fiji, the region of interest (ROI) was created on RGB images for 66 plots from the 2018 trial, and 200 experimental plots from the 2019 trial. After the creation of ROI for each plot, with the Fiji's function stack to images, RGB images were separated into the individual channels R, G and B (Figure 2).  The further step implied the extraction of the mean values from each individual channel (R, G, and B) for every plot (ROI) in both years (2018 and 2019) and for both development phases (V4 and R3). That was accomplished using Fiji's measure tool, which was applied to each of the three individual channels. The individual values obtained for all experimental plots were used for the calculation of the simple VIs, which were predictors for the MLM.

Vegetation Indices Calculated from UAV Images
Eight VIs were used for the prediction model of the number of soybean plants/m 2 (Table 3).
Agronomy 2020, 10, 1108 5 of 10 Table 3. Vegetation indices (VIs) used as predictors for the machine learning model (MLM). G-mean value of the GREEN channel, R-mean value of the RED channel, and B-mean value of the BLUE channel.

Vegetation Index
Name Formula References Excess green red Modified excess green

Prediction Model
R package RF was used for the prediction of soybean plant density, including the following settings (maxnodes = 50 and ntree = 100) [36,37]. To work properly, the machine learning algorithm needs training and test data sets as input parameters.
In the first trial year, the prediction model was based on the number of plants/m 2 and the VIs calculated on the 66 experimental plots. In this case, VIs and the number of plants counted from 80% of the randomly selected plots were used as the training set, while VIs for the remaining 20% of the plots were used as the test set. This 80/20 data partition is done by the code that is included in the RF model. After running the model and obtaining the predicted values of the number of plants/m 2 for 20% of the plots, the obtained values were compared to the manually counted values, so as to assess the quality of the model. In

Results
RGB images that were collected with the UAV in both soybean trial years were of good quality because they provided sharp orthophoto of the experimental trials in both years after a stitching procedure and were usable for the calculation of the simple VIs. The results showed a high correlation between the real and predicted number of plants/m 2 , with a relatively low mean absolute error and root mean squared error (Table 4). The model in 2018 showed good results, however, for accepting it as a tool for prediction of the number of plants per unit area, it is important that the model provides a similar quality in different years. The model from 2018 was therefore evaluated the following year on the 200 experimental plots (Table 5).
Comparing the predicted and real values of number of plants/m 2 , it can be concluded that the high correlation coefficient and R 2 between these two variables indicates the possibility of using this model  The model in 2018 showed good results, however, for accepting it as a tool for prediction of the number of plants per unit area, it is important that the model provides a similar quality in different years. The model from 2018 was therefore evaluated the following year on the 200 experimental plots (Table 5). Comparing the predicted and real values of number of plants/m 2 , it can be concluded that the high correlation coefficient and R 2 between these two variables indicates the possibility of using this model as a tool for digital counting of plants on experimental plots (Figure 3). Lower R 2 and higher error for model validation in 2019 were observed compared to the results obtained in the model calibration in 2018. This was expected and illustrates the effect of uncontrolled factors.  (Table 6).   (Table 6). The results show a higher variability between the individual plots for the real values than between the same plots, but with the predicted values of number of plants/m 2 for each plot. This is indicated by a higher value of standard deviation, which resulted in the higher standard error for real values compared to the predicted ones. Also, the range between the maximum and minimum number of plants/m 2 per plot is narrower for the predicted values than for the real ones. This is mainly because of the lower value of maximum number of plants/m 2 per plot that was calculated by the prediction model. The results show a higher variability between the individual plots for the real values than between the same plots, but with the predicted values of number of plants/m 2 for each plot. This is indicated by a higher value of standard deviation, which resulted in the higher standard error for real values compared to the predicted ones. Also, the range between the maximum and minimum number of plants/m 2 per plot is narrower for the predicted values than for the real ones. This is mainly because of the lower value of maximum number of plants/m 2 per plot that was calculated by the prediction model.

Descriptive Statistics of Number of Plants/m 2 for 200 Plots Real Values Predicted Values
The number of plants/m 2 manually counted on 200 plots in 2019 and the number of plants/m 2 obtained from the same plots using the prediction model were subtracted; the difference is shown in the box plot ( Figure 4).

Discussion
The main reason why the model underestimated the real number of plants/m 2 lies in the overlap between the plants on the denser plots, so the model could not distinguish every soybean plant individually. For those plots that had a higher number of plants/m 2 in reality, the model predicted lower values.

Discussion
The main reason why the model underestimated the real number of plants/m 2 lies in the overlap between the plants on the denser plots, so the model could not distinguish every soybean plant individually. For those plots that had a higher number of plants/m 2 in reality, the model predicted lower values.
On the other hand, one possible reason why the model overestimated the real values for some plots was the presence of weeds on those plots so the model counted them as soybean plants, which resulted in a mismatch between the real and predicted values. This indicates that, for higher precision of the results, plots must be free of weeds so as not to disturb the prediction model.
The higher value of IncNodePurity indicates a greater influences of the variable on the final results of the prediction [21]. On average, the values of indices extracted from the later (R3) phase of soybean development had a greater influence as predictors than the values of the same indices extracted from the earlier (V4) phase. This is because the plants were more robust in the R3 phase, with more leaves, which increased the number of green pixels per plot and thus improved the index efficiency. Still, the values of VIs calculated from phase V4 also had an important role on the final results of the prediction, especially for indices NGRD and ExGR which were verified by their high IncNodePurity values.
In the study, the middle phases (V4 and R3) of soybean development were used for plant density prediction. However, in a two-year experiment on safflower, a density estimation model was proposed, analyzing the pixel green ratio of plants that were in the of 2-4 leaves stage [38]. The results of the study showed that better values of prediction (for 2017: R 2 = 0.88; for 2018 R 2 = 0.86) were obtained during the early growth stages because of less overlapping between the plants. This indicates the higher accuracy of the soybean prediction model if the VIs are calculated from the images of plants in the earlier growth stages than in stages V4 and R3 when the plants overlap. The reason for using the middle soybean development stages in soybean trials was to avoid any errors in calculating the predictions which could happen for two reasons. One mistake could occur if an analysis is done too early is that, due to the uneven emergence of all individual plants in the plots, some plants are not taken into account. A second mistake can be potential plant losses occurring as a consequence of inter-row cultivation.
A study on maize has pointed out that the simple amount of green pixel extracted from the images is not enough for digital counting of the number of plants [1]. This is reflected in the low value of R 2 = 0.023 calculated between the number of green pixels and the number of plants counted manually. Only after conducting the additional image transformation did the results improve significantly (R 2 = 0.89). Excessive and often complicated image transformations were avoided during the soybean research. This was achieved by using only the values of pixels to calculate simple VIs and importing them into the MLM, which resulted in a precise prediction of the number of plants per unit area (R 2 = 0.80 in 2018 and R 2 = 0.76 in 2019).

Conclusions
Plant density is one of the most important factors in successful agricultural production, not only for soybeans but for all crops. It is of key importance in obtaining high yields. Therefore, information about the number of plants per unit area is necessary. This information can be obtained by means of a nondestructive method, using the MLMs and VIs derived from RGB images. A favorable time for the extraction of VIs is the middle phase of plant development, because all plants that emerge earlier or later are taken into account for the calculation. Although more research is needed, the predictions calculated using RF in the soybean trials gave good results and showed that the method can be used as new tool for gathering significant information about crops.