Estimating Cotton Yield in the Brazilian Cerrado Using Linear Regression Models from MODIS Vegetation Index Time Series

.


Introduction
Cotton holds significant economic importance in Brazil, ranking as the world's fourth largest producer and the second largest exporter.In the 2022-2023 season, Brazil cultivated cotton across 1.66 million hectares, resulting in a total lint production of 3.03 million tons [1].
The majority of the cotton cultivation occurs within the Brazilian Cerrado, characterized by flat and arable lands with favorable rainfall ranging from 750 to 2000 mm/year across the biome [2].This region employs a highly mechanized, rainfed production system, with approximately 92% occurring without irrigation [3].Typically, cotton is cultivated from January to July, following the soybean harvest [4].The primary cotton producing states are Mato Grosso and Bahia, jointly contributing to around 90% of Brazil's total output [5].
Estimating in-season cotton yield over large areas at a regional or national level provides critical information for farmers, policymakers, governments, crop insurers, and commodity traders.Yield estimations involve a combination of field surveys, remote sensing, statistical regression, and crop simulation models [6,7].Mid-season crop forecasting is particularly vital for farmers, as it informs management decisions regarding harvesting, storage, transport logistics, and planning for subsequent crops [8,9].Early forecasting cotton yield, before 90-100 days after sowing, provides valuable management information for implementing corrective interventions, such as adjusting input applications (e.g., topdressing fertilizer and growth regulators), and comparing predicted yields to historical or expected yields for specific fields.
While crop yield models have demonstrated adequate performances at the field scale, their applications across large areas face challenges due to the substantial data volume and computational processing costs [10].To address them, statistical regression models such as multiple linear regression and machine learning techniques have been employed to integrate spectral indices from satellite data with climate variables at regional or national levels [10][11][12].Table 1 presents various studies on cotton yield estimation utilizing statistical regression models or crop models combined with remote sensing across the regional and field scale under different platforms (satellite, airplane, unmanned aerial vehicle (UAV), and ground sensors) in diverse countries.Three of these studies developed regression models, including Random Forest, to forecast cotton yield at a regional scale in China [10], India [11] and Australia [12], using satellite-derived vegetation indices (VIs) and climate variables as covariates.The models incorporated precipitation, temperature, evapotranspiration, vapor pressure, soil moisture, and other climate variables.They utilized hundreds of data instances from thousands of hectares across different farm fields over multiple crop years for training and testing, resulting in root mean square errors (RMSE) for seed cotton yield of 157 kg ha −1 [10], 375 kg ha −1 [11], and 976 kg ha −1 [12] in China, India, and Australia, respectively.
Simple linear regression models have also been developed to predict cotton yield using solely satellite-derived vegetation indices as independent variables [13][14][15][16], unmanned aerial vehicle (UAV) [17][18][19][20], airplane [21], and ground sensor [22].While vegetation indices alone have limitations in capturing the complex relationships among production, plant physiology, climate, soil nutrition, soil water parameters, pest and disease infestation, and crop management characteristics [10], several studies have demonstrated their ability to predict crop yield with reasonable accuracy [7,23,24].On average, the RMSEs for seed cotton yield for the linear regressions were comparable to those assessed by using machine learning techniques and crop model approaches (Table 1), indicating the potential advantages of linear regression models.These models operate with a smaller set of welldefined independent variables, require less data manipulation, and utilize smaller data instances compared to other approaches.Moreover, the greater mathematical explainability inherent in linear equations for specific satellite products, vegetation indices, and crop phenological stages allows for direct transferability and easy adaptation to other locations, facilitating clearer understandings and comparisons among different models.
The current study assesses the feasibility of employing simple linear models to correlate cotton yield with VIs derived from satellite data obtained from the Moderate-Resolution Imaging Spectroradiometer (MODIS) sensors aboard the Terra and Aqua satellites.The aim is to forecast in-season cotton yield within the Brazilian Cerrado region.Reflectance time series data from multiple MODIS spectral bands were extracted to assess 15-day interval averages spanning from sowing to harvest of widely recognized VIs.These VI averages, computed over various 15-day intervals, were accessed as predictors (independent variables) within linear regression models to estimate cotton yield.

Experimental Areas and Dataset
The regression models utilized data from 281 commercial cotton farm fields (hereinafter referred to as plots) across the states of Mato Grosso (125 plots), Goiás (83 plots), and Bahia (73 plots) during the growing seasons from 2016 to 2022.Data from 2016-2018 seasons were gathered from surveys conducted by the Mato Grosso Cotton Institute [32,33] and the Brazilian Agricultural Research Corporation [34], while data from the 2019 to 2022 seasons were obtained directly from cotton farmers.Figure 1 illustrates the spatial distribution of the 281 plots.Most plots in Mato Grosso, Goiás, and Bahia were situated at the western, southern, and western parts of the states, respectively.
Daily accumulated precipitations and extreme air temperatures (maximums and minimums) data were obtained for each plot by mean of the Brazilian agrometeorological monitoring system (web application) AGRITEMPO [36], which collects data from conventional and automatic weather stations, by taking the closest station to each plot and downloading the daily climate data from 0 to 180 days after sowing (DAS).The data were then averaged for month 1 (0-30 DAS) to 6 (150-180 DAS) and then correlated to the observed yield only to evaluate the effects of the climate variables on yield and to help explain possible bias between observed and predicted yields.

Satellite Data Acquisition and Preprocessing
With geographic co-ordinates of all plots, contour maps (kml format) were generated using the Google Earth Pro (Google, Mountain View, CA) and then converted to shapefile format using the QGIS software [37].Spectral reflectance data and VIs were obtained as time series from the EOS-MODIS MCD43A4 V6.1 product [38].This product provides daily surface reflectance data (NBAR) adjusted for Nadir bidirectional reflectance distribution function (BRDF) effects.It offers smoothed data with a 500 m spatial Daily accumulated precipitations and extreme air temperatures (maximums and minimums) data were obtained for each plot by mean of the Brazilian agrometeorological monitoring system (web application) AGRITEMPO [36], which collects data from conventional and automatic weather stations, by taking the closest station to each plot and downloading the daily climate data from 0 to 180 days after sowing (DAS).The data were then averaged for month 1 (0-30 DAS) to 6 (150-180 DAS) and then correlated to the observed yield only to evaluate the effects of the climate variables on yield and to help explain possible bias between observed and predicted yields.

Satellite Data Acquisition and Preprocessing
With geographic co-ordinates of all plots, contour maps (kml format) were generated using the Google Earth Pro (Google, Mountain View, CA, USA) and then converted to shapefile format using the QGIS software [37].Spectral reflectance data and VIs were obtained as time series from the EOS-MODIS MCD43A4 V6.1 product [38].This product provides daily surface reflectance data (NBAR) adjusted for Nadir bidirectional reflectance distribution function (BRDF) effects.It offers smoothed data with a 500 m spatial resolution.The product uses 16-day compositions with the best pixel selection (maximum value composite) to create daily time series.The product includes spectral bands for blue (0.459-0.479 µm), green (0.545-0.565 µm), red (0.62-0.67 µm), near infrared (NIR) (0.841-0.876 µm), short-wave infrared (SWIR1) (1.23-1.25 µm), SWIR2 (1.628-1.656µm), and SWIR3 (2.105-2.155µm).Google Earth Engine (GEE) platform [39], and its JavaScript code editor, was used to extract these time series.The GEE code selects pixels within each plot contour area, excluding those within 250 m of the boundary using a buffer command.This step ensured focus on the core area of each plot.Finally, the code calculated and exported average reflectance values for each band across the entire growing season (from sowing to harvest).The reflectance image and the time series graph were visually inspected before exporting the representative time series for each plot in comma-separated values (csv).
Spectral bands were then combined in an Excel ® , (Tokyo, Japan) worksheet and the VIs were calculated, averaged in 15-day intervals from 0 to 240 DAS, and correlated to cotton yield for the each mutually exclusive interval.Reflectance data are plotted in a logarithmic scale for better comparisons among the spectral bands.In Table 2 is shown a list of the VIs and their definitions.A subset of plots (167) was randomly selected (balanced by yield) and used for generating the regression models (training set).The remaining dataset (114 plots) was used for the independent model validation (test set), employing a wellknown cross-validation method set as 60-40% (train-test).Linear regression were fitted for each combination of VI (Table 2) and time interval.Models' accuracies were measured by the Root Mean Square Error (RMSE) and determination coefficient (R 2 ) estimated from the test set.Linear regressions were performed, and the RMSE, R 2 , and statistical significance (Analysis of Variance-ANOVA, New Providence, NJ, USA) of the fittings were calculated using a spreadsheet program (Microsoft ® Excel ® 2019, version 2403).

Vegetation Indices Formulation Reference
Green Index (GI) G: green, R: red, NIR: near infrared, B: blue.

Results and Discussion
Examples of MODIS images (NIR band) and time series of all spectral reflectance bands are shown in Figure 2 for a plot with very low cotton yield (1665 kg ha −1 in season 2016; Figure 2a, left) and high yield (4965 kg ha −1 in season 2017; Figure 2a, right).The original smoothed daily time series are shown in Figure 2b and 15-day averaged time series in Figure 2c.Reflectance data are plotted in a logarithmic scale for better comparisons among the spectral bands.In general, lower reflectance values (greater absorption of electromagnetic radiation) are observed for the blue, red, and green bands due to the absorption of these light wavelengths by leaves through photosynthesis.Higher reflectance in this range can be associated to plant stresses caused by biotic and/or abiotic factors.The spectral range between 0.4 and 0.7 µm (visible region) is the photosynthetically active radiation (PAR) that is the most efficient portion of the electromagnetic spectrum used by the plant for photosynthesis.Healthy plants present high light absorption in the visible range (0.4-0.7 µm) mainly by the leaf pigments like chlorophyll and carotenoids, relatively high reflectance in the NIR range (0.7-1.1 µm) due to leaf and cell structure scattering effects, and relatively low reflectance in the SWIR range (1.1-2.5 µm) due to water and chemicals into the leaf structure [48,49].For an actual instance, the low cotton yield (left graph in Figure 2) was mainly caused by low precipitation (water stress) in the 2016 season [35], which is noted by slight increases in red and blue reflectance (lower absorbance by cotton plants), especially between 70 and 170 DAS, and decreasing NIR reflectance, as compared to the high-yield plot (right).
Figure 3a shows the average time series of the seven MODIS spectral bands for five classes of cotton yield (lower than 3, 3-3.75, 3.75-4.5,4.5-5.25,and greater than 5.25 ton ha −1 ).As expected, the reflectance of NIR and SWIR1 bands increased with higher cotton yield.Conversely, reflectance in the red, blue, SWIR2, and SWIR3 decreases as yield increases, particularly from around 60-80 DAS until harvest.This trend is confirmed by the correlation coefficients (r) between cotton yield and reflectance displayed in Figure 3b.Notably, the green band reflectance, while higher than red and blue as expected, exhibits no clear visual relationship with yield in Figure 3a. Figure 3b highlights the strongest correlations between spectral reflectance bands and cotton yield.Positive correlations are observed for NIR and SWIR1 bands, while negative correlations are found for SWIR3, red, SWIR2, and blue bands.These findings, obtained using broadband satellite sensors, are consistent with studies employing narrow-band ground measurements with spectrora-diometers.For example, Thenkabail et al. [50] reported a negative correlation between cotton biophysical parameters (leaf area index, wet biomass, plant height, and crop yield) for reflectance in the visible light range (350 nm to 700 nm) and positive correlations in the near-infrared to short-wave infrared range (730 nm to 1050 nm).
to leaf and cell structure scattering effects, and relatively low reflectance in the SWIR range (1.1-2.5 μm) due to water and chemicals into the leaf structure [48,49].For an actual instance, the low cotton yield (left graph in Figure 2) was mainly caused by low precipitation (water stress) in the 2016 season [35], which is noted by slight increases in red and blue reflectance (lower absorbance by cotton plants), especially between 70 and 170 DAS, and decreasing NIR reflectance, as compared to the high-yield plot (right).Figure 3a shows the average time series of the seven MODIS spectral bands for five classes of cotton yield (lower than 3, 3-3.75, 3.75-4.5,4.5-5.25,and greater than 5.25 ton ha −1 ).As expected, the reflectance of NIR and SWIR1 bands increased with higher cotton yield.Conversely, reflectance in the red, blue, SWIR2, and SWIR3 decreases as yield increases, particularly from around 60-80 DAS until harvest.This trend is confirmed by the correlation coefficients (r) between cotton yield and reflectance displayed in Figure 3b.Notably, the green band reflectance, while higher than red and blue as expected, exhibits no clear visual relationship with yield in Figure 3a. Figure 3b highlights the strongest correlations between spectral reflectance bands and cotton yield.Positive correlations are observed for NIR and SWIR1 bands, while negative correlations are found for SWIR3, red, SWIR2, and blue bands.These findings, obtained using broadband satellite sensors, are consistent with studies employing narrow-band ground measurements with spectroradiometers.For example, Thenkabail et al. [50] reported a negative correlation between cotton biophysical parameters (leaf area index, wet biomass, plant height, and crop yield) for reflectance in the visible light range (350 nm to 700 nm) and positive correlations in the near-infrared to short-wave infrared range (730 nm to 1050 nm).Cotton yield prediction for each 15-day interval during the growing season was attempted using simple linear regression models.The nine vegetation indices listed in Table 2 were used as independent variables and the training dataset (167 plots) was employed for model fitting.Figure 4b displays the determination coefficients (R 2 ) achieved for these VIs across DAS intervals.The highest R 2 values were obtained for the TVI (0.69), EVI (0.68), SAVI (0.67), and NDVI (0.61) during intervals between 120 and 165 DAS.This period corresponds to the cotton late-season stage, encompassing boll opening and defoliation.Figure 4a presents the average time series of these VIs for different cotton yield classes.A general trend of saturation was observed for all VIs at the peak in the time series for the highest yield classes.This effect was most pronounced for NDVI, which is consistent with previous studies reporting NDVI saturation at high levels of green biomass or leaf area index [51,52].Cotton yield prediction for each 15-day interval during the growing season was attempted using simple linear regression models.The nine vegetation indices listed in Table 2 were used as independent variables and the training dataset (167 plots) was employed for model fitting.Figure 4b displays the determination coefficients (R 2 ) achieved for these VIs across DAS intervals.The highest R 2 values were obtained for the TVI (0.69), EVI (0.68), SAVI (0.67), and NDVI (0.61) during intervals between 120 and 165 DAS.This period corresponds to the cotton late-season stage, encompassing boll opening and defoliation.Figure 4a presents the average time series of these VIs for different cotton yield classes.A general trend of saturation was observed for all VIs at the peak in the time series for the highest yield classes.This effect was most pronounced for NDVI, which is consistent with previous studies reporting NDVI saturation at high levels of green biomass or leaf area index [51,52].
The results suggest that TVI, EVI, and SAVI hold promise for in-season cotton yield estimation within the training dataset, as they yielded the highest determination coefficients for yield prediction. Table 3 presents the linear regression models for the four VIs with the highest R² and statistically significant overall fit (analysis of variance, ANOVA, p-value< 0.05) of the adjusted models for different 15-day intervals, ranging from 75-90 to 180-195 DAS.The table also includes models for the average VI from 75 DAS to harvest (75-195 DAS) and the peak of the observed VI values throughout the season.Figure 5 illustrates the relationships between cotton yield and TVI for different DAS intervals.While the highest R 2 (0.73) was achieved for the 75-195 DAS interval for TVI (Figure 5) and the other VIs (Table 3), these equations would only be practical at the very end of the season since they require data accumulation until 195 DAS for averaging.For in-season forecasting, models derived from VI data between 105-120 DAS and 165-180 DAS yielded the most promising results.However, intervals like 150-165 and 165-180 DAS are too close to harvest and may not provide sufficient lead time for farmers to implement management interventions effectively.Peak VI values have been explored in various studies for in-season yield estimation [14,53].In this study, the peak models exhibited lower R 2 values compared to those obtained between 105-120 and 165-180 DAS (Table 3) but remained higher than those for 90-105 DAS interval.This suggests potential for earlier in-season forecasts considering that peak VI values for TVI, EVI, SAVI, and NDVI typically occur around 80 to 100 DAS, as seen in Figure 4.The results suggest that TVI, EVI, and SAVI hold promise for in-season cotton yield estimation within the training dataset, as they yielded the highest determination coefficients for yield prediction. Table 3 presents the linear regression models for the four VIs with the highest R 2 and statistically significant overall fit (analysis of variance, ANOVA, p-value < 0.05) of the adjusted models for different 15-day intervals, ranging from 75-90 to 180-195 DAS.The table also includes models for the average VI from 75 DAS to harvest (75-195 DAS) and the peak of the observed VI values throughout the season.Figure 5 illustrates the relationships between cotton yield and TVI for different DAS intervals.While the highest R 2 (0.73) was achieved for the 75-195 DAS interval for TVI (Figure 5) and the other VIs (Table 3), these equations would only be practical at the very end of the season since they require data accumulation until 195 DAS for averaging.For in-season forecasting, models derived from VI data between 105-120 DAS and 165-180 DAS yielded the most promising results.However, intervals like 150-165 and 165-180 DAS are too close to harvest and may not provide sufficient lead time for farmers to implement management interventions effectively.Peak VI values have been explored in various studies for in-season yield estimation [14,53].In this study, the peak models exhibited lower R 2 values compared to those obtained between 105-120 and 165-180 DAS (Table 3) but remained higher than those for 90-105 DAS interval.This suggests potential for earlier in-season forecasts considering that peak VI values for TVI, EVI, SAVI, and NDVI typically occur around 80 to 100 DAS, as seen in Figure 4. Model estimates for cotton yields using TVI equations (Table 3) are compared to observed yields alongside their respective RMSE in Figure 6. Figure 7 illustrates variations in RMSE (Figure 7a) and R 2 (Figure 7b) across different DAS intervals for the four best VIs.The lowest RMSE and highest R 2 were achieved by EVI and TVI between 90-105 and 135-150 DAS (four DAS intervals), enabling in-season yield forecasting with RMSE of approximately 750 kg ha −1 (Figure 7a).Averaged VIs for the period 75-195 DAS and peak models displayed comparable RMSE values to the best models (EVI, TVI, and SAVI) for 15-day intervals (90-105 to 135-150 DAS) (Figure 7c).However, for EVI, the peak model exhibited the lowest RMSE (726 kg ha −1 ). Figure 7d presents the average yield predictions for the 114 testing plots at 105-120 DAS, 75-195 DAS, and peaks.The average observed yield for the 114 plots stood at 4372 kg ha −1 (represented by the dotted horizontal line in Figure 7d), with most linear models tending to overestimate yield by approximately 160 kg ha −1 on average.Overall, there were tendencies of overestimation for lower-yield plots and underestimation for higher-yield plots, as evidenced in Figure 6 for TVI.However, the 75-195 DAS model appears to reduce such bias.Although NDVI provided a closer average yield forecast for the 114 plots compared to the measured average yield (Figure 7d), its RMSE was significantly higher than the other VIs (EVI, TVI, and SAVI) for all intervals between 75 and 165 DAS.Therefore, the best performances were achieved with EVI and TVI for the 105-120, 120-35, and 135-150 DAS intervals.If an earlier prediction is required, the 90-105 DAS or peak models are preferred, as the RMSE for 75-90 DAS and earlier DAS intervals were excessively high, and the R 2 values were insignificant.Linear equations shown in Table 3.  3.
The observed trend of overestimation for low-yield plots and underestimation for high yields (Figure 6) can be further examined by correlating the residual error (yield estimated − yield observed ; estimated by the peak model with TVI) for the observed yield of each plot (Figure 8a).In this analysis, negative residual errors (indicating underestimation) are evident for yields higher than approximately 4000 kg ha −1 , while positive residual errors (indicating overestimation) are observed for lower yield values.Several factors may contribute and explain the variation in low and high yield plots, such as climate conditions, soil type, soil fertility, diseases, cultivars, and others.However, in rainfed cotton production, one of the primary driven factors affecting yield has consistently been climatic condition, particularly variations in precipitation and temperature throughout the season.Consequently, the choice of sowing date and cycle duration, determined based on the recommended cultivar and local climatic patterns, is influenced by water demands and temperature fluctuations during each stage of cotton development, ultimately impacting the expected cotton yield.For the testing dataset, the earliest sowing date was 19 November.Therefore, observed yield was correlated with sowing dates after 1 November as a proxy for sowing data, independent of the year (Figure 8b), revealing a negative influence on yield.Additionally, the cotton cycle duration, primarily dictated by the cultivar used, exhibited a positive correlation with yield (Figure 8c).These two parameters are intrinsically linked to climatic conditions and both significantly influence observed yields, with a discernible trend of increasing yield associated with earlier sowing dates and longer cotton cycle durations.3).3).
The most significant correlations between monthly accumulated precipitation and yield were observed in the third and fourth months after sowing (r = 0.59 for 60-90 DAS and 0.39 for 90-120 DAS; Figure 8d), corresponding to the mid-season of cotton growth, from canopy closure until flowering and boll development stages.In the first month, rainfall demonstrated an adverse effect on yield (r = −0.29),as high-intensity rainfall during the early season can potentially damage the germination process.Monthly averaged maximum and minimum temperatures, along with their differences (T max − T min ) also exerted an influence on yield, particularly after the third month (r = −0.49).
Several studies, such as those listed in Table 1, have investigated the optimal timing for acquiring satellite and UAV images or ground measurements for cotton yield predictions.Some studies have relied on single-date images, while others have assessed images at specific phenological stages or analyzed complete time series data.Table 4 provides a summary of the optimal periods, expressed in DAS, as inferred from the literature cited in Table 1.The most favorable correlations between cotton yield and the VIs have been reported across various stages, ranging from the flowering and boll development period [12,18,22,[26][27][28][29][30][31] to boll opening, maturation, and defoliant application [10,[15][16][17]19,21,27,30].Notably, these findings align with the results obtained in the present study.Specifically, the study conducted by Lang et al. [10] in Xinjiang Province, China, spanning from 2012 to 2019 and encompassing 355 plots, yielded results strikingly similar to ours, with the lowest RMSE and highest R 2 observed during the fourth and fifth months after sowing (90-120 to 120-150 DAS).3).The observed trend of overestimation for low-yield plots and underestimation for high yields (Figure 6) can be further examined by correlating the residual error (yieldestimated − yieldobserved; estimated by the peak model with TVI) for the observed yield of each plot (Figure 8a).In this analysis, negative residual errors (indicating underestimation) are evident for yields higher than approximately 4000 kg ha −1 , while positive residual errors (indicating overestimation) are observed for lower yield values.Several factors may contribute and explain the variation in low and high yield plots, such as climate conditions, soil type, soil fertility, diseases, cultivars, and others.However, in rainfed cotton production, one of the primary driven factors affecting yield has consistently been climatic condition, particularly variations in precipitation and temperature throughout the season.Consequently, the choice of sowing date and cycle duration, determined based on the recommended cultivar and local climatic patterns, is influenced by water demands and temperature fluctuations during each stage of cotton development, ultimately impacting the expected cotton yield.For the testing dataset, the earliest sowing date was 19 November.Therefore, observed yield was correlated with sowing dates after 1 November as a proxy for sowing data, independent of the year (Figure 8b), revealing a negative influence on yield.Additionally, the cotton cycle duration, primarily dictated by the cultivar used, exhibited a positive correlation with yield (Figure 8c).These two parameters are intrinsically linked to climatic conditions and both significantly influence observed yields, with a discernible trend of increasing yield associated with earlier sowing dates and longer cotton cycle durations.
The most significant correlations between monthly accumulated precipitation and yield were observed in the third and fourth months after sowing (r = 0.59 for 60-90 DAS and 0.39 for 90-120 DAS; Figure 8d), corresponding to the mid-season of cotton growth, from canopy closure until flowering and boll development stages.In the first month, rainfall demonstrated an adverse effect on yield (r = −0.29),as high-intensity rainfall during the early season can potentially damage the germination process.Monthly averaged maximum and minimum temperatures, along with their differences (Tmax−Tmin) also exerted an influence on yield, particularly after the third month (r = −0.49).UAV 90-120 [29] Active Sensor 60-100 [16] Satellite 105 [22] Spectroradiometer 75 * single date applied.
The most effective forecasting models for the mid to late-season period (90 to 150 DAS) on a regional scale can serve as valuable advance information for various stakeholders, including commodity traders, policymakers, governments, and, in some instances, farmers.This information can aid farmers in logistical planning and preparing for the next crop season, such as determining fertilizer needs based on the nutrient exportation from the previous crop.However, for certain in-season interventions like topdressing fertilizer applications, some forecasting models may not be practical.This is because most topdressing fertilizers (both macro and micronutrients) are typically applied before 100 DAS.Nonetheless, earlier prediction using peak models still offers a window of opportunity for topdressing fertilizer intervention.For instance, by around 80 DAS, cotton plants have absorbed 45%, 50%, and 80% of sulfur, potassium, and nitrogen, respectively [54].

Conclusions
The forecasting methodology, employing simple regression models and time series intervals for cotton yield prediction using the MODIS product MCD43A4 V6.1 allowed in-season prediction with an RMSE of approximately 750 kg ha −1 at a regional scale across three cotton-producing states in the Brazilian Cerrado.This straightforward approach offers ease of application and can be readily utilized for predicting seed cotton yield at farm, region, and national levels.One limitation of the approach is the coarse spatial resolution of the MODIS product, which restricts its applicability primarily to large commercial plots, characteristic of crop production systems in the Brazilian Cerrado.
Among the nine VIs evaluated, EVI and TVI emerged as the most effective individual predictors.However, it is important to note that accuracies, as assessed by mean of RMSEs, were notably low up to 75 DAS, presenting a limitation to the application of this approach for estimating cotton yield during the initial stages of cotton development.The most reliable in-season predictions were achieved through 15-day intervals ranging from 90-105 to 135-150 DAS, corresponding to the mid to late stages of cotton development (including boll development, open boll, and fiber maturation).For earlier stages (below 90-105 DAS), the most accurate forecasts were obtained from the model fitted for peaks using EVI and TVI, typically occurring around 80-90 DAS.
Future validation experiments should focus on assessing the accuracies and practical utility of various models in upcoming cotton seasons across different sub-regions.This evaluation should include their ability to predict cotton net production at various scales, ranging from individual farms to states, regions, and even at the national level.When considering the applicability of models at the farm level for specific management interventions, such as within-season topdressing fertilizer application or planning for the next crop based on exported nutrients, caution is warranted.Additional validations or complimentary approaches may be necessary to ensure the reliability and effectiveness of these models in practical agricultural decision making.
The proposed approach can be extended to corn, which is also commonly grown as a second harvest in the Brazilian Cerrado.Additionally, it could be adapted for soybean, the primary first crop cultivated in Brazil, despite facing challenges such as increased cloud interference in satellite images due to its cultivation during the rainy season.

Figure 1 .
Figure 1.Spatial location of the 281 plots cultivated with cotton in the states of Mato Grosso (MT), Goiás (GO), and Bahia (BA) (typical Brazilian Cerrado biome) used to train (yellow dots) and test the models (blue dots) and three cotton plots zoomed (Plots 1, 2, and 3).

Figure 1 .
Figure 1.Spatial location of the 281 plots cultivated with cotton in the states of Mato Grosso (MT), Goiás (GO), and Bahia (BA) (typical Brazilian Cerrado biome) used to train (yellow dots) and test the models (blue dots) and three cotton plots zoomed (Plots 1, 2, and 3).

Figure 2 .
Figure 2. MODIS product MCD43A4 V6.1: (a) NIR images obtained for two commercial cotton production plots; (b) original extracted time series (daily data) for red, NIR, blue, green, and SWIR spectral bands; (c) time series in 15-day interval averages.Plot on left presented low cotton yield on right high cotton yield.

Figure 2 .
Figure 2. MODIS product MCD43A4 V6.1: (a) NIR images obtained for two commercial cotton production plots; (b) original extracted time series (daily data) for red, NIR, blue, green, and SWIR spectral bands; (c) time series in 15-day interval averages.Plot on left presented low cotton yield on right high cotton yield.

Figure 3 .
Figure 3. Average time series of reflectance for seven MODIS spectral bands for five cotton yield classes (a) and the linear correlation coefficients (r) between cotton yield and reflectance for the different 15-day days after sowing (DAS) intervals in the training dataset (167 plots) (b).

Figure 3 .
Figure 3. Average time series of reflectance for seven MODIS spectral bands for five cotton yield classes (a) and the linear correlation coefficients (r) between cotton yield and reflectance for the different 15-day days after sowing (DAS) intervals in the training dataset (167 plots) (b).

Figure 4 .
Figure 4. Time series of NDVI, EVI, SAVI, and TVI from MODIS sensor for five cotton yield classes (a) and determination coefficients (R 2 ) of linear correlations between cotton yield and 9 vegetation indices (Table 2) (b) for the different 15-day days after sowing (DAS) classes for the training dataset (167 plots).

Figure 4 .
Figure 4. Time series of NDVI, EVI, SAVI, and TVI from MODIS sensor for five cotton yield classes (a) and determination coefficients (R 2 ) of linear correlations between cotton yield and 9 vegetation indices (Table 2) (b) for the different 15-day days after sowing (DAS) classes for the training dataset (167 plots).

Figure 4 .
Figure 4. Time series of NDVI, EVI, SAVI, and TVI from MODIS sensor for five cotton yield classes (a) and determination coefficients (R 2 ) of linear correlations between cotton yield and 9 vegetation indices (Table 2) (b) for the different 15-day days after sowing (DAS) classes for the training dataset (167 plots).

Figure 5 .
Figure 5. Linear correlations between TVI and seed cotton for different 15-day DAS for the training dataset, from 75-90 to 180-195 DAS, 75-195 DAS (averaged period) and the time series peak value.Linear equations shown in Table3.

Figure 5 .
Figure 5. Linear correlations between TVI and seed cotton for different 15-day DAS for the training dataset, from 75-90 to 180-195 DAS, 75-195 DAS (averaged period) and the time series peak value.Linear equations shown in Table3.

Figure 7 .
Figure 7. Root mean square error (RMSE) (a) and linear determination coefficients (R 2 ) (b) between predicted and observed yield for the 114 plots of the testing dataset for different 15-day DAS; comparison of RMSE (c) and predicted yield (d) for 105-120 DAS, 75-195 DAS, and peak equations for TVI, EVI, SAVI, and NDVI.Dotted horizontal line in d) represents the average observed yield for the 114 plots.

Figure 7 .
Figure 7. Root mean square error (RMSE) (a) and linear determination coefficients (R 2 ) (b) between predicted and observed yield for the 114 plots of the testing dataset for different 15-day DAS; comparison of RMSE (c) and predicted yield (d) for 105-120 DAS, 75-195 DAS, and peak equations for TVI, EVI, SAVI, and NDVI.Dotted horizontal line in d) represents the average observed yield for the 114 plots.

Figure 7 .
Figure 7. Root mean square error (RMSE) (a) and linear determination coefficients (R 2 ) (b) between predicted and observed yield for the 114 plots of the testing dataset for different 15-day DAS; comparison of RMSE (c) and predicted yield (d) for 105-120 DAS, 75-195 DAS, and peak equations for TVI, EVI, SAVI, and NDVI.Dotted horizontal line in (d) represents the average observed yield for the 114 plots.

Figure 8 .
Figure 8. Relationships between the residual error (Yieldestimated−Yieldobserved; estimated by TVI peak model) and observed cotton yield (a); observed yield and sawing day from 1 November (b); observed yield and cotton cycle (c); and correlation coefficients (r) between observed yield and

Figure 8 .
Figure 8. Relationships between the residual error (Yield estimated − Yield observed ; estimated by TVI peak model) and observed cotton yield (a); observed yield and sawing day from 1 November (b); observed yield and cotton cycle (c); and correlation coefficients (r) between observed yield and monthly accumulated precipitation, monthly averaged minimum (Tmin) and maximum (Tmax) temperatures, and Tmax − Tmin (d) for month 1 to 6 after sowing.

Table 1 .
A review of previous studies for estimation of seed cotton yield using remote sensing.

Table 2 .
Vegetation indices used in this study.

Table 3 .
Linear regression models for predicting cotton yield from averaged 15-day TVI, EVI, SAVI, and NDVI vegetation indices, averaged 75-195 DAS and peak (maximum value in the time series), using the training dataset.
Y: seed cotton yield (kg ha −1 ); Peak: maximum value of the VI in the time series; p-value determined by ANOVA.

Table 4 .
Optimal periods for remote sensing (RS) image or data acquisition, expressed in days after sowing (DAS), considering best correlations between yield and VIs from different studies.