1. Introduction
Monitoring and predicting crop phenology, growth, and yield is crucial for global food security, market dynamics, policy making, and decision making [
1]. Accurate early season estimations of crop yield provides farmers with an estimate of their production, enabling them to assess risks, determine insurance premiums, and evaluate input costs [
2]. In addition to supporting individual farmers, it contributes to a broader understanding of the complex interplay between environmental factors and management practices in agriculture [
3]. This understanding facilitates the development of more effective and flexible within-season management strategies [
4,
5] and enables the anticipation of market demand by forecasting supply [
6].
Traditional methods such as manual sampling and field campaigns are labor-intensive, and provide limited insights into the spatial variability of crop yield. These limitations have led to the development of alternative approaches for estimating yields during the growing season [
7]. Using advanced technologies and data-driven methods, these yield forecasts not only reduce the work and time spent on measurements, but also improve the spatial coverage and accuracy of the obtained information [
1]. In recent years, there have been remarkable advances in yield forecasting methods that use Earth Observation (EO) data, satellite remote sensing imagery, acquired via modern missions. This technology has become a valuable source of information for monitoring agricultural practices. By exploiting EO data, within-season (or early season) crop yield forecasts can now be provided, enabling farmers and stakeholders to make informed decisions leading to optimal production [
8]. This advancement primarily stems from the availability of near real-time data (NRT) EO datasets from open sources with optical instruments on satellites like SPOT, MODIS, PROBA-V, and Landsat which have played a crucial role in such operational crop monitoring [
9]. These instruments offer several benefits, including daily revisit cycles, global coverage, long-term data archives, and low- or no-cost accessibility. However, there is a quest for generic time series analysis methods for crop mapping and monitoring that can be deployed at a large scale, taking advantage of the global coverage, with high spatial and temporal resolution, provided by modern Earth observation missions.
The launch of Sentinel-2A (S2A) in late 2015 and Sentinel-2B (S2B) in early 2017 has significantly enhanced crop monitoring capabilities. S2A provides a 10-day revisit time period over Europe and Africa, and 20 days elsewhere, while S2B ensures a 5-day revisit time period worldwide since February 2018 [
10]. This unprecedented revisit time is particularly suitable for in-season crop monitoring. Unlike previous Earth Observation (EO) missions, Sentinel-2 enables the derivation of red-edge-based vegetation indices, which exhibit stronger correlations with agronomic parameters compared to red-based indices [
11]. The combination of Copernicus’ free and open access policy with the high resolution of Sentinel-2 images allows for the construction of dense and consistent time series throughout the crop growth cycle in most regions of the world [
12]. Consequently, Sentinel-2 satellite imagery, with its spectral bands (visible, near infrared, red-edge, and short-wave infrared) and spatial resolutions (10 m, 20 m, and 60 m), has been successfully exploited in recent years for modeling crop grain yield at field and within-field scales [
1,
13,
14].
Before the advent of Sentinel-2, Landsat satellite data played a crucial role in accurately mapping crop types and predicting yields at the field level in agricultural landscapes, worldwide [
15]. With a temporal resolution of 8 days, Landsat 7 and 8 (before 2022) and Landsat 8 and 9 (after 2022) images offer a tangible advantage in crop monitoring when coupled with Sentinel-2, fully enriching the latter with the information provided by Landsat thermal bands [
16]. The Landsat missions are currently the only constellation equipped with thermal bands, provided at an adequate spatial resolution (100 meters), for precise monitoring at the scale of individual fields. Land surface temperature (LST), derived from these thermal bands, is used to monitor heat stress and drought, which can explain some of the variability in yields between years [
2,
17]. Indeed, relying solely on early season optical remote sensing data can make it difficult to detect the onset of drought, which mainly captures information on the upper canopy. This is because drought symptoms tend to appear earlier in the lower leaves, potentially underestimating the negative effects of drought on yield [
18]. At the county scale, studies have shown a negative correlation between MODIS diurnal LST and mid-summer corn yield forecasts [
2]. However, in satellite-based agricultural modeling, studies mainly focus on vegetation indices in the visible and near-infrared part of the electromagnetic spectrum, and potential data related to Land Surface Temperature are often neglected [
19]. We therefore considered it important to evaluate LST as an input variable in our task of early season crop yield prediction.
When it comes to yield estimation methods, mechanistic crop growth models are the standard choice, as they are designed and calibrated to simulate yield formation processes using soil information, climate and farm management practices. They enable crop yields to be predicted at any time and in any place [
5], but the need for extensive (and often costly) data on field-specific biotic and abiotic factors limits the large-scale deployment of these approaches over the ongoing season [
2,
15,
20]. In contrast, machine learning (ML) algorithms can handle complex relationships between predictors and the target variables, leading to their increased use in the agriculture domain [
21,
22,
23].
A key limitation of ML methods for crop yield prediction is their dependence on data acquired under specific local conditions, which may result in inaccurate forecasts when confronted with data acquired under unseen conditions not included in the model’s training data [
24]. This may be partly explained by possible differences in crop progress and climate/environmental condition changes from season to season, or from location to location, where these seasonal changes in phenology are primarily influenced by temperature and water regimes [
25]. To face such generalization issues affecting modern ML approaches, in the context of yield prediction, a possible solution is to resample remote sensing satellite image data over periods calculated over thermal time, i.e., the number of growing degrees–days accumulated from the sowing date, rather than calendar time [
26], with the aim of mitigating possible shifts in crop phenology at a given date, due to different temperature regimes.
The research is founded upon a cut-off date of mid-August, approximately two months prior to the corn harvest. This timeframe corresponds to the initial stages of maize reproduction, including heading and pollination, with regional variations influenced by planting dates [
27]. At this juncture, grain filling commences, during which kernel size begins to develop, while the kernel count has already been established in prior stages. Consequently, pre-stage crop conditions and weather patterns can significantly impact yield potential. The objective of this study is to contribute to the comprehension of utilizing remote sensing data for estimating in-season corn yield through the application of machine learning methodologies.
From a methodological point of view, our approach is based on a real-life deployment scenario of a machine learning framework. In this framework, a model is trained using reference data collected during previous seasons. With regard to the selection and processing of predictors, our methodology includes (1) the incorporation of thermal time to account for the various phenological advances of crops, which depend on temperature regimes rather than a fixed number of days, (2) the integration of Land Surface Temperature (LST) and agroclimatic stress indicators to address the limitation of optical imagery which does not reach its full potential during the growing season, and (3) the exploitation of comprehensive historical information covering the whole season to improve predictions for the current early season forecasting task.
In this study, a multi-year proprietary dataset is utilized, comprising data from 1319 corn production fields predominantly situated across 29 counties within the U.S. Corn Belt spanning the years 2017 to 2022. The dataset includes the average yield of each field, and for evaluation purposes, each year is assessed independently, with the remaining years serving as the training dataset for the model.
The rest of this paper is organized as follows.
Section 2 introduces the data available on the study site, and
Section 3 the proposed framework associated with experimental settings. The results are reported and discussed in
Section 4. Finally,
Section 5 concludes.