1. Introduction
The global use of renewable energy sources has expanded quickly in recent years, mainly due to efforts to obtain net zero emissions by 2050, the world agreement to reduce greenhouse gas emissions close to zero [
1]. Solar, wind, hydro, and biomass are the transition drivers to an energy system with lower carbon emissions [
2]. Solar photovoltaic (solar PV) is considered crucial to world energy transition and is leading the growth of renewables worldwide. In 2021, 56% of the new renewable generating capacity in the world came from solar PV. In 2022, solar PV achieved a significant milestone: more than 1 terawatt of solar capacity, breaking its annual installation records for the ninth consecutive year [
3]. Despite these achievements and the expectation of new records in the coming years, the energy transition to lower emission levels is still far away [
4], so there is a need to accelerate new project deployment to obtain the 2050 targets.
In terms of installed solar PV capacity by country at the end of 2022, China was in first place, followed by the United States and Japan, while Brazil was in eighth position [
5]. In 2022, renewable energy sources accounted for 88% of the Brazilian electricity mix, with hydro, wind, and solar energy being the main contributors [
6]. Solar PV rose from 2455 MW in 2018 to 47,033 MW in August 2024, an increase of 1816% [
7].
Brazil has great potential for photovoltaic power generation, especially in the Northeast, Midwest, and part of the Southeast regions, which receive very high values of solar irradiance (annual average between 5000 and 6200 Wh/m
2) [
8]. However, solar power generation is intermittent due to climatic factors such as solar irradiance, temperature, cloudiness, and precipitation [
8], making accurate forecasting challenging. Therefore, simulations and forecasts of solar power generation are crucial for operational planning, where better use of existing resources is desired, and for electrical system expansion, energy transition, and medium or long-term planning.
With the growing interest in renewable energy, several key issues have emerged as crucial for the energy sector to facilitate the transition to sustainable energy solutions. These include improving energy efficiency in homes to reduce greenhouse gas emissions [
9] and understanding the stochastic nature of renewable energy sources for effective planning, which requires advanced methodologies to analyze the interdependence between different renewable sources [
10].
Recent studies have presented applications with meteorological variables in forecasting models [
11,
12], analyzing and comparing the impacts of using solar irradiance and wind speed as the input data [
13,
14] to demonstrate how close solar irradiance is to PV output. Furthermore, irradiance can be an input variable in simulation and forecasting models to help identify patterns and solve problems with missing values in historical PV outputs [
15,
16,
17,
18], as shown in the systematic review of Ahmed et al. (2020) [
19] on photovoltaic solar energy forecasting. In other studies, regression models are used to optimize the tilt and azimuth of solar collectors using climate variables [
20,
21].
Locally climatic variables can be obtained by collecting them at observation points. Although weather stations are distributed throughout Brazilian territory, in many localities, the data is scarce with significant missing values or even a complete absence of time series for extended periods. In this context, climate reanalysis datasets, which combine historical observations with weather models through data assimilation to recreate past weather patterns [
22,
23], serve as an alternative to replace or supplement the measured data [
15,
24].
A growing interest in reanalysis datasets has been seen in recent studies applying climate data to the energy sector. Despite being considered to have lower accuracy than satellite data [
25,
26], these datasets are globally available, easy to access, free of charge, and provide long-term hourly historical records. Many studies focus on applying meteorological data, such as wind speed in wind power generation models [
27,
28,
29] or solar radiation in photovoltaic generation models [
15,
17,
30,
31,
32].
Other researchers have focused on checking the quality of datasets by comparing them with locally measured data [
33,
34,
35,
36,
37,
38,
39,
40] to determine whether the database can be used for a specific purpose [
26]. In addition, there are studies applying methods to reduce the bias of reanalysis data [
38,
41] and test the possibility of using reanalysis data to complete missing values in climate time series [
15].
Specifically for solar energy, much of the current literature studying reanalysis datasets pays particular attention to data quality by comparing them with satellite-based data or ground measurements. The baseline surface radiation network (BSRN) is often used. BSRN is a solar radiation monitoring network centralized in the World Radiation Monitoring Center (WRMC). It has 76 stations, but only 51 are active and distributed worldwide with resolution of 1 to 3 min [
42,
43]. Of the stations currently available from the BSRN, only four active stations are located in Brazil.
However, in searching the literature, the authors found few studies analyzing solar irradiance from reanalysis datasets in Brazilian territory [
26,
33,
44,
45]. Some have studied reanalysis in a global approach, comparing these products to BSRN stations [
26,
33].
Therefore, a gap remains given the limited number of studied locations since only the BSRN database has been considered. As discussed in [
44], the results presented in one location should not be regarded as true for all places. Thus, to have a broader view of the Brazilian case, the National Institute of Meteorology (INMET), under the Brazilian Ministry of Agriculture responsible, can be used. INMET is responsible for providing national meteorological information, and its data serve as measured observations for comparison with reanalysis datasets. The data is available in a digital Meteorological Database (
Banco de Dados Meteorológicos do INMET—BDMEP) and follows the World Meteorological Organization’s technical measurement standards [
46]. So far, no studies have been found comparing GHI from reanalysis databases with INMET data. This study offers a valuable starting point for expanding knowledge by exploring a larger territory and providing key inputs for decision-making in solar energy applications.
The primary objective of this study is to assess the suitability of global horizontal irradiance (GHI) data from reanalysis datasets, comparing them with ground-based measurements across multiple locations in Brazil. This expands the analysis beyond the limited locations typically examined in the literature [
26,
33,
44,
45]. By analyzing the performance of climate reanalysis datasets, this study aims to provide valuable insights concerning their applicability for developing models that aid the electricity sector, making better decisions about how to distribute electricity and overcome the challenges of operating the interconnected Brazilian electricity system.
Although numerous reanalysis datasets are available, not all of them provide hourly data or have full coverage of the Brazilian territory. The datasets selected for this study were chosen specifically for their global coverage and demonstrated utility in previous studies [
24,
26,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41]. These datasets include: (a) the Modern-Era Retrospective Analysis for Research and Applications version 2 (MERRA-2), developed by NASA [
22]; (b) the Fifth Generation European Reanalysis (ERA5), developed by the Copernicus Climate Change Service (C3S) at the European Centre for Medium-Range Weather Forecasts (ECMWF) [
23]; (c) ERA5-Land, also developed by C3S at ECMWF [
47]; and (d) the Climate Forecast System version 2 (CFSv2), developed by the National Center for Environmental Prediction (NCEP) [
48].
To achieve the research objective, GHI time series from reanalysis datasets will be compared with hourly ground-based measurements from BDMEP. As a secondary objective, this study examines each reanalysis dataset and its characteristics to expand knowledge about energy in Brazilian applications.
This article is divided into five sections, including this introduction.
Section 2 describes the methodology and data used in the research;
Section 3 presents the results;
Section 4 discusses the results; and finally, in
Section 5, the main conclusions are drawn.
3. Results
Figure 3 depicts the hourly aggregated GHI behavior of all the reanalysis datasets compared to the GHI observed at the INMET meteorological stations after the data treatment. CSFv2 shows values that are far from the average of the other datasets in the first and last hours of the day, as well as greater dispersion. MERRA-2 shows the least dispersion throughout the hours. The average values from ERA5 and ERA5-Land have a very similar pattern and are higher than the hourly average of the measured data, while the MERRA-2 values are lower.
Table 2 presents the descriptive statistics of all databases. Concerning the means and standard deviations, ERA5 and ERA5-Land are close to the observed values, but the median values are different. MERRA-2 has a lower standard deviation, which corroborates what was visualized in the hourly boxplot (
Figure 3).
Concerning the calculated metrics,
Figure 4 presents the boxplot of the relative metrics average to allow comparison of the results between the datasets. MERRA-2 presents the smallest errors for rMBE, rMAE, and rRMSE. The relative metrics present the value in terms of the measured values average, facilitating reanalysis database comparisons. All the tables and graphs in this section show the relative metrics.
Table 3 summarizes the results of rMBE, rMAE, rRMSE, and PCC. It is possible to see, where rMBE is evaluated considering the mean, that MERRA-2 has the lowest value while ERA5 has the best result considering the median. For the rMAE values, MERRA-2 has the best value for both the mean and the median, which is also observed for rRMSE and PCC.
A negative value of MBE and rMBE means that the reanalysis dataset has been underestimated, and a positive value indicates that it has been overestimated. On the other hand, the MAE and the rMAE consider the modulus of the difference between the observed and the estimated data (reanalysis). Unlike MBE and rMBE, MAE and rMAE do not disguise the error because the negative values do not cancel the positive ones, thus avoiding a false impression of the error being smaller than it is. To complete the error metrics, the RMSE and rRMSE give greater weight to the largest deviations since they consider the square error in the calculation before the mean and the square root are calculated. The RMSE and rRMSE increase considerably when the data variation is high and when there are outliers in the series. For GHI, the values during the day have great amplitude, and if the temporal fit is not done properly, the bias that exists in the reanalysis databases can increase, affecting this metric.
Table 4 shows the five best results (in blue), and
Table 5 the five worst (in red). It is worth highlighting that the best values for rMBE are the absolute values because the purpose is to find the smallest difference between the observed and the estimated datasets. Regarding PCC, the best result is the highest value found comparing the four reanalysis datasets for each station analyzed.
In
Table 4, the lowest rMBE (−0.09%) was observed at station A306 for the ERA5 reanalysis dataset. For the rMAE, station A402 presents the lowest error (21.01%) for the MERRA-2; station 429 has the lowest rRMSE value (43.48%) for MERRA-2; and station A336 presents the highest PCC (0.9488), also for MERRA-2.
Analyzing the worst results obtained (
Table 5), station A705 appeared in the worst place for rMBE (39.32%) for the CSFv2 reanalysis dataset. For the rMAE, the worst result (62.89%) was for station A428, while in the case of rRMSE it had an error of 114.99%. For rMBE, the station A428 returned errors above 34% for all reanalysis databases.
Figure 5 shows the graphical representation of the rMBE from
Table 4 and
Table 5, giving a better idea of the difference between the best (
Figure 5a) and worst (
Figure 5b) results observed. By visual examination, station A402 exhibits the smallest errors, while station A428 has the greatest errors for all reanalysis datasets. All the calculated metric values are in
Table A2 and
Table A3.
MERRA-2 had the lowest MBE and rMBE at 14 stations, followed by ERA5 with the best rMBE at 11 stations, ERA5-Land showed the best rMBE at seven stations, and CSFv2 at only three stations. The results obtained for RMSE and rRMSE are very similar to the results returned for MAE and rMAE. MERRA-2 presented the lowest rRMSE in 33 localities and ERA5-Land in the other 2. For rMAE, MERRA-2 presented the lowest value in 32 localities and ERA5-Land in the other 3.
Pearson’s correlation coefficient helps to identify how close the observed values are to the corresponding reanalysis datasets. For all datasets, the PCC values were strongly positive, over 0.81 (
Table A3). MERRA-2 leads with 33 higher values, ranging from 0.8999 to 0.9488, and ERA5-Land is in second place with only two best results, varying from 0.8261 to 0.9427. Although ERA5 and CSFv2 do not have any best value for either station, the coefficients are not that bad. ERA5 ranges from 0.8144 to 0.9268, and CSFv2 from 0.8112 to 0.8975.
To get an overview of where the best-performing stations are geographically located,
Figure 6 shows the PCC. For MERRA-2, the best results are located next to the coast in the Northeast and Southeast regions of Brazil. For the ERA5-Land, ERA5, and CSFv2, the best PCC was obtained in a more central band within the analyzed area.
In summary, for rRMSE the best values are from 43.48% to 78.76% (difference of 35.28 percentage points), showing for some stations that the error between the observed GHI and the reanalysis is rather large. Considering the rMAE, the error variation is smaller compared to rMBE and rRMSE, whose best values go from 21.01% to 41.32% (20.31 percentage points). For rMBE, the range is −6.02% to 34.95% (40.97 percentage points). The error amplitude decreases when the five stations with the worst results are removed.
Looking at the monthly error helps to ascertain whether the reanalysis databases reflect GHI behavior for 2020. Since Brazil is located mainly in the Southern Hemisphere, with large areas near the Equator, the seasons are not as well defined as in some countries in the Northern Hemisphere.
Figure 7 displays the metrics’ behavior during the seasons. February in summer, May in autumn, August in winter, and November in spring. MERRA-2 shows the smallest error range in all months analyzed.
Table A4 shows the mean and the median metrics for all months. ERA5 and ERA5-Land do not represent the GHI well for August, September, and October.
According to the results for the analyzed metrics, MERRA-2 performed better, but ERA5 and ERA5-Land also presented good results.
Table A5 and
Table A6 show the comparison of each reanalysis with the observed GHI, considering the mean and the standard deviation of the data by month.
Table 6 summarizes the average of all stations’ data. The mean variation is no more than 5.25% for MERRA-2, ERA5, and ERA5-Land in the aggregated view. But when we analyzed the values month by month (
Table A5 and
Table A6), ERA5 and ERA5-Land showed more significant variations in September and October. Note that the values of reanalysis and measured GHI are very close, with percentage errors ranging between −6.04% and 20.10% for the mean and between −9.71% and 9.23% for the standard deviation. This indicates that the reanalysis replicates the GHI measured by INMET not considering CSFv2 (−6.04% and 15.32%), but for the standard deviation, there is no difference.
Figure 8a shows a good example of the monthly GHI for station A402, located in the municipality of Barreiras in the state of Bahia. This municipality had only 27 days of precipitation in 2020, concentrated mainly in January and February, and an average temperature of 25 °C [
46]. MERRA-2, ERA5, and ERA5-Land all represented well the behavior of the GHI in this locality except for the CSFv2 database, which showed a significant difference in the first and last daylight hours (
Figure 8b).
Station A428, which showed higher errors in all metrics, is an example where all reanalysis databases did not represent the locally measured GHI well (
Figure 9a,b). This station is also located in Bahia, in the municipality of Senhor do Bonfim. The monthly average temperature was 23 °C in 2020 and there were 146 rainy days in that year. The wettest month was June with 23 rainy days. Neither of the reanalysis datasets was able to represent the locally measured GHI for this station properly.
The difference between these two examples is quite significant: the station with the best result (A402) is in a region with few rainy days, while the station with the worst result (A428) is in a region with many rainy days. This may indicate that the reanalysis bases are not reproducing the GHI properly on days with overcast skies. In addition, it is possible that station A428 could have sensor calibration problems. However, this information cannot be confirmed as there is no information on INMET’s website about poorly calibrated sensors, but it cannot be ignored.
4. Discussion
The primary goal of this study was to compare global horizontal irradiance (GHI) from different hourly reanalysis datasets with ground-based data from a public database. MERRA-2 consistently showed the lowest error and the highest correlation in most locations when compared to the GHI measured at 35 INMET stations across Brazil. These findings suggest that MERRA-2 provides the most reliable representation of GHI for Brazilian applications, although future studies could further validate this by examining more years of data.
Although ERA5-Land has a smaller grid, higher accuracy of the GHI values was not perceived for the studied localities. A limitation of this study was the use of the GHI obtained at the nearest geographic coordinate from the meteorological stations since the closest distance will not always adequately represent the region studied due to climate and geographical characteristics. An alternative approach would be to interpolate the GHI values from the four closest grid coordinates and compare whether this reduces the error found.
It was also observed that ERA5 and ERA5-Land had higher errors during August, September, and October—late winter and early spring in the Southern Hemisphere—when many regions of Brazil experience cloudy weather and frequent rainfall. This indicates that these databases may encounter difficulties in accurately modelling cloud cover and precipitation during these periods, emphasizing the necessity of accounting for seasonal variability in GHI predictions.
Another significant observation is that MERRA-2 underestimated GHI in 28 out of the 35 stations, while ERA5 (23 stations), ERA5-Land (27 stations), and CFSv2 (34 stations) overestimated GHI, which is consistent with the general behavior of reanalysis models in predicting solar irradiance.
According to Yang and Bright (2020) [
26], ERA5 outperforms the results found for MERRA-2 in almost all the stations compared to BSRN stations, in contrast to what was found here. Looking specifically at the Petrolina and Brasilia stations located close to the stations studied by [
26], for Petrolina (A307), ERA5 outperformed MERRA-2 for the rMBE, but for the rRMSE, this was not observed. In relation to the Brasilia station, again in [
26], ERA5 outperformed MERRA-2 in both metrics, contrary to the results found here. It is important to note that our study used the year 2020 while [
26] used all available data, and for these locations, there is no data available for 2020 in the BSRN dataset.
Salazar et al. (2020) [
44], who only studied the BSRN Petrolina station, found that ERA5 outperformed MERRA-2, differing from our results. However, as mentioned by these authors [
44], the results cannot be assumed to be true for other regions of Brazil without conducting additional studies (the same applies here). This is especially because Petrolina (A307) is one of the INMET stations with the worst results for all the databases studied in 2020.
The other difference observed in relation to the work conducted by [
26] regards the preprocessing performed on the datasets. There, the authors considered that the data were aligned in relation to the timestamps, which was not the case here, as shown in
Figure 2. This mismatch in MERRA-2 led to our decision to align the timestamps of all the time series, which may explain the difference in results since, before the adjustment, MERRA-2 had the worst results; this can be considered a limitation. Further exploration of the differences between the databases and their impact on reducing bias should be the subject of future studies.
Further research should focus on understanding why certain locations, such as A015, A207, A307, A428, and A705, exhibited larger errors. Nonetheless, our results indicate that GHI data from reanalysis databases, particularly MERRA-2, can be a valuable tool for Brazilian regions lacking ground-based measurements. Although our study is limited to the year 2020, the findings demonstrate the potential for these datasets to enhance solar energy forecasting, especially in areas where observational data are sparse or unavailable.
5. Conclusions
There are several reanalysis databases available on a global scale with different grids and temporal granularity. This study prioritized hourly reanalysis databases, but always the latest version available from each organization that produced global reanalysis data. From the results obtained, the CSFv2 dataset was not suitable to be used for applications with GHI in Brazil, since the errors found were very high compared to the other databases. We observed a pattern of high variations in the first and last hours of daylight that did not adequately describe the GHI locally measured. In contrast, MERRA-2 emerges as the most accurate in most cases, with ERA5 and ERA5-Land presenting good results except for June to October.
These findings contribute to the improvement of forecasting in the context of solar energy, offering a viable alternative in regions with limited climate variable measurements. Climate variables from reanalysis are a valuable source of data, not only for solar generation forecasting models but also for the imputation of missing data in incomplete time series.
Based on this, future studies could focus on selecting the reanalysis dataset that minimizes the error for the location under investigation. This approach would involve considering MERRA-2, ERA5, and ERA5-Land, not only as stand-alone options but also as a possible combination to obtain a more accurate representation of global horizontal irradiance (GHI).
When observing the monthly aggregated values, it is possible to notice seasonality, probably associated with the rainy and dry seasons and the cloudiness in each region. For further research, the calculation of monthly deviations and correlating the data with clearness and precipitation indices can contribute to analyzing data quality and allow a reanalysis of dataset choice that best reproduces the uncertainty of GHI according to the regional characteristic.
To extend the study and verify the applicability of the reanalysis databases by applying GHI as an input variable in forecasting models, it would be interesting to test all the datasets to forecast solar generation and compare the results obtained with the counterpart forecasts generated via GHI from ground measurements.