1. Introduction
The National Oceanic and Atmospheric Administration (NOAA)’s National Centers for Environmental Information (NCEI) provide one of the most widely used monthly land surface temperature datasets, the Global Historical Climatology Network (GHCN) [
1,
2,
3,
4]. As can be seen from
Table 1, it is either the primary (1°) or a major secondary (2°) instrumental data source for each of the current global and hemispheric land surface temperature estimates [
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19].
The first version of the dataset (released in 1992 by NOAA’s National Climatic Data Center—which became NCEI in 2015) did not attempt to correct for non-climatic biases in the station records [
1]. However, starting with Version 2 (released in 1997), NOAA have provided two alternate variants of the dataset [
2]. The first variant contains the original (henceforth, “raw” or “unhomogenized”) temperature records without any attempt to correct for non-climatic biases. For the other variant (henceforth, “homogenized”), all station records are processed using an automated statistical homogenization program.
Since Version 2, the GHCN includes a separately processed dataset for the contiguous United States (i.e., all states except Alaska and Hawai’i) called the United States Historical Climatology Network (USHCN) [
20,
21]. This is a high-density network of relatively long station records (~1200 stations, spanning the period 1895–present) for which NOAA maintains detailed “station history metadata”, i.e., records describing details and dates of any documented changes in the types of thermometer used, observation practices, station location, etc. As we will discuss later, access to such station history metadata can potentially be very useful for identifying and correcting for non-climatic biases in the station records. However, for the rest of the GHCN stations, NOAA did not acquire any of this station history metadata. Instead, the station metadata for the rest of the GHCN stations are limited to “station environment metadata”, i.e., details on the most recent station location, e.g., current station name, latitude, longitude and elevation [
2,
3,
4]. Given this extra information for the contiguous U.S. component, NOAA processes and homogenizes the USHCN dataset separately from the rest of the GHCN dataset and merges the two components later [
2,
3,
4].
For Version 2 of the GHCN, NOAA applied very different homogenization techniques to both components. The USHCN component was homogenized using the station history metadata to correct for changes in “time of observation” (TOB) [
22] and changes in instrumentation [
23] using empirically-derived corrections. Non-climatic biases arising from other documented changes were estimated using the relative homogeneity adjustment algorithm of Karl & Williams (1987) [
24]. A separate population-based correction for potential urbanization bias was also applied [
25] along with some infilling of data gaps [
20]. In contrast, for the rest of the GHCN, NOAA applied the Easterling and Peterson (1995) [
26] relative homogeneity adjustment algorithm. This algorithm was designed to statistically identify both the date and magnitude of any non-climatic jumps (“breakpoints”) in the station record relative to a reference time series constructed from the average of five neighboring stations [
2,
26]. As a result, it does not require or use any station metadata other than latitude and longitude.
In 2009, NOAA developed Version 2 of the USHCN [
21] which replaced the previous explicit corrections for changes in instrumentation and urbanization bias as well as the Karl & Williams (1987) homogenization algorithm with the new Pairwise Homogeneity Adjustment (PHA) algorithm described by Menne & Williams (2009) [
27]. This PHA algorithm was designed to identify non-climatic breakpoints that were either documented (i.e., associated with a known station history metadata event) or undocumented. The algorithm requires less strict statistical thresholds for potential breakpoints occurring shortly before or after a documented metadata event. Menne & Williams (2009) showed that the algorithm fared quite well when applied to synthetic time series where artificial biases had been intentionally added [
27]. Menne et al. (2009) later argued that the PHA coincidentally also
indirectly removed much of the urbanization biases in station records—citing the PHA results for one station (Reno, Nevada, USA) as an example [
21]. Therefore, for USHCN version 2, the previous explicit corrections for urbanization bias and changes in instrumentation were dropped. Instead, it was argued that the PHA adjustments for both documented and undocumented events should be adequate for correcting all of these non-climatic biases [
21], although the TOB adjustments were kept as-is, on the basis of Vose et al. (2003)’s analysis [
28]. Later, Menne et al. (2010) [
29] argued that the PHA adjustments were also indirectly capable of identifying and removing non-climatic biases arising from degradations over time in the quality of the exposure of many USHCN stations [
30,
31,
32,
33] that had been revealed to be widespread and systemic by the voluntary citizen science “Surfacestations” project [
33].
Therefore, when NOAA introduced Version 3 of the GHCN in 2011, they decided to switch to also using the PHA for homogenizing the non-USHCN component [
3]. This decision was repeated for the current Version 4 that was introduced as a “beta” version in 2015, and formally launched in 2018 [
4].
NOAA still has not collected station history metadata for analyzing any of the GHCN stations (other than the USHCN component). Therefore, the PHA is applied to the non-USHCN component of the GHCN in “undocumented events only” mode. However, several “benchmarking” evaluations have found that the PHA performs quite well at correcting synthetic time series “blind” when certain artificial biases are introduced [
34,
35,
36], i.e., when run in “undocumented events only” mode. Moreover, as mentioned above, Menne et al. have argued that the PHA is indirectly quite successful at correcting for long-term non-climatic biases from urbanization [
29] and reductions in the quality of station exposure [
29]. Hausfather et al. (2013) support this claim [
37].
On the other hand, Soon et al. argue that much of the apparent “removal” of urbanization biases and poor station exposure biases via homogenization is a statistical artefact of the homogenization process which leads to the “blending” of non-climatic biases [
18,
19,
38,
39]. That is, if several of the reference neighboring stations are affected by gradual multidecadal biases such as urbanization bias then, when the sign and magnitude of a given breakpoint is being calculated by the homogenization algorithm, some of these biases may inadvertently be added to the homogenized record. This “aliasing” of biases has now been demonstrated as a systemic statistical artefact of standard homogenization algorithms by several studies [
32,
38,
40].
As a result, the more breakpoints are adjusted for each record, the more the trends of that record will tend to converge towards the trends of its neighbors. Initially, this might appear desirable since the trends of the homogenized records will be more homogeneous (arguably one of the main goals of “homogenization”), and therefore some have objected to this criticism [
41]. However, if multiple neighbors are systemically affected by similar long-term non-climatic biases, then the homogenized trends will tend to converge towards the averages of the station network (including systemic biases), rather than towards the true climatic trends of the region.
Soon et al. (2018) suggest that one way to minimize this blending problem of homogenization would be to ensure that the neighbor network used for the homogenization process is not systemically biased relative to the target stations, e.g., rural stations should be homogenized using a mostly rural station network [
38]. Indeed, they note that Ren et al. have effectively carried this out for their homogenization of Chinese records for the post-1960 period [
42,
43]. However, this is a non-trivial challenge for future research, which is beyond the scope of this paper.
At any rate, in Soon et al. (2015), three of us noted an additional concern over the PHA adjustments applied by NOAA to the GHCN dataset [
18]. While evaluating rural temperature trends for Ireland, Soon et al. (2015) noted several oddities with the homogenized GHCN record (then Version 3) for the longest rural Irish record in the dataset, i.e., Valentia Observatory [
18]:
There was a concerning lack of consistency in the breakpoints and adjustments applied by NOAA to the record between each of the five different updates of the GHCN dataset the authors had downloaded (October 2011, January 2012, January 2013, July 2014 and January 2015);
None of the breakpoints identified by NOAA’s PHA for any of those updates corresponded to any of the four documented events in the station history metadata which the Valentia Observatory observers provided;
The PHA homogenization failed to identify, in any of those updates, non-climatic biases associated with the major station move in 1892 or the second station move in 2001 for which parallel measurements showed a −0.3 °C cooling bias.
Independently, another one of us (P.O’N.) had been regularly downloading and archiving the updates of the datasets for research purposes from NOAA’s website since May 2011—initially roughly fortnightly, but later updated to roughly weekly (March 2012) and eventually (March 2014) daily using an automated script. By early 2015, he had also begun noticing the remarkable lack of consistency in PHA adjustments between different updates of the GHCN dataset. Therefore, he continued archiving NOAA’s roughly daily updates to the dataset. This extensive archive of NOAA’s updates to the GHCN dataset is the main basis for the analysis in this paper. This comprises 1877 updates of Version 3 and 1812 updates of Version 4.
Meanwhile, Soon et al. (2015) had fortuitously been able to directly ground-truth the various adjustments to the Valentia Observatory record because the station observers had provided access to both the station history metadata and accompanying parallel measurements for two of the more recent events [
18]. However, at the time, access to publicly archived station history metadata for other stations was extremely limited.
Still, over the last decade or so, particularly in Europe, there has been an increasing realization of the importance of not only tracking down and digitizing early historical weather records [
44,
45,
46,
47,
48,
49,
50,
51,
52,
53], but also identifying and digitizing the relevant station history metadata associated with temperature records [
46,
47,
48,
49,
50,
51,
52,
53,
54,
55,
56,
57,
58]. Therefore, for quite a few of the GHCN stations, relevant history metadata are now available.
Some of these station history metadata have been published in the scientific literature, e.g., [
52,
54,
58,
59,
60,
61,
62,
63,
64,
65], while others are archived by various organizations such as national meteorological services, e.g., [
51,
66,
67,
68]. However, much of the collected station metadata are part of ongoing digitizing projects, e.g., [
46,
47,
48,
49,
50,
55,
56,
57,
69,
70,
71], and are not yet publicly archived. In addition, analyzing and interpreting the relevance of historical metadata for evaluating homogenization adjustments can be somewhat subjective. Therefore, for this collaborative study, the first four listed co-authors sought the assistance of the remaining co-authors in compiling and analyzing the station history metadata from as many GHCN stations as possible for the European region. At the time of writing, we have compiled relevant station history metadata for more than 800 European GHCN stations, and these stations will be the basis for this analysis.
In this paper, we will use this unique combination of the extensive archive of more than 1,800 iterations each of Version 3 and 4 of the GHCN datasets along with our compilation of station history metadata to quantify and evaluate the PHA adjustments that have been applied by NOAA to the GHCN dataset since 2011 [
3,
4]. We appreciate that the PHA [
27] has performed quite well over the years in various studies, including benchmarking tests [
27,
34,
35,
36]. However, these earlier assessments of the PHA were generally “one-off” assessments, i.e., they did not evaluate the consistency of the breakpoints and adjustments applied with repeated runs of the algorithm.
Nominally, our analysis in this study is confined to Europe, since this is the region for which we have compiled the relevant station history metadata. However, this region comprises many of the longest and most complete station records in the GHCN dataset, and we therefore think the results from this case study are of relevance to all users of the GHCN dataset. We also expect our findings will be of interest to the temperature homogenization community.
The main aims of this study are:
To describe how the PHA adjustments applied to the stations in the widely-used GHCN datasets vary between updates, using this subset of European stations as a detailed case study for the entire global dataset.
To evaluate how closely related (or otherwise) the breakpoints identified by the PHA process are to documented station history metadata events.
To discuss the implications these findings have for the scientific community’s goals of accurately estimating regional climatic temperature trends.
To make recommendations for steps the temperature homogenization community can take to resolve some of the identified problems, moving forward.
3. Results
Figure 4 shows the numbers of distinct sets of adjustments associated with each of the records in Version 3 (green) and Version 4 (blue). For Version 4, we can see that the majority of stations (68%) had more than 25 distinct sets of adjustments over the period of record (2015–2021, with 1812 distinct iterations of the dataset analyzed), and 14% of the stations (including Cheb described above, with 269) had more than 200 distinct sets of adjustments applied. For Version 3, the results are a bit more encouraging in that only 44% of the records had more than 25 distinct sets of adjustments and 12% had more than 200 distinct sets. Nonetheless, for both versions, there is clearly a major inconsistency in the homogenization adjustments that NOAA has applied to the GHCN station records over time.
In
Figure 5,
Figure 6 and
Figure 7, we assess the consistency of all homogenization adjustments applied to all of the European stations we analyzed for each Version. As explained in
Table 2 and
Table 3, for Version 3, this comprises 1877 PHA iterations applied to 259 stations (from 24 countries). For Version 4, this comprises 1812 applied to 847 stations (again from 24 countries)
Firstly, in
Figure 5, we assess the consistency of the PHA adjustments without reference to any station history metadata. Using the same categories for the consistency of breakpoints as in
Section 2.3, we see that the majority of PHA adjustments were applied less than or equal to 25% of the time. For Version 3, 38% of adjustments were “very intermittent” (i.e., applied ≤5% of the time) and 23% were “intermittent” (i.e., 5–25% of the time)—
Figure 5a. For Version 4, 32% were “very intermittent” and 32% were “intermittent—
Figure 5b.
Only a minority of the PHA adjustments were consistently applied to the station records. For Version 3, only 14% of adjustments were “very consistent” (i.e., applied >95% of the time) and only 21% were “consistent” (i.e., 75–95% of the time)—see
Figure 5a. For Version 4, only 9% were “very consistent” and only 16% were “consistent”—see
Figure 5b.
As discussed in the introduction, aside from the U.S. component of the dataset (not analyzed here), the PHA is applied by NOAA to the GHCN in the absence of any station history metadata [
3,
4]. NOAA’s primary justification for this is that, when the station records for the GHCN were initially being compiled, the available station history metadata were very limited [
2,
3,
4], except for the U.S. component (USHCN), which is based on data housed by NOAA [
21,
22,
24,
25]. A secondary justification is that some benchmarking simulations involving the introduction of artificial breakpoints to synthetic data have suggested that the PHA algorithm performs well at identifying many of these breakpoints without the use of “metadata” [
27,
34,
35]. Another justification is that several studies (involving NOAA co-authors) of the USHCN dataset have argued that the PHA performs well at identifying both documented and undocumented events [
21,
27,
29,
37].
Nonetheless, in recent years, there has been a renewed interest in tracking down and digitizing station history metadata for many countries and individual stations in the recognition that this can help identify genuine non-climatic breakpoints, e.g., [
46,
47,
48,
49,
50,
51,
52,
53,
54,
55,
56,
57,
58]. Unfortunately, the levels of detail and comprehensiveness of the station history metadata currently available vary substantially between countries. Some of this might be a legacy of different practices between different national meteorological services; e.g., Mamara et al. have noted that the station history metadata potentially available in the archives of the Hellenic National Meteorological Service (HNMS) are quite limited for Greece compared to some other countries [
55,
72]. However, they emphasize the importance of prioritizing the gathering and digitization of this information where available: “These cases demonstrate the necessity that the various meteorological services gather and digitize detailed metadata” [
72].
At any rate, for this paper, we have compiled together, from multiple sources, station history metadata for 847 of the Version 4 stations and 259 of the Version 3 stations in Europe. In
Figure 6, we compare how often the various breakpoints identified by NOAA for these station records coincide with any of the documented events in the corresponding metadata.
Only 3% of the PHA breakpoints for either version corresponded exactly (to the nearest month) to documented events. However, in quite a few cases only the year of an event was recorded. Furthermore, given that the PHA was being run by NOAA without any station history metadata and was relying purely on statistical techniques, it is understandable if a breakpoint is not identified accurately to the nearest month. Nonetheless, even if we consider those breakpoints identified by the PHA within 12 months of the documented event, the matches are disappointedly low: only 18% for Version 3 and 19% for Version 4.
For reference, we also show those breakpoints that loosely “match” to documented events in that the dates were within 2 years or 3 years. However, we caution that to give the “best case outcome” for the PHA, we considered a match to have occurred if any documented event coincided with the breakpoint. As a result, if two breakpoints were identified within 2 or 3 years of a given metadata event, both breakpoints might be double-counted in these categories as being “associated with” the same event even if one breakpoint occurred before the event and the other occurred after the event.
Therefore, some of the apparent “matches” within 2 or 3 years are probably spurious. Regardless, 69% of the PHA breakpoints for Version 3 stations and 67% of those for Version 4 stations had no match to any event at all within 3 years.
In
Figure 7, we cross-reference the results of
Figure 5 and
Figure 6. That is, we compare how the consistency of a given breakpoint compares to whether it matches with documented station history events. If the consistency is better for breakpoints associated with documented events, this might offer some optimism that the PHA’s consistency could be improved using station history metadata. Disappointingly, for the Version 3 stations, there does not seem to be a clear improvement in the consistency of breakpoints based on whether they coincided with documented events. However, the results are more encouraging for Version 4, in that 39% of the very consistent breakpoints (occurring >95% of the time) had “a match” within 3 years and 28% within 1 year, whereas for the very inconsistent breakpoints (occurring ≤5% of the time) only 28% had “a match” within 3 years and only 14% had a match within 1 year. Given that the sample size was larger for Version 4 (847 stations vs. 259 in Version 3), this suggests that the consistency of the PHA adjustments to the GHCN might be partially improved if station history metadata were used.
A Case Study of the Ukrainian Results
The above discussion describes the combined results across all 24 analyzed countries. However, the exact results vary slightly from country to country. Therefore, as a case study, let us now briefly discuss the breakdown of results for one of the 24 countries —Ukraine.
In both considered versions of the GHCN dataset (Versions 3 and 4), there are multiple stations for Ukraine, one of the largest countries in Europe. Version 3 only contained 25 stations, but this has been increased to 119 for the most recent version 4. Despite the significant increase in stations in Version 4, we should note that there is still a significant portion of unused Ukrainian temperature data (especially, historical), which could increase the representativeness of the regional climate of Eastern Europe in the GHCN [
56,
57].
In order to evaluate the regional peculiarities of the GHCN homogenization algorithm (its detecting and adjustment parts), metadata of Ukrainian stations were collected from their historical descriptions [
73,
74]. For the Ukrainian metadata, this consists mainly of dates of station relocations and other non-climatic events which were clearly documented in the stations’ histories.
As discussed above, the results of the homogenization procedure for a particular station can vary significantly from day to day (i.e., from one homogenization run to another). These variations can appear as different sets of detected break points (different timing, different number), different adjustment factors, or both. It is not completely clear why this wide range of variations exists.
At any rate, for Version 3 of the dataset, the total number of break points detected in all homogenization runs is 47, while for Version 4, this quantity is considerably larger, totaling 592 (
Table 4). This is physically explainable because in the latest version of GHCN, the higher station density has probably increased the average correlation between the time series, allowing the PHA to detect more breaks. In both versions, the majority of the detected breaks (76.6% in Version 3 and 67.4% in Version 4) can be reported as “intermittent” or “very intermittent”. They occur in less than 25% of the homogenization runs. “Inconsistent” breaks, occurring between 25 and 75% of the performed homogenizations, constitute 10.6% and 19.3% of the total detected breaks in Version 3 and 4, respectively. Only approximately 13% of the detected shifts can be considered as “consistent” or “very consistent” in both versions of the GHCN dataset, i.e., persistently occurring in more than 75% of the homogenization runs (
Table 4).
An important question here is how many of the detected breaks can be confirmed by the collected metadata. In our analysis, we assumed that a break point is confirmed by some historical event when a time shift between them is not more than 1 year. The results of the comparison analysis are presented in
Table 5. As can be seen from the table, only a small number of detected shifts (around 5–6% in both versions of GHCN) are associated with documented metadata events.
It is worth noting that, in other studies involving two of us (Oleg and Olesya S.), the Ukrainian monthly air temperature data (collections of 178 time series of mean, maximum and minimum temperature) has been homogenized for the period of 1946–2015 [
56,
57] with the HOMER software [
75]. In this independent analysis, for all three datasets, the percentage of detected break points, which can be confirmed by historical events, was more than 30%. However, these results are difficult to directly compare with the PHA homogenization procedure in GHCN, because HOMER was run in a semi-automatic mode where final decisions regarding detected breaks were made by an expert.
4. Discussion
After comparing the different homogenization adjustments applied by NOAA to their widely used Global Historical Climatology Network (GHCN) monthly temperature dataset, we found a disconcerting inconsistency between the updates to the dataset from day to day. Availing of a substantial archive compiled by regularly downloading the datasets from NOAA’s website over the period 2011–2019 for version 3 and 2015-present for version 4, we assembled 1877 distinct iterations of the version 3 dataset and 1812 of the version 4 dataset which NOAA has published over the past decade.
From our large sample of station records from 24 European countries, we found that only 16% of the breakpoints and adjustments applied to the GHCN version 4 station records were repeatedly applied more than 75% of the time; 64% of the adjustments were applied to records less than 25% of the time, with 32% being applied less than 5% of the time. The results were arguably slightly better for version 3, with 21% being applied more than 75% of the time and 61% being applied less than 25% of the time. However, for version 3, 38% of the adjustments were applied less than 5% of the time.
This remarkable inconsistency in the results from NOAA’s application of the Pairwise Homogenization Algorithm (PHA) [
27] to the GHCN since 2011 [
3,
4] is quite surprising since the PHA has performed quite well over the years against various benchmarking tests [
27,
34,
35,
36]. However, we note that those earlier assessments of the PHA were generally “one-off” assessments, i.e., they did not evaluate the consistency of the breakpoints and adjustments applied with repeated runs of the algorithm. Therefore this inconsistency of the PHA adjustments
between consecutive runs would have been inadvertently overlooked by those earlier tests.
We stress that in this paper, we are definitely
not criticizing the overall temperature homogenization project. We also stress that the efforts of Menne & Williams (2009) in developing the PHA [
27] to try and correct for both undocumented and documented non-climatic biases were commendable. Long-term temperature records are well-known to be frequently contaminated by various non-climatic biases arising from station moves [
24,
61,
75], changes in instrumentation [
23,
61], siting quality [
18,
19,
29,
31,
32,
33], times of observation [
22,
28], urbanization [
18,
19,
25,
37,
38,
39,
42], etc. Therefore, if we are interested in using these records to study regional, hemispheric or global temperature trends, it is important to accurately account for these biases. Indeed, many of us are actively engaged in developing and/or improving the reliability of the available temperature records for Europe [
18,
19,
35,
44,
45,
46,
49,
56,
57,
58,
59,
69,
70,
71,
75,
76,
77,
78,
79] and elsewhere [
18,
19,
38,
39] through various homogenization techniques and/or the collection of more data.
Rather, we believe these findings should be used as motivation for improving our approaches to homogenizing the available temperature records. Meanwhile, the results raise serious concerns over the reliability of the homogenized versions of the GHCN dataset, and more broadly over the PHA techniques, which do not appear to have been appreciated until now. As shown in
Table 1, the homogenized GHCN datasets have been widely used by the community for studying global temperature trends. We note that PHA has also been used for homogenizing some other climatic datasets [
80,
81,
82,
83,
84,
85], although our analysis here is specifically on the PHA adjustments that have been applied to the GHCN datasets since 2011 [
3,
4].
Another major concern is how few of the adjustments are associated with documented station history metadata events—only 19% of Version 4 and 18% of Version 3 breakpoints occur within 1 year of a documented event. We agree that metadata information can sometimes be erroneous and that many events leading to non-climatic breakpoints will go undocumented [
26,
27,
55,
56,
72,
86,
87,
88,
89].
So, we would expect a fraction of the breakpoints to be undocumented. However, particularly when these breakpoints are inconsistent from run to run, it raises the question, how do we know if an undocumented breakpoint that occurs for one run but not another is genuine? We are reminded of this observation by the 20th century statistician and economist, Schumacher:
“A man who uses an imaginary map, thinking that it is a true one, is likely to be worse off than someone with no map at all.”—Ernst P. Schumacher, “Small is Beautiful” (1973) [
90].
With that in mind, we believe that future efforts to homogenize temperature datasets should involve a greater use of station history metadata. We appreciate that in recent years, with the commendable compilation of very large climate datasets, there has been an understandable appeal of using automated statistical homogenization techniques which do not require metadata. However, given the findings from this analysis and the above “imaginary map” problem, we suggest that, as a community, we should begin treating these “blind” homogenization results with a bit more skepticism. With that in mind, we welcome efforts to allow the use of metadata in various homogenization packages, e.g., [
75,
91,
92], including PHA [
27].
While we encourage more investigation of the various homogenization techniques via “benchmarking” experiments using synthetic data [
27,
34,
35,
36], we believe these findings highlight the importance of also benchmarking techniques against real-world data [
93]. The collection and digitizing of station history metadata for as many stations as possible should be encouraged. In many cases, this could potentially be combined with the various commendable efforts to track down and digitize early instrumental thermometer records [
44,
46,
47,
48,
49,
53].
Finally, in this study, we have not focused on the impacts of homogenization on long-term trends. However, we note that various studies have demonstrated that the application of homogenization to a series of temperature records can inadvertently lead to the “aliasing” or “blending” of some non-climatic biases if multiple neighbors are systemically affected by similar long-term non-climatic biases, such as urbanization biases or siting biases [
32,
38,
40]. As mentioned in the introduction, Soon et al. (2018) have suggested that one way to minimize this blending problem would be to ensure that the neighbor network used for the homogenization process is not systemically biased relative to the target stations, e.g., rural stations should be homogenized using a mostly rural station network [
38].