Evaluation of the Homogenization Adjustments Applied to European Temperature Records in the Global Historical Climatology Network Dataset

: The widely used Global Historical Climatology Network (GHCN) monthly temperature dataset is available in two formats—non-homogenized and homogenized. Since 2011, this homogenized dataset has been updated almost daily by applying the “Pairwise Homogenization Algorithm” (PHA) to the non-homogenized datasets. Previous studies found that the PHA can perform well at correcting synthetic time series when certain artiﬁcial biases are introduced. However, its performance with real world data has been less well studied. Therefore, the homogenized GHCN datasets (Version 3 and 4) were downloaded almost daily over a 10-year period (2011–2021) yielding 3689 different updates to the datasets. The different breakpoints identiﬁed were analyzed for a set of stations from 24 European countries for which station history metadata were available. A remarkable inconsistency in the identiﬁed breakpoints (and hence adjustments applied) was revealed. Of the adjustments applied for GHCN Version 4, 64% (61% for Version 3) were identiﬁed on less than 25% of runs, while only 16% of the adjustments (21% for Version 3) were identiﬁed consistently for more than 75% of the runs. The consistency of PHA adjustments improved when the breakpoints corresponded to documented station history metadata events. However, only 19% of the breakpoints (18% for Version 3) were associated with a documented event within 1 year, and 67% (69% for Version 3) were not associated with any documented event. Therefore, while the PHA remains a useful tool in the community’s homogenization toolbox, many of the PHA adjustments applied to the homogenized GHCN dataset may have been spurious. Using station metadata to assess the reliability of PHA adjustments might potentially help to identify some of these spurious adjustments.


Time Series GHCN (Raw) GHCN (Homogenized) Extra Homogenization
NOAA National Centers for Environmental Information (NCEI), "NOAAGlobalTemp" [3][4][5][6] × The first version of the dataset (released in 1992 by NOAA's National Climatic Data Center-which became NCEI in 2015) did not attempt to correct for non-climatic biases in the station records [1]. However, starting with Version 2 (released in 1997), NOAA have provided two alternate variants of the dataset [2]. The first variant contains the original (henceforth, "raw" or "unhomogenized") temperature records without any attempt to correct for non-climatic biases. For the other variant (henceforth, "homogenized"), all station records are processed using an automated statistical homogenization program.
Since Version 2, the GHCN includes a separately processed dataset for the contiguous United States (i.e., all states except Alaska and Hawai'i) called the United States Historical Climatology Network (USHCN) [20,21]. This is a high-density network of relatively long station records (~1200 stations, spanning the period 1895-present) for which NOAA maintains detailed "station history metadata", i.e., records describing details and dates of any documented changes in the types of thermometer used, observation practices, station location, etc. As we will discuss later, access to such station history metadata can potentially be very useful for identifying and correcting for non-climatic biases in the station records. However, for the rest of the GHCN stations, NOAA did not acquire any of this station history metadata. Instead, the station metadata for the rest of the GHCN stations are limited to "station environment metadata", i.e., details on the most recent station location, e.g., current station name, latitude, longitude and elevation [2][3][4]. Given this extra information for the contiguous U.S. component, NOAA processes and homogenizes the USHCN dataset separately from the rest of the GHCN dataset and merges the two components later [2][3][4].
For Version 2 of the GHCN, NOAA applied very different homogenization techniques to both components. The USHCN component was homogenized using the station history metadata to correct for changes in "time of observation" (TOB) [22] and changes in instrumentation [23] using empirically-derived corrections. Non-climatic biases arising from other documented changes were estimated using the relative homogeneity adjustment algorithm of Karl & Williams (1987) [24]. A separate population-based correction for potential urbanization bias was also applied [25] along with some infilling of data gaps [20]. In contrast, for the rest of the GHCN, NOAA applied the Easterling and Peterson (1995) [26] relative homogeneity adjustment algorithm. This algorithm was designed to statistically identify both the date and magnitude of any non-climatic jumps ("breakpoints") in the station record relative to a reference time series constructed from the average of five neighboring stations [2,26]. As a result, it does not require or use any station metadata other than latitude and longitude.
In 2009, NOAA developed Version 2 of the USHCN [21] which replaced the previous explicit corrections for changes in instrumentation and urbanization bias as well as the Karl & Williams (1987) homogenization algorithm with the new Pairwise Homogeneity Adjustment (PHA) algorithm described by   [27]. This PHA algorithm was designed to identify non-climatic breakpoints that were either documented (i.e., associated with a known station history metadata event) or undocumented. The algorithm requires less strict statistical thresholds for potential breakpoints occurring shortly before or after a documented metadata event.  showed that the algorithm fared quite well when applied to synthetic time series where artificial biases had been intentionally added [27].  later argued that the PHA coincidentally also indirectly removed much of the urbanization biases in station recordsciting the PHA results for one station (Reno, Nevada, USA) as an example [21]. Therefore, for USHCN version 2, the previous explicit corrections for urbanization bias and changes in instrumentation were dropped. Instead, it was argued that the PHA adjustments for both documented and undocumented events should be adequate for correcting all of these non-climatic biases [21], although the TOB adjustments were kept as-is, on the basis of Vose et al. (2003)'s analysis [28]. Later, Menne et al. (2010) [29] argued that the PHA adjustments were also indirectly capable of identifying and removing non-climatic biases arising from degradations over time in the quality of the exposure of many USHCN stations [30][31][32][33] that had been revealed to be widespread and systemic by the voluntary citizen science "Surfacestations" project [33].
Therefore, when NOAA introduced Version 3 of the GHCN in 2011, they decided to switch to also using the PHA for homogenizing the non-USHCN component [3]. This decision was repeated for the current Version 4 that was introduced as a "beta" version in 2015, and formally launched in 2018 [4].
NOAA still has not collected station history metadata for analyzing any of the GHCN stations (other than the USHCN component). Therefore, the PHA is applied to the non-USHCN component of the GHCN in "undocumented events only" mode. However, several "benchmarking" evaluations have found that the PHA performs quite well at correcting synthetic time series "blind" when certain artificial biases are introduced [34][35][36], i.e., when Atmosphere 2022, 13, 285 4 of 21 run in "undocumented events only" mode. Moreover, as mentioned above, Menne et al. have argued that the PHA is indirectly quite successful at correcting for long-term nonclimatic biases from urbanization [29] and reductions in the quality of station exposure [29]. Hausfather et al. (2013) support this claim [37].
On the other hand, Soon et al. argue that much of the apparent "removal" of urbanization biases and poor station exposure biases via homogenization is a statistical artefact of the homogenization process which leads to the "blending" of non-climatic biases [18,19,38,39]. That is, if several of the reference neighboring stations are affected by gradual multidecadal biases such as urbanization bias then, when the sign and magnitude of a given breakpoint is being calculated by the homogenization algorithm, some of these biases may inadvertently be added to the homogenized record. This "aliasing" of biases has now been demonstrated as a systemic statistical artefact of standard homogenization algorithms by several studies [32,38,40].
As a result, the more breakpoints are adjusted for each record, the more the trends of that record will tend to converge towards the trends of its neighbors. Initially, this might appear desirable since the trends of the homogenized records will be more homogeneous (arguably one of the main goals of "homogenization"), and therefore some have objected to this criticism [41]. However, if multiple neighbors are systemically affected by similar long-term non-climatic biases, then the homogenized trends will tend to converge towards the averages of the station network (including systemic biases), rather than towards the true climatic trends of the region. Soon et al. (2018) suggest that one way to minimize this blending problem of homogenization would be to ensure that the neighbor network used for the homogenization process is not systemically biased relative to the target stations, e.g., rural stations should be homogenized using a mostly rural station network [38]. Indeed, they note that Ren et al. have effectively carried this out for their homogenization of Chinese records for the post-1960 period [42,43]. However, this is a non-trivial challenge for future research, which is beyond the scope of this paper.
At any rate, in Soon et al. (2015), three of us noted an additional concern over the PHA adjustments applied by NOAA to the GHCN dataset [18]. While evaluating rural temperature trends for Ireland, Soon et al. (2015) noted several oddities with the homogenized GHCN record (then Version 3) for the longest rural Irish record in the dataset, i.e., Valentia Observatory [18]: 1. There was a concerning lack of consistency in the breakpoints and adjustments applied by NOAA to the record between each of the five different updates of the GHCN dataset the authors had downloaded (October 2011, January 2012, January 2013, July 2014 and January 2015); 2. None of the breakpoints identified by NOAA's PHA for any of those updates corresponded to any of the four documented events in the station history metadata which the Valentia Observatory observers provided; 3. The PHA homogenization failed to identify, in any of those updates, non-climatic biases associated with the major station move in 1892 or the second station move in 2001 for which parallel measurements showed a −0.3 • C cooling bias. Independently, another one of us (P.O'N.) had been regularly downloading and archiving the updates of the datasets for research purposes from NOAA's website since May 2011-initially roughly fortnightly, but later updated to roughly weekly (March 2012) and eventually (March 2014) daily using an automated script. By early 2015, he had also begun noticing the remarkable lack of consistency in PHA adjustments between different updates of the GHCN dataset. Therefore, he continued archiving NOAA's roughly daily updates to the dataset. This extensive archive of NOAA's updates to the GHCN dataset is the main basis for the analysis in this paper. This comprises 1877 updates of Version 3 and 1812 updates of Version 4.
Meanwhile, Soon et al. (2015) had fortuitously been able to directly ground-truth the various adjustments to the Valentia Observatory record because the station observers had provided access to both the station history metadata and accompanying parallel measurements for two of the more recent events [18]. However, at the time, access to publicly archived station history metadata for other stations was extremely limited.
Some of these station history metadata have been published in the scientific literature, e.g., [52,54,[58][59][60][61][62][63][64][65], while others are archived by various organizations such as national meteorological services, e.g., [51,[66][67][68]. However, much of the collected station metadata are part of ongoing digitizing projects, e.g., [46][47][48][49][50][55][56][57][69][70][71], and are not yet publicly archived. In addition, analyzing and interpreting the relevance of historical metadata for evaluating homogenization adjustments can be somewhat subjective. Therefore, for this collaborative study, the first four listed co-authors sought the assistance of the remaining co-authors in compiling and analyzing the station history metadata from as many GHCN stations as possible for the European region. At the time of writing, we have compiled relevant station history metadata for more than 800 European GHCN stations, and these stations will be the basis for this analysis.
In this paper, we will use this unique combination of the extensive archive of more than 1800 iterations each of Version 3 and 4 of the GHCN datasets along with our compilation of station history metadata to quantify and evaluate the PHA adjustments that have been applied by NOAA to the GHCN dataset since 2011 [3,4]. We appreciate that the PHA [27] has performed quite well over the years in various studies, including benchmarking tests [27,[34][35][36]. However, these earlier assessments of the PHA were generally "one-off" assessments, i.e., they did not evaluate the consistency of the breakpoints and adjustments applied with repeated runs of the algorithm.
Nominally, our analysis in this study is confined to Europe, since this is the region for which we have compiled the relevant station history metadata. However, this region comprises many of the longest and most complete station records in the GHCN dataset, and we therefore think the results from this case study are of relevance to all users of the GHCN dataset. We also expect our findings will be of interest to the temperature homogenization community.
The main aims of this study are: 1. To describe how the PHA adjustments applied to the stations in the widely-used GHCN datasets vary between updates, using this subset of European stations as a detailed case study for the entire global dataset. 2. To evaluate how closely related (or otherwise) the breakpoints identified by the PHA process are to documented station history metadata events. 3. To discuss the implications these findings have for the scientific community's goals of accurately estimating regional climatic temperature trends. 4. To make recommendations for steps the temperature homogenization community can take to resolve some of the identified problems, moving forward.

Timeline of How Our Archive of GHCN Datasets Was Compiled
NOAA NCEI update the GHCN-monthly datasets and rerun the PHA homogenization program approximately daily and upload it to their website (currently https://www.ncei.noaa. gov/products/land-based-station/global-historical-climatology-network-monthly [accessed on 24 December 2021]).
As discussed above, each time the PHA program is run, the homogenized GHCN dataset generated is different from the previous iteration. Initially, we might expect the adjustments applied to each station record to remain fairly consistent from run to run. For instance, we might expect the adjustments to vary occasionally as each of the still-active stations' records are updated with the latest month's data. Also, as the dataset is updated, the "40 out of 100 nearest neighbors" used by the PHA for homogenizing an individual station could change from run to run [27]. However, as we will discuss in detail later, the adjustments applied to each station record will often change quite dramatically between runs as NOAA updates the dataset and re-runs their homogenization program. Each run will often identify new breakpoints throughout the entire record and often will drop breakpoints which had been identified on previous runs.
As a result, the homogenized temperatures for any given station in, e.g., the mid-19th century might be very different in Tuesday's dataset than they were in Monday's dataset, for instance. For other stations, the homogenized records might remain essentially the same every time the homogenization program is run over the years.
Fortunately, as mentioned in the introduction, one of us (P.O'N.) began regularly downloading and archiving the latest versions of the GHCN datasets from NOAA's website in 2011-initially roughly fortnightly, but daily since March 2014. Table 2 shows the number of different iterations of the datasets in his archive that were downloaded for each year. The timeline of how frequently the datasets have been downloaded is described below: March 2014: An automated script was set up to download the dataset daily. However, NOAA's updates appear to have been only approximately every 24 hours. Therefore, on some days, the datasets downloaded were identical to the preceding day. We have removed these identical copies from our analysis. Furthermore, on some days, the download was not carried out due to P.O'N's computer being offline for maintenance or travel. Hence, the annual totals for each dataset in Table 2 are less than 365 for all years.  Figure 1 shows the locations of all the European GHCN stations for which we compiled station history metadata (filled red circles) as well as the remaining stations for which we had not yet identified station history metadata at the time of writing. Table 3 lists the total station counts for each country. Details on the sources for these station history metadata are provided in the Supplementary Materials.  Figure 1 shows the locations of all the European GHCN stations for which we compiled station history metadata (filled red circles) as well as the remaining stations for which we had not yet identified station history metadata at the time of writing. Table 3 lists the total station counts for each country. Details on the sources for these station history metadata are provided in the Supplementary Materials.

Case Study of Cheb, Czech Republic as an Example of Our Analysis Framework
As a case study, Figures 2 and 3 show a detailed example of how we analyzed each station. The results are for one station record-the Version 4 record for Cheb, Czech Republic (GHCN Version 4 station ID = "EZM00011406"). Figure 2a plots the net adjustment applied to the homogenized record (as a linear trend in • C/century) for each day's version of the GHCN dataset. It can be seen that this metric changes quite erratically from day to day. This is a surprising result to us. We might have expected some occasional variations in the exact adjustments applied to a given station over the years, e.g., due to changes in the stations used as neighbors and monthly updates to the most recent temperature values. However, we would have still expected that the homogenization adjustments calculated by the PHA for any given station should remain fairly similar every time the algorithm is re-run.    It can be seen that this metric changes quite erratically from day to day. This is a surprising result to us. We might have expected some occasional variations in the exact adjustments applied to a given station over the years, e.g., due to changes in the stations used as neighbors and monthly updates to the most recent temperature values. However, we would have still expected that the homogenization adjustments calculated by the PHA for any given station should remain fairly similar every time the algorithm is re-run.
Every time the PHA is run, the program calculates a series of breakpoints for each station record along with the magnitudes of the adjustments applied to the station record from that breakpoint to the end of the record. However, each time the PHA is re-run, the breakpoints and adjustments applied to each station record can change. On the other Every time the PHA is run, the program calculates a series of breakpoints for each station record along with the magnitudes of the adjustments applied to the station record from that breakpoint to the end of the record. However, each time the PHA is re-run, the breakpoints and adjustments applied to each station record can change. On the other hand, the PHA will often recalculate the same breakpoints and adjustments that it identified on a previous run. Figure 2b plots a histogram of the "distinct sets of homogenization adjustments" that occurred for the Cheb record over the 1770 runs we analyzed. Over the 1770 runs (2015-2021), 34% of the sets of adjustments to Cheb were repeated more than 20 times, but 7% of the sets of adjustments only occurred once during the entire period. Figure 2c presents the information conveyed in Figure 2a in histogram form, i.e., it shows the distribution of the different net homogenization adjustments applied to the Cheb Atmosphere 2022, 13, 285 11 of 21 record by NOAA over the 1770 runs. In this case, the average net adjustment for Cheb was relatively similar, with 59% of the net adjustments falling in the range of −0.8 • C/century to −0.5 • C/century. As will be discussed below, the breakpoints identified for this station were relatively consistent. However, even for this relatively consistent example, the net adjustments were often several tenths of a degree higher or lower. Figure 3 provides more details on the exact adjustments applied to this station record. Three documented station history events are associated with our metadata for this station. The dates of these three events are indicated by vertical (blue) dashed lines in each of the panels except Figure 3b, and details on the events are provided in the figure caption. Figure 3a shows the original unhomogenized station record. For this paper, we define five categories of "breakpoints" based on how frequently the PHA identifies the date as a breakpoint: 1. "Very consistent": the breakpoint was identified for >95% of the PHA runs; 2. "Consistent": the breakpoint was identified >75%, but ≤95% of the runs; 3. "Inconsistent": the breakpoint was identified for >25%, but ≤75% of the runs; 4. "Intermittent": the breakpoint was identified >5%, but ≤25% of the runs; 5. "Very intermittent": the breakpoint was identified for ≤5% of the runs; For Cheb, the 18 breakpoints comprise: 2 "very consistent"; 5 "inconsistent"; 4 "intermittent"; and 7 "very intermittent". Given that two of the breakpoints were "very consistent" (indeed, they occurred in 100% of the runs), the adjustments for this station are quite similar between different runs. However, as discussed above, there is quite a lot of variability in the exact set of adjustments applied in each run. In Figure 3c-i, we plot the 7 most frequently applied sets of adjustments. The adjustments of Figure 3c were applied to 107 of the 1770 runs (6.0%), Figure 3d was applied 28 times (1.6%) and the other five sets were applied 27 times each (1.5%). Therefore, they collectively describe 15.1% of the adjustments applied to the Cheb record. While the shapes of the adjustments are broadly similar, and all of the iterations include the two breakpoints mentioned above (one of which corresponds to a documented event), as mentioned before, there is still considerable variability in the overall net adjustments to the record.
The above discussion was just for one of the GHCN Version 4 stations, but should give a concrete feeling for the type of information revealed by the analysis. In the next section, we will summarize the results from all of the analyzed stations, i.e., 847 Version 4 and 249 Version 3 stations. Figure 4 shows the numbers of distinct sets of adjustments associated with each of the records in Version 3 (green) and Version 4 (blue). For Version 4, we can see that the majority of stations (68%) had more than 25 distinct sets of adjustments over the period of record (2015-2021, with 1812 distinct iterations of the dataset analyzed), and 14% of the stations (including Cheb described above, with 269) had more than 200 distinct sets of adjustments applied. For Version 3, the results are a bit more encouraging in that only 44% of the records had more than 25 distinct sets of adjustments and 12% had more than 200 distinct sets. Nonetheless, for both versions, there is clearly a major inconsistency in the homogenization adjustments that NOAA has applied to the GHCN station records over time.

Results
record (2015-2021, with 1812 distinct iterations of the dataset analyzed), and 14% of the stations (including Cheb described above, with 269) had more than 200 distinct sets of adjustments applied. For Version 3, the results are a bit more encouraging in that only 44% of the records had more than 25 distinct sets of adjustments and 12% had more than 200 distinct sets. Nonetheless, for both versions, there is clearly a major inconsistency in the homogenization adjustments that NOAA has applied to the GHCN station records over time.  In Figures 5 to 7, we assess the consistency of all homogenization adjustments applied to all of the European stations we analyzed for each Version. As explained in Tables  2 and 3

, for Version 3, this comprises 1877 PHA iterations applied to 259 stations (from 24 countries). For Version 4, this comprises 1812 applied to 847 stations (again from 24 countries)
Firstly, in Figure 5, we assess the consistency of the PHA adjustments without reference to any station history metadata. Using the same categories for the consistency of breakpoints as in Section 2.3, we see that the majority of PHA adjustments were applied less than or equal to 25% of the time. For Version 3, 38% of adjustments were "very intermittent" (i.e., applied ≤5% of the time) and 23% were "intermittent" (i.e., 5-25% of the time) - Figure 5(a). For Version 4, 32% were "very intermittent" and 32% were "intermittent - Figure 5  In Figures 5-7, we assess the consistency of all homogenization adjustments applied to all of the European stations we analyzed for each Version. As explained in Tables 2 and 3 Figure 5. Breakdown of how consistent the PHA adjustments applied to the European Global H torical Climatology Network (GHCN) stations are when considered over all iterations analyz "Very consistent" adjustments were repeatedly identified by the PHA for >95% of the iteratio "Consistent" was ≤95%, but >75%; "Inconsistent" was ≤75%, but >25%; "Intermittent" was ≤2 but >5%; "Very intermittent" was ≤5%. Only a minority of the PHA adjustments were consistently applied to the station r ords. For Version 3, only 14% of adjustments were "very consistent" (i.e., applied >95% the time) and only 21% were "consistent" (i.e., 75-95% of the time) -see Figure 5(a). Version 4, only 9% were "very consistent" and only 16% were "consistent" -see Fig  5( As discussed in the introduction, aside from the U.S. component of the dataset ( analyzed here), the PHA is applied by NOAA to the GHCN in the absence of any stat history metadata [3,4]. NOAA's primary justification for this is that, when the station r ords for the GHCN were initially being compiled, the available station history metad were very limited [2][3][4], except for the U.S. component (USHCN), which is based on d "Very consistent" adjustments were repeatedly identified by the PHA for >95% of the iterations; "Consistent" was ≤95%, but >75%; "Inconsistent" was ≤75%, but >25%; "Intermittent" was ≤25%, but >5%; "Very intermittent" was ≤5%. At any rate, for this paper, we have compiled together, from multiple sources, station history metadata for 847 of the Version 4 stations and 259 of the Version 3 stations in Europe. In Figure 6, we compare how often the various breakpoints identified by NOAA for these station records coincide with any of the documented events in the corresponding metadata.
(a) (b) Figure 6. Breakdown of how often the PHA adjustments for a station corresponded to a documented event in the station history metadata. "Exact match" refers to breakpoints that occurred for the same month as a documented event, whereas "No match" indicates that there was no documented event within 3 years of the breakpoint. Note that if the date of a documented event was only known to the nearest year, the date was approximated to June of that year, i.e., month 6. Only 3% of the PHA breakpoints for either version corresponded exactly (to the nearest month) to documented events. However, in quite a few cases only the year of an event was recorded. Furthermore, given that the PHA was being run by NOAA without any station history metadata and was relying purely on statistical techniques, it is understandable if a breakpoint is not identified accurately to the nearest month. Nonetheless, even if we consider those breakpoints identified by the PHA within 12 months of the documented event, the matches are disappointedly low: only 18% for Version 3 and 19% for Version 4.
For reference, we also show those breakpoints that loosely "match" to documented events in that the dates were within 2 years or 3 years. However, we caution that to give the "best case outcome" for the PHA, we considered a match to have occurred if any documented event coincided with the breakpoint. As a result, if two breakpoints were identified within 2 or 3 years of a given metadata event, both breakpoints might be doublecounted in these categories as being "associated with" the same event even if one breakpoint occurred before the event and the other occurred after the event.
Therefore, some of the apparent "matches" within 2 or 3 years are probably spurious. Regardless, 69% of the PHA breakpoints for Version 3 stations and 67% of those for Version 4 stations had no match to any event at all within 3 years.
In Figure 7, we cross-reference the results of Figures 5 and 6. That is, we compare how the consistency of a given breakpoint compares to whether it matches with documented station history events. If the consistency is better for breakpoints associated with documented events, this might offer some optimism that the PHA's consistency could be improved using station history metadata. Disappointingly, for the Version 3 stations, there does not seem to be a clear improvement in the consistency of breakpoints based on Figure 6. Breakdown of how often the PHA adjustments for a station corresponded to a documented event in the station history metadata. "Exact match" refers to breakpoints that occurred for the same month as a documented event, whereas "No match" indicates that there was no documented event within 3 years of the breakpoint. Note that if the date of a documented event was only known to the nearest year, the date was approximated to June of that year, i.e., month 6.  whether they coincided with documented events. However, the results are more encouraging for Version 4, in that 39% of the very consistent breakpoints (occurring >95% of the time) had "a match" within 3 years and 28% within 1 year, whereas for the very inconsistent breakpoints (occurring ≤5% of the time) only 28% had "a match" within 3 years and only 14% had a match within 1 year. Given that the sample size was larger for Version 4 (847 stations vs. 259 in Version 3), this suggests that the consistency of the PHA adjustments to the GHCN might be partially improved if station history metadata were used.

A case study of the Ukrainian results
The above discussion describes the combined results across all 24 analyzed countries. However, the exact results vary slightly from country to country. Therefore, as a case study, let us now briefly discuss the breakdown of results for one of the 24 countries -Ukraine.
In both considered versions of the GHCN dataset (Versions 3 and 4), there are multiple stations for Ukraine, one of the largest countries in Europe. Version 3 only contained 25 stations, but this has been increased to 119 for the most recent version (4). Despite the significant increase in stations in Version 4, we should note that there is still a significant portion of unused Ukrainian temperature data (especially, historical), which could increase the representativeness of the regional climate of Eastern Europe in the GHCN [56,57].
In order to evaluate the regional peculiarities of the GHCN homogenization algorithm (its detecting and adjustment parts), metadata of Ukrainian stations were collected from their historical descriptions [73,74]. For the Ukrainian metadata, this consists mainly of dates of station relocations and other non-climatic events which were clearly docu- Firstly, in Figure 5, we assess the consistency of the PHA adjustments without reference to any station history metadata. Using the same categories for the consistency of breakpoints as in Section 2.3, we see that the majority of PHA adjustments were applied less than or equal to 25% of the time. For Version 3, 38% of adjustments were "very intermittent" (i.e., applied ≤5% of the time) and 23% were "intermittent" (i.e., 5-25% of the time)- Figure 5a. For Version 4, 32% were "very intermittent" and 32% were "intermittent- Figure 5b.
Only a minority of the PHA adjustments were consistently applied to the station records. For Version 3, only 14% of adjustments were "very consistent" (i.e., applied >95% of the time) and only 21% were "consistent" (i.e., 75-95% of the time)-see Figure 5a. For Version 4, only 9% were "very consistent" and only 16% were "consistent"-see Figure 5b.
As discussed in the introduction, aside from the U.S. component of the dataset (not analyzed here), the PHA is applied by NOAA to the GHCN in the absence of any station history metadata [3,4]. NOAA's primary justification for this is that, when the station records for the GHCN were initially being compiled, the available station history metadata were very limited [2][3][4], except for the U.S. component (USHCN), which is based on data housed by NOAA [21,22,24,25]. A secondary justification is that some benchmarking simulations involving the introduction of artificial breakpoints to synthetic data have suggested that the PHA algorithm performs well at identifying many of these breakpoints without the use of "metadata" [27,34,35]. Another justification is that several studies (involving NOAA co-authors) of the USHCN dataset have argued that the PHA performs well at identifying both documented and undocumented events [21,27,29,37].
Nonetheless, in recent years, there has been a renewed interest in tracking down and digitizing station history metadata for many countries and individual stations in the recognition that this can help identify genuine non-climatic breakpoints, e.g., [46][47][48][49][50][51][52][53][54][55][56][57][58]. Unfortunately, the levels of detail and comprehensiveness of the station history metadata currently available vary substantially between countries. Some of this might be a legacy of different practices between different national meteorological services; e.g., Mamara et al. have noted that the station history metadata potentially available in the archives of the Hellenic National Meteorological Service (HNMS) are quite limited for Greece compared to some other countries [55,72]. However, they emphasize the importance of prioritizing the gathering and digitization of this information where available: "These cases demonstrate the necessity that the various meteorological services gather and digitize detailed metadata" [72].
At any rate, for this paper, we have compiled together, from multiple sources, station history metadata for 847 of the Version 4 stations and 259 of the Version 3 stations in Europe. In Figure 6, we compare how often the various breakpoints identified by NOAA for these station records coincide with any of the documented events in the corresponding metadata.
Only 3% of the PHA breakpoints for either version corresponded exactly (to the nearest month) to documented events. However, in quite a few cases only the year of an event was recorded. Furthermore, given that the PHA was being run by NOAA without any station history metadata and was relying purely on statistical techniques, it is understandable if a breakpoint is not identified accurately to the nearest month. Nonetheless, even if we consider those breakpoints identified by the PHA within 12 months of the documented event, the matches are disappointedly low: only 18% for Version 3 and 19% for Version 4.
For reference, we also show those breakpoints that loosely "match" to documented events in that the dates were within 2 years or 3 years. However, we caution that to give the "best case outcome" for the PHA, we considered a match to have occurred if any documented event coincided with the breakpoint. As a result, if two breakpoints were identified within 2 or 3 years of a given metadata event, both breakpoints might be double-counted in these categories as being "associated with" the same event even if one breakpoint occurred before the event and the other occurred after the event.
Therefore, some of the apparent "matches" within 2 or 3 years are probably spurious. Regardless, 69% of the PHA breakpoints for Version 3 stations and 67% of those for Version 4 stations had no match to any event at all within 3 years.
In Figure 7, we cross-reference the results of Figures 5 and 6. That is, we compare how the consistency of a given breakpoint compares to whether it matches with documented station history events. If the consistency is better for breakpoints associated with documented events, this might offer some optimism that the PHA's consistency could be improved using station history metadata. Disappointingly, for the Version 3 stations, there does not seem to be a clear improvement in the consistency of breakpoints based on whether they coincided with documented events. However, the results are more encouraging for Version 4, in that 39% of the very consistent breakpoints (occurring >95% of the time) had "a match" within 3 years and 28% within 1 year, whereas for the very inconsistent breakpoints (occurring ≤5% of the time) only 28% had "a match" within 3 years and only 14% had a match within 1 year. Given that the sample size was larger for Version 4 (847 stations vs. 259 in Version 3), this suggests that the consistency of the PHA adjustments to the GHCN might be partially improved if station history metadata were used.

A Case Study of the Ukrainian Results
The above discussion describes the combined results across all 24 analyzed countries. However, the exact results vary slightly from country to country. Therefore, as a case study, let us now briefly discuss the breakdown of results for one of the 24 countries -Ukraine.
In both considered versions of the GHCN dataset (Versions 3 and 4), there are multiple stations for Ukraine, one of the largest countries in Europe. Version 3 only contained 25 stations, but this has been increased to 119 for the most recent version 4. Despite the significant increase in stations in Version 4, we should note that there is still a significant portion of unused Ukrainian temperature data (especially, historical), which could increase the representativeness of the regional climate of Eastern Europe in the GHCN [56,57].
In order to evaluate the regional peculiarities of the GHCN homogenization algorithm (its detecting and adjustment parts), metadata of Ukrainian stations were collected from their historical descriptions [73,74]. For the Ukrainian metadata, this consists mainly of dates of station relocations and other non-climatic events which were clearly documented in the stations' histories.
As discussed above, the results of the homogenization procedure for a particular station can vary significantly from day to day (i.e., from one homogenization run to another). These variations can appear as different sets of detected break points (different timing, different number), different adjustment factors, or both. It is not completely clear why this wide range of variations exists.
At any rate, for Version 3 of the dataset, the total number of break points detected in all homogenization runs is 47, while for Version 4, this quantity is considerably larger, totaling 592 (Table 4). This is physically explainable because in the latest version of GHCN, the higher station density has probably increased the average correlation between the time series, allowing the PHA to detect more breaks. In both versions, the majority of the detected breaks (76.6% in Version 3 and 67.4% in Version 4) can be reported as "intermittent" or "very intermittent". They occur in less than 25% of the homogenization runs. "Inconsistent" breaks, occurring between 25 and 75% of the performed homogenizations, constitute 10.6% and 19.3% of the total detected breaks in Version 3 and 4, respectively. Only approximately 13% of the detected shifts can be considered as "consistent" or "very consistent" in both versions of the GHCN dataset, i.e., persistently occurring in more than 75% of the homogenization runs (Table 4). An important question here is how many of the detected breaks can be confirmed by the collected metadata. In our analysis, we assumed that a break point is confirmed by some historical event when a time shift between them is not more than 1 year. The results of the comparison analysis are presented in Table 5. As can be seen from the table, only a small number of detected shifts (around 5-6% in both versions of GHCN) are associated with documented metadata events. It is worth noting that, in other studies involving two of us (Oleg and Olesya S.), the Ukrainian monthly air temperature data (collections of 178 time series of mean, maximum and minimum temperature) has been homogenized for the period of 1946-2015 [56,57] with the HOMER software [75]. In this independent analysis, for all three datasets, the percentage of detected break points, which can be confirmed by historical events, was more than 30%. However, these results are difficult to directly compare with the PHA homogenization procedure in GHCN, because HOMER was run in a semi-automatic mode where final decisions regarding detected breaks were made by an expert.

Discussion
After comparing the different homogenization adjustments applied by NOAA to their widely used Global Historical Climatology Network (GHCN) monthly temperature dataset, we found a disconcerting inconsistency between the updates to the dataset from day to day. Availing of a substantial archive compiled by regularly downloading the datasets from NOAA's website over the period 2011-2019 for version 3 and 2015-present for version 4, we assembled 1877 distinct iterations of the version 3 dataset and 1812 of the version 4 dataset which NOAA has published over the past decade.
From our large sample of station records from 24 European countries, we found that only 16% of the breakpoints and adjustments applied to the GHCN version 4 station records were repeatedly applied more than 75% of the time; 64% of the adjustments were applied to records less than 25% of the time, with 32% being applied less than 5% of the time. The results were arguably slightly better for version 3, with 21% being applied more than 75% of the time and 61% being applied less than 25% of the time. However, for version 3, 38% of the adjustments were applied less than 5% of the time.
This remarkable inconsistency in the results from NOAA's application of the Pairwise Homogenization Algorithm (PHA) [27] to the GHCN since 2011 [3,4] is quite surprising since the PHA has performed quite well over the years against various benchmarking tests [27,[34][35][36]. However, we note that those earlier assessments of the PHA were generally "one-off" assessments, i.e., they did not evaluate the consistency of the breakpoints and adjustments applied with repeated runs of the algorithm. Therefore this inconsistency of the PHA adjustments between consecutive runs would have been inadvertently overlooked by those earlier tests.
Rather, we believe these findings should be used as motivation for improving our approaches to homogenizing the available temperature records. Meanwhile, the results raise serious concerns over the reliability of the homogenized versions of the GHCN dataset, and more broadly over the PHA techniques, which do not appear to have been appreciated until now. As shown in Table 1, the homogenized GHCN datasets have been widely used by the community for studying global temperature trends. We note that PHA has also been used for homogenizing some other climatic datasets [80][81][82][83][84][85], although our analysis here is specifically on the PHA adjustments that have been applied to the GHCN datasets since 2011 [3,4].
Another major concern is how few of the adjustments are associated with documented station history metadata events-only 19% of Version 4 and 18% of Version 3 breakpoints occur within 1 year of a documented event. We agree that metadata information can sometimes be erroneous and that many events leading to non-climatic breakpoints will go undocumented [26,27,55,56,72,[86][87][88][89].
So, we would expect a fraction of the breakpoints to be undocumented. However, particularly when these breakpoints are inconsistent from run to run, it raises the question, how do we know if an undocumented breakpoint that occurs for one run but not another is genuine? We are reminded of this observation by the 20th century statistician and economist, Schumacher: "A man who uses an imaginary map, thinking that it is a true one, is likely to be worse off than someone with no map at all."-Ernst P. Schumacher, "Small is Beautiful" (1973) [90].
With that in mind, we believe that future efforts to homogenize temperature datasets should involve a greater use of station history metadata. We appreciate that in recent years, with the commendable compilation of very large climate datasets, there has been an understandable appeal of using automated statistical homogenization techniques which do not require metadata. However, given the findings from this analysis and the above "imaginary map" problem, we suggest that, as a community, we should begin treating these "blind" homogenization results with a bit more skepticism. With that in mind, we welcome efforts to allow the use of metadata in various homogenization packages, e.g., [75,91,92], including PHA [27].
While we encourage more investigation of the various homogenization techniques via "benchmarking" experiments using synthetic data [27,[34][35][36], we believe these findings highlight the importance of also benchmarking techniques against real-world data [93]. The collection and digitizing of station history metadata for as many stations as possible should be encouraged. In many cases, this could potentially be combined with the various commendable efforts to track down and digitize early instrumental thermometer records [44,[46][47][48][49]53].
Finally, in this study, we have not focused on the impacts of homogenization on longterm trends. However, we note that various studies have demonstrated that the application of homogenization to a series of temperature records can inadvertently lead to the "aliasing" or "blending" of some non-climatic biases if multiple neighbors are systemically affected by similar long-term non-climatic biases, such as urbanization biases or siting biases [32,38,40]. As mentioned in the introduction, Soon et al. (2018) have suggested that one way to minimize this blending problem would be to ensure that the neighbor network used for the homogenization process is not systemically biased relative to the target stations, e.g., rural stations should be homogenized using a mostly rural station network [38].