Distributional trends in the generation and end-use sector of low-carbon hydrogen plants

This paper uses established and recently introduced methods from the applied mathematics and statistics literature to study trends in the end-use sector and capacity of low-carbon hydrogen projects in recent and upcoming decades. First, we examine distributions in plants over time for various end-use sectors and classify them according to metric discrepancy, observing clear similarity across all industry sectors. Next, we compare the distribution of usage sectors among different continents and examine the changes in sector distribution over time. Finally, we judiciously apply several regression models to analyse the association between various predictors and the capacity of global hydrogen projects. Across our experiments, we see a welcome exponential growth in the capacity of zero-carbon hydrogen plants and significant growth of new and planned hydrogen plants in the 2020's across every sector.


Introduction
Hydrogen has great potential as an alternative fuel source to existing fossil fuels and may play a role in the world's coordinated attempt to reach net-zero carbon emissions during this century. Hydrogen production has already been used for numerous purposes, including production of liquid fuels, water, and space heating; direct injection into the gas grid; transport; and other industrial applications. Unlike petrol or natural gas, the combustion of hydrogen does not produce any greenhouse gas (GHG) emissions, which have a role in anthropogenic global warming. Hydrogen in its molecular form (H 2 gas) does not exist naturally on Earth; it must be synthesised via a variety of different procedures.
The production of hydrogen is classified by colours according to its mode of preparation and associated emissions. "Green hydrogen" refers to production techniques that do not generate any GHG emissions, the most desirable outcome from a sustainability perspective. Typically, green hydrogen plants use renewable sources of energy (such as solar) to extract hydrogen via the electrolysis of water; the acquired hydrogen may then be stored and subsequently used for numerous applications. Black, brown, and grey hydrogen refer to production techniques that use black coal, brown coal, and natural gas, respectively, and are increasingly obsolete and costly energy sources with considerable amounts of emissions and harmful byproducts [1]. Blue hydrogen is a middle ground, defined as the production of hydrogen using fossil fuels followed by carbon capture and storage (CSS). This is not a truly zero-emissions process, as only a proportion of all generated GHGs can be captured, and harmful byproducts may remain.
With the increasing global interest in alternative energy sources in recent years, a great deal of research has focused on the viability and underlying science of hydrogen production. Early research into hydrogen fuel cells dates back several decades [2][3][4]. Since then, research arXiv:2301.08457v2 [physics.soc-ph] 9 Mar 2023 on hydrogen production has been substantial: [5,6] each provided a review article on different electrolysis technologies, [7] detailed alkaline electrolysis (ALK), and [8][9][10] described successive advances in proton exchange membrane electrolysis (PEM). Reference [11] compared and contrasted ALK and PEM in detail.
More recent research has explored sophisticated means for integrating different renewable energy sources. In [12], the authors explored the use of geothermic energy to power electrolysis, and [13] investigated the use of micro-organisms (biocatalysed electrolysis). Then, [14] discussed further advances in such microbial electrolysis cells. More recently, [15] described how electronic waste may be used to generate metallic components for a process called "chemical looping reforming".
There have also been many technological advances to enhance the efficiency of hydrogen production, both novel and incremental. In [16], the authors examined numerous means to reduce energy consumption during electrolysis. In [17], the authors compared the efficiency of different electrodes, with further advances made by [18]. In particular, [19] analysed electrode overpotential during ALK electrolysis production, whereas [20] explored optimal configurations of electrolysis under different conditions. In [21], the authors investigated state-of-the-art electrocatalysts, and [22] analysed the use of hybrid structures for increasingly efficient water electrolysis, combining both morphological features and electrochemical properties, whereas various researchers have studied other cutting-edge catalysts [23][24][25][26][27]. Numerous advances in blue hydrogen plants have also been made, including pyrolysis of plastic [28] and catalytic decomposition of methane [29] and other hydrocarbons [30]. Finally, numerous articles have examined different means for transportation and storage of hydrogen, an essential component of its widespread use [31][32][33][34].
In addition, numerous articles have taken a geopolitical focus, analysing both the policy environment and natural resources of various countries and their suitability for hydrogen production plants. Different papers have investigated the significant variability in countries' adoption of hydrogen technologies, including in the United States [35], the United Kingdom [36], China [37], South Korea [36], the Philippines [38], Mexico [39], Morocco [40], and across the European Union [41]. Our previous work analysed the geographic rollout of low-carbon plants across different continents [42]. Finally, numerous authors have discussed the end-use sectors [43,44] or points of end-use [45] hydrogen energy usage, including [46], who comprehensively reviewed the same low-carbon plants we mathematically analyse.
Whereas the existing literature has been more technological or geopolitical in focus, our paper is a mathematical study of trends in the rollout and prevalence of low-carbon hydrogen plants (green and blue) with a focus on the end-use sector. We make use of time series analysis and metrics that have been extensively applied to various fields such as epidemiology [47][48][49][50][51][52][53], environmental sciences [54][55][56], finance [57][58][59][60][61][62][63][64][65][66], cryptocurrencies [67][68][69][70][71], crime [72][73][74] and other fields [75][76][77][78][79]. We are unaware of any instance where time series or distance analysis has been applied to the rollout of hydrogen plants over time and by the end-use sector. We study the changing usage and energy capacity of low-carbon hydrogen plants over time, with a particular interest in the increasing potential of green hydrogen plants, which emit zero carbon. Our main finding is promising: an exponential increase in the capacity of green plants over time, regardless of usage sector, and a dramatic closing of the gap between the capacity of green and non-green plants.

Data
All data analysed in this paper are drawn from the International Energy Agency (IEA) [80] and consist of plants built in (or projected to be built in) 1975-2043. This dataset records all low-carbon hydrogen plants, namely either green, zero emissions, or blue, incorporating fossil fuels and carbon capture and storage (CSS), as discussed in Section 1. We will refer to blue hydrogen plants as "Fossil" throughout the manuscript to highlight this distinction. The dataset contains four different technologies of green plants: alkaline electrolysis (ALK), proton exchange membrane electrolysis (PEM), solid oxide electrolysis cells (SOEC), and other electrolysis. The technologies of blue plants are coal gasification, natural gas reforming, and oilbased processes, in each case followed by CSS. Throughout the paper, we aggregate the four green electrolysis technologies and the three blue CSS production methods to classify each plant as green or blue/fossil. We also make use of the IEA estimated zero-carbon hydrogen capacity, measured in nm 3 /h of hydrogen for each plant. This is either quoted directly from the plant or estimated according to the stated power consumption of the plant and its technology. It is a measure of how much hydrogen the plant produces (or will produce upon completion).
The most important aspect of the dataset we analyse is the end-use sector of each plant. These are: refining (oil refining), ammonia (ammonia production), methanol (methanol production), iron and steel (steelmaking and other high-temperature iron processes), other industry (other high-temperature heat industrial applications), mobility (use in vehicles), power (use in the supply of power to the electricity grid), grid injection (injection of hydrogen into the natural gas grid), CHP (combined heat and power fuel cells), domestic heat (water and space heating), biofuels (biofuel production), synfuels (synthetic liquid fuels other than methanol), CH4 grid injection (injection of synthetic methane into the gas grid), and CH4 mobility (use of synthetic methane in vehicles). A small number of plants have more than one end-use sector, but the majority have only one.

Distributions of end-use sector over time
In this section, we study the evolution of the propagation of low-carbon hydrogen plants for each sector and investigate the similarities in these trends. Our primary mathematical object of study is the cumulative distribution function (CDF) F S (t) of number of plants over time for each end-use sector S. This is defined as the proportion of plants with end-use sector S indexed with time ≤ t out of all plants with that sector. As such, it is a non-decreasing function from 0 to 1. We remark that a small number of plants have more than one usage sector, and thus will contribute to more than one CDF. This is no problem for our analysis. The temporal CDFs for eight sectors S are displayed in Figure 1. As only four plants in total are indexed before t = 2000, we exclude these from our figures.
In Figure 2, we perform hierarchical clustering on these figures using the L 1 metric between these CDFs, defined as As all CDFs take comparable values between 0 and 1, it is appropriate to directly compare them as such.
Hierarchical clustering [81,82] is an iterative clustering technique that seeks to build a hierarchy of similarity between elements where there is some way to measure distance between them. In this case, our elements are the cumulative distribution functions equipped with the distance above. This is opposed to a technique such as k-means, which ultimately specifies k discrete groupings of elements. Hierarchical clustering is either agglomerative, where each element (CDF in our case) begins in its own cluster and branches between them are successively built, or divisive, where all elements begin in one cluster and are successively split. The results of hierarchical clustering are commonly displayed in dendrograms, which resemble branching trees. In this paper, we implement agglomerative hierarchical clustering with the average linkage method [83]. Figure 2 reveals numerous similarities and differences in the temporal distributions of new plants, which can also be seen in Figure 1. Figure 1. Cumulative distribution functions F S for eight sectors S, (a) refining (b) ammonia (c) synfuels (d) methanol (e) mobility (f) domestic heat (g) CHP (h) power. Sectors are described in Section 2. The greatest collective similarity is observed between industrial applications, with an explosion of planned plants in the 2020's. Power serves as an anomaly with its highly uniform trend of new plants.

Figure 2.
Hierarchical clustering on cumulative distribution functions F S (relative to time) for all 14 end-use sectors S in our database. A strong cluster of similarity is observed for the seven industrial uses, ranging from biofuels to ammonia. A secondary cluster of more 'consumer uses' is revealed from grid injection to domestic heat, whereas power is observed as an outlier due to its highly uniform nature.
Examining  Figure 1d) are deemed to be nearby. Curiously, all seven of these end-use sectors comprise industrial applications. Examining the plots in Figure 1, we can see these sectors' temporal distributions are dominated by plants to be built in 2020-2030, with a smaller number in the 2010's. Next, there is a moderately similar cluster consisting of mobility, grid injection and domestic heat. The first two, as represented by mobility ( Figure 1e) have a very similar concave shape with a gradually increasing number of plants from 2000 to the 2020's, while domestic heat (Figure 1f) features slightly more activity in the 2010's. These three sectors reflect uses of hydrogen production that more directly service regular consumers.
Next, there are some detected outliers. CHP (Figure 1g), CH4 mobility, and CH4 grid injection all share a similar shape, exhibiting a growth in plants earlier relative to all other sectors. The most prominent outlier is power (Figure 1h), with a highly uniform growth in new plants from 2000 to 2030 observed in no other sector. Indeed, power has been one of the most consistently popular uses of hydrogen production since its inception, and this is reflected in the consistent and ongoing construction of new plants in this sector.

Usage distributions
In this section, we investigate the distribution of plants across different sectors, focusing on each continent of location as well as green vs. blue/fossil technology. For each plant in our dataset, the location is recorded as either a country of location or continent. Thus, we first collate the plants on a continent-by-continent basis. We divide plants into seven continental and/or geopolitical regions as follows: Europe, North America (the United States and Canada), Latin America (including Mexico), Oceania, East Asia (China, Japan and Korea), Africa, and other (plants from elsewhere in Asia and the Middle East). We select these regions to combine countries according to both geographic proximity, political relations, and economic development. Next, we divide every plant up according to groups that specify both a continental region and whether the technology is green or blue/fossil. For example, we group all green European plants, or all East Asian fossil plants. There are no African fossil or Latin American fossil plants in our dataset, so this leaves 12 different continental/technological groupings G.
where the sum is taken over all 14 sectors S. This has the property that d(G, H) = 0 if and only if groups G and H have an identical distribution of usages, again allowing multiple end-use sectors in some plants. Furthermore, d(G, H) ≤ 1 is the maximal possible distance, with equality if and only if the usage distributions are disjoint, with no usage sectors in common at all between the two groups. When G is empty, we set p (G) to be the zero vector 0 ∈ R 14 which is a distance of 1 from every other distribution and is thus always appearing as an outlier in hierarchical clustering. While seemingly simple, this distance may be interpreted as possibly the most suitable distance between two distributions of discrete sets, as it can be shown to be equivalent to the discrete Wasserstein metric between distributions on a discrete metric space. More details are provided in [84] and Appendix A.
In Figure 3, we perform hierarchical clustering on the 12 continental/technological groups G. Examining the dendrograms in conjunction with the stacked bar plots highlights the key similarities and differences that give rise to the structure of the clustering observed. In Figure 3a, the 12 continental/technological groups divide into two primary subclusters of six groups each, with no outliers observed. A close subcluster of similarity is observed between Oceanian green, European green, North American green, and East Asian green plans. That is, we observe notable similarity in the usage distributions of green hydrogen plants from the four most economically developed continental groups. Examining Figure 4a, we see that these four groups are characterised by a distribution of end-use sectors that crosses (almost) all end-use sectors in each group. Furthermore, these four groups all have low proportion of plants dedicated to ammonia production, an observation we will return to below. Next, European fossil and East Asian fossil plants are deemed similar and in the same primary cluster.
Examining Figure 4a, we observe very low ammonia usage, but not as diverse of a range of end-use sectors as the previous subcluster.  Turning to the other main cluster, we see that all six of these groups, North American fossil, Latin American green, African green, other green, other fossil, and Oceanian fossil plants, have high proportions of plants dedicated to ammonia production. The observed subcluster of other fossil and Oceanian fossil plants exhibits more plants dedicated to ammonia than any other group, while the remaining subcluster is slightly more evenly distributed. Thus, the primary division of the 12 groups into two main clusters is almost entirely explained by a dramatic difference in just a single sector of usage: ammonia production.
There are numerous insights to be gleaned from examining the dendrograms and bar plots on a decade-by-decade basis. In the decade 2000-2009, a subcluster of similarity is observed between six of seven continents' green power plants: Latin American green, East Asian green, other green, Oceanian green, European green and North American green (Figure 3b). Turning to Figure 4b, we can see that this similarity can be primarily explained by the prevalence of power generation plants among these green plants. In the next decade (Figure 3c), the subcluster structure of North American green, Latin American green, other green, and East Asian green plants (with European fossil, Oceanian green, and European green plants being slightly less similar) is primarily explained with a high proportion of mobility and power among these plants. The decade 2020-2029's clustering (Figure 3d) is nearly identical to that of the full period, as the counts of the full period are dominated by this decade, so the distributional differences between groups of 2020-2029 closely resemble those over all time. Finally, the decade 2030-2039 is mostly dominated by missing data or continental/technological groups that have only planned to build plants dedicated to ammonia production. As this is quite far into the future, it will of course be subject to change as more plants and their end-use sector are planned and built.

Trends in capacity over time and relative to technology and end-use sector
In this section, we examine the increase in capacity of low-carbon hydrogen plants over time while examining differences between technology and end-use sector. In Figure 5, we display all plants with an identified technology, end-use sector, and capacity in our dataset. Displaying the logarithm of the capacity against the year (or projected year) of construction, we see an approximate linear trend between the log of capacity and time, suggesting an exponential increase in capacity over time. We separate plants' hydrogen technology into green and blue/fossil and separate them into end-use sectors. For graphical readability, we have chosen to separate sectors into the four clusters discussed in Section 3 that were determined in Figure 2. These are: the seven industrial applications, the three consumer/domestic sectors (domestic heat, grid injection, mobility), the CHP/CH4 outlier cluster, and the single outlier (power).
For robustness, we also repeated this plot with some slightly different groupings. For example, we separated CHP from the two CH4 uses, and we separated domestic heat from mobility/grid injection (as domestic heat is not quite as similar as the latter two in Figure 2). No substantial differences were observed in the log capacity vs. year scatter plot, so we proceed with the groupings from Figure 2 as they are.
To confirm this exponential fit, we implemented a series of linear regressions between the (log) capacity, time, as well as technology and usage. Linear regression is a commonly used and celebrated statistical model. This method seeks to model a response or target variable (in this case capacity or log capacity) as a linear function of other variables, including time. The most suitable linear coefficients are determined by a process called least squared estimation [85], where the line of best fit is defined by the property that the sum of square deviations from that line is minimised. Linear regression is perhaps the simplest of all statistical or machine learning models; however, it is by no means trivial. In its simplicity, it can provide greater interpretability than much more complex models, and sometimes perform just as well. That is our aim today. We remark that our purpose is not predictive (predicting future trends in hydrogen plants) but to describe trends observed in the data as of now (including planned plants for future construction). Our linear regression models take one of the following forms: where y i is the capacity, t i is the year for each plant, and we include dummy variables for end-use sectors and technology (green or fossil). With only two technologies, only one dummy variable is necessary; for the end-use sectors, we require one fewer dummy variable than the number of sectors. We run eight linear regressions across three choices-capacity or log capacity, all 14 end-use sectors, or the four grouped usages as in Figures 2 and 5-and the inclusion of the technology variable or not. We record the adjusted R 2 , a measure of goodness of fit, for all eight regressions in Table 1. year of construction for all plants in our dataset with available data. We classify plants by both their technology as well as their usage, using the clusters of end-use sectors from Section 3 Figure 2. We can see the early dominance of blue plants (with a few early exceptions) by several orders of magnitude, but this is closing with time. The results confirm the main finding visible in Figure 5, a clearly superior fit exists when considering the log capacity model across all other choices. This strongly suggests a rather regular exponential growth in the capacity of plants over time, including when separating across usage sectors. That is, we have used goodness-of-fit measures to reveal that the increase in capacity over time is better represented as exponential than linear, and rather well represented as exponential in its own right. Viewing Figure 5 across the usage groups tells a promising story. With the exception of the industrial sectors, which saw a relatively early onset of high-capacity plants (though few in number), the other groups are growing at a commensurate rate, as can be seen by the relative collinearity of domestic heat/mobility/grid injection, CHP/CH4, and power. We can also see the welcome closing of the gap between green and blue/fossil hydrogen plants' capacities.

Conclusion
This paper studies the capacity and end-use sectors of low-carbon hydrogen technologies over time, between green (zero-carbon) or blue/fossil (captured and stored carbon) technologies, and differences across the world. Combining the findings from investigating the distinct phenomena explored in this paper provides a unique, holistic overview regarding the state of low-carbon hydrogen projects globally. We are unaware of any similar work in the application field of low-carbon hydrogen.
In Section 3, we investigate the distributions of new hydrogen plants over time. Our results show a welcome growth in new and planned plants in recent years and the 2020's, consistent across sectors. By analysing the discrepancy between temporal cumulative distribution functions, we reveal interesting structural similarity, including a strong cluster of industrial applications, where the number of plants is presently increasing significantly, and a smaller cluster of domestic applications. Some outlier end-use sectors are also revealed, particularly power, which has been much more consistent and uniform in its increase in new plants over time, heralding a more reliable and proven usage sector. Simply put, we classify end-use sectors as under uniform growth (power), concave-up growth (domestic applications), and explosive growth (industrial applications).
In Section 4, we study usage distributions of end-use sector across continental and technological groupings of hydrogen plants, both across the entire period of analysis and on a decade-by-decade basis. We reveal additional structural similarity, including the greatest diversity of usages among green power plants of the four most economically advanced continents, a welcome sign of the widespread utility and diverse applications of hydrogen energy. Interestingly, single key end-use sectors can almost entirely explain the broad-level cluster structure, such as ammonia across the entire period and the 2020-2029 decade, power in the 2000-2009 decade, and mobility and power in the 2010-2019 decade.
Finally, in Section 5, we demonstrate the changing relationship between green and blue hydrogen project capacity across a range of sectors (and the groupings determined by Section 3). We observe both graphically and analytically an exponential growth in the capacity of green hydrogen projects across all sectors, both grouped and ungrouped. This is an excellent sign for the future of this renewable fuel and suggests that investing in green hydrogen projects will likely provide returns in the coming years, regardless of end-use sector. Indeed, the diversity of the end-use applications, especially with exponentially increasing capacity of plants, means that the market for hydrogen technology is both growing and inherently diversified. This reflects well on the potential returns in governments and private entities investing in this technology, as well as less risk than a technology with more limited uses.
The future development of hydrogen plants must consider safety, profitability, and widespread utility of hydrogen production, all to further the key aim of decarbonisation, while blue hydrogen plants discussed in this manuscript attempt to sequester their carbon emissions, the use of oil, coal, or natural gas in the process carries a significant risk of harmful byproducts, and the capture of emissions is imperfect [86]. Thus, we hope to see growth in the number of green hydrogen plants across all end-use sectors, providing useful benefits to a diverse range of societal sectors, and associated increases in their capacity. Monitoring these trends may be crucial to the future of decarbonisation for many industries as diverse societies drive towards a sustainable existence.
Author Contributions: Both authors contributed equally in every aspect of the paper. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Data Availability Statement:
The data analysed in this article are publicly available at [80]. A cached copy is available at https://github.com/MaxMenzies/HydrogenData (accessed on 26 November 2022).

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Probability distribution distance
In this appendix, we give a technical explanation of why our distance (Equation (2)) can be interpreted as perhaps the most appropriate distance between two distributions on a discrete set. Let (X, d) be a metric space, µ, ν two probability measures on X, and q ≥ 1 a real number. The Wasserstein metric between distributions µ, ν is defined as where the infimum is taken over all probability measures γ on X × X with marginal distributions µ and ν, respectively. From here, let q = 1. By the Kantorovich-Rubinstein formula [87], there is an alternative formulation of Equation (A1) when X is compact (for example, finite): where the supremum is taken over all 1-Lipschitz functions F : X → R, meaning |F(x, y)| ≤ d(x, y) for all x, y.
Proposition A1. Let (X, d) be a finite set with the discrete metric, with d(x, y) = 1 for x = y and 0 otherwise. Let W 1 (µ, ν) be the L 1 -Wasserstein metric between two probability measures µ, ν on X, as expressed in Equation (A2). That is, Associate to µ and ν corresponding distribution functions or probability vectors f and g, respectively. Then, the supremum written above is optimised by the following choice of F: Thus, W 1 ( f , g) reduces to the same form of Equation (2), namely Proof. Let F be an arbitrary 1-Lipschitz function on X with its discrete metric. Thus, F : X → R and |F(x) − F(y)| ≤ d(x, y) ≤ 1 for all x, y. We define M and m by M = sup x∈X F(x) and m = inf y∈X F(y). By taking the supremum over elements x and the infimum over y, the Lipschitz condition implies that M − m ≤ 1. So using the fact that ∑ x f (x) = ∑ x g(x) = 1 to eliminate the second summand of Equation (A9). Next, we set So P − N = ∑ x∈X f (x) − g(x) = 0, whereas P + N = ∑ x∈X | f (x) − g(x)| = f − g 1 . It follows P = N = 1 2 f − g 1 , and X Fdµ − X Fdν ≤ P. Taking the supremum over F, we determine W 1 ( f , g) ≤ P. Finally, let F be the function defined in the proposition Statement (A4). Then, X Fdµ − X Fdν coincides with P by definition. Thus, the supremal value W 1 ( f , g) coincides exactly with P, and P = 1 2 f − g 1 , as required.