Application of Machine Learning and Data Augmentation Algorithms in the Discovery of Metal Hydrides for Hydrogen Storage

Beltrame, Giancarlo; Dematteis, Erika Michela; Stavila, Vitalie; Rizzi, Paola; Baricco, Marcello; Palumbo, Mauro

doi:10.3390/met15111221

Open AccessArticle

Application of Machine Learning and Data Augmentation Algorithms in the Discovery of Metal Hydrides for Hydrogen Storage

by

Giancarlo Beltrame

¹,

Erika Michela Dematteis

¹

,

Vitalie Stavila

²,

Paola Rizzi

¹

,

Marcello Baricco

¹

and

Mauro Palumbo

^1,*

¹

Department of Chemistry, NIS and INSTM, University of Turin, Via Pietro Giuria 7, 10125 Turin, Italy

²

Sandia National Laboratories, Livermore, CA 94551, USA

^*

Author to whom correspondence should be addressed.

Metals 2025, 15(11), 1221; https://doi.org/10.3390/met15111221

Submission received: 20 August 2025 / Revised: 17 October 2025 / Accepted: 19 October 2025 / Published: 4 November 2025

(This article belongs to the Section Metallic Functional Materials)

Download

Browse Figures

Versions Notes

Abstract

The development of efficient and sustainable hydrogen storage materials is a key challenge for realizing hydrogen as a clean and flexible energy carrier. Among various options, metal hydrides offer high volumetric storage density and operational safety, yet their application is limited by thermodynamic, kinetic, and compositional constraints. In this work, we investigate the potential of machine learning (ML) to predict key thermodynamic properties—equilibrium plateau pressure, enthalpy, and entropy of hydride formation—based solely on alloy composition using Magpie-generated descriptors. We significantly expand an existing experimental dataset from ~400 to 806 entries and assess the impact of dataset size and data augmentation, using the PADRE algorithm, on model performance. Models including Support Vector Machines and Gradient Boosted Random Forests were trained and optimized via grid search and cross-validation. Results show a marked improvement in predictive accuracy with increased dataset size, while data augmentation benefits are limited to smaller datasets and do not improve accuracy in underrepresented pressure regimes. Furthermore, clustering and cross-validation analyses highlight the limited generalizability of models across different material classes, though high accuracy is achieved when training and testing within a single hydride family (e.g., AB₂). The study demonstrates the viability and limitations of ML for accelerating hydride discovery, emphasizing the importance of dataset diversity and representation for robust property prediction.

Keywords:

solid-state hydrogen storage; metal hydrides; augmentation; machine learning; materials design

1. Introduction

The development of sustainable and efficient hydrogen storage solutions is important in harnessing hydrogen’s potential as bidirectional energy carrier. Various renewable energy sources including solar, wind, and hydroelectric are promising, and hydrogen can be used to store excess electricity in times of over-production. In times of high energy demand, or in cases of emergencies, hydrogen can react chemically or electrochemically with oxygen to produce only water while releasing substantial energy. For instance, 1 kg of hydrogen has the same energy content as 2.4 kg of methane or 2.8 kg of gasoline [1]. Various techniques, including electrolysis, biogas reforming, and photoelectrochemical processes, can be used to produce hydrogen. This versatile energy carrier can be transformed back into power and heat using fuel cells or turbines. Metal hydrides can play a significant role in storing hydrogen, offering a safer and more compact alternative to conventional gas compression and liquid storage methods [2]. These materials chemically bond with hydrogen, allowing for high volumetric storage densities and the advantage of releasing hydrogen upon demand, making them particularly suitable for applications ranging from transportation fuel to power generation [3]. However, the practical application of metal hydrides is hindered by challenges related to their thermodynamic stability, kinetics, and initial activation. Ideally, these materials would absorb and release hydrogen at pressures below 30 bar close to room temperature, exhibit rapid kinetics, minimal hysteresis, negligible need for activation, and resistance to contaminating gases [4]. Moreover, they should avoid the use of critical raw materials and remain cost-effective. Recently, high entropy alloys have extended the range of candidate materials in the field, showing promise due to the large composition space and potential for high hydrogen storage capacities [5,6]. These materials, characterized by a mixture of multiple principal elements, offer a new avenue for creating metal hydrides with tailored properties as the vast compositional space presents both a challenge and an opportunity [7].

Despite their promise, the search for metallic hydrides combining all necessary properties for applications is a daunting task. In this respect, the integration of machine learning (ML) in materials science represents a transformative shift, enabling the acceleration of new materials development through data-driven insights [8,9,10,11]. The dramatic increase in its usage can be attributed to its remarkable ability to capture complex correlations among large datasets and to efficiently predict materials properties. This capability is crucial for discovering novel metal hydrides with optimized hydrogen storage properties, where co-optimizing various properties is essential, as highlighted in recent studies published on this topic [5,12,13,14,15,16,17,18,19]. The first studies showed the feasibility of an ML approach, but had limited predictive power as they used measured properties of the hydrides as predictors in training the models [12,13]. Shortly afterwards, starting from the HYDPARK database and using the Magpie software (https://wolverton.bitbucket.io/, accessed on 15 July 2025), a model trained on features obtained from the materials composition made the approach fully predictive and easy to apply to estimate the equilibrium plateau pressure, enthalpy and entropy of hydrogenation of new hydrides [6]. Some later studies focused on predicting other relevant properties (enthalpy of formation, hydrogen capacity, etc.) for AB₂ hydrides [14,15,16]. A significant number of recent studies focus on high entropy alloys for hydrogen storage [5,17,18,19] using both experimental and Density Functional Theory (DFT) results to generate the necessary datasets for training ML models, which can then be used to navigate this large compositional space to identify alloys with optimal properties. Using DFT generated data can help to overcome the limited availability of experimental data on metallic hydrides; however, some quantities related to the absorption/desorption process are challenging to obtain from theoretical calculations. Hence, the limited size of available datasets for hydrides, often limited to hundreds of items when using experimental data, remains a critical issue.

In this work, we focus on the effect of augmenting the dataset size on ML prediction quality. To this end, we have collected more experimental data published in the literature and nearly doubled the size of the dataset compared to an earlier publication [20]. In parallel, we tested the benefits and limitations of using a data augmentation technique (Pairwise Difference Regression, PADRE) [21] to increase the size of the dataset without using new real experimental data, but rather recombining the original data. Our ML approach is based on predictors generated using the Magpie platform [22,23], which is structure agnostic and based only on the hydride composition. Several models have been tested using a grid search for an extensive hyperparameters optimization. The models were trained to make predictions of the plateau pressure at 25 °C, as well as the enthalpy and entropy of adsorption/desorption, with the goal of refining our previous results. These quantities provide a comprehensive metric of the thermodynamic stability of metallic hydrides and hence can be used to select promising candidates for real applications.

2. Methodology

2.1. Datasets

Initially, we used the same dataset (DS0) as in our earlier publication [20], containing approximately 400 entries for different hydride compositions. For each entry, experimental data for the plateau pressure at 25 °C (

P_{e q}^{0}

) and its logarithm, the enthalpy (

{∆ H}_{f o r}

) and entropy (

{∆ S}_{f o r}

) of hydride formation are available. These quantities also allow a complete description of the variation in plateau pressure as a function of temperature (see the first section in Supplementary Materials for further details). These entries are the result of a careful selection carried out in our previous work from over two thousand entries available in the HYDPARK database (details are reported in Supplementary Materials in Ref. [20]). This dataset was used for comparison and to test a few additional models compared to our earlier work. Gathering data from the literature [24], we built an expanded dataset (DS1) containing a total of 806 entries. Most of the added data refer to AB₂ compounds, i.e., Laves phases (C14, C15, C36). In both datasets, we discarded a few entries with

{l n P}_{e q}^{0} < - 20

pertaining to complex hydrides as their number is too small for ML models to satisfactorily capture the complex correlations between predictors and target properties in this pressure range.

Following our previous approach [20,25], we employed the Magpie software [22,23] to generate a machine learning (ML) dataset from the compositions of the alloys collected in the original (DS0) and expanded dataset (DS1). This process resulted in the creation of 145 features for each entry in the datasets. These features involve a variety of thermo-physical variables, including mean, standard deviation, maximum, minimum, mode, and maximum difference in the elements in the composition formula. More details on the features generated by Magpie can be found in Table S1 in Supplementary Materials and in the original documentation of the software package [22,23].

2.2. Machine Learning

Several ML models (or algorithms) are usually employed and tested to find the best one for a specific dataset. For this study, we experimented with various algorithms, including k-nearest neighbors (kNN), Random Forest (RF), Random Forest with Gradient Boosting (RFGB) and Support Vector Machines (SVM), as implemented in the scikit-learn library [26]. The k-nearest neighbors (kNN) algorithm predicts a data point’s label by looking at the labels of its k closest points in the training set and choosing the majority (for classification) or the average (for regression). A Random Forest algorithm is an ensemble learning method that builds many decision trees on random subsets of the data and features, then combines their outputs to make a final prediction. This reduces overfitting and improves accuracy compared to a single decision tree. While a Random Forest builds many independent decision trees on random subsets of the data and averages their predictions, Random Forest with Gradient Boosting builds trees sequentially, each one correcting the errors of the previous, making them usually more accurate but also more complex and slower to train. A Support Vector Machine (SVM) model for regression predicts continuous values by finding a line (or hyperplane) that fits the data within a specified margin of tolerance, while minimizing errors beyond that margin. These ML models are well known to perform effectively with medium-size datasets, such as DS0 and DS1. In each model, the features used were those obtained from Magpie, whereas the target quantities to predict were experimental values for

P_{e q}^{0}

,

{∆ H}_{f o r}

and

{∆ S}_{f o r}

.

Before training, each dataset was converted into a matrix using the Python library Pandas (2.0.3) [27] and randomly shuffled to mitigate potential grouping of similar data. For each ML algorithm, the best values of its hyperparameters were determined to minimize the chosen loss function (mean absolute error, MAE). This optimization was performed using the GridSearchCV function from the scikit-learn library (1.3.2) on a suitable grid of values. The algorithms were trained for each combination of hyperparameter values on the grid, and the corresponding loss function was calculated. Each training iteration utilized 10-fold cross-validation, evaluating the model 10 times with different configurations of train/test sets to obtain more robust statistics on the error. Cross-validation was also used to evaluate the final generalization error using the best values of the hyperparameters determined in the grid search. As performance metrics, we used the mean absolute error (MAE) and the coefficient of determination (R²) both during training and for the final performance evaluation of each model.

In order to check for the existence of hidden patterns, structures, or groupings in our datasets, we applied a couple of well-known unsupervised ML algorithms (k-means++ and DBSCAN (Density-Based Spatial Clustering of Applications with Noise)), as implemented in the scikit-learn library, to test both distance-based and density-based methods. k-means is a clustering algorithm that partitions data into k groups by repeatedly assigning each point to the nearest cluster center and then updating the centers as the mean of their assigned points. It aims to minimize the variance within clusters and maximize separation between them. k-means++ uses an improved initialization method for the k-means clustering algorithm that selects starting centroids more strategically, spreading them out to reduce the chances of poor clustering results. This leads to faster convergence and better overall cluster quality compared to random initialization. DBSCAN is a clustering algorithm that groups together points that are closely packed (high density) while marking points in low-density regions as outliers. Unlike k-means, it does not require specifying the number of clusters in advance and can find arbitrarily shaped clusters. Finally, we note that cluster analysis in machine learning is about grouping data points based on similarity to reveal structure in unlabeled data. This differs from the traditional cluster expansion in physics which refers to a mathematical technique for approximating properties of many-particle systems by systematically expanding interactions in terms of clusters of particles.

For the first method, we determined the best number of clusters (k) by running the algorithm for different values of it (from 2 to 8) and evaluating the results based on the resulting silhouette diagrams. For DBSCAN, the optimal values of the hyperparameters “epsilon” and “min_samples” were also determined according to silhouette diagrams (see the Supplementary Materials for details, Figures S1–S3).

2.3. Data Augmentation (PADRE Algorithm)

PADRE [21] is a data augmentation algorithm that constructs all possible pairs (i, j) between single entries of the training set and test set and for each pair considers the differences in the original features as new ones. Each entry of the augmented dataset (x_augmented) will be formed by concatenating the original features of the pair and the differences, as in Equation (1) [21]:

x_{a u g m e n t e d} = x_{i} \oplus x_{j} \oplus x_{i^{-}} x_{j}

(1)

where x_i and x_j are the features of two entries of the original dataset. Furthermore, the target variable is also modified in the augmented dataset to represent the difference between the pairs. This procedure results in a significant increase in the size of the original dataset (both the number of features and entries) as detailed in Table 1.

These augmented datasets, hereafter referred to as DSA0 and DSA1, were used to train ML models to assess the effectiveness of the data augmentation procedure.

3. Results and Discussion

3.1. Effect of Increasing the Dataset Size with New Real Data

At first, we trained a few additional models with the logarithm of the equilibrium pressure (

l n P_{e q}^{0}

) as the target property and using the original dataset DS0, as in [21], to test possible improvements. For each model, the best combination of hyperparameters was first determined and then used with 10-fold cross-validation to obtain the final MAE. The results are reported in Tables S2–S5 in the Supplementary Materials. While the kNN model showed the worst performance (MAE = 1.99) and will not be considered further, the lowest MAE (1.35) was obtained with the SVM model. This model performed slightly better than the best result in previous work (MAE = 1.52) [21], which was obtained with an RFGB model.

An attempt at retraining an RFGB model in this work led to a similar MAE (minor differences can be attributed to minor changes in the optimal hyperparameters and statistical fluctuations in 10-fold cross-validation results). Despite the slight improvement in the model performance obtained with the SVM model, it is difficult to achieve further significant improvements in predictions by simply testing other ML algorithms or refining the grid search of the optimal hyperparameters.

Hence, in the following, we describe the results obtained with the same ML models when the size of the dataset is increased with both new real data and with data augmentation techniques. First, we compared the distribution of the target pressure considering the old dataset (DS0) and the new dataset (DS1), as illustrated in Figure 1. It is evident that there is a substantial increase in the number of entries in DS1, particularly in the region near

l n P_{e q}^{0} = 0

, between −5 and 5. It is also noteworthy that new data are now available in the region above

l n P_{e q}^{0}

= 5, whereas only a few data points are added below

l n P_{e q}^{0}

= −5.

As with DS0, we carried out hyperparameter optimization and trained different models on DS1. As before, the RFGB and SVM models show the best performance, leading to MAE values of 1.07 and 1.13, respectively. These values demonstrate a significant improvement compared with the mean errors obtained with DS0 and in our earlier work, clearly showing the benefit of increasing the dataset size. All results are shown in Table 2.

The low performance of the kNN model on both the DS0 and DS1 datasets was somewhat expected. Although the model is simple and intuitive, it can struggle with noisy or high-dimensional data; its predictions are often unstable since they depend heavily on local neighbors, and its performance usually lags behind more sophisticated models. In contrast, RF captures nonlinear patterns and interactions, is more robust to noise, and generalizes well even with modest dataset sizes like 400 or 800. Gradient boosting is particularly effective at further improving its performance. Finally, SVM can perform very well if the kernel and parameters are properly tuned, as was done here, especially on smaller datasets such as DS0 and DS1. However, it does not scale well with larger datasets or a high number of features (see the section on data augmentation).

As already pointed out in our earlier work [21], it is important to rationalize the results of ML models with feature importance analysis, which provides an explainable insight into the inner workings of the models and helps understand correlations between the features and the target properties. The results of feature importance were evaluated for the Random Forest model with the methodology proposed by Shapley [28] (SHAP values in Figure 2), which can be applied to all ML models, and with the native method for random forests implemented in the Scikit-learn library (Figures S4 and S5). With both methodologies, the most important descriptor turns out to be the mean_GSvolume_pa (

ν_{p a}^{M a g p i e}

), which is the average ground-state volume of the alloy, calculated from ground-state volumes per atom of the elemental solids (

ν_{p a, i}

) as

ν_{p a}^{M a g p i e} = \sum_{i} f_{i} ν_{p a, i}

, where

f_{i}

are the atomic fractions. This confirms our previous findings (for a detailed analysis and the physical interpretation of this predictor, the reader is referred to our previous work [20]) and is not affected by the increase in size of the dataset. In addition, the other predictors (among the top-ten most important ones) are also similar, though the exact order is slightly different.

3.2. Effect of Increasing the Dataset Size with Data Augmentation

Producing and collecting new data with PCT measurements is time-consuming; hence, it is desirable to enhance the model performance in other ways, if possible. As data augmentation could be helpful to this aim, we present here the results of the application of PADRE method to our datasets. The new datasets DSA0 and DSA1 were generated by applying the PADRE method and then the RFGB model was retrained on both of them, with the same approach adopted for the original datasets (DS0 and DS1). We did not retrain the SVM model due to its poor scalability with dataset size, and we also excluded the kNN and RF models, because of their modest performance in earlier training. Using 10-fold cross-validation on the test set, the average MAE obtained was 1.36 on DSA0 and 1.24 on DSA1. Comparing these results obtained with data augmentation with the previous ones, we can note that the PADRE method led to an improvement in the prediction quality for DS0, but not for DS1. To better understand this outcome, we systematically applied the PADRE method to datasets of different sizes in the range from 100 to 806, i.e., the size of DS1. These datasets were obtained by random sampling DS1, then the PADRE method was applied to each newly created dataset of size 100, 200, 300, etc., and an RFGB model was trained on each one. The results are shown in Figure 3. It is evident that models trained on datasets generated with data augmentation produce better results compared to the original datasets when the size of the latter is less than 500 elements. In contrast, when the original dataset is sufficiently large, data augmentation is less effective and leads to slightly worse results. Moreover, the effect of data augmentation on the MAE value appears to be limited but not negligible on small datasets. The fact that data augmentation improves the quality of prediction on small datasets more than on large ones agrees with the original findings from Ref. [21], though the crossover point (in our case, 500 elements) critically depends on the type of data and features used. The feature importance values evaluated on the augmented datasets show that the most important one is now the difference in ground-state volumes (Figure S4 in the Supplementary Materials), which is consistent with previous results without data augmentation.

Another interesting point to consider is the quality of the predictions across the range of target equilibrium-pressure values. We already pointed out in earlier work (Ref. [20]) that ML models do not generalize well in the wings of the

l n P_{e q}^{0}

distribution, i.e., at very low/high values, because of the correspondingly low number of samples in the training dataset. To investigate the effect of data augmentation on this issue, we split the data on the dataset DS1 and DSA1 into bins of size 1 in

l n P_{e q}^{0}

(e.g., −20 to −19, −19 to −18, etc.) and we calculated the MAE of the data in each bin with and without data augmentation using 10-fold cross-validation (Figure 4). The points in Figure 4 represent the MAE for each fold in a certain bin (10 folds for each bin, cross validation has been applied to each bin separately) when training was performed on the dataset DS1 (without data augmentation), whereas the black line represents the average MAE in each bin without augmentation. With respect to our earlier work (Ref. [20]) we can notice a certain overall reduction in MAE values, but still significantly higher values on the wings of the distribution. After applying data augmentation and training on dataset DSA1, MAE values have been evaluated in a similar way for each bin (red squares and red line). It is evident that data augmentation with PADRE does not lead to a decrease in MAE values in any region of the distribution, with a few exceptions in some bins. In some bins at the extremes of the distribution, the application of PADRE generates significantly worse predictions (i.e., higher MAE values). It is unfortunately evident this approach to data augmentation does not improve the accuracy of predictions at high/low pressure values, where data are sparser than in the middle of the range. Hence, the benefits of data augmentation appear to be limited to small datasets and are only marginal compared to the benefits of introducing new real data in the dataset.

3.3. Clustering and Further Analysis

To gain a deeper understanding of dataset DS1, we employed two clustering algorithms, kmeans++ and DBSCAN, from the Scikit-learn library. Both were applied to the entire feature vector and we searched for the optimal number of clusters (k) and other hyperparameters according to the silhouette score in addition to inertia values (more details and the silhouette score and diagrams are reported in Supplementary Materials, Figures S1 and S2). With kmeans++, the optimal number of clusters was found to be k = 4, while with DBSCAN the corresponding number of clusters was found to be k = 8. The clusters determined according to kmeans++ are shown in different colors in Figure 5.

The clustering results are shown in Figure 5, where the target quantity

l n P_{e q}^{0}

is plotted against the feature

ν_{p a}^{M a g p i e}

, defined as the average ground-state volume of the alloy calculated from ground-state volumes per atom of the elemental solids (

ν_{p a, i}

) as

ν_{p a}^{M a g p i e} = \sum_{i} f_{i} ν_{p a, i}

, where

f_{i}

is the atomic fraction. This choice is motivated by our earlier and current results, which clearly indicate a correlation between these quantities. However, as Figure 5 shows, the different clusters of materials do not appear to be well separated in this plot. Similar results were obtained with DBSCAN (see Supplementary Materials for details, Figure S3). These limitations on the clusters obtained with respect to

l n P_{e q}^{0}

vs.

ν_{p a}^{M a g p i e}

are also similar to previous results on the DS0 dataset [20] and were already discussed in detail. In the following, we go beyond the current clustering results to conduct a demanding test designed to assess the generalization capability of the trained ML models for hydrides under the most challenging conditions.

It is well-established that it is difficult for an ML model to generalize to new instances (in the present case on new possible hydrides) that differ significantly from the data on which it was trained. To quantify this difficulty in the worst possible case, we used the kmeans++ clustering results to split the dataset DS1 into training and test sets in the following way: the first of the four clusters obtained from kmeans++ is set aside and used as a test set while the other clusters (2, 3, 4) are used for training an RFGB model; after training, the model results are evaluated by calculating the MAE and R² values on the test set (cluster 1) [29]; this step is iteratively repeated choosing each time a different cluster for testing and the others for training (cluster 2 for testing and clusters 1, 3, 4 for training, and so on). In this way, assuming the kmeans++ clustering has successfully split the datasets into homogenous clusters, each time, the test set contains hydrides which are somewhat different from those used in training. The results are shown in Table 3 and are rather unsatisfactory for most clusters except when cluster 4 is used for testing and clusters 1-2-3 for training.

The above results are, however, dependent on how successful the applied clustering algorithms are in splitting the original dataset into homogenous subsets. To further confirm these findings, we used a different clustering of the dataset, based on the material classes defined in the HYDPARK database, i.e., according to different types of hydrides (AB, AB₂, etc.; see Supplementary Materials for more details, Figures S6 and S7). We then repeated the previous approach using these material classes as clusters and the obtained results are reported in Table 4.

As observed, the results are generally unsatisfactory across most material classes, although for some classes the limited number of instances may significantly affect the reliability of the results. In fact, this is probably the case for A₂B hydrides (only 10 instances in the class) and possibly for Mg and MIC (miscellanea types) with 32 and 52 instances, respectively. However, it does explain the poor results obtained when testing on AB₅ hydrides or even on the largest class AB₂. In this latter case, which is the second best one among all possible classes, the results point out that training with data on hydrides belonging to different material classes but not including AB₂ lead to performances which are significantly lower than those obtained in Section 3.1 Although another factor that may affect the present results is the imbalance between the material classes sizes, it is clear from the results in both Table 3 and Table 4 that the ML models for hydrides cannot be expected to perform well unless they have been trained on datasets containing a relevant number of data on similar hydrides.

On the contrary, models trained on similar hydrides are expected to perform well when applied to make predictions within the same material class. To verify this point, we created a new dataset DS2 containing only hydrides of AB₂ type (the most numerous) for a total of 454 instances. We then retrained both an RFGB and an SVM model using 10-fold cross-validation to evaluate the results as in Section 3.1. The full results for each fold are shown in Table S5 and they show an average MAE = 0.83 and R² = 0.89 for RFGB and MAE = 1.12 R² = 0.72 for SVM, which are significantly lower than the corresponding values obtained on the whole DS1 dataset.

3.4. Validation of Model Predictions

Though the quality of the predictions based on the trained ML models is still limited at extremely low/high pressures, they can capture some trends in the equilibrium plateau pressures in the composition space and can help guide the exploration of new alloy compositions. For example, Figure 6 shows the variation in the predicted plateau equilibrium pressures at room temperature starting from the FeTi binary equimolar phase and partially substituting iron or titanium with other elements (Ni, Mn, Cu, Al, V). In almost all cases, the predicted

l n P_{e q}^{0}

decreases in agreement with experimental findings [24] with one notable exception when V substitutes Ti, which is also consistent with experiments.

4. Conclusions

This work focuses on investigating the effectiveness and versatility, as well as the limitations, of machine learning (ML) approaches in the design of metal hydrides and in predicting their equilibrium plateau pressure. Building on our previous research, we first checked for possible improvements using different models on the same dataset (DS0) and we found that limited enhancements are possible. This was the case with an SVM model not tested before. However, we found a higher effectiveness of new experimental data in enhancing the performance of the trained models. Specifically, when doubling the size of the dataset (DS1) the improvement in the MAE and R² values of the trained models was significant. Feature importance results on this larger dataset essentially confirm previous findings.

In contrast, the tested data augmentation technique (PADRE) did not enhance model performance, with enhancements being evident only for small original datasets (before data augmentation is applied) and diminishing as the dataset size increased. This possibly points to the fact that, when the quantity of real data is large enough, the improvement that can be obtained by these techniques is limited or null as the model is already capable of capturing the complex correlations present in the data. Furthermore, the application of these techniques does not lead to any improvement in the quality of the predictions at pressure values which are not well represented in the dataset (at high/low pressure regimes), which could represent a significant advantage for certain applications. The choice to test a data augmentation method was motivated by the improvements these techniques have demonstrated on other datasets, particularly in image analysis; however, our results indicate that this may not always be the case. Although other data augmentation methods might prove more successful, the present results make it clear that the effectiveness of real new data is significantly higher.

We also verified that the generalization ability of these trained ML models is somewhat limited to the same type of hydrides which are well represented in the training set. Attempts at predicting properties for hydrides of different classes lead to poor results. Conversely, models trained on well-represented material classes such as AB₂ can perform very well on similar new materials. We emphasize that this limitation is not reflected by the standard metrics commonly adopted in the ML community (e.g., MAE, R², or similar measures), as many ML results are reported solely on the basis of these values using train/test splits or cross-validation. While such metrics are beneficial in that they enable straightforward comparisons across different published studies and provide a uniform way to approach the generalization problem in ML models, they fail to capture the full extent of a model’s generalization ability—particularly in regions of the data space that may be critical for specific applications and materials design.

Despite the above noted limitations, the framework developed here facilitates the interpretation of ML models and can support the rapid screening of new materials with the desired capacity and thermodynamic properties for specific hydrogen storage use cases [3]. Nevertheless, experimental validation remains essential.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/met15111221/s1, Table S1: Examples of Magpie generated attributes (features) and their meaning; Table S2: Mean Absolute Error (MAE) and R² values obtained for various ML algorithms with the

l n P_{e q}^{0}

as target using the dataset DS0; Table S3: Mean Absolute Error (MAE) and R² values obtained for ML algorithms with the ∆H_for for as target using the dataset DS1; Table S4: Mean Absolute Error (MAE) and R² values obtained for ML algorithms with the ∆S_for as target using the dataset DS1; Table S5: Results of 10-fold cross validation on DS2 using RFGB and SVM; Figure S1: Plots showing different silhouette coefficient values for different clusters for kmeans++ algorithm. Each plot is obtained by fixing the number of clusters k and then running the kmean++ algorithm. In each diagram, each cluster has a different color and they are labelled from 0 to n, where n = k − 1. The dashed vertical line represents the silhouette score for each diagram; Figure S2: Plots showing different silhouette coefficients for different clusters for DBSCAN algo-rithm. Each diagram has been obtained by fixing the values of the hyperparameters “epsilon” and “min_samples”, whereas the number of clusters is automatically determined by the algorithm. In each diagram, each cluster has a different color and they are labelled from 0 to n, where n = k − 1. The dashed vertical line represents the silhouette score for each diagram; Figure S3: The DBSCAN clustering on dataset DS1 reported on a

l n P_{e q}^{0}

vs

ν_{p a}^{M a g p i e}

. Points of different colors belong to different clusters. The feature

ν_{p a}^{M a g p i e}

is defined as the average ground-state volume of the alloy calculated as

ν_{p a}^{M a g p i e} = \sum_{i} f_{i} ν_{p a, i}

, where

ν_{p a, i}

are the ground-state volumes per atom of the elemental solids and

f_{i}

are the atomic fractions.

P_{e q}^{0}

is the equilibrium plateau pressure at 25 °C; Figure S4: Features importance for the RFGB model trained on DS0 (A) and DS1 (B) datasets with the

l n P_{e q}^{0}

as target property. Here feature importance was calculated using the Scikit library; Figure S5: Distribution of entries in DS2 (only AB2 entries) with respect to

l n P_{e q}^{0}

. Figure S6: Count plot showing the presence of a particular element in the alloys included in the dataset DS1 (count > 4); Figure S7: The number of instances in the DS1 dataset broken down by material classes as defined in the HYDPARK database.

Author Contributions

Conceptualization, G.B.; Data curation, G.B., V.S. and M.P.; Formal analysis, G.B., E.M.D., V.S. and M.P.; Funding acquisition, P.R. and M.B.; Investigation, G.B.; Methodology, G.B. and M.P.; Project administration, E.M.D.; Resources, E.M.D., P.R. and M.P.; Software, V.S. and M.P.; Supervision, E.M.D., M.B. and M.P.; Validation, E.M.D., V.S., M.B. and M.P.; Writing—original draft, G.B. and M.P.; Writing—review & editing, E.M.D., V.S., M.B., P.R. and M.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by EX-MACHINA project, MUR program “PNNR M4C2 Initiative 1.2: Young Researcher-Seal of Excellence” (CUP: D18H22002040007); Project CH4.0 under the MUR program “Dipartimenti di Eccellenza 2023-2027” (CUP: D13C22003520001); U.S. Department of Energy (DOE), Office of Energy Efficiency and Renewable Energy (EERE), Hydrogen and Fuel Cell Technologies Office through the Hydrogen Materials Advanced Research Consortium (HyMARC).

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author(s).

Acknowledgments

Spoke 7 “Materials and Molecular Sciences” of ICSC—Centro Nazionale di Ricerca in High-Performance Computing, Big Data and Quantum Computing, funded by the European Union—NextGenerationEU. We would like to thank the “Centro di Competenza sul Calcolo Scientifico” for allowing us to use the OCCAM supercomputer. Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC (NTESS), a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (DOE/NNSA) under contract DE-NA0003525. This written work is authored by an employee of NTESS. The employee, not NTESS, owns the right to, title of, and interest in the written work and is responsible for its contents. Any subjective views or opinions that might be expressed in the written work do not necessarily represent the views of the U.S. Government. The publisher acknowledges that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this written work or allow others to do so, for U.S. Government purposes.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be constructed as a potential conflict of interest.

References

Ball, M.; Wietschel, M. The Future of Hydrogen—Opportunities and Challenges. Int. J. Hydrogen Energy 2009, 34, 615–627. [Google Scholar] [CrossRef]
Züttel, A. Materials for Hydrogen Storage. Mater. Today 2003, 6, 24–33. [Google Scholar] [CrossRef]
Allendorf, M.D.; Stavila, V.; Snider, J.L.; Witman, M.; Bowden, M.E.; Brooks, K.; Tran, B.L.; Autrey, T. Challenges to Developing Materials for the Transport and Storage of Hydrogen. Nat. Chem. 2022, 14, 1214–1223. [Google Scholar] [CrossRef]
Hirscher, M.; Yartys, V.A.; Baricco, M.; Bellosta von Colbe, J.; Blanchard, D.; Bowman, R.C.; Broom, D.P.; Buckley, C.E.; Chang, F.; Chen, P.; et al. Materials for Hydrogen-Based Energy Storage—Past, Recent Progress and Future Outlook. J. Alloys Compd. 2020, 827, 153548. [Google Scholar] [CrossRef]
Witman, M.; Ling, S.; Wadge, M.; Bouzidi, A.; Pineda-Romero, N.; Clulow, R.; Ek, G.; Chames, J.; Allendorf, E.; Agarwal, S.; et al. Towards Pareto Optimal High Entropy Hydrides via Data-Driven Materials Discovery. J. Mater. Chem. A 2023, 11, 15878–15888. [Google Scholar] [CrossRef]
Witman, M.; Ek, G.; Ling, S.; Chames, J.; Agarwal, S.; Wong, J.; Allendorf, M.D.; Sahlberg, M.; Stavila, V. Data-Driven Discovery and Synthesis of High Entropy Alloy Hydrides with Targeted Thermodynamic Stability. Chem. Mater. 2021, 33, 4067–4076. [Google Scholar] [CrossRef]
Marques, F.; Balcerzak, M.; Winkelmann, F.; Zepon, G.; Felderhoff, M. Review and Outlook on High-Entropy Alloys for Hydrogen Storage. Energy Environ. Sci. 2021, 14, 5191–5227. [Google Scholar] [CrossRef]
Liu, X.; Zhang, J.; Pei, Z. Machine Learning for High-Entropy Alloys: Progress, Challenges and Opportunities. Prog. Mater. Sci. 2023, 131, 101018. [Google Scholar] [CrossRef]
Han, G.; Sun, Y.; Feng, Y.; Lin, G.; Lu, N. Artificial Intelligence Guided Thermoelectric Materials Design and Discovery. Adv. Electron. Mater. 2023, 9, 2300042. [Google Scholar] [CrossRef]
Chen, C.; Zuo, Y.; Ye, W.; Li, X.; Deng, Z.; Ong, S.P. A Critical Review of Machine Learning of Energy Materials. Adv. Energy Mater. 2020, 10, 1903242. [Google Scholar] [CrossRef]
Butler, K.T.; Davies, D.W.; Cartwright, H.; Isayev, O.; Walsh, A. Machine Learning for Molecular and Materials Science. Nature 2018, 559, 547–555. [Google Scholar] [CrossRef]
Rahnama, A.; Zepon, G.; Sridhar, S. Machine Learning Based Prediction of Metal Hydrides for Hydrogen Storage, Part I: Prediction of Hydrogen Weight Percent. Int. J. Hydrogen Energy 2019, 44, 7337–7344. [Google Scholar] [CrossRef]
Rahnama, A.; Zepon, G.; Sridhar, S. Machine Learning Based Prediction of Metal Hydrides for Hydrogen Storage, Part II: Prediction of Material Class. Int. J. Hydrogen Energy 2019, 44, 7345–7353. [Google Scholar] [CrossRef]
Suwarno, S.; Dicky, G.; Suyuthi, A.; Effendi, M.; Witantyo, W.; Noerochim, L.; Ismail, M. Machine Learning Analysis of Alloying Element Effects on Hydrogen Storage Properties of AB2 Metal Hydrides. Int. J. Hydrogen Energy 2022, 47, 11938–11947. [Google Scholar] [CrossRef]
Kim, J.M.; Ha, T.; Lee, J.; Lee, Y.-S.; Shim, J.-H. Prediction of Pressure-Composition-Temperature Curves of AB2-Type Hydrogen Storage Alloys by Machine Learning. Met. Mater. Int. 2023, 29, 861–869. [Google Scholar] [CrossRef]
Maghsoudy, S.; Zakerabbasi, P.; Baghban, A.; Esmaeili, A.; Habibzadeh, S. Connectionist Technique Estimates of Hydrogen Storage Capacity on Metal Hydrides Using Hybrid GAPSO-LSSVM Approach. Sci. Rep. 2024, 14, 1503. [Google Scholar] [CrossRef]
Wen, C.; Zhang, Y.; Wang, C.; Xue, D.; Bai, Y.; Antonov, S.; Dai, L.; Lookman, T.; Su, Y. Machine Learning Assisted Design of High Entropy Alloys with Desired Property. Acta Mater. 2019, 170, 109–117. [Google Scholar] [CrossRef]
Halpren, E.; Yao, X.; Chen, Z.W.; Singh, C.V. Machine Learning Assisted Design of BCC High Entropy Alloys for Room Temperature Hydrogen Storage. Acta Mater. 2024, 270, 119841. [Google Scholar] [CrossRef]
Huang, W.; Martin, P.; Zhuang, H.L. Machine-Learning Phase Prediction of High-Entropy Alloys. Acta Mater. 2019, 169, 225–236. [Google Scholar] [CrossRef]
Witman, M.; Ling, S.; Grant, D.M.; Walker, G.S.; Agarwal, S.; Stavila, V.; Allendorf, M.D. Extracting an Empirical Intermetallic Hydride Design Principle from Limited Data via Interpretable Machine Learning. J. Phys. Chem. Lett. 2020, 11, 40–47. [Google Scholar] [CrossRef]
Tynes, M.; Gao, W.; Burrill, D.J.; Batista, E.R.; Perez, D.; Yang, P.; Lubbers, N. Pairwise Difference Regression: A Machine Learning Meta-Algorithm for Improved Prediction and Uncertainty Quantification in Chemical Search. J. Chem. Inf. Model. 2021, 61, 3846–3857. [Google Scholar] [CrossRef]
Magpie Manual. Available online: https://wolverton.bitbucket.io/ (accessed on 15 July 2025).
Ward, L.; Agrawal, A.; Choudhary, A.; Wolverton, C. A General-Purpose Machine Learning Framework for Predicting Properties of Inorganic Materials. NPJ Comput. Mater. 2016, 2, 16028. [Google Scholar] [CrossRef]
Dematteis, E.M.; Berti, N.; Cuevas, F.; Latroche, M.; Baricco, M. Substitutional Effects in TiFe for Hydrogen Storage: A Comprehensive Review. Mater. Adv. 2021, 2, 2524–2560. [Google Scholar] [CrossRef]
Zhou, P.; Xiao, X.; Zhu, X.; Chen, Y.; Lu, W.; Piao, M.; Cao, Z.; Lu, M.; Fang, F.; Li, Z.; et al. Machine Learning Enabled Customization of Performance-Oriented Hydrogen Storage Materials for Fuel Cell Systems. Energy Storage Mater. 2023, 63, 102964. [Google Scholar] [CrossRef]
Scikit-Learn. Available online: https://scikit-learn.org/stable/ (accessed on 10 June 2025).
Pandas. Available online: https://pandas.pydata.org/ (accessed on 15 July 2025).
Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar] [CrossRef]
Meredig, B.; Antono, E.; Church, C.; Hutchinson, M.; Ling, J.; Paradiso, S.; Blaiszik, B.; Foster, I.; Gibbons, B.; Hattrick-Simpers, J.; et al. Can Machine Learning Identify the next High-Temperature Superconductor? Examining Extrapolation Performance for Materials Discovery. Mol. Syst. Des. Eng. 2018, 3, 819–825. [Google Scholar] [CrossRef]

Figure 1. Distribution of entries in DS0 and DS1 with respect to

l n P_{e q}^{0}

.

Figure 1. Distribution of entries in DS0 and DS1 with respect to

l n P_{e q}^{0}

.

Figure 2. SHAP values for

l n P_{e q}^{0}

predictions on the DS1 dataset for the GB model. Note that for Magpie features (adapted from Refs. [20,22]) not explicitly defined here, a detailed explanation can be found in the references.

Figure 2. SHAP values for

l n P_{e q}^{0}

predictions on the DS1 dataset for the GB model. Note that for Magpie features (adapted from Refs. [20,22]) not explicitly defined here, a detailed explanation can be found in the references.

Figure 3. Effect of dataset size on the results obtained with data augmentation. The red points are MAE values obtained by training RFGB models with the target

l n P_{e q}^{0}

without data augmentation on datasets of different sizes, from 100 elements up to DS1. The blue points are MAE values obtained by training RFGB models with the target

l n P_{e q}^{0}

with data augmentation applied to datasets of different sizes from 100 elements up to DS1. The red and blue shaded areas represent the variance on the MAE obtained from 10-fold cross-validation.

Figure 3. Effect of dataset size on the results obtained with data augmentation. The red points are MAE values obtained by training RFGB models with the target

l n P_{e q}^{0}

without data augmentation on datasets of different sizes, from 100 elements up to DS1. The blue points are MAE values obtained by training RFGB models with the target

l n P_{e q}^{0}

with data augmentation applied to datasets of different sizes from 100 elements up to DS1. The red and blue shaded areas represent the variance on the MAE obtained from 10-fold cross-validation.

Figure 4. Distribution of entries in DS1 with respect to

l n P_{e q}^{0}

(blue histogram) overlaid with error analysis in each bin using 10-fold cross validation. The red and gray points/squares represent the MAE for each cross-validation fold obtained with dataset DS1 (no augmentation) and DSA1 (with augmentation); the red and gray lines show the average MAE over the 10 cross-validation folds in each bin, without and with data augmentation, respectively.

Figure 4. Distribution of entries in DS1 with respect to

l n P_{e q}^{0}

(blue histogram) overlaid with error analysis in each bin using 10-fold cross validation. The red and gray points/squares represent the MAE for each cross-validation fold obtained with dataset DS1 (no augmentation) and DSA1 (with augmentation); the red and gray lines show the average MAE over the 10 cross-validation folds in each bin, without and with data augmentation, respectively.

Figure 5. The kmeans++ clustering on dataset DS1 reported on a

l n P_{e q}^{0}

vs.

ν_{p a}^{M a g p i e}

plot. Points of different colors belong to different clusters. The feature

ν_{p a}^{M a g p i e}

is defined as the average ground-state volume of the alloy calculated as

ν_{p a}^{M a g p i e} = \sum_{i} f_{i} ν_{p a, i}

, where

ν_{p a, i}

are the ground-state volumes per atom of the elemental solids and

f_{i}

are the atomic fractions.

P_{e q}^{0}

is the equilibrium plateau pressure at 25 °C.

Figure 5. The kmeans++ clustering on dataset DS1 reported on a

l n P_{e q}^{0}

vs.

ν_{p a}^{M a g p i e}

plot. Points of different colors belong to different clusters. The feature

ν_{p a}^{M a g p i e}

is defined as the average ground-state volume of the alloy calculated as

ν_{p a}^{M a g p i e} = \sum_{i} f_{i} ν_{p a, i}

, where

ν_{p a, i}

are the ground-state volumes per atom of the elemental solids and

f_{i}

are the atomic fractions.

P_{e q}^{0}

is the equilibrium plateau pressure at 25 °C.

Figure 6. Predicted variation of

l n P_{e q}^{0}

at 25 °C for different amounts and types of substituting elements (Mn, Ni, Cu, Al, V). Predictions for Vanadium have been made when it substitutes Titanium (purple line) and when it substitutes Iron (brown line); in all other cases Iron atoms are substituted by X atoms.

Figure 6. Predicted variation of

l n P_{e q}^{0}

at 25 °C for different amounts and types of substituting elements (Mn, Ni, Cu, Al, V). Predictions for Vanadium have been made when it substitutes Titanium (purple line) and when it substitutes Iron (brown line); in all other cases Iron atoms are substituted by X atoms.

Table 1. Variation in the dimensions of DS0 and DS1 before and after PADRE.

Dataset	Before PADRE	After PADRE
DS0	398 items × 145 features	158.404 items × 435 features
DS1	806 items × 145 features	649.636 items × 435 features

Table 2. Mean Absolute Error (MAE) and R² values obtained for ML algorithms with

l n P_{e q}^{0}

as target using the dataset DS1.

Table 2. Mean Absolute Error (MAE) and R² values obtained for ML algorithms with

l n P_{e q}^{0}

as target using the dataset DS1.

Model	MAE	R²
SVM	1.13	0.82
RFGB	1.07	0.84
RF	1.47	0.74
KNN	1.82	0.64

Table 3. MAE and R² values obtained from the cluster reported in column 1 (set aside as test set) when training an RFGB model on all the other clusters obtained from kmeans++.

Kmean++ Cluster Used as Test Set	Number of Instances	MAE	R²
1	531	5.97	−2.68
2	43	2.92	0.1
3	69	3.26	0.01
4	176	1.25	0.82

Table 4. Mean Absolute Error (MAE) and R² values obtained from the cluster (material class) reported in column 1 (set aside as test set) when training an RFGB model on all the other clusters (material classes) as defined in the HYDPARK database. Mg refers to Mg-based hydrides, SS refers to solid solutions and MIC refers to intermetallic compounds not included in previous clusters (other classes are self-explanatory).

Material Class	Quantity of Data	MAE	R²
A₂B	10	1.65	−4.06
AB	78	3.9	0.10
AB₂	454	2.10	0.37
AB₅	106	1.80	−0.5
Mg	32	2.68	−0.26
MIC	52	3.74	−0.41
SS	85	2.25	0.54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Beltrame, G.; Dematteis, E.M.; Stavila, V.; Rizzi, P.; Baricco, M.; Palumbo, M. Application of Machine Learning and Data Augmentation Algorithms in the Discovery of Metal Hydrides for Hydrogen Storage. Metals 2025, 15, 1221. https://doi.org/10.3390/met15111221

AMA Style

Beltrame G, Dematteis EM, Stavila V, Rizzi P, Baricco M, Palumbo M. Application of Machine Learning and Data Augmentation Algorithms in the Discovery of Metal Hydrides for Hydrogen Storage. Metals. 2025; 15(11):1221. https://doi.org/10.3390/met15111221

Chicago/Turabian Style

Beltrame, Giancarlo, Erika Michela Dematteis, Vitalie Stavila, Paola Rizzi, Marcello Baricco, and Mauro Palumbo. 2025. "Application of Machine Learning and Data Augmentation Algorithms in the Discovery of Metal Hydrides for Hydrogen Storage" Metals 15, no. 11: 1221. https://doi.org/10.3390/met15111221

APA Style

Beltrame, G., Dematteis, E. M., Stavila, V., Rizzi, P., Baricco, M., & Palumbo, M. (2025). Application of Machine Learning and Data Augmentation Algorithms in the Discovery of Metal Hydrides for Hydrogen Storage. Metals, 15(11), 1221. https://doi.org/10.3390/met15111221

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Application of Machine Learning and Data Augmentation Algorithms in the Discovery of Metal Hydrides for Hydrogen Storage

Abstract

1. Introduction

2. Methodology

2.1. Datasets

2.2. Machine Learning

2.3. Data Augmentation (PADRE Algorithm)

3. Results and Discussion

3.1. Effect of Increasing the Dataset Size with New Real Data

3.2. Effect of Increasing the Dataset Size with Data Augmentation

3.3. Clustering and Further Analysis

3.4. Validation of Model Predictions

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI