A Machine Learning Approach to Determine the Band Gap Energy of High-Entropy Oxides Using UV-Vis Spectroscopy

Hoyos-Sanchez, Juan P.; Hahn, Horst; Jha, Shikhar K.; Schweidler, Simon; Velasco, Leonardo

doi:10.3390/eng6120340

Open AccessArticle

A Machine Learning Approach to Determine the Band Gap Energy of High-Entropy Oxides Using UV-Vis Spectroscopy

by

Juan P. Hoyos-Sanchez

¹

,

Horst Hahn

^2,3

,

Shikhar K. Jha

⁴

,

Simon Schweidler

^2,*

and

Leonardo Velasco

^1,*

¹

Direccion Academica, Campus La Paz, Universidad Nacional de Colombia sede de La Paz, Km 9 via Valledupar, La Paz 202017, Cesar, Colombia

²

Institute of Nanotechnology, Karlsruhe Institute of Technology, Kaiserstraße 12, 76131 Karlsruhe, Germany

³

College of Engineering, Material Science & Engineering, University of Arizona, 1235 E. James E. Rogers Way, Tucson, AZ 85721, USA

⁴

Mechanical and Manufacturing Engineering Technology, Rochester Institute of Technology, Rochester, NY 14623, USA

^*

Authors to whom correspondence should be addressed.

Eng 2025, 6(12), 340; https://doi.org/10.3390/eng6120340 (registering DOI)

Submission received: 6 October 2025 / Revised: 16 November 2025 / Accepted: 21 November 2025 / Published: 1 December 2025

(This article belongs to the Special Issue Transfer Learning and Data Augmentation in Engineering: Bridging Gaps for Smart Industrial Solutions)

Download

Browse Figures

Versions Notes

Abstract

This study introduces a machine learning-based framework for the automated determination of band gap energies in high-entropy oxides (HEOs) using UV-Vis spectroscopy data. Traditionally, band gap energies are obtained from the Tauc plots by manually extrapolating the linear region to the photon energy axis, a process that is time consuming and prone to human error, particularly when dealing with large datasets. To overcome these limitations, we developed a Python-based workflow that automates the band gap evaluation process through key steps, including data preprocessing, data augmentation, hyperparameter tuning, and band gap energy prediction. Various machine learning algorithms were employed to model the relationships between UV-Vis spectra and band gap energies, resulting in significant improvements in both accuracy and efficiency. Among the tested models, Bagging, Extra Trees, and Random Forest exhibited the best predictive performance, achieving mean absolute errors (MAE) as low as 0.26–0.28 eV and coefficients of determination (R²) of 0.73–0.74, substantially outperforming conventional automated methods. Although data augmentation and hyperparameter optimization yielded only modest performance gains, they contributed to improved model robustness. Overall, the proposed ML framework provides a scalable and efficient approach for the rapid characterization of HEOs, minimizing the need for manual analysis and accelerating data-driven materials discovery.

Keywords:

band gap; machine learning; high-entropy materials; UV-Vis spectroscopy; high throughput

1. Introduction

Single-phase multicomponent alloys containing several elements in near-equiatomic proportions have attracted extensive research interest since 2004, when Yeh et al. and Cantor et al. independently introduced the concepts of high-entropy alloys and multicomponent alloys, respectively [1,2,3]. Since then, the family of multicomponent materials has expanded beyond metallic alloys to oxides, borides, carbides, and many others [4,5,6]. Considerable efforts have been made to categorize these materials under a unified conceptual framework, leading to different terminologies such as complex concentrated alloys [3,7], multi-principal-element alloys [8], high-entropy materials [9,10,11], and multicomponent alloys [12,13,14]. The rapid development of these materials has revealed a vast chemical space that requires systematic exploration. Optimizing material properties for specific applications often demands both structural and compositional modifications, which exponentially increase the number of synthesis and characterization experiments required [15,16,17]. Consequently, research efforts are increasingly focused on combinatorial approaches, high-throughput techniques [14,18,19], and computational modeling and design [20,21,22,23] to efficiently explore the vast compositional space of these materials [24,25,26,27,28,29,30].

High-entropy oxides (HEOs), first introduced by Rost et al., represent a notable extension of this multicomponent materials class [4]. Since 2016, numerous studies have reported advances in various synthesis methods [31,32,33], structural characterizations [34,35,36,37], and potential applications of HEOs [38,39,40]. Deviations from near-equiatomic compositions have motivated combinatorial studies, for which high-throughput synthesis and characterization techniques have become key enablers to navigate the complex compositional space. The UV-Vis spectroscopy, a widely used technique to determine the optical band gap, plays an important role in the HEO studies [18,29]. The diffuse reflectance spectroscopy (DRS) combined with the UV-Vis spectrophotometry provides valuable insight into the optical properties of HEOs, but the analysis of large datasets demands automation. The interpretation of DRS data typically relies on the Kubelka–Munk (K–M) model, which allows the estimation of the optical band gap (E_g) from transformed reflectance spectra. As outlined in previous studies, the method requires both a clear understanding of the model and careful sample preparation [41,42]. Convetionally, E_g is obtained by plotting the transformed spectra by the K–M model and extrapolating the linear portion of the curve to the photon energy axis—a procedure that is manageable for a few samples but becomes impractical for large high throughput (HT) datasets. A detailed explanation is offered in the methodology section [42,43,44].

In recent years, machine learning has emerged as a powerful tool for understanding complex input–output relationships, making it particularly suitable for automating material characterization tasks [24,45]. Machine learning algorithms can extract intricate correlations between input features, such as spectral data, and output properties [46,47]. Therefore, machine learning approaches can be effectively applied to determining key parameters such as band gap energies. For instance, in organic materials, machine learning models have achieved mean absolute error (MAE) values as low as 0.26 eV for band gap prediction [48]. By training on large datasets containing the UV-Vis spectra and corresponding band gap values, machine learning algorithms can capture the underlying patterns and correlations governing the spectroscopic responses. Once trained, such models can be deployed to predict the band gap energies of unseen materials, offering a streamlined, data-driven approach to optical material characterization [48].

This work presents a machine-learning framework for predicting the band gap energies of HEOs using the UV-Vis spectroscopy data. The model was trained on datasets obtained from previous high-throughput experiments [18,29] and benchmarked against conventional analytical methods. The results demonstrate significant improvements in prediction accuracy and computational efficiency, providing a scalable approach for rapid HEO characterization. To the best of our knowledge, this is the first ML-based framework for band gap extraction specifically targeting inorganic high-entropy materials, whose vast compositional complexity poses unique analytical challenges. The proposed approach thus represents a key step toward accelerating the analysis of band gap spectra in the HT studies and advancing data-driven materials discovery.

2. Materials and Methods

The methodology employed in this study is based on datasets obtained from previous publications by our research group [18,29]. These datasets were selected to ensure comparable synthesis conditions achieved through the HT experiments and to provide sufficient chemical diversity arising from the combinatorial synthesis of multiple elements. The overall experimental workflow comprises three main stages: data preprocessing, data augmentation, and machine learning-based prediction of band gap energy from the UV-Vis spectroscopy data. In the preprocessing stage, raw spectral data were cleaned, normalized, and transformed to remove noise and prepare the input features for ML modeling. Subsequently, synthetic data were generated through data augmentation to improve model generalization and robustness. Finally, various ML algorithms were applied to establish accurate relationships between the processed UV-Vis spectra and the corresponding band gap energies of HEOs.

2.1. Spectra Dataset, Kubelka-Munk Transformation, and Band Gap Calculation

The band gap energies of the UV-Vis spectra datasets were estimated using the K–M transformation [18,29] in combination with the Tauc method [41,42]. The relationship between the transformed reflectance and the photon energy is expressed as the following:

(F (R_{\infty}) * h v)^{\frac{1}{γ}} = C (h v - E_{g})

(1)

where

F (R_{\infty}) = \frac{K}{S} = \frac{{(1 - R_{\infty})}^{2}}{2 R_{\infty}}

represents the reemission function, where

R_{\infty}

is the reflectance of the sample, and

K

and

S

are the absorption and scattering coefficients in the K–M model.

h

is the plank constant,

v

is the photon frequency,

E_{g}

the band gap energy, and

C

is a proportionality constant.

γ

depends on the type of electronic transition for direct and indirect band gaps (

γ

= 1/2 or 2, respectively).

An example to estimate the direct and indirect band gap using the K–M transformation is shown in Figure 1. The UV-Vis spectra of a CeO₂ power sample [18] is shown in Figure 1a, while Figure 1b,c show the corresponding Tauc plots for direct and indirect transition, respectively. Linear fitting was performed to determine the band gap energies using the GapExtractor v1.0 software [43], yielding band gap energies of 3.38 eV (direct) and 3.08 eV (indirect). Although effective, this semi-automated approach requires manual parameter tuning, which becomes increasingly time-consuming and impractical for high-throughput datasets containing hundreds of spectra.

2.2. Data Pre-Processing

Each sample in the original dataset was measured five times, yielding nearly identical spectra per sample and resulting in a total of 3090 individual measurements. To prevent overfitting and ensure that the machine learning algorithms developed in this study generalize effectively, the original dataset was reduced from 3090 to 618 spectra by retaining only the first measurement for each sample (each spectrum contains 901 data points). This selection effectively eliminates redundant and duplicated information. The preprocessing procedure, described in detail in the Supplementary Information, included cleaning, normalization, and formatting of the spectral data. The resulting 617 processed samples were grouped into nine identifiers (IDs) (see Table S1 in the Supplementary Materials for a detailed description). Representative spectra for each ID are shown in Figure 2.

2.3. Machine Learning Algorithms

Jupyter Notebooks 7.3.0 served as the main development environment for this study, and the complete source code is available in an open-access GitHub repository (https://github.com/jphoyos/Bandgap-predictor, accessed on 14 July 2025). Model training and evaluation were conducted on a workstation equipped with a 13th-generation Intel Core i5 CPU, a GeForce RTX 3060 graphics card (NVIDIA, China), and 16 GB DDR4 of RAM (Crucial). All scripts were implemented in Python 3 [49] using Scikit-Learn 1.4.2 [50] as the primary machine learning library.

In this study, machine learning models were trained to approximate a function f such that y = f(x), where x ∈ ℝ⁹⁰¹ represents the 901-dimensional Tauc spectrum of a sample, and y ∈ ℝ is the corresponding band gap energy (in eV). This input-output structure was maintained across all evaluated models.

A total of sixteen regression algorithms were evaluated to identify the most effective predictors [51]. The evaluated algorithms included both ensemble-based and classical regression-based approaches employing diverse learning strategies. Ensemble methods enhance predictive performance by combining multiple weak learners [52]. Bagging-based algorithms, such as Random Forest and Extra-Trees, aggregate predictions through averaging, while boosting-based methods like AdaBoost, Gradient Boosting, LightGBM, and XGBoost sequentially adjust the weights of weak learners to minimize the residual error in each iteration [53,54]. Linear models such as Linear Regression, Lasso, and ElasticNet, apply regularization to prevent overfitting [55], meanwhile, RANSAC, is a robust regression technique capable of handling outliers [56]. Instance-based approaches such as, k-Nearest Neighbors rely on proximity in feature space, whereas probabilistic linear models ARD regression and Bayesian ridge introduce Bayesian inference to manage parameter uncertainty [57]. Finally, the Multilayer Perceptron (MLP) represent a feedforward neural network capable to capture nonlinear relationship between the input features and the target [55,57]. To further improve predictive performance, hyperparameters were optimized using a genetic search algorithm [58]. This method simulates the process of natural evolution, where a population of candidate hyperparameter sets and it is iteratively improved using selection, crossover, and mutation operators [58,59]. The performance of each candidate is evaluated through cross-validation, and the best-performing combinations are retained and evolved until the optimal set of hyperparameters is found [58,59].

Table 1 summarizes the most relevant hyperparameters identified through the genetic search for the best-performing models. All experiments employed a five-fold cross-validation scheme and a random seed of 123 to ensure reproducibility.

2.4. Data Augmentation Methodology

The preprocessed dataset was randomized and split into two subsets: 80% of the data (selected randomly) were used for algorithms training, and the remaining 20% were reserved for testing (Test-DS). An analysis of the band gap distribution in the training dataset revealed concentration intervals [2, 2.25] eV and [3.25, 3.5] eV, as shown in the histogram in Figure 3a. To enhance the machine learning model’s ability to generalize and achieve optimal performance, two data augmentation methods were applied.

Method 1: In this approach, the transformed curves from the initial training dataset were shifted within the band gap range [1.5–5.5] eV to generate new synthetic samples. The spectra were shifted along the x-axis with a step size of ~0.02 eV to simulate variations in photon energy measurements. These shifts produced new synthetic spectra to increase data diversity. It is important to note that each synthetic spectrum lacks direct physical meaning, i.e., its features do not correspond to real band gap measurements. An Autoregressive Integrated Moving Average (ARIMA) model with parameters p = 50, d = 1, and q = 25 was used to interpolate missing points in the spectral curves [60]. The parameter p (autoregressive—AR-term) captures the relationship between an observation and its previous values (lags), d refers to the degree of differencing applied to achieve stationarity, and q indicates the number of moving average-terms (MA-terms), accounting for dependencies between observations and past forecast errors. The number of autoregressive terms (p) was determined by evaluating the autocorrelation function and averaging the lags for which the autocorrelation coefficient exceeded 0.75. By shifting the spectra in steps of 0.3 eV, a total of 70,631 new synthetic spectra curves were generated. The resulting band gap histogram for the first augmentation method is shown in Figure 3b.

Method 2: This method addressed the imbalance in the dataset, where certain band gap values were overrepresented. A Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (SMOGN) [61] was applied using a rarity threshold of 0.8 to generate additional samples in the underrepresented regions. SMOGN combines oversampling of rare target values with the addition of Gaussian noise to create synthetic samples that follow the local data structure, thereby improving the representation of extreme or infrequent band gap values [61]. This approach yielded in 222 new synthetic samples, as shown in the histogram in Figure 3c.

As a result of data augmentation, three training datasets were generated: (i) DS1, which corresponds to the training dataset without any augmentation (Figure 3a); (ii) DS2, which corresponds to the training dataset augmented using Method 1 (Figure 3b); and (iii) DS3, which corresponds to the training dataset augmented using Method 2 (Figure 3c). Each algorithm was trained on all three training datasets. Model performance was evaluated using five-fold cross-validation, which allows training on multiple train-test splits, thereby reducing the likelihood of overfitting and providing a more reliable estimate of performance on unseen data [62].

The resulting models were subsequently applied to the Test-DS, and regression performance was quantified using three metrics: MAE, root mean square error (RMSE), and the coefficient of determination (R²) [63,64]. While R² measures the proportion of variance in the target variable explained by a model, the MAE and RMSE provide a more direct measurement of the prediction accuracy [63,64]. Therefore, model performance was primarily interpretated on MAE and RMSE, as these metrics more accurately reflect the practical predictive capability even when R² values are moderate. A detailed description of these metrics is provided in the Supplementary Material. A schematic overview of the methodology used in this study is presented in Scheme 1.

3. Results and Discussion

3.1. Models Performance Evaluation

The trained algorithms (developed as models during the training phase) were evaluated on the Test-DS, and eight best-performing models, based on the R² metric [63] are shown in Table 2 (see Table S2 in the Supplementary Material for the 16 trained algorithms). Most of the best-performing models were decision tree-based, which is expected given the highly nonlinear and complex relationship between the spectral features and band gap values. The Random Forest and Extra-Trees models achieved the lowest MAE of 0.29–0.27 eV and highest R² of 0.73–0.72. However, these models required higher memory capacity during both training and inference. In contrast, optimized boosting algorithms such as LightGBM and XGBoost reached comparable accuracy (MAE = 0.33–0.32 eV, R² = 0.63) while maintaining more compact model representations, implying lower memory overhead for deployment. It is important to note that experimental band gap determinations can vary significantly depending on the analytical evaluation method used, with reported deviations of up to 0.87 eV between manual assessments [65,66]. To evaluate the robustness of the models, data standardization [67] was performed on the preprocessed datasets, followed by algorithm training, and testing the models. However, the performance on the standardized datasets (see Tables S3 and S4) did not exceed that of the models trained on unstandardized data, highlighting the intrinsic robustness of the best-performing models

As a benchmark, the band gap of the Test-DS was also determined using a public library from the Automatic Bandgap Extractor package [68], which employs the first derivative of the Kubelka–Munk (K–M) transformed data [69,70]. This benchmark yielded an R² of −0.85, indicating substantial deviations from expected values. Manual adjustments within the software could improve precision; however, they would require considerable human intervention. Models trained on Der-DS1 and evaluated on Der-Test-DS generally showed lower performance than their counterparts trained on the original datasets (see Table 2 and Table S5). The Extra-Trees model maintained a similar R², while the k-nearest neighbors (kNN) model exhibited a slightly reduced MAE on Der-Test-DS.

To assess the models’ ability to predict band gaps, the top-performing models were used to predict the band gaps for the entire preprocessed dataset and its derivative counterpart. The resulting scatterplots (Figure 4) were analyzed to evaluate whether the models accurately captured the relationships between the spectral features and the band gap values. A random distribution along the dashed line in Figure 4 indicates strong generalization, while systematic deviations suggest underfitting or missing feature relationships. Figure 4(a1–a3) show the model predictions for the preprocessed dataset, and Figure 4(b1–b3) display the derivative-transformed dataset. Among these, the Extra-Trees model consistently achieved the best performance (Figure 4(a2,b2)), with most predicted band gaps closely aligning with the ideal dotted diagonal line. However, significant deviations were observed across all models for the band gaps in the range of 2.9–3.5 eV, especially for Test-DS and Der-Test-DS. The Extra-Trees model also demonstrated improved prediction accuracy for higher band gaps around ~5 eV (Figure 4(a2)).

Finally, Principal Component Analysis (PCA) was applied to reduce the dimensionality of the preprocessed dataset. PCA transforms correlated variables into a smaller set of uncorrelated principal components, while preserving most of the data variance [55,57]. Using the covariance matrix of the standardized features, the number of components was chosen to explain 95% of the total variance [55,57]. The resulting dataset (PCA-DS) and corresponding test set (PCA-Test-DS) were used for model training and evaluation. However, the performance of PCA-based models was significantly poorer than that of models trained on the original data. Consequently, PCA was excluded from further analysis.

In summary, decision tree-based models (e.g., Bagging, Extra-Trees, and Random Forest) demonstrated superior performance compared to the other algorithm models, suggesting that the dataset contains complex, nonlinear dependencies effectively captured by ensemble trees. The highest R² (0.73) represents a solid result, though not yet optimal. The limited performance may be attributed to restricted training or non-optimized hyperparameter tuning. The following section explores optimization strategies, specifically including hyperparameters tuning and data augmentation to enhance diversity and reduce overfitting.

3.2. Model Hyperparameter Tuning

To enhance the models’ performance (Table 2), hyperparameter tuning was performed using a genetic algorithm implemented using the sklearn-genetic Python library [59]. This method integrates evolutionary algorithms with cross-validation within the scikit-learn framework to efficiently optimize the model parameters. The search space for each model’s relevant hyperparameters was carefully defined, as summarized in Table 1, to guide the genetic optimization process. A total of 10 generations were selected for the genetic algorithm. Increasing the number of generations yielded only marginal performance improvements (on the order of hundredths), which did not justify the substantially higher computational time. Since decision tree-based models initially showed the best performance, their estimator range values were harmonized for consistency across models (except for Random Forest, which used a broader range between 50 and 1000 estimators to assess the influence of this parameter on performance). The optimized hyperparameter values obtained from the genetic search are listed in Table 3, and the corresponding model performance on the Test-DS is presented in Table 4.

The optimization process led to consistence improvements across all evaluated models, as reflected by higher R² and lower MAE values compared to the results obtained without tuning (Table 2).

The most significant improvement was observed for the k-nearest neighbors model, where the R² increased by nearly 200%. In contrast, models that initially achieved strong baseline performance, such as the Extra-Trees and Random Forest regressors, showed only slight improvements. This indicates that these algorithms might have already reached their maximum generalization capacity given the provided training dataset (DS1).

3.3. Data Augmentation

The first augmentation method (Method 1) expanded the original dataset (DS1) from 617 to 71.249 spectra (resulting in DS2). The models were retrained on DS2 following the procedure described in Section 3.1 and their performance was tested on the Test-DS. A slight decline in model performance was observed when comparing the results in Table 4 with those in Table 2. This decrease is likely attributed to a limited diversity introduced by Method 1, which increased the dataset volume but not its representational variety.

The second augmentation strategy (Method 2) generated 222 new synthetic spectra that were structurally different from the original DS1 data, resulting in a dataset of 839 spectra (DS3), representing a 35.9% increase in size. The models were again trained and evaluated on Test-DS following the same protocol (Section 3.1). However, the model performance (Table 4) showed no significant improvement relative to the original results (Table 2). Specifically, the Random Forest and Extra-Trees models, which initially performed best (MAE = 0.28, R² = 0.74), showed either a slight decrease in R² = 0.71 or negligible changes in MAE. On the other hand, models like Bagging showed slight improvements in both MAE (from 0.30 to 0.26) and R² (from 0.71 to 0.74), likely due to the ensemble’s inherent advantage in leveraging minor variations within the training data. The most notable improvement was observed in the k-Nearest Neighbors model, where R² increased from 0.1 to 0.4 and MAE from 0.5 to 0.44. However, these values remain substantially lower compared to those achieved by the Extra-Trees models.

When considering these results alongside model-complexity, a clear trend emerges. Bagging ensembles (e.g., Random Forest, Extra-Trees, Bagging) consistently achieved the highest absolute accuracy albeit at the cost of increased computational resources due to their extensive tree depth and ensemble size. Among them, Extra-Trees proved the fastest to train, as it bypasses the exhaustive search for optimal splits by using random thresholds. In contrast, optimized boosting implementations (e.g., XGBoost and LightGBM) offered competitive accuracy with more compact model structure and lower deployment overhead.

Figure 5 shows the scatterplots for the three best-performing models after applying hyperparameter tuning and data augmentation. Training with hyperparameter tuning via the sklearn-genetic framework resulted in improved performance for the three models compared to the same training in Figure 4(a1–a3), meaning the blue squares (training predictions) are closer to the dashed line. However, model performance on the Test-DS (red squares) remained similar to that shown in Figure 4(a1–a3), indicating that generalization to unseen data was largely unchanged.

When analyzing the models trained with DS2 (Method 1, Figure 5(b1–b3)), the performance patterns remained consistent with those in Figure 4(a1–a3), despite the substantial increase in training data. The Extra-Trees model continued to outperform others, yet without notable improvement on the Test-DS. Training with DS3 (Method 2, SMOGN-generated spectra, Figure 5(c1–c3)) resulted in improved fits to the training dataset, as evidenced by the closer alignment of training predictions (blue squares) with the dashed line. However, test predictions (red dots) remained scattered, suggesting limited generalization and possible overfitting. Quantitatively, the Bagging model improved slightly (MAE from 0.30 to 0.28 eV), the Random Forest from MAE 0.29 to 0.26 eV, while Extra-Trees remained effectively unchanged. In summary, the Random Forest model achieved the highest R² (0.74), while Bagging and Extra-Trees yielded the lowest MAE and RMSE, 0.27 eV and 0.39 eV, respectively. These results are consistent with those reported by Asad et al. where comparable accuracy was found to be satisfactory for predicting the band gap of organic materials [48]. However, this study highlights the persistent challenges of generating sufficiently diverse and representative synthetic data to improve the band gap energy prediction of HEOs from the UV-Vis spectroscopy. Further research could explore more advanced data augmentation techniques or alternative model architectures that better leverage the available data for enhanced predictive performance.

3.4. Application Development for Visualization and Decision Making

To optimize band gap estimation and facilitate decision making, we developed a user-friendly software tool, available for download at https://github.com/jphoyos/bandgap-predictor/tree/main/bandgap_predictor accessed on 20 November 2025. The software accepts properly formatted input dataset and allows users to choose from three machine learning models (e.g., Bagging, Extra-Tree, or Random Forest) for bandgap prediction, as shown in Figure 6. The tool is particularly advantageous for handling large-scale datasets, as it efficiently processes multiple samples and provides accurate predictions with minimal computational effort. Examples demonstrating the visualization features and band gap prediction capabilities of the software are provided in the Supplementary Material.

4. Conclusions

In this study, we developed a machine learning framework for predicting the band gap energies of high-entropy oxides from the UV-Vis spectroscopy data. The approach integrates robust regression algorithms models optimized through genetic hyperparameter tuning to identify the most effective models for this task. By systematically comparing several state-of-the-art regression models, we demonstrated that machine learning can substantially improve the accuracy of band gap predictions compared to traditional analytic techniques.

Data augmentation techniques, such as curve shifting and synthetic minority over-sampling, successfully increased the dataset size but had only a minor impact on the model performance. Decision tree-based models, including Bagging, Extra-Trees, and Random Forest, consistently delivered the best results, although further data augmentation led to only marginal or even slightly reduced accuracy. This indicates that these models had already captured the key relationships within the preprocessed dataset, while the augmented data did not introduce sufficient new information to significantly improve predictions. This observation highlights the intrinsic difficulty of creating synthetic yet physically meaningful data for complex multicomponent systems such as HEOs.

Hyperparameter tuning with genetic algorithms yielded modest performance improvements, indicating that while tuning helps refine predictive accuracy, the model architecture itself remains the dominant factor. Expanding the range of estimators did not significantly influence the results, confirming that the models were already effectively exploiting the available information.

Compared with traditional analytical methods, which can exhibit errors up to 0.87 eV depending on the evaluation approach, our best-performing machine learning models achieved a mean absolute error (MAE) of 0.26 eV. When evaluated against an automated baseline method (GapExtractor) applied to the same dataset, the machine learning models, particularly Bagging, Extra-Trees, and Random Forest, achieved substantially higher R² values with lower errors, underscoring their superior ability to capture complex, nonlinear relationships that may otherwise remain undetected through manual or rule-based analysis. Moreover, these models can streamline high-throughput characterization by automating band gap estimation, significantly reducing the experimental effort and analysis time. This demonstrates the strong potential of machine learning to accelerate materials discovery by efficiently navigating large compositional spaces.

While the present models were trained on combinatorial data involving up to six elements, their performance on other material systems may be limited. Expanding the experimental dataset with additional spectra will be essential to developing more robust and generalizable models. Furthermore, the current framework assumes uniform data dimensionality (i.e., identical spectral resolution), which constrains its direct applicability to datasets with varying wavelength grids.

Future research could explore more advanced data augmentation techniques and alternative model architectures capable of capturing complex spectral dependencies. For instance, incorporating spectra measured under different synthesis conditions or adding controlled noise could enhance model robustness. Additionally, building larger and more diverse datasets, including both direct and indirect bandgap materials, would create a richer feature space, enabling machine learning models to better learn the underlying physics. Advanced deep learning architectures, such as convolutional neural networks (CNNs), transformers, or hybrid models, could further improve predictive power. Ultimately, integrating such models into automated, high-throughput experimental pipelines could enable real-time band gap prediction and significantly advance the practical application of machine learning in materials research.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/eng6120340/s1, Table S1. Description of the dataset, where n is the total sample amount, m is the number of measured points, and id is the data set. Sample status refers to the preparation method, for the Ids 1-8, 900 and 700 are heat treatment temperatures, while air, argon, and H2 highlight the heat treatment atmosphere; Figure S1. Dataset identifiers plotted after the Kubelka-Munk transformation; Table S2. Performance of the 16 models on Test-DS; Table S3. Performance of the models after preprocessed dataset with standardization; Table S4. Performance of the models after preprocessed dataset with normalization; Table S5. Performance of the models on the Der-Test-DS; Table S6. Pipeline: Standard scaler + PCA + estimator over preprocessed dataset; Figure S2. Feature-importance analysis for Random Forest model using both permutation importance and SHAP values; Software user’s manual.

Author Contributions

Writing—original draft, visualization, data curation, methodology, software: J.P.H.-S.; writing—review and editing, resources, H.H.; writing—review and editing, resources, S.K.J.; writing—original draft, visualization, supervision, resources, S.S.; supervision, data curation, writing—review and editing, conceptualization, resources, L.V. All authors have read and agreed to the published version of the manuscript.

Funding

Juan P. Hoyos and L. Velasco are grateful for the support provided by Universidad Nacional de Colombia (HERMES Project no. 61237). L. Velasco is grateful for the support provided by Universidad Nacional de Colombia (HERMES project no. 57862, 57683, 61001). S.S acknowledge financial support from the KIT via the project Auto.MAP and the Helmholtz Program “Materials Systems Engineering” under program no. 43.31.01. Scheme 1 was created using Canva and is subject to the platform’s terms of use.

Data Availability Statement

Data and Phyton code for this article are available open-source under an Apache-2.0 license on GitHub at https://github.com/jphoyos/bandgap-predictor accessed on 20 November 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cantor, B.; Chang, I.T.H.; Knight, P.; Vincent, A.J.B. Microstructural development in equiatomic multicomponent alloys. Mater. Sci. Eng. A 2004, 375–377, 213–218. [Google Scholar] [CrossRef]
Yeh, J.W.; Chen, S.K.; Lin, S.J.; Gan, J.Y.; Chin, T.S.; Shun, T.T.; Tsau, C.H.; Chang, S.Y. Nanostructured high-entropy alloys with multiple principal elements: Novel alloy design concepts and outcomes. Adv. Eng. Mater. 2004, 6, 299–303. [Google Scholar] [CrossRef]
Biswas, K.; Yeh, J.-W.; Bhattacharjee, P.P.; DeHosson, J.T.M. High entropy alloys: Key issues under passionate debate. Scr. Mater. 2020, 188, 54–58. [Google Scholar] [CrossRef]
Rost, C.M.; Sachet, E.; Borman, T.; Moballegh, A.; Dickey, E.C.; Hou, D.; Jones, J.L.; Curtarolo, S.; Maria, J.P. Entropy-stabilized oxides. Nat. Commun. 2015, 6, 8485. [Google Scholar] [CrossRef]
Gild, J.; Zhang, Y.; Harrington, T.; Jiang, S.; Hu, T.; Quinn, M.C.; Mellor, W.M.; Zhou, N.; Vecchio, K.; Luo, J. High-Entropy Metal Diborides: A New Class of High-Entropy Materials and a New Type of Ultrahigh Temperature Ceramics. Sci. Rep. 2016, 6, 37946. [Google Scholar] [CrossRef]
Zhou, J.; Zhang, J.; Zhang, F.; Niu, B.; Lei, L.; Wang, W. High-entropy carbide: A novel class of multicomponent ceramics. Ceram. Int. 2018, 44, 22014–22018. [Google Scholar] [CrossRef]
Gorsse, S.; Nguyen, M.H.; Senkov, O.N.; Miracle, D.B. Database on the mechanical properties of high entropy alloys and complex concentrated alloys. Data Brief 2018, 21, 2664–2678, Erratum in Data Brief 2020, 32, 106216. https://doi.org/10.1016/j.dib.2020.106216. [Google Scholar] [CrossRef]
Tong, C.J.; Chen, Y.L.; Chen, S.K.; Yeh, J.W.; Shun, T.T.; Tsau, C.H.; Lin, S.J.; Chang, S.Y. Microstructure characterization of AlxCoCrCuFeNi high-entropy alloy system with multiprincipal elements. Metall. Mater. Trans. A 2005, 36, 881–893. [Google Scholar] [CrossRef]
Yeh, J.W.; Lin, S.J. Breakthrough applications of high-entropy materials. J. Mater. Res. 2018, 33, 3129–3137. [Google Scholar] [CrossRef]
Oses, C.; Toher, C.; Curtarolo, S. High-entropy ceramics. Nat. Rev. Mater. 2020, 5, 295–309. [Google Scholar] [CrossRef]
Tsai, M.H.; Yeh, J.W. High-entropy alloys: A critical review. Mater. Res. Lett. 2014, 2, 107–123. [Google Scholar] [CrossRef]
George, E.P.; Raabe, D.; Ritchie, R.O. High-entropy alloys. Nat. Rev. Mater. 2019, 4, 515–534. [Google Scholar] [CrossRef]
Zhang, Y.; Zuo, T.T.; Tang, Z.; Gao, M.C.; Dahmen, K.A.; Liaw, P.K.; Lu, Z.P. Microstructures and properties of high-entropy alloys. Prog. Mater. Sci. 2014, 61, 1–93. [Google Scholar] [CrossRef]
Wang, Q.; Velasco, L.; Breitung, B.; Presser, V. High-Entropy Energy Materials in the Age of Big Data: A Critical Guide to Next-Generation Synthesis and Applications. Adv. Energy Mater. 2021, 11, 2102355. [Google Scholar] [CrossRef]
Xiang, X.D.; Sun, X.; Briceño, G.; Lou, Y.; Wang, K.A.; Chang, H.; Wallace-Freedman, W.G.; Chen, S.W.; Schultz, P.G. A combinatorial approach to materials discovery. Science 1995, 268, 1738–1740. [Google Scholar] [CrossRef] [PubMed]
Potyrailo, R.; Rajan, K.; Stoewe, K.; Takeuchi, I.; Chisholm, B.; Lam, H. Combinatorial and high-throughput screening of materials libraries: Review of state of the art. ACS Comb. Sci. 2011, 13, 579–633. [Google Scholar] [CrossRef]
Gebhardt, T.; Music, D.; Takahashi, T.; Schneider, J.M. Combinatorial thin film materials science: From alloy discovery and optimization to alloy design. Thin Solid Films 2012, 520, 5491–5499. [Google Scholar] [CrossRef]
Velasco, L.; Castillo, J.S.; Kante, M.V.; Olaya, J.J.; Friederich, P.; Hahn, H. Phase–Property Diagrams for Multicomponent Oxide Systems toward Materials Libraries. Adv. Mater. 2021, 33, 2102301. [Google Scholar] [CrossRef] [PubMed]
Guilmard, M. Effects of aluminum on the structural and electrochemical properties of LiNiO₂. J. Power Sources 2003, 115, 305–314. [Google Scholar] [CrossRef]
Ceder, G.; Chiang, Y.-M.; Sadoway, D.R.; Aydinol, M.K.; Jang, Y.-I.; Huang, B. Identification of cathode materials for lithium batteries guided by first-principles calculations. Nature 1998, 392, 694–696. [Google Scholar] [CrossRef]
Pollice, R.; Gomes, G.D.P.; Aldeghi, M.; Hickman, R.J.; Krenn, M.; Lavigne, C.; Lindner-D’Addario, M.; Nigam, A.; Ser, C.T.; Yao, Z.; et al. Data-Driven Strategies for Accelerated Materials Design. Acc. Chem. Res. 2021, 54, 849–860. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.Z.; Gucci, F.; Zhu, H.; Chen, K.; Reece, M.J. Data-Driven Design of Ecofriendly Thermoelectric High-Entropy Sulfides. Inorg. Chem. 2018, 57, 13027–13033. [Google Scholar] [CrossRef] [PubMed]
Cole, J.M. A Design-to-Device Pipeline for Data-Driven Materials Discovery. Acc. Chem. Res. 2020, 53, 599–610. [Google Scholar] [CrossRef]
Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef]
Chen, A.; Zhang, X.; Zhou, Z. Machine learning: Accelerating materials development for energy storage and conversion. InfoMat 2020, 2, 553–576. [Google Scholar] [CrossRef]
Saeki, A.; Kranthiraja, K. A high throughput molecular screening for organic electronics via machine learning: Present status and perspective. Jpn. J. Appl. Phys. 2020, 59, SD0801. [Google Scholar] [CrossRef]
Kirklin, S.; Saal, J.E.; Meredig, B.; Thompson, A.; Doak, J.W.; Aykol, M.; Rühl, S.; Wolverton, C. The Open Quantum Materials Database (OQMD): Assessing the accuracy of DFT formation energies. NPJ Comput. Mater. 2015, 1, 15010. [Google Scholar] [CrossRef]
Hafner, J.; Wolverton, C.; Ceder, G. Toward Computational Materials Design: The Impact of Density Functional Theory on Materials Research. MRS Bull. 2006, 31, 659–668. [Google Scholar] [CrossRef]
Kumbhakar, M.; Khandelwal, A.; Jha, S.K.; Kante, M.V.; Keßler, P.; Lemmer, U.; Hahn, H.; Aghassi-Hagmann, J.; Colsmann, A.; Breitung, B.; et al. High-Throughput Screening of High-Entropy Fluorite-Type Oxides as Potential Candidates for Photovoltaic Applications. Adv. Energy Mater. 2023, 13, 2204337. [Google Scholar] [CrossRef]
Schweidler, S.; Schopmans, H.; Reiser, P.; Boltynjuk, E.; Olaya, J.J.; Singaraju, S.A.; Fischer, F.; Hahn, H.; Friederich, P.; Velasco, L. Synthesis and Characterization of High-Entropy CrMoNbTaVW Thin Films Using High-Throughput Methods. Adv. Eng. Mater. 2023, 25, 2200870. [Google Scholar] [CrossRef]
Sarkar, A.; Wang, Q.; Schiele, A.; Chellali, M.R.; Bhattacharya, S.S.; Wang, D.; Brezesinski, T.; Hahn, H.; Velasco, L.; Breitung, B. High-Entropy Oxides: Fundamental Aspects and Electrochemical Properties. Adv. Mater. 2019, 31, 1806236. [Google Scholar] [CrossRef]
Anand, G.; Wynn, A.P.; Handley, C.M.; Freeman, C.L. Phase stability and distortion in high-entropy oxides. Acta Mater. 2018, 146, 119–125. [Google Scholar] [CrossRef]
Jiang, S.; Hu, T.; Gild, J.; Zhou, N.; Nie, J.; Qin, M.; Harrington, T.; Vecchio, K.; Luo, J. A new class of high-entropy perovskite oxides. Scr. Mater. 2018, 142, 116–120. [Google Scholar] [CrossRef]
Chellali, M.R.; Sarkar, A.; Nandam, S.H.; Bhattacharya, S.S.; Breitung, B.; Hahn, H.; Velasco, L. On the homogeneity of high entropy oxides: An investigation at the atomic scale. Scr. Mater. 2019, 166, 58–63. [Google Scholar] [CrossRef]
Sarkar, A.; Kruk, R.; Hahn, H. Magnetic properties of high entropy oxides. Dalton Trans. 2021, 50, 1973–1982. [Google Scholar] [CrossRef]
Bérardan, D.; Franger, S.; Dragoe, D.; Meena, A.K.; Dragoe, N. Colossal dielectric constant in high entropy oxides. Phys. Status Solidi–Rapid Res. Lett. 2016, 10, 328–333. [Google Scholar] [CrossRef]
Sarkar, A.; Velasco, L.; Wang, D.; Wang, Q.; Talasila, G.; de Biasi, L.; Kübel, C.; Brezesinski, T.; Bhattacharya, S.S.; Hahn, H.; et al. High entropy oxides for reversible energy storage. Nat. Commun. 2018, 9, 3400. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Hossain, M.D.; Du, Y.; Chambers, S.A. Exploring the potential of high entropy perovskite oxides as catalysts for water oxidation. Nano Today 2022, 47, 101697. [Google Scholar] [CrossRef]
Wang, J.; Stenzel, D.; Azmi, R.; Najib, S.; Wang, K.; Jeong, J.; Sarkar, A.; Wang, Q.; Sukkurji, P.A.; Bergfeldt, T.; et al. Spinel to Rock-Salt Transformation in High Entropy Oxides with Li Incorporation. Electrochem 2020, 1, 60–74. [Google Scholar] [CrossRef]
Sarkar, A.; Eggert, B.; Velasco, L.; Mu, X.; Lill, J.; Ollefs, K.; Bhattacharya, S.S.; Wende, H.; Kruk, R.; Brand, R.A.; et al. Role of intermediate 4 f states in tuning the band structure of high entropy oxides. APL Mater. 2020, 8, 051111. [Google Scholar] [CrossRef]
Landi, S.; Segundo, I.R.; Freitas, E.; Vasilevskiy, M.; Carneiro, J.; Tavares, C.J. Use and misuse of the Kubelka-Munk function to obtain the band gap energy from diffuse reflectance measurements. Solid State Commun. 2022, 341, 114573. [Google Scholar] [CrossRef]
Escobedo-Morales, A.; Ruiz-López, I.I.; Ruiz-Peralta, M.D.; Tepech-Carrillo, L.; Sánchez-Cantú, M.; Moreno-Orea, J.E. Automated method for the determination of the band gap energy of pure and mixed powder samples using diffuse reflectance spectroscopy. Heliyon 2019, 5, e01505. [Google Scholar] [CrossRef]
Morales, A.E.; Ruiz-López, I. GapExtractor, Mendeley Data, V1; Benemerita Universidad Autonoma de Puebla: Puebla, Mexico, 2020. [Google Scholar] [CrossRef]
Mursyalaat, V.; Variani, V.I.; Arsyad, W.O.S.; Firihu, M.Z. The development of program for calculating the band gap energy of semiconductor material based on UV-Vis spectrum using delphi 7.0. J. Phys. Conf. Ser. 2023, 2498, 012042. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Morgan, D.; Jacobs, R. Opportunities and Challenges for Machine Learning in Materials Science. Annu. Rev. Mater. Res. 2020, 50, 71–103. [Google Scholar] [CrossRef]
Mueller, T.; Kusne, A.G.; Ramprasad, R. Machine Learning in Materials Science. In Reviews in Computational Chemistry; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2016; pp. 186–273. [Google Scholar] [CrossRef]
Khan, A.; Tayara, H.; Chong, K.T. Prediction of organic material band gaps using graph attention network. Comput. Mater. Sci. 2023, 220, 112063. [Google Scholar] [CrossRef]
Van Rossum, G.; Drake, F.L. Python 3 Reference Manual: (Python Documentation Manual Part 2); CreateSpace Independent Publishing Platform: North Charleston, SC, USA, 2009. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Khalifa, R.M.; Yacout, S.; Bassetto, S. Developing machine-learning regression model with Logical Analysis of Data (LAD). Comput. Ind. Eng. 2021, 151, 106947. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, J.; Shen, W. A Review of Ensemble Learning Algorithms Used in Remote Sensing Applications. Appl. Sci. 2022, 12, 8654. [Google Scholar] [CrossRef]
Dunstan, J.; Villena, F.; Hoyos, J.P.; Riquelme, V.; Royer, M.; Ramírez, H.; Peypouquet, J. Predicting no-show appointments in a pediatric hospital in Chile using machine learning. Health Care Manag. Sci. 2023, 26, 313–329. [Google Scholar] [CrossRef]
Dietterich, T.G. Ensemble Methods in Machine Learning. In Multiple Classifier Systems, MCS 2000; Lecture Notes in Computer Science, Proceedings of the First International Workshop, MCS 2000 Cagliari, Italy, 21–23 June 2000; Springer: Berlin/Heidelberg, Germany, 2000; Volume 1857, pp. 1–15. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning, 1st ed.; Information Science and Statistics; Springer: New York, NY, USA, 2006. [Google Scholar]
Fortin, F.-A.; de Rainville, F.-M.; Gardner, M.-A.G.; Parizeau, M.; Gagné, C. DEAP: Evolutionary algorithms made easy. J. Mach. Learn. Res. 2012, 13, 2171–2175. [Google Scholar]
Arenas, R. Rodrigo-Arenas/Sklearn-Genetic-Opt. 2024. Available online: https://github.com/rodrigo-arenas/Sklearn-genetic-opt (accessed on 20 November 2025).
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Branco, P.; Torgo, L.; Ribeiro, R.P. SMOGN: A Pre-Processing Approach for Imbalanced Regression. In Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR, Skopje, North Macedonia, 22 September 2017; pp. 36–50. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
Makuła, P.; Pacia, M.; Macyk, W. How To Correctly Determine the Band GatoEnergy of Modified Semiconductor Photocatalysts Based on UV–Vis Spectra. J. Phys. Chem. Lett. 2018, 9, 6814–6817. [Google Scholar] [CrossRef] [PubMed]
Welter, E.S.; Garg, S.; Gläser, R.; Goepel, M. Methodological Investigation of the Band Gap Determination of Solid Semiconductors via UV/Vis Spectroscopy. ChemPhotoChem 2023, 7, e202300001. [Google Scholar] [CrossRef]
Gron, A. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 1st ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2017. [Google Scholar]
Siemenn, A.E. PV-Lab/Automatic-Band-Gap-Extractor. 2024. Available online: https://github.com/PV-Lab/Automatic-Band-Gap-Extractor (accessed on 20 November 2025).
Escobedo Morales, A.; Sánchez Mora, E.; Pal, U. Use of Diffuse Reflectance Spectroscopy for Optical Characterization of Un-Supported Nanostructures. Rev. Mex. Fis. 2007, 53, 18–22. [Google Scholar]
Zimnyakov, D.A.; Sevrugin, A.V.; Yuvchenko, S.A.; Fedorov, F.S.; Tretyachenko, E.V.; Vikulova, M.A.; Kovaleva, D.S.; Krugova, E.Y.; Gorokhovsky, A.V. Data on energy-band-gap characteristics of composite nanoparticles obtained by modification of the amorphous potassium polytitanate in aqueous solutions of transition metal salts. Data Brief 2016, 7, 1383–1388. [Google Scholar] [CrossRef]

Figure 1. E_g calculation using the K–M transformation and the GapExtractor software v1.0 [43]. (a) UV-Vis spectra for CeO₂ obtained from reference [18]. (b)

(F (R_{\infty}) * h v)^{2}

vs.

h v

, direct E_g fitting 3.38 eV. (c)

(F (R_{\infty}) * h v)^{\frac{1}{2}}

vs. hv, indirect

E_{g}

fitting 3.08 eV.

Figure 1. E_g calculation using the K–M transformation and the GapExtractor software v1.0 [43]. (a) UV-Vis spectra for CeO₂ obtained from reference [18]. (b)

(F (R_{\infty}) * h v)^{2}

vs.

h v

, direct E_g fitting 3.38 eV. (c)

(F (R_{\infty}) * h v)^{\frac{1}{2}}

vs. hv, indirect

E_{g}

fitting 3.08 eV.

Figure 2. Pre-processed dataset identifiers plotted after the Kubelka–Munk transformation. (a) ID 0, 106 unique spectra from the dataset in reference [18]. (b–i) ID 1 to ID 7 contain each 64 unique spectra, while ID 8 contains 63 unique spectra from the dataset in reference [18,29].

Figure 3. The histogram of training band gap values. (a) Band gap values obtained from articles [18,29]. (b) The histogram of band gap values after data augmentation with Method 1. (c) The histogram of band gap values after data augmentation with Method 2 (SMOGN).

Scheme 1. Methodology summary.

Figure 4. Scatter plots for the predicted band gap by the three best models. (a1–a3) Predicted bandgap for the preprocessed dataset (i.e., DS1 and Test-DS). (b1–b3) Predicted bandgap for the pre-processed dataset after the first derivative (i.e., Der-DS1 and Der-Test-DS1). The blue dots represent the training datasets (DS1 and Der-DS1), and the red dots represent the testing datasets (Test-DS and Der-Test-DS). Near positions to the black dashed lines symbolize better predictions from the models.

Figure 5. Scatter plots for the predicted bandgap by three best models. (a1–a3) Predicted bandgap for the preprocessed dataset (i.e., DS1 and Test-DS) and hyperparameter tuning. (b1–b3) Predicted bandgap for the data augmentation using Method 1 (i.e., DS2 and Test-DS) and hyperparameter tuning. (c1–c3) Predicted bandgap for the data augmentation using Method 2 (i.e., DS3 and Test-DS) and hyperparameter tuning. The blue squares represent the training datasets (DS1, DS2, and DS3), and the red dots represent the testing datasets (Test-DS). Near positions to the black dashed lines indicate better predictions from the models.

Figure 6. Prediction of the bandgap for a given dataset.

Table 1. Hyperparameters for genetic search.

Model	Hyperparameter	Values Range for Genetic Search
AdaBoost	n_estimators learning_rate loss	Integer(50,200) Continuous(0.01,1,distribution = ‘log-uniform’) [Linear, square, exponential]
Bagging	n_estimators max_samples max_features bootstrap bootstrap_features	Integer(50,200) Continuous(0.1, 1.0, distribution = ‘log-uniform’) Continuous(0.1, 1.0, distribution = ‘log-uniform’) [True, False] [True, False]
Extra-Trees	n_estimators bootstrap max_depth max_features min_samples_split min_samples_leaf	Integer(50,200) True, False [10, 30, 50, None] [sqrt, log2, 1.0] Integer(2, 10) Integer(2, 4)
Gradient Boosting	n_estimators loss max_depth max_features min_samples_split min_samples_leaf	Integer(50,200) [squared_error, absolute_error] [10, 30, 50, None] [sqrt, log2, 1.0] Integer(2, 10) Integer(2, 4)
LightGBM	n_estimators learning_rate max_depth num_leaves min_child_samples subsample colsample_bytree reg_alpha reg_lambda	Integer(50, 200) Continuous(0.01, 0.3, distribution = log-uniform) Integer(10, 30) Integer(20, 100) Integer(5, 20) Continuous(0.7, 1.0, distribution = uniform) Continuous(0.7, 1.0, distribution = uniform) Continuous(1 × 10⁻³,1.0, distribution = log-uniform) Continuous(1 × 10⁻³, 1.0, distribution = log-uniform)
Random Forest	n_estimators bootstrap criterion max_depth max_features min_samples_split min_samples_leaf	Integer(50, 1000) [True, False] [squared_error, absolute_error, friedman_mse, poisson] 10, 30, 50, None [sqrt, log2, 1.0] Integer(2, 10) Integer(1, 4)
XGBoost	n_estimators learning_rate subsample max_depth	Integer(50, 200) Continuous(0.05, 0.5) [0.5, 0.75, 1] [3, 6, 10]
k-Nearest Neighbors	n_neighbors weights p	Integer(1, 20) uniform, distance Integer(1, 2)

Table 2. Metrics of the eight models that presented the best performance.

Trained Algorithm—Model	Test-DS			Der-Test-DS
Trained Algorithm—Model	MAE (eV)	RMSE	R²	MAE (eV)	RMSE	R²
AdaBoost	0.54	0.57	0.41	0.51	0.59	0.38 ↓
Bagging	0.30	0.40	0.71	0.35	0.53	0.50 ↓
Extra-Trees	0.27	0.40	0.72	0.28	0.40	0.71 ↓
Gradient Boosting	0.32	0.43	0.67	0.35	0.51	0.53 ↓
LightGBM	0.33	0.45	0.63	0.33	0.47	0.60 ↓
Random Forest	0.29	0.39	0.73	0.33	0.47	0.61 ↓
XGBoost	0.32	0.46	0.63	0.35	0.56	0.45 ↓
k-Nearest Neighbors	0.50	0.71	0.10	0.46	0.68	0.17 ↑

↑↓ Increase/decrease in performance relative to Test-DS. Best metrics are marked in bold.

Table 3. Resulting Hyperparameters values before and after genetic search as well as for augmentation methods.

Model	Hyperparameter	Before Genetic Search	After Genetic. Search	Augmented Data Method 1 + Genetic Search	Augmented Data Method 2 SMOGN + Genetic Search
AdaBoost	n_estimators	50	139	171	57
	learning_rate	1.0	0.0141	0.1428	0.0596
	loss	linear	exponential	square	exponential
Bagging	n_estimators	10	192	151	77
	max_samples	1.0	0.7630	0.9920	0.9549
	max_features	1.0	0.2138	0.3106	0.2219
	bootstrap	True	False	True	False
	bootstrap_features	False	True	True	True
Extra-Trees	n_estimators	100	70	145	128
	bootstrap	False	False	False	False
	max_depth	None	30	30	None
	max_features	1.0	1.0	1.0	1.0
	min_samples_split	2	4	3	2
	min_samples_leaf	1	2	1	1
Gradient Boosting	n_estimators	100	163	194	178
	loss	squared_error	absolute_error	squared_error	absolute_error
	max_depth	3	50	10	None
	max_features	None	1.0	log2	log2
	min_samples_split	2	7	7	6
	min_samples_leaf	1	1	4	1
LightGBM	n_estimators	100	154	199	163
	learning_rate	0.1	0.03567	0.1590	0.0950
	max_depth	−1	17	17	17
	num_leaves	31	33	42	50
	min_child_samples	20	5	13	9
	subsample	1.0	0.7893	0.7272	0.8161
	colsample_bytree	1.0	0.9177	0.8870	0.7374
	reg_alpha	0.0	0.0515	0.5198	0.0017
	reg_lambda	0.0	0.1838	0.0065	0.03193
Random Forest	n_estimators	100	153	100	885
	bootstrap	True	True	True	False
	criterion	squared_error	Poisson	squared_error	absolute_error
	max_depth	None	30	None	30
	max_features	1.0	Sqrt	1.0	sqrt
	min_samples_split	2	2	2	3
	min_samples_leaf	1	1	1	1
XGBoost	n_estimators	None	91	None	136
	learning_rate	None	0.0702	None	0.1022
	subsample	None	1	None	0.5
	max_depth	None	6	None	10
k-nearest neighbors	n_neighbors	5	8	5	6
	weights	uniform	Distance	uniform	Distance
	p	2	1	2	1

Table 4. Performance of data augmentation and genetic search.

Model	Hyperparameter Tuning			Data augmentation Method 1 + Genetic Search			Data augmentation Method 2 + Genetic Search
Model	MAE (eV)	RMSE	R²	MAE (eV)	RMSE	R²	MAE (eV)	RMSE	R²
AdaBoost	0.39	0.48	0.58 ↑	0.56	0.62	0.30 ↓	0.41	0.48	0.59 ↑
Bagging	0.27	0.39	0.73 ↑	0.32	0.40	0.71	0.26	0.38	0.74 ↑
Extra-Trees	0.27	0.39	0.73 ↑	0.31	0.41	0.71 ↓	0.27	0.39	0.72
Gradient Boosting	0.29	0.42	0.68 ↑	0.29	0.39	0.73 ↑	0.33	0.45	0.63 ↑
LightGBM	0.28	0.42	0.68 ↑	0.32	0.43	0.67 ↓	0.29	0.42	0.69 ↑
Random Forest	0.28	0.38	0.74 ↑	0.31	0.41	0.70 ↓	0.28	0.40	0.71 ↓
XGBoost	0.30	0.43	0.66 ↑	0.33	0.43	0.68 ↑	0.29	0.41	0.70 ↑
k-Nearest Neighbors	0.47	0.62	0.31 ↑	0.47	0.61	0.33 ↑	0.44	0.58	0.40 ↑↑

↑↓ Increase/decrease in performance relative to Test-DS of Table 1. Best metrics are marked in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hoyos-Sanchez, J.P.; Hahn, H.; Jha, S.K.; Schweidler, S.; Velasco, L. A Machine Learning Approach to Determine the Band Gap Energy of High-Entropy Oxides Using UV-Vis Spectroscopy. Eng 2025, 6, 340. https://doi.org/10.3390/eng6120340

AMA Style

Hoyos-Sanchez JP, Hahn H, Jha SK, Schweidler S, Velasco L. A Machine Learning Approach to Determine the Band Gap Energy of High-Entropy Oxides Using UV-Vis Spectroscopy. Eng. 2025; 6(12):340. https://doi.org/10.3390/eng6120340

Chicago/Turabian Style

Hoyos-Sanchez, Juan P., Horst Hahn, Shikhar K. Jha, Simon Schweidler, and Leonardo Velasco. 2025. "A Machine Learning Approach to Determine the Band Gap Energy of High-Entropy Oxides Using UV-Vis Spectroscopy" Eng 6, no. 12: 340. https://doi.org/10.3390/eng6120340

APA Style

Hoyos-Sanchez, J. P., Hahn, H., Jha, S. K., Schweidler, S., & Velasco, L. (2025). A Machine Learning Approach to Determine the Band Gap Energy of High-Entropy Oxides Using UV-Vis Spectroscopy. Eng, 6(12), 340. https://doi.org/10.3390/eng6120340

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Machine Learning Approach to Determine the Band Gap Energy of High-Entropy Oxides Using UV-Vis Spectroscopy

Abstract

1. Introduction

2. Materials and Methods

2.1. Spectra Dataset, Kubelka-Munk Transformation, and Band Gap Calculation

2.2. Data Pre-Processing

2.3. Machine Learning Algorithms

2.4. Data Augmentation Methodology

3. Results and Discussion

3.1. Models Performance Evaluation

3.2. Model Hyperparameter Tuning

3.3. Data Augmentation

3.4. Application Development for Visualization and Decision Making

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI