1. Introduction
Since their discovery through the exfoliation of the MAX phase Ti
3AlC
2 [
1], MXenes, the family of two-dimensional (2D) transition metal carbides, nitrides and carbonitrides with the general formula
, have emerged as a versatile platform for energy conversion [
2,
3]. The terminal group
T, ranging from halogens to chalcogens to hydroxyl and imide moieties, governs surface chemistry and in turn the electronic structure, while the transition metal
M and the non-metal
X fix the inner framework. In photocatalytic water splitting and solar CO
2 conversion, these tunable properties translate into a practical design requirement; the electronic bandgap
must lie in the window
eV, where the lower bound coincides with the redox potential of water splitting and the upper bound marks the edge of the near-ultraviolet region [
4,
5,
6]. Identifying compositions that meet this criterion over the combinatorial MXene space is therefore a central problem for next-generation energy materials.
The bandgap is in principle accessible through density functional theory (DFT), but accuracy depends strongly on the exchange–correlation functional. Semilocal functionals such as Perdew–Burke–Ernzerhof (PBE) systematically underestimate
, while hybrid functionals such as PBE0 correct this bias at a substantially higher computational cost [
7]. For a class as large as MXenes, screening thousands of compositions at the PBE0 level is therefore impractical and has motivated the development of machine learning (ML) surrogates. Early work by Rajan et al. [
8] demonstrated that kernel and Gaussian process models could predict functionalised MXene bandgaps from simple elemental descriptors. More recently, data-driven studies have leveraged neural network surrogates for crystal property prediction [
9] and applied them to MXenes in particular [
10].
The most directly relevant benchmark in this space is the MXgap framework of Ontiveros et al. [
11], which released an openly available dataset of 4356 MXene structures and a classifier–regressor pipeline that reports a reported classification accuracy of 92% and a mean absolute error (MAE) of 0.17 eV for
prediction. Their feature set blends periodic table indices, elemental descriptors and the PBE-level density of states (DOS), totalling more than one hundred variables. Around this benchmark, several adjacent studies have appeared: Tang et al. [
10] report MAE of 0.14 eV using a deep high-throughput screening pipeline; Zhang [
12] benchmarks gradient boosting, LightGBM and support vector regression on MXene bandgaps, and identifies the binary flag “
” as the most important predictor, which in fact leaks the classification label into the regression inputs; Dolz et al. [
13] extend similar ideas to single-metal-atom adsorption energies; and the broader landscape is surveyed by Hjiri and Mustapha [
14]. A common thread across these studies is a high-dimensional feature set and, in most cases, a lack of mathematical guarantees on physical consistency or predictive coverage.
From a mathematical standpoint, three ingredients are increasingly seen as enabling rigor in data-driven materials modeling. First, physics-informed neural networks embed constraints derived from physical laws directly into the learning objective [
15]. Second, continuous-depth architectures based on neural ordinary differential equations (Neural ODEs) [
16] and their augmented variants [
17] offer a principled alternative to discrete deep stacks, with adaptive depth and smooth representations. Third, split-conformal inference provides distribution-free coverage certificates for regression predictions, independent of model family [
18]. Complementing these ideas, multi-fidelity learning schemes such as
learning correct cheap approximations with learned residuals [
19], while Bayesian optimisation [
20,
21] and deep ensembles [
22,
23] support uncertainty-aware search over candidate materials. These tools are now routinely brought to bear on heterogeneous applied problems, from deep learning for semiconductor band structure reconstruction [
24] and physics-informed multi-objective optimisation in power electronics [
25] to frequency learning hybrids for power system forecasting [
26], artificial intelligence pipelines for soil compaction control [
27] and explainable machine learning methods for manufacturing defect prediction [
28].
Each of these tools, however, comes with characteristic strengths and limitations that motivate the specific way in which we combine them in this work. Physics-informed neural networks make physical laws part of the loss but enforce the constraints only softly; there is no a priori guarantee that a trained network respects them at test time. Neural ODEs replace discrete depth by a continuous flow, which yields adaptive resolution and a smooth latent representation, but their training requires repeated calls to a numerical solver and can be sensitive to step size choices. Split-conformal inference delivers a finite-sample, distribution-free coverage guarantee that is agnostic to the underlying model, but it relies on exchangeability between calibration and test residuals and only certifies marginal (rather than conditional) coverage. Δ learning exploits a cheap low-fidelity baseline and is highly data-efficient when the baseline is informative, but it transfers a portion of the bias of the baseline to the surrogate. Finally, Bayesian optimisation and deep ensembles provide principled uncertainty estimates for candidate ranking but at the cost of either a kernel design choice (Bayesian optimisation with Gaussian processes) or a multiplicative training budget (deep ensembles). The PC-NODE framework introduced below is designed precisely so that these strengths reinforce each other while their limitations are mitigated by architecture or by a complementary tool—non-negativity and PBE/PBE0 monotonicity become architectural guarantees rather than soft penalties; the learning baseline is paired with the conformal layer that calibrates the residual; and the deep-ensemble layer is reduced to a fixed budget of ten seeds rather than a sliding hyperparameter.
To the best of our knowledge, this is the first study to integrate a continuous-depth Neural ODE backbone with multi-fidelity learning, distribution-free split-conformal calibration, and uncertainty-aware Pareto screening into a single mathematically grounded pipeline for MXene bandgap prediction. Building on these threads, we propose a framework that brings a mathematically grounded pipeline to the MXene bandgap problem. We combine a continuous-depth Neural ODE backbone with a softplus-parameterised learning bridge between PBE and PBE0 that delivers two physical properties exactly, a penalised hurdle coupling that drives metal predictions towards zero, split-conformal intervals for distribution-free coverage, and an uncertainty-aware Pareto screening step that operates on a held-out family of lanthanum-based MXenes. Throughout, we use a compact 34-dimensional descriptor set that excludes the DOS, so that predictions can be made from readily available elemental and geometric information alone.
Our main contributions are threefold:
We introduce a physics-constrained neural ODE (PC-NODE) architecture with a classifier–regressor coupling whose softplus-parameterised learning head guarantees, by construction, two physical properties of the bandgap prediction: non-negativity and monotonicity with respect to the low-fidelity estimate. A third, hurdle, condition that metal predictions collapse toward zero is imposed through a quadratic penalty and verified empirically.
We pair this surrogate with split-conformal prediction to obtain distribution-free coverage certificates, and verify that the empirical coverage of the reported intervals matches their nominal level on a held-out test split.
We apply the trained model to a held-out set of 396 lanthanum-based MXenes that was never seen during training, validation or calibration, and run an uncertainty-aware Pareto screening step that combines the predicted bandgap, the classifier probability and the Monte Carlo dropout spread to rank 74 candidates predicted to lie inside the photocatalytic window [1.23, 3.10] eV.
2. Materials and Methods
2.1. Dataset
We use the openly available MXgap dataset [
11], which collects 4356 MXene structures of the generic stoichiometry
. The transition metal
M is drawn from {Sc, Y, Ti, Zr, Hf, V, Nb, Ta, Cr, Mo, W}, the non-metal
X from {C, N}, and the terminal group
T from {F, Cl, Br, I, O, S, Se, Te, H, OH, NH}. The index
n ranges over 1–3 and six relative stacking geometries are considered per composition, yielding a full factorial design with eighteen structures per
combination and 4356 entries in total. The dataset additionally contains a held-out set of 396 lanthanum-based MXenes, which we reserve exclusively for out-of-distribution screening in
Section 4.
Each record carries two electronic targets: the binary label
, which distinguishes metals from semiconductors, and the high-fidelity hybrid-functional bandgap
in electronvolts. The low-fidelity PBE estimate
is also provided and serves as an input in our multi-fidelity scheme (
Section 2.4). Across the full set, 3410 records are classified as metals and 946 as semiconductors, producing a class imbalance of approximately 78:22 that motivates a weighted classification loss. The bandgap distribution among semiconductors is strongly right-skewed, with mean
eV and median
eV; 226 of the 946 semiconducting structures (24%) fall inside the photocatalytic window
eV. We also note, in anticipation of the
learning analysis of
Section 2.4, that only two of the 946 semiconducting records exhibit a marginally negative
(minimum
eV), so the PBE ≤ PBE0 inequality holds on 99.8% of the dataset; the softplus parameterisation that we use later enforces this inequality exactly on predictions.
Figure 1 places each transition metal in its periodic table position and colours it by the mean
of its semiconducting subset. Group III metals (Sc and Y) generate the largest mean bandgap, while group V and VI metals tend to produce narrow-gap semiconductors; lanthanum, which appears only in the held-out subset, is annotated separately.
Figure 2 then projects the semiconductor–metal split onto the
–
plane, where each cell reports the fraction of semiconducting structures for the corresponding chemistry. The heterogeneous pattern confirms that bandgap existence is a chemistry-dependent phenomenon rather than a uniform effect and justifies the joint classifier–regressor treatment used throughout this work.
The distribution of semiconducting bandgaps per transition metal is shown as a ridgeline plot in
Figure 3. Ordering the metals by median
highlights a clear structure: Sc and Y dominate the upper end of the range and deposit most of their mass inside the photocatalytic window, whereas W, Ta, Mo and the rest cluster near low values. Finally,
Figure 4 visualises the relationship between the low-fidelity
and the high-fidelity
on the semiconductor subset. The Pearson correlation is
and the PBE values lie systematically below the PBE0 values, confirming the well-known underestimation of semilocal functionals and motivating the multi-fidelity
learning formulation introduced in
Section 2.4.
Two practical details are worth noting. First, the column of the lanthanum sheet in the raw spreadsheet is corrupted by a string-concatenation artefact in the original export. Because the PBE bandgap is algebraically identical to the difference between the conduction band minimum and the valence-band maximum, we reconstruct it from the clean and columns. On the training sheet, this reconstruction is accurate to numerical precision (residual ≤ eV), giving us confidence that the same rule applies to the held-out subset. Second, we use the dataset without additional filtering to preserve reproducibility from the original resource.
2.2. Descriptor Set
The MXgap resource provides four naturally distinct feature blocks: periodic table indices (atomic number, period, and group for M, X and T, nine variables in total), elemental descriptors (electronegativity, electron affinity, van der Waals radius and atomic radius for each of the three positions, twelve variables), geometric descriptors (layer count n, stack and hollow indices, in-plane lattice constant a, out-of-plane spacing d, and several M–T and X–T bond distances, thirteen variables), and a 100-dimensional PBE-level DOS vector. Combined, they amount to 134 variables.
Throughout this work we rely on a compact descriptor set that concatenates the periodic table, elemental and geometric blocks and excludes the DOS, giving input variables.
This choice is motivated by four complementary considerations.
(i) Physical interpretability of the retained blocks. Each of the three retained blocks has a distinct physical meaning. The periodic table block (nine variables) encodes the periodic table position of M, X and T and provides the most coarse-grained chemical fingerprint of an MXene. The elemental block (12 variables) refines this picture with continuous atomic properties (electronegativity, electron affinity, van der Waals and atomic radii) for each of the three positions. The geometric block (13 variables) captures the structural identity of the MXene through the layer count n, the stacking and hollow indices, the in-plane lattice constant a, the out-of-plane spacing d, and the M–T and X–T bond lengths that determine local bonding. Together, these three blocks are the descriptor classes that can be assembled without first running a self-consistent DFT calculation on the candidate structure.
(ii) Computational cost. The DOS block is, by construction, an output of a DFT calculation; using it as input for a surrogate that is meant to reduce DFT cost would partly defeat its purpose. The periodic table, elemental and geometric blocks are, by contrast, available from the chemical formula and a single relaxed geometry, which is inexpensive relative to a full electronic structure run.
(iii) Information overlap with the multi-fidelity baseline. The
learning parameterisation of Equation (
2) already uses the PBE-level bandgap
as a scalar baseline; this is one of the same quantities that the DOS block contributes through its bandgap edge channels. Part of the information that a DOS feature vector would carry is therefore already exploited by the architecture, but at the cost of a single scalar rather than a 100-dimensional vector.
(iv) Empirical evidence on this dataset. The empirical comparison reported in the PC-NODE ensemble study of
Section 4.2 confirms that, on the MXgap data and at the regularisation budget used here, augmenting the compact set with twelve DOS principal components increases the test MAE from
eV to
eV, and the full DOS configuration moves further to
eV. Adding DOS information here therefore degrades rather than improves generalisation, in line with the feature-to-sample-ratio analysis below.
The resulting feature-to-sample ratio is about
for the full 4356 structure pool and about
on the 946-record semiconductor subset that drives the regression target; both ratios are favourable for the small training partition used in this study. A complementary, model-driven validation of the compact set is provided by the permutation feature importance analysis reported in
Section 4.5; it shows that the geometric and elemental blocks together carry the bulk of the predictive signal, while the periodic table indices play a secondary role.
Figure 5 projects the compact descriptors into two dimensions using a Uniform Manifold Approximation and Projection (UMAP) embedding. Because UMAP does not preserve global geometry in a metric sense, we interpret the resulting layout as a suggestive visual motivation rather than a proof of cluster separability; still, the organisation of points by transition metal, and the structured position of semiconducting members within each neighbourhood, is consistent with the view that chemistry already explains a large share of the variance in
.
2.3. Physics-Constrained Neural ODE
Let
denote the compact descriptor vector for a given MXene structure and
the clamped low-fidelity bandgap. We seek a map that predicts both the classification target
and the high-fidelity bandgap
, and that respects three physically motivated properties. Two of them, non-negativity and PBE ≤ PBE0 monotonicity, are obtained exactly by the architectural parameterisation described below; a third, the hurdle condition that metal predictions should collapse towards zero, is imposed through a quadratic penalty and therefore holds only approximately at optimum. Embedding physical constraints inside Neural ODE backbones is a recent and active research direction, with closely related Physics-Informed Neural ODE formulations recently reported for chemical engineering kinetics [
29] and physics-informed ML reviewed at materials science scale by [
30].
The full architecture is summarised in
Figure 6. Two inputs enter the pipeline: the compact descriptor
and the scalar low-fidelity baseline
. The descriptor is mapped through an encoder
to a latent state
, evolved over a unit time interval by the Neural ODE block governed by the velocity field
, and read out at
. Two linear heads share this terminal latent: the classifier head outputs the metal/semiconductor probability
through a logistic sigmoid, while the regression head produces a raw scalar
r that is passed through a softplus to give the non-negative
learning correction
. The high-fidelity bandgap prediction is then obtained by adding
to the clamped baseline
. Because
and
, the resulting prediction is automatically non-negative and at least as large as the PBE estimate, the two architectural guarantees of Proposition 1. The remainder of this subsection makes each block of the schematic precise.
Following the Neural ODE formalism of Chen et al. [
16,
17], the model first encodes the descriptor through an affine–nonlinear encoder
and then evolves the resulting latent state
along a continuous-time trajectory defined by a velocity field
. As it is stated in Equation (
1):
where
is parameterised by a small multi-layer perceptron with parameters
and the terminal state
carries the learned representation. We integrate over a fixed unit interval with a fourth-order Runge–Kutta scheme; the number of internal steps
S is a hyperparameter that controls the effective depth of the representation.
The terminal state is passed through two linear heads. The classifier head produces a logit
, and the regression head outputs a raw scalar
that is transformed into a non-negative correction
. Equation (
2) shows how the predicted bandgap is assembled by summing the clamped low-fidelity estimate with the softplus-transformed correction:
where
is the logistic sigmoid and
for all
. Writing
, we have
and
, so the architecture yields
for every input. This establishes Proposition 1 below, which makes the two architectural guarantees explicit.
Proposition 1 (Architectural guarantees)
. Under the parameterisation of Equation (2), with , the following properties hold for every input and every :- (i)
Non-negativity. .
- (ii)
PBE ≤ PBE0 monotonicity. .
Proof. By definition for every ; therefore . Claim (i) follows because and . For Claim (ii), if then and ; if , the clamp gives and . □
Remark 1 (empirical activation of the clamp)
. Although the proof of Proposition 1 accommodates the case for completeness, in the MXgap database every reported PBE bandgap is non-negative; across all 4356 training records (both 3410 metals and 946 semiconductors) and all 396 records of the lanthanum-based held-out subset of Section 4.9, with no exceptions. Consequently the clamp reduces to the identity on every observed input, and introduces no bias to the semiconductor predictions
of this study. The clamp is retained in the parameterisation as a defensive mechanism; it preserves the architectural guarantees of Proposition 1 on hypothetical out-of-distribution inputs whose PBE bandgap might be returned as a small negative value by some DFT implementations as a numerical band overlap artefact. A separate, distinct point concerns the inequality , which is violated for two of the 946 semiconducting records (with eV); these cases are clamped by the strictly positive softplus on δ rather than by the operation, and we discuss them further in Section 2.4. The hurdle condition
for metals is not an architectural guarantee, because
is strictly positive and
can be strictly positive for metallic inputs; it is instead enforced
penally by the loss term introduced below, and its empirical efficacy is reported in
Section 4.6.
The training objective combines a weighted binary cross-entropy (BCE) term on the classification label with a regression term that is applied only on the semiconductor subset, a hurdle penalty that pushes the predicted bandgap towards zero for metals, and a small regulariser that favours compact corrections. The composite loss is given in Equation (
3):
where
is the positive-class weight that compensates for the 78:22 imbalance, the
term is an
loss on the regression output restricted to semiconductors, the
term is the hurdle penalty that drives metal predictions toward zero, and the
term is a weak shrinkage on the correction
that prefers small
learning updates whenever they are not required by the data. The multipliers
,
and
are small positive constants whose values are listed in
Section 3.
Figure 7 summarises the resulting constraint space. The predicted pair
always lies in the upper-right quadrant as a consequence of Proposition 1; the hurdle term pulls metal predictions toward the horizontal axis, and only predictions that additionally fall inside the photocatalytic band are of practical interest for water splitting.
2.4. Multi-Fidelity Learning
The parameterisation of Equation (
2) is a neural instantiation of the
learning idea of Ramakrishnan et al. [
19], an approach that has continued to gain traction in quantum chemical property prediction; for example, Ghosh et al. [
31] have recently used
machine learning to correct triplet excitation energy predictions of organic chromophores against TDDFT references; the cheaper low-fidelity signal
is used as a baseline and the network is asked to learn only the non-negative residual
that reconciles PBE with the more expensive PBE0 value. This construction carries three practical benefits. First, the learning problem is easier because the baseline already captures most of the variance (the Pearson correlation between PBE and PBE0 on the semiconductor subset is
, as shown in
Figure 4). Second, the constraint
is consistent with the general tendency of hybrid functionals to produce bandgaps that are at least as large as those of their semilocal counterparts; in our training set this inequality is violated in only two out of 946 semiconducting records, both with
eV, and these narrow cases are forced to
in prediction by the softplus clamp, introducing a small but bounded bias. Third, the same parameterisation applies verbatim to the lanthanum-based held-out subset, where
is available but
is not.
2.5. Split-Conformal Prediction
A point prediction of the bandgap is rarely sufficient in materials design, where downstream decisions involve costly synthesis or characterisation. We supplement the PC-NODE surrogate with split-conformal intervals [
18] that deliver finite-sample marginal coverage regardless of the underlying estimator. Conformal prediction has been increasingly adopted in scientific machine learning for the same reason: a recent example in molecular ML is the interpretable conformal-and-counterfactual framework of Ghamsary et al. [
32] for selective inhibitor design. Concretely, we partition the training data into a model-fitting subset and a calibration subset; after training on the former, we evaluate absolute residuals
on the latter and obtain the empirical quantile as shown in Equation (
4):
where
denotes the number of calibration residuals and
is the
k-th order statistic. The prediction interval at a new input
then follows from Equation (
5):
where the lower bound is clipped at zero to respect the physical non-negativity of bandgaps. Under exchangeability of calibration and test residuals, the resulting interval covers the true
with probability at least
. We emphasise explicitly that this coverage guarantee is conditional on exchangeability between the calibration and test residuals: it transfers to a held-out test subset that is drawn from the same distribution as the calibration set, but it does
not automatically transfer to the lanthanum-based out-of-distribution subset of
Section 4.9, where the chemistry of the inputs lies outside the training pool. The consequences of this distinction are made operational in
Section 4.9, where a formally correct treatment would require weighted or nonexchangeable conformal variants [
33], and revisited in the limitations paragraph of
Section 5.
2.6. Uncertainty-Aware Pareto Screening
Given a trained PC-NODE surrogate, we frame the screening of photocatalytic candidates as a two-objective optimisation problem. We describe this step as
uncertainty-aware Pareto screening rather than Bayesian optimisation because no sequential acquisition loop is involved: the surrogate is trained once and is then queried on a candidate library, and the two objectives are ranked jointly through Pareto dominance and a scalar tie-breaker. For a candidate MXene structure
with PBE estimate
, the window distance is defined in Equation (
6):
where
eV and
eV delimit the photocatalytic band. The second objective is the negative classifier probability
, so that minimising both simultaneously favours candidates whose predicted bandgap falls inside the water splitting window
and for whom the model is confident that they are semiconducting. The Pareto front over the candidate library is obtained by standard non-dominated sorting. A secondary ranking inside the front is provided by the augmented Chebyshev scalarisation [
21], given in Equation (
7):
where
are weights,
is a small augmentation term and low values of
indicate preferred candidates. Epistemic uncertainty on the two objectives is propagated from the PC-NODE surrogate through Monte Carlo (MC) dropout [
23], following common practice for uncertainty estimation in deep models [
22]. We stress that MC dropout captures only epistemic, model-level uncertainty; it is known to underestimate predictive uncertainty in out-of-distribution regimes [
34,
35], and we therefore use it in this work as a qualitative tie-breaker rather than as a calibrated error bar. We make the two methodological caveats associated with this choice explicit. First, MC dropout estimates
epistemic uncertainty—the variability of the network output under random masks of its hidden units—and does
not attempt to capture aleatoric uncertainty arising from intrinsic noise in the data. Second, in out-of-distribution regimes such as the lanthanum-based subset of
Section 4.9, the variance of the dropout ensemble is known to be a poor proxy for the true predictive uncertainty and tends to
underestimate it [
34,
35]. The split-conformal intervals of
Section 2.5 therefore remain the primary, distribution-free quantitative coverage tool of the framework, and the MC dropout spread is reported throughout the paper only as a qualitative diagnostic. The framework therefore combines three uncertainty tools, each selected for a specific role: a ten-seed deep ensemble (in the sense of Lakshminarayanan et al. [
22]; see the PC-NODE ensemble study of
Section 4.2) reduces seed-induced variance in the point predictions; the split-conformal step delivers calibrated, distribution-free intervals; and MC dropout produces a cheap per-candidate epistemic spread for use as a tie-breaker on the secondary screening objective. We adopt MC dropout rather than an additional independent-ensemble layer or a Gaussian-process surrogate for two reasons. First, the headline PC-NODE numbers already exploit a deep ensemble of ten independently trained PC-NODE models (
Section 4.2; per-seed standard deviation
eV) and an additional ensemble layer dedicated to the per-candidate spread would be largely redundant given that scale. Second, an exact Gaussian process surrogate scales as
in the training set size and would, on the 4356-structure MXgap pool, require either a kernel design that the present continuous-depth backbone deliberately avoids or sparse approximations whose own coverage behaviour would have to be re-examined; the MC dropout pass, in contrast, integrates seamlessly with the dropout layers already present in the PC-NODE architecture and adds no additional training cost. Quantitative coverage guarantees on
come from the split-conformal intervals of
Section 2.5; the MC dropout spread is reported alongside them as a diagnostic signal. We apply the pipeline to the 396-structure lanthanum-based held-out subset in
Section 4.9.
4. Results
We first report classical baselines across the four descriptor sets of
Section 2.2, then describe the tuned PC-NODE ensemble, audit the physics constraints, report the split-conformal coverage, and finally apply the inverse design pipeline first to the full library and then to the lanthanum-based held-out subset.
4.1. Baseline Comparison Across Descriptor Sets
As a reference for later comparisons, we train Random Forest (RF), Gradient Boosting (GBT) and Multi-Layer Perceptron (MLP) regressors on each of the four descriptor sets defined in
Section 2.2. To enable an apples-to-apples comparison with the PC-NODE ensemble, all baselines use the same stratified 70/15/15 split introduced in
Section 3.1—each model is trained on the PC-NODE training partition, and evaluated on the PC-NODE test partition. The regression task is restricted to the semiconductor subset of each partition, following the MXgap recipe.
Figure 8 summarises the test MAE of each
combination and juxtaposes them against two reference values from the literature: the 0.17 eV MAE reported by the MXgap framework [
11] and the 0.14 eV MAE reported by the deep learning screening pipeline of Tang et al. [
10].
Two observations stand out. First, the periodic and elemental-only sets are too shallow for any of the baselines to reach a competitive MAE, with all three models settling between 0.46 and 0.49 eV. Second, the compact set already brings a marked improvement, the MLP reaching 0.250 eV on this shared test split and the tree-based models hovering near 0.32 eV. The full + DOS set brings a further small improvement for the tree-based baselines (GBT reaches 0.244 eV and RF 0.265 eV) but it also makes the MLP noticeably worse (0.370 eV), which we attribute to the much larger feature-to-sample ratio in that regime. Already at this stage, the best baseline MAE on the shared test split is eV, which leaves room for the PC-NODE ensemble reported in the next subsection to close further toward the MXgap reference.
4.2. Tuned PC-NODE Ensemble
With the baseline picture in place,
Table 2 reports the PC-NODE ensemble results over three descriptor settings. Each row aggregates ten training runs with different random seeds, averaged at test time; we report both the per-seed mean and the ensemble-level metric. On the compact set, the ensemble attains a test MAE of 0.186 eV and
, which closes most of the gap to the MXgap reference of 0.17 eV while using only 34 input features and no DOS information. Extending the input by twelve DOS principal components actually degrades the ensemble MAE to 0.217 eV, and the full + DOS configuration does worse still (0.233 eV). This is a reversal of the typical pattern: for PC-NODE, adding DOS features degrades generalisation on this dataset. The reversal likely stems from two complementary factors. First, the regularisation budget used for the compact set may be insufficient for a 134-dimensional input when only a few thousand training samples are available. Second, the DOS vector is an auxiliary output of the same PBE calculation that produces
, so at least part of the target information is already encoded in the low-fidelity baseline used by our
learning parameterisation, which reduces the marginal value of the DOS block. We therefore use the compact configuration for all subsequent analyses. A cross-descriptor ensemble that averages the three settings gives an intermediate MAE of 0.200 eV, confirming that the compact set dominates the averaging.
Figure 9 visualises the parity between predicted and true PBE0 bandgaps for the compact ensemble on the held-out test split. Metallic structures, whose PBE0 reference value is zero by construction, cluster along the vertical axis at
, with the predicted gaps pulled toward zero by the hurdle term in Equation (
3); they appear as the dark cluster of points hugging the
y-axis in the lower left of the figure. Semiconductors scatter along the diagonal with residuals that remain compact across the whole range
eV. Conformal error bars (see
Section 4.7) are overlaid on a random subset of forty semiconductor points to give a visual sense of the predictive uncertainty.
To quantify the residual structure that is visible on the parity plot we compute the per-bin MAE, bias and residual standard deviation on the 142 semiconductors of the test split: eV, , MAE eV, bias eV, residual std eV; eV, , MAE eV, bias eV, residual std eV; eV, , MAE eV, bias eV, residual std eV; eV, , MAE eV, bias eV, residual std eV; eV, , MAE eV, bias eV, residual std eV; pooled (all bins) MAE eV, bias eV, residual std eV.
Three observations emerge that nuance the visual reading of
Figure 9. First, the largest per-bin MAE (
eV) and the strongest systematic bias (
eV) lie in the
eV bin, that is, at the lower edge of the photocatalytic window of Equation (
6), and
not at small ground-truth values. Two factors contribute: this bin contains only 24 of 142 test semiconductors, so the empirical residual is the noisiest, and the bin sits in the right tail of the training
distribution of
Figure 4, where the
learning correction is asked to produce its largest values from a comparatively sparse training signal. We flag this regime as a natural target for future data-augmentation efforts. Second, in the small-gap region (
eV) the residual is positive on average (
eV bias), reflecting a mild tendency of the surrogate to push narrow-gap semiconductors away from zero by the same hurdle pressure that drives metals towards it. This bias is small relative to the overall MAE and does not affect the metal/semiconductor separation at the classifier level. Third, the visual impression that the conformal error bars grow at small ground-truth values is an artefact of the constant-width construction of Equation (
5); the calibrated half-width
eV is identical for every point, but the lower bound is clipped at zero to respect the physical non-negativity of bandgaps, which makes the bars
appear asymmetric near zero even though their calibrated half-width is unchanged. We therefore interpret the figure not as “higher uncertainty at small gaps” but as “constant-width conformal intervals on a heteroscedastic residual landscape whose dominant deviation is concentrated near the photocatalytic-window boundary.”
4.3. Classifier Performance
While the regression MAE of
Table 2 is the headline metric, the classifier head of PC-NODE matters in two distinct ways: it drives the hurdle penalty that collapses metal predictions toward zero, and it produces the
signal that is used as the second objective in the Pareto screening of
Section 4.8.
Table 3 reports its performance on the shared test split. The ensemble reaches
accuracy and
ROC-AUC, with a macro
of
and a recall of
on the minority semiconductor class. The lower precision (
) corresponds to 70 false positives out of 654 test structures; the positive-class weight
used in the binary cross-entropy loss deliberately trades some precision for recall to counteract the 78:22 class imbalance. For reference, the MXgap framework [
11] reports a classification accuracy of
; our classifier head gives up some accuracy relative to that reference but does not rely on the DOS features.
4.4. Ablation Study
To isolate the contribution of each ingredient of PC-NODE, we train four variants on the same 70/15/15 stratified split over five random seeds and report the resulting per-seed mean and standard deviation of the test MAE on the semiconductor subset. The variants are:
A1—Plain MLP (regression only). A feed-forward network trained directly on the semiconductor subset, without a classifier head, a learning parameterisation or a hurdle penalty. This is the closest conceptual analogue of classical baselines that regress PBE0 in a single shot.
A2—MLP + softplus hurdle. Same feed-forward backbone, but the regression head is replaced by the softplus
learning parameterisation of Equation (
2) and trained jointly with the classifier using the full loss of Equation (
3).
A3—Neural ODE without physics. The continuous-depth backbone is restored but both the softplus non-negativity constraint and the hurdle penalty are removed; the regression head outputs an unconstrained scalar.
A4—Full PC-NODE. The configuration of
Table 2, with both the continuous-depth backbone and the physics constraints.
Table 4 reports the MAE and the constraint diagnostics for each variant. Three observations stand out. First, the single largest MAE reduction comes from replacing the classical regression-only MLP (A1,
eV) with a joint classifier–regressor whose regression head uses the softplus
learning parameterisation of Equation (
2) (A2,
eV). Second, removing the physics constraints from the Neural ODE backbone (A3) degrades the MAE only marginally relative to the full PC-NODE (A4), but it produces 6 to 66 positivity violations per seed and a minimum
down to
eV—that is, the unconstrained NODE actively violates the physics of the problem, which is exactly what motivates the architectural design. Third, the difference between the hurdle-based MLP (A2,
eV) and the full PC-NODE (A4,
eV) lies within the per-seed standard deviations of the two variants, so we do not read it as a statistically significant gap; both configurations satisfy Proposition 1 by construction, whereas A3 does not. Taken together, the ablation identifies the softplus
learning parameterisation as the single most consequential architectural choice. The Neural ODE backbone then provides a continuous-depth, smooth representation that preserves these physical guarantees at comparable predictive accuracy. The headline ensemble result of
Table 2 (MAE
eV over ten seeds) sits below both the A2 and A4 per-seed averages because ensemble averaging reduces variance across seeds.
The qualitative meaning of the constraint diagnostics in
Table 4 can be summarised by a side-by-side comparison of the predicted PBE0 bandgap distributions of variants A3 (unconstrained) and A4 (full PC-NODE) over the same five seeds, illustrated in the Discussion (
Section 5). The unconstrained variant produces a left tail that crosses zero and accumulates 184 predictions in the physically forbidden region
(with a minimum of
eV), whereas the PC-NODE places no mass in that region; every test prediction satisfies
as a consequence of the softplus parameterisation of Equation (
2). The narrow A4 spike at
corresponds to metal samples that are pulled toward zero by the hurdle penalty. The two histograms therefore make the same point that
Table 4 reports numerically, but in a form that does not require reading the diagnostics column-by-column.
To complement this architectural ablation with a feature budget-matched comparison against the MXgap reference of Ontiveros et al. [
11], we additionally evaluated the same PC-NODE backbone on the larger feature regime that they use—the full descriptor set augmented with the 100-dimensional DOS block, totalling 134 features—on the identical
stratified split used throughout this paper. This provides a direct comparison in which only the feature regime changes between configurations: the model architecture, the train/validation/test split, the optimiser and the seed protocol are held fixed. On this MXgap-equivalent feature regime, the PC-NODE ensemble reaches a test MAE of
eV with a classification accuracy of
, i.e.,
worse than the compact configuration (
eV MAE,
accuracy) reported in
Table 2, which uses only 34 features. The compact configuration therefore wins the comparison both on accuracy and on simplicity in our fixed-architecture, fixed-split regime; the residual gap to the published MXgap MAE of
eV reflects differences in their classifier–regressor pipeline (RF/GBT/SV/MLP cascade) and split protocol (
), as we already discuss in the Discussion (
Section 5). We note that a strict identical-checkpoint comparison would require the trained MXgap models in a redistributable form, which is not available from the public Zenodo record; the present feature regime-matched comparison is the closest faithful proxy that we are able to run.
The mechanism behind the deterioration that we observe when DOS information is added to the PC-NODE input is best understood as a combination of two effects, of which the first is dominant. (i)
Information overlap with the Δ learning baseline. The PBE-level bandgap is one of the most informative DOS-derived quantities, and our architecture already exploits it: it enters Equation (
2) as the
learning baseline
rather than as an additional feature. Once that scalar is accounted for, the remaining DOS channels add dimensionality faster than they add new predictive signal. (ii)
Curse of dimensionality on neural-network surrogates. The same direction of change is visible in the plain-MLP baseline of
Section 4.1 (
Figure 8), whose test MAE deteriorates from
eV on the compact set to
eV on the full + DOS set; the tree-based baselines, with their built-in feature selection inductive bias, do not suffer from the same effect. The intermediate DOS-PCA configuration mentioned in the limitations paragraph of
Section 5 probes the same regime with a moderate dimensionality penalty and reaches the same qualitative conclusion. We therefore read the deterioration not as overfitting in the strict sense, but as the combination of an architectural redundancy specific to
learning surrogates and the standard high-dimensional penalty on neural networks.
4.5. Permutation Feature Importance
To complement the qualitative justification of the compact descriptor set in
Section 2.2, we report a permutation-based feature importance analysis on the trained PC-NODE. For each of the
input features we shuffle the corresponding column of the test split, run inference, and measure the resulting increase in test MAE on the semiconductor subset. The procedure is repeated over five independent permutation seeds for each of five training seeds, so that each reported value is the mean of 25 paired comparisons against the per-seed baseline. The result is summarised in
Figure 10.
Three observations emerge. First, the model’s most informative inputs are the structural ones: the out-of-plane spacing
d, the in-plane lattice constant
a and the
M–
X and
X–
T bond distances dominate the ranking, with permutation-induced MAE increases of
,
,
and
eV respectively. This is consistent with the well-known dependence of the bandgap on local bonding geometry. Second, the elemental properties of the transition metal
M and the terminal group
T form a secondary tier, led by the van der Waals radius of
M (
eV), the electron affinity and electronegativity of
T (
and
eV), and the atomic radius of
M (
eV); these are the continuous descriptors that fine-tune the chemical fingerprint. Third, the periodic table indices for the non-metal
X contribute very little (permutation increase of ≲0.01 eV), which is expected because
X takes only two values (C or N) and the same information is already captured at finer resolution by the elemental properties of
X. Taken together, these observations support the descriptor set choice made in
Section 2.2: the geometric and elemental blocks carry the bulk of the predictive signal, the periodic table block contributes redundant but inexpensive context, and the DOS block is not required for the task.
4.6. Physics-Constraint Audit
The two architectural guarantees of Proposition 1 are structural properties of the parameterisation and therefore require no empirical check: for every test-set prediction, positivity violations and negative
values are exactly zero. What remains to be audited empirically is the hurdle condition, which is only a penalised term in the loss of Equation (
3) and is not guaranteed by the architecture because
and
can be strictly positive even for metallic inputs. On the test split, the mean absolute predicted bandgap on metal structures is approximately
eV, compared with the dataset-wide semiconductor mean of
eV. This is about twenty times smaller than the regression target scale, which we interpret as evidence that the hurdle penalty is sufficiently strong to collapse metal predictions towards zero in practice, without being so strong as to distort the semiconductor regression.
4.7. Conformal Prediction Intervals
Split-conformal calibration is carried out as described in
Section 2.5, with the training subset halved evenly into a model-fitting portion and a calibration portion.
Table 5 reports the calibrated quantile
, the mean interval width and the empirical coverage at the three nominal levels
.
Figure 11 visualises both panels. At the 0.90 nominal level, the empirical coverage is 0.905, within 0.5 percentage points of the target, at an interval width of approximately 1.05 eV. The 0.80 level is slightly over-covered at 0.847, and the 0.95 level slightly under-covered at 0.926; both patterns are consistent with a finite calibration set.
4.8. Uncertainty-Aware Pareto Screening on the Full Library
Turning the trained surrogate into a design tool, we apply the pipeline of
Section 2.6 to the complete 4356-structure training library as a
re-scoring exercise—every entry becomes a candidate that the surrogate places on the
plane. We emphasise that this step is a screening application of a model that has already seen a large fraction of these entries during training, so the resulting candidate counts are not a test of generalisation; an honest cross-validation of the screening output is offered by the held-out lanthanum subset of
Section 4.9.
Figure 12 presents the joint distribution of the two objectives, highlights the Pareto front and annotates the five top-ranked candidates according to the augmented Chebyshev scalarisation of Equation (
7). Of the 4356 structures, 198 are predicted to lie strictly inside the photocatalytic window
eV; three of those also lie on the Pareto front. The top Chebyshev-ranked candidate is an Sc
4N
3O
2 composition with a predicted gap of
eV and classifier confidence
. Because this structure sits inside the training library, its reference PBE0 value is also available:
eV, so the PC-NODE prediction deviates by
eV, consistent with the magnitude of typical in-sample residuals produced by the ensemble. The next four candidates are structurally closely related (Y
4N
3O
2, Zr
2C
1(NH)
2 with two geometries, Zr
2C
1O
2), reflecting that both chemistry and stacking geometry modulate the gap inside this family.
4.9. Lanthanum-Based Out-of-Distribution Screening
The lanthanum-based subset of 396 structures was not used during training, validation or calibration and offers an informative out-of-distribution (OOD) test; lanthanum sits in Group III alongside Sc and Y (already seen in training) but its 4f–5d electronic character is absent from the training pool. We run the trained PC-NODE surrogate on this subset with Monte Carlo dropout enabled, so that a per-sample standard deviation is produced alongside the point prediction. Of the 396 structures, 132 are classified as semiconductors with and 74 of those fall inside the photocatalytic window.
Two cautions accompany these numbers. First, MC dropout captures only epistemic, model-level uncertainty; its tendency to underestimate predictive uncertainty in OOD regimes [
34,
35] means that the reported mean standard deviation of
eV should be read as an internal-consistency diagnostic rather than as a calibrated error bar. Second, the split-conformal guarantee of
Section 4.7 rests on exchangeability between calibration and test residuals and therefore does not automatically transfer to the lanthanum subset; a formally correct treatment of this case calls for weighted or nonexchangeable conformal variants [
33], which we flag as a natural extension of the present work.
Table 6 lists the five highest-ranked candidates by classifier confidence;
Figure 13 visualises the full subset on the
plane. The top candidates combine a tellurium, chlorine or selenium termination with a carbide or nitride backbone, echoing the Sc/Y-rich behaviour observed in the training library. We note that the classifier probabilities saturate at
for all five candidates; this is a consequence of sigmoid saturation rather than an independent signal of confidence, and it is another reason why predictive uncertainty on the OOD subset should be interpreted with care. The La
4N
3(NH)
2 entry in particular combines a zero
with a predicted
eV, so the required
correction sits in the upper tail of the training distribution and the associated MC dropout spread (
eV) is roughly twice that of the other entries; we mark this case explicitly in
Table 6 and flag its predicted gap as requiring independent validation.
5. Discussion
Figure 14 visualises the ablation contrast introduced numerically in
Section 4.4 and
Table 4: the unconstrained Neural ODE variant A3 places mass in the physically forbidden
region, while the full PC-NODE variant A4 does not, in agreement with Proposition 1.
Our headline result, a test MAE of
eV with a compact 34-dimensional descriptor set, sits close to the MXgap reference of
eV [
11] and behind the
eV value reported by Tang et al. [
10], a gap of at most
eV. The broader MXene–ML literature is itself moving rapidly: 2026 contributions extend ML-driven design to other functional regimes such as ML-guided microwave absorption optimisation of magnetic MXene composites [
37], indicating that bandgap prediction is one node within a larger and active design ecosystem. The more informative comparison, however, does not pertain solely to the headline number, but also to what each pipeline gives up or preserves. Ontiveros et al. retain 33 elemental and 103 DOS-derived features (the latter including the PBE-level bandgap, band edges and averaged density of states), for a total of 136 inputs; Tang et al. train a deeper neural network inside a high-throughput screening pipeline. In contrast, PC-NODE uses only the 34 compact descriptors as model inputs and incorporates the PBE bandgap as the
learning baseline of Equation (
2) rather than as an additional feature; this also avoids any post hoc monotonicity or positivity correction, because the first two of these properties are guaranteed by the architectural parameterisation, as stated in Proposition 1. A point on which our setting differs from the Zhang benchmark [
12], where the classifier label
is reported as the most important regression feature, is that our hurdle construction couples the two heads through the softplus parameterisation of Equation (
2) rather than by feeding the label back as an input, which avoids the circularity of using the target as a covariate.
To position our framework more explicitly within the existing landscape of bandgap prediction methods,
Table 7 compares PC-NODE with representative studies that span the major model families. The most directly comparable benchmark is Ontiveros et al. [
11] on the same MXgap dataset and the same PBE0 target: their classifier–regressor pipeline (RF/GBT/MLP, with 33 elemental and 103 DOS-derived features—136 inputs in total—and a
split) attains a test MAE of
eV with
classification accuracy. Our PC-NODE ensemble reaches
eV with only 34 compact features as model inputs (no DOS or PBE-derived bandgap features) on a stratified
split, while additionally providing the architectural guarantees of Proposition 1. Tang et al. [
10] use a deep CNN with feature fusion on a 23,857-structure aNANt+C2DB MXene set and report a bandgap MAE of
eV against HSE06 targets, so the discrepancy in dataset and functional precludes a strict shared-set comparison. The Zhang benchmark [
12] reports MAE
eV and
with a Gradient-Boosted Decision Trees (GBDT) model whose feature-importance analysis attributes
of the total importance to the binary
flag itself; this confirms our earlier observation that feeding the classification label back as a regression input creates a target leakage path that PC-NODE deliberately avoids through the softplus hurdle parameterisation of Equation (
2). Kernel methods are represented by Rajan et al. [
8], whose Gaussian Process Regression on a curated subset of 70 semiconducting MXenes (drawn from a 23,870-structure functionalised MXene library) delivers a test RMSE of
eV against
targets; the small semiconductor sample size and the GW level rather than PBE0 target prevent a strict shared set comparison. Graph neural networks (GNNs) are represented by the Crystal Graph Convolutional Neural Network of Xie and Grossman [
9], which reports a bandgap MAE of
eV on 16,458 Materials Project crystals across the periodic table; to our knowledge, no GNN benchmark dedicated to the open MXgap resource has been published, so an MXgap-specific GNN study is a natural follow-up. Finally, with respect to deep ensembles in the sense of Lakshminarayanan et al. [
22], we note that the headline figures of this paper are themselves obtained from a ten-seed PyTorch ensemble; the per-seed standard deviation of
eV reported in
Table 2 indicates that the ensemble is internally consistent.
The conformal results of
Section 4.7 deserve a closer look. At the 0.90 nominal level, the empirical coverage is 0.905, a deviation of 0.5 percentage points that is smaller than the typical statistical fluctuation on a test set of a few hundred semiconductors. The 0.95 level slips to 0.926, which is expected given the finite calibration sample: the guarantee is valid in the infinite-calibration limit and becomes approximate in this regime. The mean interval width at 0.90 is roughly
eV, which is wide relative to the natural range of gaps in the dataset (median
eV among semiconductors) but is a faithful picture of the residuals the ensemble actually produces on unseen data. Importantly, these intervals are distribution-free: they do not rely on assumptions about the residual distribution, and they hold for any base estimator [
18]. The complementary MC dropout spread reports the model-level, epistemic component of the uncertainty and is therefore expected to be narrower than the conformal interval, which captures aleatoric variation as well; as noted in
Section 4.9, MC dropout is also known to underestimate predictive uncertainty on out-of-distribution inputs [
34,
35], so it should be read as an internal diagnostic rather than as a calibrated error bar.
The uncertainty-aware Pareto screening identifies Sc
4N
3O
2 as the top candidate in the training library and then returns 74 lanthanum-based MXenes inside the photocatalytic window, led by La
2CTe
2, La
2CCl
2 and La
4N
3(NH)
2. Our analysis restricts itself to the window criterion and the classifier confidence and does not attempt to validate these candidates with additional first-principles calculations. We also emphasise that the photocatalytic window is a
necessary but not sufficient criterion: practical photocatalysis additionally requires appropriate band-edge alignment relative to the H
+/H
2 and O
2/H
2O redox levels, a supportive excitonic structure, and chemical stability of the terminal-group chemistry. The PC-NODE prediction on La
4N
3(NH)
2, in particular, corresponds to a
correction of
eV starting from a zero PBE estimate, which sits in the upper tail of the training
distribution and is accompanied by the largest MC dropout spread among the five top candidates in
Table 6; we therefore flag this prediction as requiring independent validation.
Several limitations are worth making explicit. First, our training pool contains only eleven transition metals; the lanthanum-based screening provides an informative out-of-distribution test but is not fully adversarial, because lanthanum sits in Group III alongside Sc and Y, which are already in the training pool. A genuine adversarial test would require hybrid-functional calculations on compositions outside this family. Second, the compact descriptor set excludes the DOS by design; while we observed that adding DOS features or their PCA-compressed form did not improve the PC-NODE ensemble on this dataset, we cannot rule out that a larger training pool or a different regularisation strategy would benefit from DOS information. Third, the conformal guarantees rely on the exchangeability of calibration and test residuals, which holds under i.i.d. sampling but is not automatic for the lanthanum subset; weighted or nonexchangeable conformal variants [
33] are a natural remedy that we leave to future work. Fourth, only two of the three physical properties we consider are enforced by architecture; the hurdle condition is penalised and therefore approximate, and we interpret the empirical metal prediction magnitude of
eV as acceptable rather than guaranteed. Fifth, the screening counts reported in
Section 4.8 are obtained on the full 4356-structure pool, which overlaps with the training data, and should be read as a library re-scoring exercise; the lanthanum results of
Section 4.9 provide the honest out-of-sample view.
To consolidate the discussion above into a single decision-oriented synthesis, we summarise the framework in the SWOT analysis of
Table 8. The strengths group the architectural and statistical properties of PC-NODE that are verified within the present study; the weaknesses gather the items that the limitations paragraph has just enumerated; the opportunities translate the future work directions of the conclusions into actionable extensions; and the threats record the external factors that could constrain the practical impact of the framework, in the spirit of the methodological honesty that the reviewer is asking us to make explicit.
6. Conclusions
We have introduced PC-NODE, a physics-constrained neural ODE framework for the prediction and screening of MXene bandgaps on the MXgap dataset. The architecture couples a classifier head for the metal/semiconductor decision with a softplus-parameterised learning regression head for the gap magnitude. Two physically motivated properties are obtained exactly by the parameterisation, as stated in Proposition 1: non-negativity of the predicted bandgap and monotonicity between the low-fidelity PBE estimate and the high-fidelity PBE0 prediction. A third property, the hurdle condition that metal predictions collapse towards zero, is enforced penally and verified empirically. This two-by-construction, one-approximately-enforced split is the formulation used consistently throughout the abstract, the methodology and the present conclusion. On a compact 34-dimensional descriptor set that excludes the DOS, a ten-seed ensemble reaches a test MAE of eV and a coefficient of determination of , placing the model close to the reference MXgap benchmark with a substantially smaller feature footprint. Split-conformal calibration delivers prediction intervals whose empirical coverage matches the 0.90 nominal level within 0.5 percentage points, and the uncertainty-aware Pareto screening returns 74 lanthanum-based candidates inside the photocatalytic window eV.
The mathematical content of the framework rests on three elements that are simple to state and verifiable in closed form. First, the architectural guarantees of Proposition 1 follow directly from the strict positivity of the softplus function and the clamping of the low-fidelity input, which together place every prediction in the admissible region
of
Figure 7. Second, the split-conformal quantile of Equation (
4) furnishes finite-sample, distribution-free coverage at the nominal
level, independent of the underlying neural surrogate. Third, the augmented Chebyshev scalarisation of Equation (
7) ranks candidates by an explicit trade-off between the window distance and the classifier confidence, yielding a deterministic order on the candidate set. The empirical study of
Section 4 supports each of these elements within the scope of the MXgap data.
Three directions appear natural for future work. The first is a more ambitious out-of-distribution programme, in which an independent hybrid functional dataset outside the present transition-metal family would be generated to stress-test the coverage of both the point predictions and the conformal intervals, together with the adoption of weighted or nonexchangeable conformal variants [
33] that relax the exchangeability assumption. The second is the integration of symmetry-aware descriptors or graph-based encoders into the PC-NODE backbone, with the goal of making better use of the full-factorial design of the MXgap library. The third is experimental validation of the top-ranked lanthanum-based candidates through targeted synthesis and spectroscopic characterisation. Taken together, these directions would turn the present surrogate into a closed design loop for photocatalytic MXene discovery, with mathematical guarantees propagated end-to-end from the parameterisation to the ranked candidate list.