1. Introduction
Accurate prediction of molecular potential energies and atomic forces from 3D atomic coordinates is fundamental to computational chemistry, enabling molecular dynamics, conformational sampling, and modeling of chemical and material processes [
1,
2,
3]. Density functional theory (DFT) is widely used to generate reference energies and forces for small- and medium-sized molecular systems, but its rapidly increasing cost with system size limits routine use in long trajectories or large molecular ensembles. Machine learning of interatomic potentials offer an attractive alternative. Once trained on reliable reference data, such models can approximate DFT-level energy and force evaluations at much lower inference cost [
4,
5,
6,
7,
8,
9,
10]. Nevertheless, the cost of generating reference data and training stable force models makes sample-efficient learning a central requirement for practical molecular force-field development.
Among ML interatomic potentials, equivariant graph neural networks have become a leading approach for molecular energy and force prediction [
11,
12,
13,
14,
15,
16,
17]. Their appeal comes from encoding the symmetries of molecular physics directly in the model: predicted energies should be invariant to translations and rotations of the coordinate frame, whereas predicted forces should transform equivariantly with the atoms. Beyond pairwise distances, however, accurate molecular force fields also depend on angular and higher-body geometry. Distance-based representations can fail to distinguish configurations that differ by angular rearrangement [
18]. Therefore, many successful molecular models incorporate three-body or higher-order information through radial-angular basis functions, tensor features, spherical harmonics, equivariant attention, or vector–scalar interaction blocks [
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29]. This leads to a central design question for practical force-field learning, namely, how can a model expose useful angular many-body information while keeping training stable and computationally affordable?
These developments expose a practical tension in equivariant force-field design. Scalar-invariant models are usually efficient and stable to train, but their access to molecular geometry depends on the invariant radial, angular, or higher-body quantities supplied to the network; related expressivity limits have been analyzed for both graph and invariant geometric models [
30,
31]. Tensor-based models provide a systematic route to richer equivariant representations by propagating spherical or tensor features [
32], but their cost grows steeply with angular order and channel width, which constrains depth, model size, and routine hyperparameter exploration under fixed compute budgets [
33]. In practice, many high-performing models rely on low-order truncations or carefully engineered geometric channels rather than on unconstrained higher-order tensor representations. This motivates a more targeted question for small-molecule force-field learning: can useful three-body geometric information be exposed to an equivariant backbone through lightweight scalar mechanisms without the need to rely on Clebsch–Gordan tensor products?
Recent work has sought to reduce or avoid explicit Clebsch–Gordan (CG) tensor products while retaining strong equivariant-model accuracy [
34]. EST [
35] uses attention in a spherical spatial domain, Graph ACE [
36] provides a cluster-expansion view of equivariant message passing, Geodite [
37] removes CG tensor products from a GotenNet-style backbone using high-degree inner products and physical priors, and MARA [
38] introduces continuous SE(3)-equivariant spherical attention as an efficient approximation to equivariant interactions without high-order tensor products. From a different angle, HEGNN [
33] shows that increasing the steerable-feature degree (within a no-CG scalarization scheme) can recover angular information without expensive tensor coupling. This finding is complementary to our own, since if intrinsic high-degree capacity is already adequate, then supervision becomes the dominant limiting factor. Beyond no-CG molecular force-field designs, the broader literature offers several adjacent points of comparison. Triplet Graph Transformer (TGT) [
39] introduces triplet attention into a 2D molecular graph transformer with an auxiliary interatomic distance prediction stage, demonstrating that explicit three-body channels improve property prediction on graph-level benchmarks (PCQM4Mv2, QM9). Scalarization-compatible Triplet Cross-Attention (SCTA) is used; it can be viewed as the equivariant 3D-scalar instance of the same intuition, except evaluated for its incremental value over an already-competent equivariant backbone rather than as a standalone architecture. Several recent equivariant designs avoid or restructure the CG transform: FreeCG [
40] performs CG on permutation-invariant abstract edges to widen the CG design space, E2Former [
41] uses a linear-scaling tensor-product attention via Wigner-6j convolution, GeoMFormer [
42] couples invariant and equivariant transformer streams via cross-attention, HotPP [
43] performs E(
n)-equivariant Cartesian-tensor message passing for high-order outputs (dipole, polarizability), and TACE [
44] provides a unified irreducible Cartesian-tensor framework. Similarly, CEITNet [
45] performs many-body coupling in Cartesian channel space and targets high-order crystal tensors. These designs are largely complementary to ours; they expand or restructure architectural capacity for many-body and tensor information, whereas we ask whether adding three-body supervision is a more effective lever at fixed backbone capacity (GotenNet) than adding further three-body capacity. We do not benchmark against these models directly because their targets and design budgets differ, but do situate them in the design space when interpreting our findings. Existing studies in this area mainly target architectural expressiveness or inference efficiency. We ask a narrower experimentally controlled question for small-molecule force-field training: when added to a strong no-CG backbone, do lightweight scalar geometric priors change convergence or final accuracy, and where do their benefits fail to transfer?
We use GotenNet [
46] as this backbone because it achieves competitive revised MD17 (rMD17) [
47] accuracy without explicit CG tensor products while implicitly encoding angular information through hierarchical tensor refinement and equivariant feed-forward updates. Beyond comparing add-on mechanisms, we ask whether the backbone’s treatment of three-body geometry is limited by representational capacity, or by training supervision. We separate the two factors with three controlled probes. Two of them add representational capacity: the frame projection probe projects tensor features onto a triplet-local frame, while SCTA provides a zero-initialized residual branch that attends over neighbor pairs using scalar features and a cosine-angle basis. The third probe adds supervision rather than capacity in the form of a graph-level auxiliary loss on bond-angle and dihedral statistics. The main experiments form a controlled single-seed ablation on rMD17 aspirin; limited ethanol, uracil, and salicylic acid probes are reported separately to delimit the aspirin findings rather than to generalize them.
Our main contributions are as follows:
Capacity vs. supervision. We frame the problem of adding three-body geometry to a no-Clebsch–Gordan backbone as a choice between representational capacity and training supervision, and separate the two using three controlled probes on rMD17 aspirin. The main finding is that added three-body capacity does not move converged accuracy, whereas low-cost three-body supervision does.
Frame projection vs. SCTA. We establish two structural properties of a frame-projection scalar branch: the diagonal frame-projected pairwise feature collapses exactly to a frame-independent tensor inner product, and the only genuinely frame-dependent channel is a parity-odd pseudoscalar. At pilot scale the two branches achieve comparable validation mean absolute error (MAE), so we present the analysis as a design rationale: SCTA’s cos-angle input is a true -invariant three-body scalar that requires no frame construction.
SCTA as a capacity control. A correctly designed scalar triplet branch with complexity and a cosine-angle basis matches GotenNet’s converged force MAE within ∼0.4% on aspirin, but yields no robust independent gain. We interpret this neutral result as evidence that the GotenNet backbone’s implicit angular pathway already supplies the relevant three-body capacity, so adding more representational capacity is not the binding constraint.
Auxiliary supervision as the effective lever. A graph-level auxiliary loss on bond-angle and dihedral statistics, which adds three-body supervision rather than capacity, gives the best force MAE in our ablation ( kcal/mol/Å) while preserving energy accuracy and reducing the epochs to selected validation targets by 26–55%. Limited ethanol, uracil, and salicylic acid probes serve only to delimit the scope of this finding at single-seed precision, and are not put forward as a molecule-independent claim.
2. Results and Discussion
2.1. Main Results on rMD17 Aspirin
We first compare the reproduced vanilla GotenNet baseline, GotenNet with SCTA, and their auxiliary-supervised variants on the aspirin split of rMD17. All in-house runs use the paper-aligned configuration (
hidden channels, 16 interaction layers,
steerable features), 3000 training epochs, AdamW, and a fixed random seed.
Table 1 reports test set MAE on the held-out 1000-configuration test split, using kcal/mol for energy and kcal/mol/Å for force.
The force column provides the clearest outcome. Auxiliary-supervised variants form a narrow 0.1280–0.1292 kcal/mol/Å band, improving over the corresponding no-auxiliary runs at 0.1298–0.1303 kcal/mol/Å. The lowest force MAE in our ablation is obtained by GotenNet + aux (physics) at kcal/mol/Å, compared with kcal/mol/Å for reproduced vanilla GotenNet. By contrast, adding the SCTA capacity branch changes converged force MAE by only about at the fixed auxiliary setting. Therefore, the auxiliary geometric supervision is associated with a small but directionally favorable shift in the converged force number, whereas the added three-body representational capacity of SCTA is not. At single-seed precision, the kcal/mol/Å gap (1.8%) should not on its own be treated as a statistically established effect; the basis for our conclusion is the consistent pattern across the six controlled configurations rather than any single pairwise difference.
The energy column highlights a separate tradeoff in which GotenNet + aux (physics) preserves the low energy MAE of the vanilla GotenNet baseline (
vs.
kcal/mol), whereas SCTA + aux (hybrid) increases energy MAE to
kcal/mol. We attribute this to an interaction between the SCTA residual branch and the hybrid auxiliary target rather than to a failure of auxiliary supervision in general. Because all in-house aspirin comparisons are single-seed results on one molecule, the narrow ranking within the 0.1280–0.1292 kcal/mol/Å force band should not be over-interpreted; the more robust conclusions are the component-level pattern analyzed below and the transfer behavior reported in
Section 2.5.
Analytic Test Set Confidence Intervals
To place a finite-sample uncertainty around the point estimates in
Table 1, we compute analytic Wald-type 95% confidence intervals on each test MAE, under the assumption that residuals are approximately independent across the 1000 test configurations for energy (
) and across all force components (
n = 63,000 for 21 atoms × 3 components × 1000 configurations):
. This is a test set CI, not a seed-to-seed CI; it captures finite-sample variation under a fixed model and an independence assumption that may understate the true error if within-trajectory residuals are correlated. A block bootstrap over trajectory segments more faithfully reflects the temporal structure of the rMD17 test split. We recover the per-configuration residuals from the released best-validation checkpoints and report such a configuration-level paired block bootstrap in the analyses below. With the above caveat, the resulting analytic intervals are (energy in kcal/mol, force in kcal/mol/Å):
GotenNet (reproduced): ,
SCTA: ,
GotenNet + aux (hybrid): ,
GotenNet + aux (physics): ,
SCTA + aux (hybrid): ,
SCTA + aux (physics): ,
On force, the
kcal/mol/Å gap between GotenNet (
) and GotenNet + aux (physics) (
) corresponds to a two-sample
z-statistic of ≈2.5 (using
under independence), and the two 95% intervals overlap only at their boundary. However, this analytic interval treats the
force components as independent, and as such is optimistic. A configuration-level paired block bootstrap that resamples the ∼1000 correlated test configurations yields intervals roughly three times wider, under which the single-seed force gap is no longer distinguishable from test set sampling noise. The seed-level robustness of the effect is examined in
Section 2.2. On energy, the intervals are wider (
kcal/mol for vanilla GotenNet) and the GotenNet vs. GotenNet + aux (physics) intervals overlap substantially, consistent with the paper-level claim that GotenNet + aux (physics)
preserves energy accuracy rather than improving it. The same caveat as in the previous paragraph applies: this is finite-test-sample uncertainty, not seed-to-seed uncertainty, and the GotenNet reference values cited in
Table 1 are five-split averages reported in [
46] without per-split variance, so a direct seed-level comparison is not available.
2.2. Multi-Seed Validation on Aspirin
The aspirin comparison above is single-seed. To test whether the auxiliary force-MAE improvement is robust to initialization, we retrained vanilla GotenNet and GotenNet + aux (physics) under three random seeds on the same paper-aligned split (
splits_0) and training protocol, then evaluated each model at its best-validation checkpoint on the held-out test set.
Table 2 reports the test MAE as mean ± standard deviation over the three seeds.
The auxiliary loss lowers the mean force MAE (
kcal/mol/Å); more robustly, it reduces the seed-to-seed standard deviation roughly threefold (
). A configuration-level paired block bootstrap of the force residuals makes the per-seed difference significant on seed 2 (
kcal/mol/Å,
) but not on seed 3 (
,
), where the vanilla baseline itself converges to its lowest force MAE; therefore, the size of the effect is comparable to the baseline’s own seed-to-seed variability. Accordingly, we read the auxiliary supervision as a modest variance-reducing lever, not as a substantial gain in converged accuracy, and do not over-interpret the single-seed force band of
Section 2.1. Energy accuracy is preserved, with both models reaching a test energy MAE of about
kcal/mol across seeds.
2.3. Sample Efficiency: Auxiliary Loss and SCTA Components
Because the converged test errors above differ only slightly in force MAE, we next examine whether the scalar geometric mechanisms change how quickly a usable model is obtained. We logged validation MAE every epoch and report the earliest epoch at which each model reaches a specified validation threshold (“epoch-to-threshold”) on the 50-configuration rMD17 aspirin validation split. This analysis measures training dynamics rather than final test ranking; per-epoch test evaluation is avoided at the paper-aligned scale. Results are summarized in
Table 3 and visualized in
Figure 1 and
Figure 2.
Relative to vanilla GotenNet, SCTA alone provides a modest and threshold-dependent force acceleration, reaching val_F about 15–16% earlier but showing no consistent benefit at the earliest or tightest force thresholds. SCTA + aux provides the larger headline reductions, reaching val_F in 211 epochs instead of 284 and val_E in 15 epochs instead of 33. These numbers show that scalar geometric priors can shorten the path to usable validation accuracy, but do not by themselves identify which component is responsible.
To separate the SCTA branch from auxiliary supervision,
Table 4 compares SCTA + aux (hybrid) with the strongest auxiliary-only baseline, GotenNet + aux (physics). This is the stricter comparison, as
Section 2.1 showed that GotenNet + aux (physics) is already the best force-MAE configuration in our ablation.
Against this stronger baseline, SCTA + aux does not show a robust force-speed advantage, with the ratios moving above and below
depending on the threshold. The clearest speed benefit appears only at loose energy thresholds, where SCTA + aux reaches val_E
and ≤0.2 kcal/mol 1.5–1.8× faster than GotenNet + aux (physics), before the advantage disappears at the tighter val_E
threshold. Thus, the bulk of the sample efficiency gain relative to vanilla GotenNet should be attributed to the auxiliary geometric loss, while SCTA contributes a comparable scalar triplet pathway with some early-energy acceleration and a small regression at the tightest energy threshold (≤0.10 kcal/mol), rather than an advantage at single-seed precision. Sensitivity studies on angular-basis size, auxiliary-loss weight, and annealing schedule are reported in
Appendix B.
2.4. Test Set Ablation: Base Configuration, Auxiliary Loss, and Target Type
The threshold analysis above measures training speed, but does not fully separate three design choices: the base configuration (vanilla GotenNet vs. GotenNet with SCTA), the presence of auxiliary supervision, and the auxiliary target type. Therefore, we train six configurations under the same paper-aligned protocol and evaluate each at its best-validation checkpoint. In order to test energy and force MAE,
Table 5 additionally reports the validation-to-test inflation of energy MAE,
, where
is the test energy MAE at the best-validation checkpoint and
is the best (minimum) validation energy MAE reached during training. This helps to identify target choices that appear favorable on the validation split but transfer poorly to the held-out test set.
The force column reproduces the pattern observed in
Section 2.1. Auxiliary supervision improves force MAE for both base configurations, and the lowest value is obtained by GotenNet + aux (physics) at
kcal/mol/Å. Adding the SCTA capacity branch at a fixed auxiliary target changes force MAE by less than
and with inconsistent sign: SCTA is slightly better under the hybrid target, but slightly worse under the physics target. Therefore, the added three-body representational capacity of SCTA is not the source of the best converged force accuracy on this benchmark, which is instead provided by the auxiliary supervision.
The energy column is more informative: GotenNet + aux (physics) preserves the baseline energy accuracy (
vs.
kcal/mol), whereas SCTA + aux (hybrid) increases test energy MAE to
kcal/mol and shows the largest validation-to-test inflation (
). Switching SCTA from the hybrid target to the physics target reduces the energy inflation to
and the test energy MAE to
kcal/mol, with almost no force penalty (
vs.
kcal/mol/Å). Therefore, the hybrid target is the risky choice for SCTA, as it preserves force accuracy but overfits the small validation split on the energy axis. We attribute this to the signed dihedral mean
in the hybrid target. As quantified in
Appendix B.6 (
Table A7), this component is roughly an order of magnitude smaller in terms of mean magnitude than the other two entries of the hybrid target vector (≈−0.04 vs. ≈−0.33 and ≈0.87), since the signed cosines of the many bonded dihedrals largely cancel in the graph mean. While it is not a constant or noise-like target (it retains a conformation-to-conformation CV of ≈45%), its small scale leaves it poorly matched to the other components under the shared mean-squared auxiliary loss. The SCTA branch, which mixes scalar features into the same readout pathway through its zero-initialized residual, appears to amplify this scale mismatch into the energy prediction. The physics target replaces
with the chirality-insensitive magnitude
(≈0.83), restoring a comparable scale; this is consistent with the empirical observation that switching the target stabilizes the energy axis at little force cost. A plausible direct alternative in the form of annealing the hybrid auxiliary weight to zero late in training is examined as a sensitivity probe in
Appendix B rather than as a separate component.
In summary, GotenNet + aux (physics) is the strongest configuration in this ablation for peak force accuracy at minimal energy cost. SCTA remains useful as a scalar triplet design that reaches comparable force accuracy; however, when combined with auxiliary supervision, the physics target is the safer operating point.
2.5. Limited Cross-Molecule Probes on rMD17 Ethanol, Uracil, and Salicylic Acid
The aspirin ablation above identifies a useful operating point for this molecule, but does not provide sufficient evidence of molecule-independent improvement. To delimit the scope of the claim, we ran limited rMD17 ethanol, uracil, and salicylic acid probes and compared them with the corresponding values reported for GotenNet [
46]. For salicylic acid, we additionally trained an in-house vanilla GotenNet under the same paper-aligned protocol and single seed to provide a second paired controlled comparison (alongside aspirin) at single-seed precision; ethanol and uracil are reported only against the GotenNet five-split averages. Where in-house and reported rows are compared (ethanol, uracil), the comparison is indicative rather than a fully paired multi-seed reproduction. Because the controlled aspirin ablation (
Section 2.4) already shows that the SCTA capacity branch does not change converged accuracy at a fixed auxiliary setting, the cross-molecule probes focus on the auxiliary-supervised configuration; SCTA+aux is included only for ethanol as a spot check, while the uracil and salicylic acid rows report the auxiliary-only configuration (
Table 6).
The cross-molecule outcome should be read as scope-limiting evidence, not as a transfer claim; we have one paired in-house comparison (salicylic acid) and two unpaired probes (ethanol, uracil).
On
salicylic acid, we have both a paired in-house vanilla GotenNet run and the in-house GotenNet + aux (physics) run under the identical paper-aligned protocol. Auxiliary supervision
does not reduce test error on this molecule: force MAE moves
kcal/mol/Å (
, with 95% analytic test set intervals
and
overlapping by about 80% of each interval half-width, so that the difference is not distinguishable from zero at finite-sample test precision), and energy MAE moves
kcal/mol (
, with intervals
and
in partial overlap). Therefore, the direction of the effect on salicylic acid is opposite to the direction observed on aspirin, although both differences sit close to the resolution of single-seed test-MAE comparisons. While these analytic intervals overlap, the configuration-level paired-block bootstrap of
Section 2.1 resamples the correlated test configurations rather than treating force components as independent, thereby resolving this force-MAE degradation as significant (
kcal/mol/Å,
). The per-configuration paired differences are consistent in sign even though the marginal intervals overlap. This significance is established on a single seed, and has not been replicated across seeds.
On uracil and ethanol, no in-house paired vanilla baseline is available, so we compare in-house GotenNet + aux (physics) only against the reported GotenNet five-split averages: uracil energy MAE moves kcal/mol () and force MAE kcal/mol/Å (); ethanol kcal/mol () and kcal/mol/Å (). These unpaired directional differences are within the range expected when a single-split and single-seed in-house run is compared against five-split averages; thus, we report the magnitudes but do not treat the signs as evidence of transfer. SCTA + aux (physics) on ethanol provides vs. for the auxiliary-only run, consistent with the aspirin finding that the SCTA capacity branch does not add converged accuracy.
Across the two paired controlled comparisons (aspirin, salicylic acid) and the two unpaired probes (ethanol, uracil), the auxiliary effect lacks a uniform direction. Under the same configuration-level paired block bootstrap, the aspirin single-seed force improvement is not by itself distinguishable from test set sampling noise; instead, its robust support comes from the three-seed variance reduction of
Section 2.2), whereas the salicylic degradation is significant (
, though likewise on a single seed). Therefore, the directional inconsistency between the two molecules is statistically resolved rather than an artifact of the test set resolution. We treat this directional inconsistency as the cross-molecule finding, and it constrains the conclusion of this paper; that is, the auxiliary loss is the more effective lever on aspirin under the protocol tested, with mixed or null evidence on the other rMD17 molecules examined here. The most likely explanation for the small magnitudes on the unpaired probes is the absolute error scale. The converged force MAE on ethanol (≈0.05 kcal/mol/Å) is already roughly
lower than on aspirin (≈0.13 kcal/mol/Å), and the strongest models in the GotenNet benchmark (NequIP, MACE, and Allegro) all cluster within roughly 0.048–0.065 kcal/mol/Å on ethanol force [
46], leaving little headroom for any low-cost modification. A degenerate supervision signal is not the explanation here; the conformation-to-conformation variance of the auxiliary geometric target on ethanol is comparable to or larger than on aspirin (
Appendix B.6), so the target itself carries usable information.
2.6. Emergent Layer-Wise Self-Gating
SCTA is attached through a zero-initialized LayerScale residual,
, so the trained values of
indicate where the optimizer chooses to engage the triplet branch. Evaluated at the best-validation checkpoint, the learned pattern is strongly depth-dependent (
Figure 3): the earliest layers L0–L2 remain almost off (
), the branch is strongly active across the middle block L6–L11 (
), and the final layers L12–L15 decay back to a moderate level (
) without returning to zero. Therefore, SCTA behaves as a mid-depth geometric correction rather than as a uniformly active replacement for the backbone’s implicit angular pathway. This depth-dependent pattern is specific to the 16-layer paper-aligned configuration: in the pilot configurations with 2–3 layers used in Appendices
Appendix A and
Appendix B.1, all SCTA layers necessarily occupy similar relative depths, meaning that the mid-depth interpretation does not apply.
2.7. Pilot Ablation: SCTA Does Not Replace Tensor Features
The self-gating pattern raises a narrower architectural question: can the explicit scalar triplet branch compensate for reducing the backbone’s steerable-feature order? To test this, we ran a reduced pilot ablation on aspirin (, 3 interaction layers, 100 epochs, single seed 42), comparing GotenNet with or and with or without SCTA. This pilot is intended as a qualitative architecture check, not as a replacement for the paper-aligned 3000-epoch results.
The pilot result is clear in its direction: reducing the backbone from
to
worsens force MAE by 42% (A → B). Adding SCTA to this weakened backbone does not recover the lost tensor channel; it further degrades force MAE in this short-budget setting (B → C). In contrast, adding SCTA to the full
backbone improves the pilot validation errors (A → D). The conclusion is not that SCTA can simplify the backbone, but that it is useful only when the underlying equivariant representation remains strong enough. At this 100-epoch pilot budget, val_E does not converge and is sensitive to architectural perturbations (compare
Appendix B.5). Thus, the force column should be read as the more stable indicator at pilot scale (
Table 7).
This result matches the design intent of SCTA. The branch supplies an explicit scalar cosine-angle signal, whereas the steerable channel carries higher-order angular information that is not reducible to a single triplet cosine. Therefore, SCTA should be treated as a residual scalar three-body prior on top of a competent equivariant backbone rather than as a substitute for the backbone’s tensor features.
2.8. Note on Frame-Projection Alternatives
A natural control for SCTA is to replace its explicit
input with scalars obtained by projecting equivariant tensor features onto a local triplet frame, following the spirit of LEFTNet [
50], frame averaging [
51], and related inertial-frame designs [
52]. At pilot configuration C, the frame projection branch is empirically competitive with SCTA. In parallel with SCTA, it gives essentially the same force MAE (
vs.
kcal/mol/Å); as a full replacement for SCTA, it matches SCTA on force (
) while noticeably reducing energy MAE (
vs.
kcal/mol; see full table in
Appendix A). Therefore, we do not present frame projection as a failure case at this scale.
The algebraic analysis adds a design rationale rather than an explanation of an empirical failure. Under any orthonormal triplet frame, the diagonal projected pairwise scalar collapses to the ordinary tensor inner product
, so it does not encode the chosen local frame. The remaining genuinely frame-dependent channel is a pseudoscalar triple product, so it is not an
scalar of the kind that a parity-even energy target on achiral molecules can directly exploit. The cos-angle input of SCTA avoids both issues by construction, being a true
-invariant three-body scalar that requires no frame, which is why we adopt it as the main design in this paper. The full setup, empirical table, and proofs are given in
Appendix A.
2.9. Computational Cost
2.9.1. Per-Layer Asymptotic Cost
Because scalability is a central constraint for neural interatomic potentials [
53], we report both asymptotic and measured costs. SCTA adds two operations to each forward pass. First, triplet enumeration is performed once and cached across interaction layers, requiring
triplet index tuples for node degrees
. Second, each layer applies triplet attention: element-wise feature products, softmax over triplets grouped by center, scatter aggregation, and a final linear projection. This layer-wise branch has complexity
in hidden dimension
D. No Clebsch–Gordan coupling appears, so the classical tensor-product scaling factor
is absent.
By comparison, each Graph Attention Transformer Architecture (GATA) layer in the GotenNet backbone has a cost of
on the edges
. Therefore, the SCTA-to-GATA asymptotic ratio is
where
is the average node degree. Equation (
1) is obtained in two steps that make the two “≈” explicit. The first step cancels the common factor
D in the two per-layer costs,
, and substitutes the exact counts
and
. The second step replaces this ratio of degree-dependent sums by a single function of the mean degree: writing
gives
, which reduces to
when the degree distribution is concentrated around
(i.e.,
). The residual term
is a non-negative degree variance correction, so
is a lower bound on the true triplet-to-edge ratio. On rMD17 aspirin (21 atoms, 5 Å radial cutoff),
, for a nominal triplet-to-edge ratio of ≈6.5×. Numerically, each aspirin graph has
triplets (≈91 per atom) and
directed edges, so the exact ratio
coincides with the approximation
to the quoted precision, confirming that correction of the degree variance is negligible for this near-homogeneous neighborhood. The measured epoch-level overhead is much smaller because GATA also contains steerable-feature updates and heavier pairwise message MLPs, whereas SCTA uses a lightweight scalar attention branch. Therefore, the asymptotic ratio in Equation (
1) upper-bounds rather than predicts the realized per-epoch cost measured next.
2.9.2. Empirical Wall Clock
At the paper-aligned configuration (
, 16 layers) on a single NVIDIA RTX 6000 Ada with batch size 4, one pass through the 950-sample aspirin training split takes approximately 48 s/epoch for vanilla GotenNet and 62 s/epoch for SCTA (single timing run, not repeated), corresponding to a ∼29% per-epoch overhead from the triplet branch. This measured 29% is far below the ≈
asymptotic triplet-to-edge ratio of Equation (
1). This is because the per-epoch wall clock is dominated by the backbone’s steerable-feature (HTR) updates and pairwise message MLPs, against which the added scalar triplet work
is comparatively cheap; the asymptotic ratio counts triplets relative to edges, not relative to the backbone’s full per-edge tensor workload. The cross-molecule probes of
Section 2.5 were trained on a separate single NVIDIA RTX 4090; the hardware choice affects only wall clock timing, not the reported MAE values. Relative to vanilla GotenNet, SCTA + aux can still reduce wall clock time at selected early-to-mid validation thresholds, since it needs fewer epochs: val_F
kcal/mol/Å is reached at approximately
GPU-hours for SCTA + aux vs.
GPU-hours for vanilla GotenNet, and val_E
kcal/mol is reached in
h vs.
h.
This comparison should not be read as a general SCTA throughput advantage.
Section 2.3 shows that the vanilla comparison conflates SCTA and auxiliary supervision and that SCTA + aux does not have a robust force time-to-threshold advantage over the stronger GotenNet + aux (physics) baseline. Therefore, the compute-side conclusion is narrower: auxiliary geometric supervision provides most of the sample efficiency benefit at negligible cost, while SCTA introduces moderate triplet overhead and is justified mainly as a no-CG scalar three-body design and as an analytical probe.
2.9.3. Auxiliary-Loss Overhead
The auxiliary geometric loss adds negligible overhead. The graph-level target is computed under torch.no_grad once per batch, and the two-layer auxiliary-head Multi-Layer Perceptron (MLP) adds ≪1% of GotenNet’s total parameter count. Therefore, the auxiliary loss is practically “free” at inference time (the auxiliary head is not used for energy/force prediction) and contributes only an additional <0.5 s per epoch during training.
2.9.4. Memory
The dominant memory cost of the SCTA branch is the triplet-index cache (precomputed once per forward pass and reused across all interaction layers), which scales as with the same triplet-to-edge ratio as the FLOP cost. At rMD17 aspirin (, per graph) and batch size 4, the cached triplet tensors occupy a small fraction of activation memory relative to the steerable feature buffers used by the Hierarchical Tensor Refinement (HTR) module; we did not observe a memory-bound regime for this benchmark, although for larger systems with denser neighborhoods would scale quadratically with and could become limiting. Concretely, the cache holds only a few integer index tensors per triplet (the center, its two neighbors, and the two contributing edges) shared across all 16 interaction layers, so its footprint is negligible other than the floating-point activation buffers. For large or periodic systems in which grows quadratically with , capping the neighbor count or enumerating triplets in blocks would bound this cost without changing the scalar attention itself.
3. Materials and Methods
Figure 4 provides an overview of the implemented architecture and separates the inference-time energy–force path from the training-only auxiliary supervision path. Along the inference path (top row of
Figure 4), atomic inputs
build a 5 Å neighbor graph that feeds the GotenNet interaction stack; the stack’s scalar features
are read out as the energy
and differentiated to give conservative forces
. The SCTA residual branch sits inside this stack, adding a scalar triplet-attention term to
(
Section 3.3). The bottom row is active only during training: a geometric target on bond-angle and dihedral statistics (
Section 3.4), an auxiliary head that reads
, and the combined objective
(
Section 3.5). The auxiliary head is discarded at inference, so it shapes the representation during training without adding any test-time cost.
3.1. Dataset and Units
We use the
rMD17 dataset [
47], a revised version of MD17 [
5] with DFT energies and forces for small organic molecules at the Perdew–Burke–Ernzerhof (PBE)/def2-SVP level. The main controlled experiments use aspirin (
, 21 atoms, 100,000 configurations). Following the GotenNet evaluation protocol [
46], we use the 950/50/1000 train/validation/test split and keep the corresponding split file (
splits_0.npz) fixed for all in-house comparisons.
Section 2.5 adds limited ethanol, uracil, and salicylic acid probes using the same code path and compares them with the corresponding GotenNet reported values; these probes are used only to bound the transfer claim.
During training and evaluation, energies and forces are handled in the dataset’s native kcal/mol and units, matching the convention used in GotenNet. Literature baselines are converted to these units when necessary. Energies are standardized to zero mean and unit variance using the training set; forces are not standardized.
3.2. Backbone: GotenNet
We use GotenNet [
46] as the equivariant backbone. Given atomic numbers
and positions
, the model constructs a 5 Å radial-cutoff graph and maintains scalar node features
together with steerable tensor features
. Unless otherwise stated, the paper-aligned runs use
, 16 interaction blocks, and
.
Each interaction block applies GATA with HTR edge updates followed by Equivariant Feed-Forward (EQFF) updates to the scalar and steerable channels. SCTA and the auxiliary head are added around this backbone without changing the backbone energy readout. The final scalar node features are mapped to atomic energy contributions by a two-layer MLP and summed over atoms, giving the standardized energy readout
where
denotes the two-layer atom-wise MLP acting on the final scalar features
and
is the energy in the standardized units of
Section 3.1. Forces are computed as
, preserving energy–force consistency.
3.3. Scalarization-Compatible Triplet Cross-Attention (SCTA)
The backbone of
Section 3.2 is held fixed throughout. The two mechanisms examined in this work are introduced around it. The first (this subsection) augments the model with explicit three-body representational capacity through a scalar triplet branch (SCTA); the second (
Section 3.4) instead augments the training supervision through an auxiliary geometric loss. Because both are attached to an identical backbone, the capacity-versus-supervision comparison of
Section 2 is a controlled one.
Each SCTA layer chains five steps. First, the triplet angle is embedded in a fixed angular basis (Equation (
3)). This basis gates a per-triplet attention score (Equation (
4)), which is then normalized over the triplets sharing a center (Equation (
5)). The resulting weights aggregate a symmetric neighbor value (Equation (
6)), and the aggregated message is added back to the scalar channel through a zero-initialized LayerScale residual (Equation (
7)).
SCTA is a scalar residual branch inserted after GATA and EQFF in each interaction block. For each center atom i, we enumerate unordered neighbor pairs and form triplets with . The branch uses only scalar node features and the angle cosine ; it does not introduce Clebsch–Gordan tensor products or alter the steerable tensor channel directly.
3.3.1. Triplet Enumeration
Triplet indices are built once per forward pass and cached across all interaction blocks. For a node with graph degree , the branch enumerates neighbor pairs, giving triplets in total. On rMD17 aspirin at the 5 Å cutoff, this corresponds to about 91 triplets per atom on average.
3.3.2. Angular Basis
The angle is encoded by a fixed Gaussian basis on
:
where
are uniformly spaced and
. We use
throughout. The angular basis is projected to hidden dimension
D by a bias-free linear map
, where
denotes the
c-th channel of
. Equation (
3) is not evaluated globally or pooled over the graph, instead being computed independently for each triplet
from that triplet’s own angle cosine
. The resulting per-triplet vector
is exactly the quantity that enters the attention score in Equation (
4) below. Therefore, the angular basis acts only as a local geometric gate on each triplet, not as a graph-level descriptor.
3.3.3. Attention
Following scaled dot-product attention [
54], we compute query, key, and value projections from scalar node features:
,
, and
, where
are learned weight matrices acting on the scalar node features
h. The scalar score for triplet
is
where
c indexes the
D hidden channels,
,
,
are the
c-th components of the center query and the two neighbor keys,
is the
c-th channel of the projected angular basis of Equation (
3), and
is the standard scaled-dot-product normalization that keeps the variance of
approximately constant in
D. The four-way product
is the scalar
-symmetric analogue of a pairwise dot-product attention logit, and is large only when the center, both neighbors, and the triplet angle are mutually aligned in channel
c. The score is followed by a softmax over all triplets centered at the same atom:
where
is the set of neighbor pairs around the center
i, so that
and the branch forms a convex per-center combination of triplet messages.
3.3.4. Aggregation
The triplet message is aggregated symmetrically over
:
and passed through an output projection before being added to the scalar channel:
The LayerScale vector
is initialized to zero so the SCTA branch is exactly inactive at initialization; therefore, its learned magnitude reveals where the optimizer chooses to engage the explicit triplet signal.
3.4. Auxiliary Geometric Supervision
Whereas SCTA enlarges the inference-time representational capacity of the model (
Section 3.3), the auxiliary supervision introduced here leaves the inference network unchanged and instead shapes the learned representation through an additional training objective. The auxiliary path follows a chain parallel to that of SCTA: a bonded subgraph is identified (Equation (
8)), reduced to a low-dimensional geometric target in one of two variants (Equations (
9) and (
10)), regressed by a graph-level auxiliary head (Equation (
11)), and trained against that target using a mean-squared error (Equation (
12)).
The auxiliary head applies graph-level supervision to scalar representations using geometric summaries computed from the molecular structure. It is used only during training, and is removed from the energy/force prediction path at inference time. The motivation for the construction below is that the quantities supervised by the auxiliary loss (bond angles and dihedrals) are defined over
covalent bonds, whereas the 5 Å radial-cutoff graph of
Section 3.2 also connects many non-bonded atom pairs. Enumerating angles and torsions on that message-passing graph would mix chemically meaningful bond angles with spurious through-space angles, which we avoid by evaluating the geometric targets on a separate bonded subgraph rather than on the message-passing graph. We first identify a bonded subgraph
which is separate from the 5 Å message-passing graph. The
Å threshold is a conservative upper bound for typical single covalent bonds in the rMD17 molecules considered here (e.g., C–H
Å, C–C
Å, C–O
Å, N–H
Å, O–H
Å). It sits above the longest of these single bonds yet below the shortest non-bonded contact distance, so that the bonded subgraph approximates the chemical bond graph without explicit bond perception, valence assignment, or a chemistry toolkit. The cutoff is used as a fixed hyperparameter rather than being tuned per molecule. Moreover, the threshold is supported empirically: the explicit ablation in
Appendix B.2 (
Table A3) shows that
Å is the best of the three tested values
Å on
both the energy and force axes at pilot configuration C, with a force-MAE spread of only
across the sweep, so the threshold is not finely tuned. The bonded triplets
and quadruplets
entering Equations (
9) and (
10) are enumerated on
, not on the message-passing graph.
We evaluate two three-dimensional target variants. The
hybrid target summarizes signed bond-angle and dihedral statistics
where
ranges over bonded triplets and
over bonded quadruplets. For a quadruplet
,
with
and
. One component of the hybrid target is the signed dihedral mean
, which is poorly scaled; it is small in magnitude (about
on aspirin, roughly an order of magnitude below the other two components), which is because the signed cosines of the many bonded dihedrals largely cancel in the graph mean (
Appendix B.6). This scale mismatch is the most likely source of the energy-axis instability seen when the hybrid target is combined with SCTA (
Section 2.4). We did not directly isolate the scale from sensitivity to chirality, for example by rescaling
alone, so the attribution rests on the component statistics together with the stabilizing effect of replacing the signed mean by its magnitude in the physics target. The
physics target instead replaces the signed dihedral mean with chirality-insensitive torsion magnitude and bond-length summaries
where
is the minimum bonded distance and
is the mean over atoms of the standard deviation of incident bonded distances.
Both target variants are computed under torch.no_grad; thus, the auxiliary loss supervises the representation but does not add a direct force-gradient term through the geometric target.
A two-layer MLP maps the per-node scalar features to the auxiliary dimension, followed by a graph-level mean pooling
We use a graph-mean pooling head matched to a graph-level target
rather than per-bond or per-triplet local supervision (e.g., histogram-matching on bond-angle distributions or edge-level prediction of
) for two practical reasons: it aligns with the graph-level mean readout already used by the GotenNet backbone for energy, and it does not commit the model to a specific bond/triplet partition during representation learning. This graph-level design is coarse by choice, as it supervises only aggregate geometric statistics, not
where a geometric error occurs. In
Appendix B.7, we discuss more localized finer-grained alternatives per-bond or per-triplet targets, distribution matching, and per-atom descriptors) along with their drawbacks and why we leave them to future work. The auxiliary loss is the mean-squared error, with
denoting either target variant:
3.5. Overall Training Objective
The standardized energy prediction is mapped back to physical units and differentiated into conservative forces (Equation (
14)), which enter the combined energy–force objective together with the auxiliary term of
Section 3.4 (Equation (
15)).
The training objective follows the GotenNet energy–force loss, and adds the auxiliary geometric term only when the corresponding head is enabled. Let
denote the standardized energy prediction of Equation (
2) and
the standardized DFT energy target, where standardization uses the training set energy mean
and standard deviation
of
Section 3.1 through
. The physical energy is recovered by inverting this standardization:
with
as the same training set statistics. Forces then follow from this physical energy by energy conservation. Because
and
are constants independent of
r, the additive shift
vanishes under the derivative and the scale
factors out:
so that energy and force predictions share a single scalar potential and the force scale is fixed by
rather than being learned independently. The full loss is
where
is the reference DFT force in kcal/mol/Å. We use
and
for all in-house runs. For non-auxiliary models, we set
. Auxiliary-supervised models use the constant setting
in the main paper-aligned experiments; the sensitivity study also tests a linear decay during the first
epochs:
The auxiliary head is discarded for energy and force evaluation; thus, the auxiliary target shapes the learned representation during training but does not introduce a separate inference-time prediction path.
3.6. Training Protocol
All paper-aligned models are trained with AdamW [
55,
56] at learning rate
, with 1000 warmup steps and a ReduceLROnPlateau schedule to monitor the validation loss (patience 30 epochs, decay factor 0.8, minimum learning rate
). Weight decay is set to zero. Exponential Moving Averaging (EMA) of model weights with decay rate 0.9 is used to compute the validation metrics, and is the set of weights restored at the best-validation checkpoint for the test evaluation reported in
Table 1 and
Table 5; this choice is fixed across all six in-house configurations, and does not bias the auxiliary-versus-no-auxiliary comparison. Each model is trained for up to 3000 epochs with batch size 4 and inference batch size 4 on a single NVIDIA RTX 6000 Ada GPU with 96 GB memory, and is evaluated at the checkpoint with the best validation loss. Actual training durations range from ∼2100 to 3000 epochs across the six configurations of
Table 1, and the full per-epoch metric logs are released as part of the data package. The train/validation/test split and random seed are fixed across all in-house comparisons.
3.6.1. Error Metrics
All energy and force errors reported in this paper, including the values tabulated in
Table 1 and
Table 5, consist of Mean Absolute Error (MAE) evaluated on the held-out test split. For a set of
test configurations indexed by
n, the energy MAE is
where
is the model energy and
the DFT reference energy, both in kcal/mol. The force MAE averages over every Cartesian force component of every atom in every test configuration:
where
A is the number of atoms (
for aspirin),
is the predicted force, and
is the DFT reference in kcal/mol/Å. For aspirin, this yields
force components, which is the sample size
n used for the analytic force confidence intervals in
Section 2.1. The validation MAE used for checkpoint selection and for the sample efficiency analysis of
Section 2.3 is defined identically but evaluated on the 50-configuration validation split.
The test MAE values in
Table 1 and
Table 5 are evaluated once at the checkpoint with the best validation loss. Per-epoch test metrics are not used for checkpoint selection or sample-efficiency analysis.
Reduced-capacity runs are used only for qualitative ablations and supplementary checks.
Section 2.7 uses hidden dimension 32, three interaction layers, and
to probe tensor-feature complementarity.
Appendix A uses hidden dimension 16 and two interaction layers for frame-projection controls. These pilot configurations are trained for 100 epochs on CPU with seed 42, and otherwise follow the same optimizer and scheduler settings.
3.6.2. Sample Efficiency Metric
To quantify convergence behavior for the auxiliary and SCTA variants (
Section 2.3), we record the validation energy and force MAE at the end of every training epoch during the paper-aligned 3000-epoch runs; for each model, we compute the earliest epoch
at which the best-so-far validation MAE reaches a given threshold
:
where
is the validation MAE recorded at epoch
s. The inner
takes the best (lowest) value seen up to epoch
t, while the outer minimum returns the first epoch index meeting the threshold; thus,
is a single integer in
rather than a continuous quantity. Thresholds are expressed in the same native units as the reported MAE values (kcal/mol for energy and kcal/mol/Å for force). The speedup of model
M relative to GotenNet is the ratio of the two integer epoch counts obtained from Equation (
19):
Therefore, the right-hand side involves no integration, as
and
are the respective integer epochs at which vanilla GotenNet and model
M first cross the threshold
, and
is simply their quotient. Values above 1 indicate faster convergence than vanilla GotenNet. If a run does not reach a threshold within 3000 epochs, no speedup is reported for that threshold.
3.7. Implementation
We implement SCTA and the auxiliary loss on top of the public GotenNet codebase using PyTorch 2.5.1, PyTorch Geometric 2.7.0, PyTorch Lightning 2.2.5, e3nn 0.6.0, and Hydra 1.3.2. SCTA is added as a representation module extension after each GATA/EQFF block; triplet indices are generated from the same 5 Å neighbor graph, cached once per forward pass, and reused across interaction layers. The auxiliary target construction is performed under torch.no_grad, and the auxiliary head contributes only to the training loss.
Hydra configuration files specify the backbone, hidden size, number of interaction layers, whether SCTA is enabled, auxiliary target type (hybrid or physics), auxiliary weight, and training budget. Source code, trained checkpoints, configuration files, and experimental logs will be released upon acceptance.
4. Conclusions
This work asks whether the difficulty of capturing three-body geometry in a no-Clebsch–Gordan backbone such as GotenNet is a matter of representational capacity or of training supervision. We address the question with three controlled probes on a single-seed paper-aligned rMD17 aspirin split: two that add representational capacity (frame projection of tensor features and scalarization-compatible triplet cross-attention) and one that adds supervision instead of capacity (a graph-level auxiliary loss on bond-angle and dihedral statistics).
Frame projection is included as a natural alternative scalar three-body branch. At pilot configuration C, it performs comparably to SCTA on aspirin: when used in parallel with SCTA, the two branches give the same force MAE; when used as a replacement for SCTA, the frame branch matches SCTA on force and noticeably reduces energy MAE. Algebraically, however, the frame projection’s two scalar outputs have specific structural limitations: the diagonal projected feature is exactly frame-independent and reduces to an ordinary tensor inner product, and the only genuinely frame-dependent channel is parity-odd. These observations motivate the cos-angle input of SCTA as the more principled scalar three-body choice, since it is a true -invariant of the triplet and requires no frame construction.
As a correctly designed scalar triplet branch that avoids both algebraic limitations of frame projection, SCTA matches the converged force accuracy of the GotenNet backbone to within ∼0.4% on aspirin; however, it does not produce a robust independent gain over the stronger GotenNet + aux (physics) baseline. We read this neutral outcome as evidence that the backbone’s implicit angular pathway already supplies the relevant three-body representational capacity, an interpretation that is supported by the learned LayerScale weights (nearly inactive in early layers, strongly active in the middle interaction blocks at paper-aligned scale). Therefore, adding more capacity is not the binding constraint, even when correctly scalarized. A pathway-level probe (
Appendix B.5) supports the same picture from the opposite side: removing the
inner-product term from GotenNet’s HTR pathway costs almost nothing on pilot val_F and noticeably improves pilot val_E, suggesting that the backbone’s existing angular capacity is if anything not monotonically beneficial across both prediction targets at small scale. We leave this observation as an opening for future architectural simplification.
Of the two levers, supervision is the one with a measurable effect on converged accuracy, though a modest and seed-dependent one. On the single-seed ablation, the physics-style auxiliary target gives the lowest force MAE (
kcal/mol/Å vs.
for reproduced GotenNet); across three random seeds (
Section 2.2), it lowers the mean force MAE only slightly (
kcal/mol/Å) while reducing the seed-to-seed standard deviation roughly threefold. Therefore, its most reproducible benefits are a reduction in seed-to-seed variance and faster convergence (epochs to validation targets cut by 26–55%) rather than a large peak-accuracy gain, while energy accuracy is preserved (≈0.0355 kcal/mol). The choice of auxiliary target matters, as the hybrid target preserves force accuracy but interacts poorly with SCTA on the energy axis, whereas the physics target is the safer operating point. The auxiliary effect is scale-dependent: at the 100-epoch pilot configuration C used in the appendices, the auxiliary loss does not improve over SCTA alone (
Appendix A.2), and the headline gain should accordingly be read as a paper-aligned-scale effect rather than a universal advantage at small scale. Limited cross-molecule probes on ethanol, uracil, and salicylic acid are reported only to delimit the scope of this finding: across all three, GotenNet + aux (physics) stays within a few percent of the reported GotenNet force MAE; however, at single-seed precision and without paired in-house baselines, this does not establish a molecule-independent claim. For the small ethanol effect in particular, the most plausible explanation is that its converged force MAE is already comparable to the spread of strong baselines on this molecule (NequIP, MACE, and Allegro all fall within ∼0.048–0.065 kcal/mol/Å per the GotenNet benchmark), leaving little room for any low-cost modification.
Limitations and Future Directions
The present study covers one fully controlled molecule across three random seeds and three limited transfer probes, so the narrow 0.1280–0.1292 kcal/mol/Å single-seed force-MAE band among auxiliary-trained aspirin models should not be read as a definitive ranking. The ethanol, uracil, and salicylic acid comparisons use reported GotenNet numbers (five-split averages) rather than full paired multi-seed reproductions, and should be treated accordingly. Multi-seed validation on aspirin is now provided (
Section 2.2), leaving multi-molecule validation across seeds as the most important remaining step. We evaluate models on held-out test configurations only, and do not assess long-horizon molecular-dynamics stability (energy drift, force-error tail behavior, or RMSD divergence under nanosecond-scale rollouts), which is increasingly used to gauge the practical utility of small MAE improvements. We regard this as the main limitation of our practical perspective: a converged-MAE difference as small as the
kcal/mol/Å we report need not translate into a difference in trajectory stability, and an auxiliary loss that slightly lowers force MAE could help or hurt energy conservation over long rollouts. Therefore, establishing whether the auxiliary supervision delivers a genuine practical benefit would require NVE/NVT rollouts measuring energy drift and RMSD divergence, which we identify as the key practical followup enabled by the released checkpoints. Further work should test whether auxiliary geometric supervision transfers to newer no-CG or inner product-based backbones such as EST [
35], Geodite [
37], and MARA [
38], whether SCTA benefits from force- and energy-specific readouts, and whether four-body or dihedral-aware attention improves the treatment of torsional geometry [
20,
57,
58]. We release the code, configurations, checkpoints, and logs to support these followup studies.