Review Reports - AdamN: Accelerating Deep Learning Training via Nested Momentum and Exact Bias Handling

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The initial motivation seems partially redundant with respect to the state of the art since, in lines 48–57, it is argued that there is no principled approach to the so-called momentum-of-momentum and that existing solutions inevitably suffer from bias or warmup dependence problems. This assertion is subsequently not adequately supported by a critical comparison with recent works that implicitly address similar issues already present in the literature. The claim that no one provides a principled, bias-corrected momentum-of-momentum (lines 166–169) is also not proven to be a conclusion based on a systematic analysis.

The deduction of the exact debiasing factor ft(β1,β2) is based on the explicit assumption of stationarity of the gradient statistics, i.e., E[g_j] ≡ g. Although this assumption is used in the literature to justify bias correction in Adam, in the specific case of a double EMA, it becomes even more restrictive. There is no adequate discussion regarding the impact of violating this assumption.

Lines 428–433 state that for large values of β₂ the optimizer would assume relatively larger steps; it is unclear why previously (lines 398–404) it is emphasized that freshness tends to zero as β₂ increases, suggesting instead greater inertia and hence more cautious steps.

Algorithm 1 seems to explicitly suggest computing the weights w{t,j} and the sum over all passed gradients, contradicting the claim, reiterated several times in lines 61–62, 489–490, that the computational cost remains of the same order as Adam/AdamW.

The hyperparameter selection appears strongly biased in favor of AdamN, as indicated in Table 3. AdamN benefits from a ramp on (β₂) and the use of AMP, while AdamW and SGD do not appear to benefit from similar systematic measures. This does not make for a fair comparison.

The claim that the best hyperparameters are selected for each optimizer (lines 512-513) must be supported by a discussion in terms of search space, optimization strategy (grid/random/manual), number of runs per optimizer, optimization budget, stopping criterion, or whether multiple seeds were used.

The experiments reported in Tables 4 and 5 lack confidence intervals and statistical tests to assess the robustness of the reported results.

The proposed approach deserves further critique, as only marginal references to the application claims presented in the abstract have been made to date.

Author Response

Please see the uploaded file for the reviewers response.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors introduce "AdamN," a new adaptive optimizer that incorporates a nested momentum structure (EMA of EMA) combined with an exact double-EMA bias correction factor. The paper claims that this approach creates a smoother, curvature-aware search direction that allows for a "warmup-free" start and faster time-to-quality compared to AdamW.

1 . I strongly recommend renaming the paper to something more formal, such as "AdamN: Accelerating Deep Learning Training via Nested Momentum and Exact Bias Handling."
2 . While the introduction cites recent competitive optimizers like Lion and Adan , the experimental comparisons in Section 6 are limited to SGD, Adam, and AdamW. Given the claim of "Speeding Up Deep Learning Training," a comparison against other speed-focused optimizers (like Lion) is necessary to validate the superiority of AdamN.
3 . The introduction of beta_2 as a nested momentum parameter changes the role of the traditional second-moment coefficient (now beta_3). This adds complexity to the hyperparameter search space for end-users compared to the standard AdamW configuration.
4. The ablation study (Table 19) shows that "nested-on" performance degrades significantly at high beta_2 values (e.g., 0.95), performing worse than "nested-off". The limitations section briefly mentions this, but it suggests the method is sensitive to this new hyperparameter.
5. In Figures 5, 6, and 7, the axis labels and legends are quite small. Please increase the font size for better readability.
6 . regarding Table 19, please expand the discussion on why the nested momentum fails at high beta_2. Providing a "safe" default range for beta_2 in the conclusion would be very helpful for practitioners

Author Response

Please see the uploaded file for the reviewers response.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Pls see the attachment.

Comments for author File: Comments.pdf

Author Response

Please the uploaded file for the reviewers response.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

All the concerns were sufficiently addressed.

Reviewer 2 Report

Comments and Suggestions for Authors

I thank the authors for the high quality of their work. They have done an excellent job, and all my comments have been successfully addressed. The paper is now ready for publication.