This section is structured around the three research questions stated in the Introduction. For RQ1, under both fairness regimes, dense historical connectivity does not yield a consistent perplexity improvement over the baseline decoder: on WT2 the baseline remains better, while on PTB the gap is small and regime-dependent. For RQ2, the ablations indicate that most of the performance variation is explained by capacity allocation (in particular depth and feed-forward width) rather than by connectivity alone. For RQ3, Zipf–RQA of long-form generations reveals systematic family-wise shifts in long-range structural descriptors, even when likelihood-based metrics do not improve.
4.1. Experimental Setting
The experimental campaign was designed to systematically assess the performance of the proposed Dense Transformer Decoder against a standard Transformer Decoder baseline. General architectural details, training procedures, and dataset preprocessing steps are provided in
Section 3. Here, we summarize the specific configurations, search protocols, and evaluation criteria relevant to the two grid search setups adopted in this study (
Table 2).
Table 2.
Compact training configuration (shared across runs unless stated otherwise). This table summarizes the optimization and reproducibility settings used for both baseline and dense models.
Table 2.
Compact training configuration (shared across runs unless stated otherwise). This table summarizes the optimization and reproducibility settings used for both baseline and dense models.
| Item | Value |
|---|
| Loss | Next-token cross-entropy (Equations (1) and (2)). |
| Optimizer | AdamW (, , , weight decay ). |
| Learning rate | Grid ; best selected by validation loss. |
| LR schedule | StepLR (step size epoch, ); no warmup. |
| Gradient clipping | Global L2 norm clipped at . |
| Dropout | (applied to embedding+positional sum and within attention/FFN sublayers). |
| Context length | tokens (fixed sliding window; no padding). |
| Batching | Batch size (total); contiguous windows from a concatenated token stream. |
| Epochs | 15. |
| Random seed | 42 (controls shuffling and initialization). |
| Precision | AMP for training (autocast + GradScaler); evaluation in standard precision. |
We evaluated the models on two widely used benchmark datasets for language modeling:
PTB [
27]: a syntactically annotated corpus of American English, preprocessed following standard tokenization and rare-word handling conventions [
28].
WT2 [
29]: a corpus extracted from Wikipedia articles, characterized by a larger vocabulary and longer-range dependencies compared to PTB.
As anticipated in
Section 3.1, in both cases, the official training, validation, and test splits provided by the dataset authors were used without modification.
Performance was measured in terms of test PPL, the standard metric for autoregressive language modeling [
12], where lower values indicate better generalization. Each model was trained and evaluated under both the
hyperparameters and
parameters fairness regimes (
Section 3.3). In addition to PPL, we assess linguistic generalization through targeted probes [
30,
31].
For the long-form analysis on WT2 (hyperparameters regime), we generate
texts per family from a fixed prompt (
“The future of language models is”) using top-
k sampling (
, temperature
), with a target length of 8000 words per sample. We use the best WT2 checkpoints under this regime, baseline
run_005 and dense
run_059 (
Table 3). Zipf–RQA is computed on word-level Zipf-rank series with windowed RQA over four overlapping windows (20% overlap), fixing the recurrence rate to 20% for comparability. Representative recurrence plots in
Section 4.5 (window 0) are selected by a median-case criterion: within each family, we choose the sample whose DET/LAM ratio is closest to the family median.
All experiments share the same optimization settings:
, dropout
, epochs 15, batch size 128, and context length
(BPTT truncation length). Architectural parameters vary by grid and fairness regime as detailed below, with the regime nomenclature summarized in
Table 1 (
Section 3.3).
For both regimes, baseline models in Grid A use a fixed architecture:
The baseline is tuned only over the LR grid above.
Under the hyperparameters regime in Grid A, dense models explore only a limited architectural subset. All configurations shared
and
, while depth and feed-forward width were varied according to
This search space is intentionally narrow by design, in that fixing the representation scale and the head setting reduces degrees of freedom and computational cost, and aims to isolate the effect of dense historical connectivity under a controlled training recipe rather than to perform an unconstrained architecture search. Potential confounds introduced by these fixed architectural choices (e.g., head count and positional encoding differences in Grid A) are addressed via explicit control checks in the WT2 results (see in the following). Grid C then provides a broader capacity-reallocation ablation that also varies and h to contextualize the Grid A findings.
In the
parameters regime in Grid A, dense models are resized to not exceed the baseline parameter count (
Section 3.3). We iteratively reduce
subject to
and
until the dense budget constraint is satisfied.
Two configurations structure the experimental campaign. Grid A (primary) includes both fairness regimes using the fixed baseline configuration in Equation (
18) and the dense settings described above. Grid C (ablation) is a broader capacity-reallocation study under the
hyperparameters regime, and its architectural search space is defined by
All optimization hyperparameters coincide with those used in Grid A. For the depth–PPL summary in Grid C (
Figure 2), the
subset uses
, while the
subset spans Equation (
20).
Concerning the random seed control, to ensure reproducibility, all runs were initialized with fixed random seeds, controlling for sources of stochasticity in dataset shuffling, parameter initialization, and parallel computation, following best practices for deterministic deep learning experiments [
18]. This overall setup ensured a controlled and reproducible framework for comparing the two architectures under both the
same training recipe and
same parameter budget regimes, providing a consistent basis for the analyses presented in the following sections.
4.2. Quantitative Results (Grid A)
This subsection addresses RQ1 by comparing baseline and dense decoders under the two fairness regimes, and it supports RQ2 by inspecting how outcomes vary across the explored dense configurations. We report test PPL, comparing the proposed dense decoder against the baseline across datasets and fairness regimes (see
Section 3.3 and
Table 1). These results correspond to the hyperparameter exploration of Grid A, our primary experimental setup, which systematically explores both
hyperparameters regime and
parameters regime (parameter-budget) fairness regimes, using the search space and selection criteria in
Section 4.1. Unless stated otherwise, all figures and tables in this subsection refer to the best checkpoint per configuration, selected on validation loss according to
Section 3.4.
Figure 2.
Grid C (dense, hyperparameters): test PPL vs. depth L for PTB and WT2, stratified by number of heads (). Panels: (a) PTB, ; (b) WT2, ; (c) PTB, ; (d) WT2, . Points are colored/marked by model width in the panels; in the panels, all points correspond to . Each point corresponds to one trained configuration in the grid.
Figure 2.
Grid C (dense, hyperparameters): test PPL vs. depth L for PTB and WT2, stratified by number of heads (). Panels: (a) PTB, ; (b) WT2, ; (c) PTB, ; (d) WT2, . Points are colored/marked by model width in the panels; in the panels, all points correspond to . Each point corresponds to one trained configuration in the grid.
For an overview across datasets and regimes,
Table 3 summarizes the best PPL achieved on PTB and WT2 under both the
hyperparameters regime and the
parameters regime (see
Section 3.3). On PTB, the dense variant yields a small improvement under the
hyperparameters regime, while the baseline remains better under the parameter-budget constraint. On WT2, the baseline is consistently better in both regimes, indicating that dense historical connections do not straightforwardly translate into lower perplexity in this decoder-only, fixed-context setting.
From the perspective of RQ2, the Grid A ablations provide a simple causal takeaway. Specifically, within our controlled comparisons,
capacity-allocation knobs such as depth (
L) and feed-forward width (
) are the factors that reliably shift perplexity, whereas introducing dense historical connectivity by itself does not yield a consistent perplexity gain across datasets and fairness regimes. This is visible not only in the best-checkpoint comparisons (
Table 3), but also in the distribution of outcomes across the explored dense configurations (
Figure 3), which shows that WT2 remains worse than the tuned baseline throughout the explored design space.
Table 3.
Best test PPL per dataset, fairness regime, and architecture. Full configurations are reported in
Appendix A.
Table 3.
Best test PPL per dataset, fairness regime, and architecture. Full configurations are reported in
Appendix A.
| Dataset | Regime | Model | Params | PPL | Best Run ID |
|---|
| PTB | hyperparams | baseline | 71.95 M | 325.13 | run_001 |
| PTB | hyperparams | dense | 118.68 M | 322.69 | run_023 |
| PTB | params | baseline | 71.95 M | 325.13 | run_003 |
| PTB | params | dense | 46.87 M | 328.61 | run_029 |
| WT2 | hyperparams | baseline | 158.23 M | 726.14 | run_005 |
| WT2 | hyperparams | dense | 204.96 M | 770.74 | run_059 |
| WT2 | params | baseline | 158.23 M | 726.14 | run_007 |
| WT2 | params | dense | 134.69 M | 766.54 | run_071 |
Importantly,
Figure 3 complements the best-checkpoint comparison by reporting the full distribution of dense outcomes over the explored design space. Because
is referenced to the best baseline setting within Grid A (baseline fixed architecture with a small LR grid), the figure directly tests whether the dense topology can outperform a tuned baseline under the same dataset and regime. On WT2,
remains positive across the explored dense design space in both regimes, indicating that the observed gap is not driven by a single outlier configuration. On PTB, a small subset of dense configurations approaches (or slightly improves upon) the tuned baseline under the
hyperparameters regime, consistent with the narrower gap observed in
Table 3.
Given its larger vocabulary and longer-range dependencies, WT2 is our primary testbed for figures and probes. Since our setup uses word-level tokenization (
Section 3.1), the resulting perplexity values are expected to be numerically higher than subword/BPE-based reports and should be interpreted within this evaluation setup.
Table 4 reports absolute and relative differences between dense and baseline under both regimes. The dense decoder is worse in PPL by
points (
) in the
hyperparameters regime, and by
points (
) under the
parameters regime (parameter—budget constraint).These deltas suggest that, within the decoder-only causal setting, exposing every layer to the full representational history (as described in
Section 3.2) does not yield a lower language-modeling perplexity on WT2 when controlling either for optimization settings or for a fixed parameter budget.
Control checks for architectural confounds (WT2, hyperparameters): Grid A comparisons may be confounded by architectural differences beyond connectivity, notably the attention-head count (baseline
vs. dense
) and the positional encoding choice (baseline sinusoidal vs. dense learned absolute embeddings). To quantify the magnitude of these effects in isolation, we ran two minimal baseline controls under the same WT2
hyperparameters regime recipe as the original baseline (
Table 4): (i) a head-matched baseline with
, achieving 723.41 test PPL, and (ii) a positional-encoding-matched baseline using learned absolute positional embeddings, achieving 727.06 test PPL. Both controls are close to the original baseline (726.14) and remain substantially below the dense Grid A WT2 perplexity (770.74), suggesting that these specific confounds do not explain the WT2 gap observed for the dense configuration.
Table 4.
WT2 deltas (Dense minus Baseline) under both fairness regimes.
Table 4.
WT2 deltas (Dense minus Baseline) under both fairness regimes.
| Regime | Baseline PPL | Dense PPL | PPL | % |
|---|
| hyperparams | 726.14 | 770.74 | 44.60 | 6.14 |
| params | 726.14 | 766.54 | 40.40 | 5.56 |
To complement likelihood-based evaluation, we also report cloze-style probes on WT2 under the
hyperparameters regime (see
Section 3.4). These probes are intended as lightweight behavioral diagnostics, not as definitive measures of linguistic competence. In particular, Top-
k success (Top-1/Top-5) is a strict criterion in a word-level setup with a large vocabulary, and small lexical or contextual variations can move the expected continuation outside the top ranks even when it receives non-negligible probability mass. Accordingly, we emphasize rank and
as the more informative signals for these probes. As summarized in
Table 5 and
Table 6, Top-1/Top-5 success is zero across the reported probe families at context length
, which is consistent with the rank statistics (median ranks remain above 5). Within this evaluation setting, the baseline tends to assign higher probability to the expected tokens in FACT and CTX, consistent with the PPL gap in
Table 4.
Figure 4 compares evaluation loss/PPL summaries of the best baseline (run_005) and dense (run_059) models on WT2 under the
same training recipe (hyperparameters) regime. The baseline achieves a lower test PPL.
Figure 5 qualitatively compares last-layer self-attention heatmaps on WT2 between the two models on the same fixed input and context length. Both allocate substantial mass to recent tokens (consistent with causal masking and local predictive cues). However, the dense model exhibits a more heterogeneous pattern with sharper off-diagonal hotspots, whereas the baseline shows a more regular near-diagonal decay. To complement the illustrative heatmaps with a lightweight summary statistic,
Table 7 reports two batch-averaged metrics computed on the WT2 test split: normalized attention entropy (lower indicates a more peaked distribution) and diagonal mass (higher indicates stronger self-token attention). Under this setting, the dense model shows lower entropy and higher diagonal mass, consistent with a more concentrated attention profile.
Figure 6 shows the training/validation learning curves (loss and perplexity) for the best baseline and dense models on WT2, highlighting the convergence dynamics and stability across epochs.
Across datasets and regimes, dense historical connectivity does not consistently improve perplexity. Under the
hyperparameters regime, dense yields a small gain on PTB but degrades on WT2; under the
parameters regime (parameter—budget constraint), the baseline remains better on both datasets (
Table 3). On WT2, where we focus figures and probes, we find consistent PPL gaps in favor of the baseline (
Table 4) and no evidence of improved cloze-style probe behavior in our limited probe set (
Table 5 and
Table 6). Taken together, these findings from Grid A motivate a more targeted architectural ablation in Grid C to identify which capacity-allocation dimensions most affect dense-decoder performance.
4.3. Ablation Study (Grid C)
This subsection addresses RQ2 by isolating how dense-decoder perplexity varies with capacity-allocation factors (depth, width, feed-forward size, and head count) under a controlled training recipe. We now ablate architectural factors within the Dense Transformer Decoder using the alternative search configuration of Grid C (see
Section 4.1). To isolate the effect of single factors, we vary one dimension at a time under the
hyperparameters regime, keeping all other settings fixed. Test PPL is reported for each sweep. PTB one-factor tables are reported in detail below, and we additionally provide a compact WT2 one-factor summary to better localize the WT2 mechanism under matched controls.
Figure 2 complements these tables with a cross-dataset depth–PPL view over the full grid.
Figure 2 further supports a robustness perspective, specifically, across the full grid (rather than only the best configuration), deeper models tend to yield lower PPL, with a clear dispersion that reflects interactions with width and head count. This motivates emphasizing depth and feed-forward capacity as primary levers for dense-decoder performance under the tested optimization envelope, while treating width-only scaling as less predictive in isolation.
Grid C yields the best models summarized in
Table 8. The top configuration uses
with a deeper stack (
), more heads (
), and a wider feed-forward block (
), achieving 320.85 PPL. Close variants that reduce
or depth remain competitive, indicating that (within this regime) increasing depth and feed-forward capacity is a robust driver of better PTB perplexity for the dense decoder.
Varying width at fixed
,
,
yields only modest and non-monotonic changes (
Table 9). Increasing
from 512 to 640 slightly improves PPL (331.84 to 330.93), while larger widths do not further improve under this setting. This suggests that, for the dense decoder under this optimization envelope, depth and feed-forward scaling are more reliable levers than width alone.
Depth produces the strongest and most consistent improvements at
,
,
(
Table 10). Increasing layers from 4 to 6 to 8 to 10 steadily lowers PPL from 354.87 to 343.54 to 333.64 to 322.69, supporting the view that dense historical access benefits from deeper stacks where representations can be progressively composed and reused.
Changing the number of heads while keeping
,
,
yields an improvement when moving from
to
(
Table 11). Despite reducing per-head dimensionality, additional heads may help the dense model represent diverse token dependencies more effectively under this setting.
Finally, scaling the feed-forward block at
,
,
yields steady improvements (
Table 12). Increasing
from 2048 to 3584 lowers PPL from 335.17 to 331.13, consistent with the MLP being a primary channel for token-wise nonlinear transformation in word-level language modeling.
On WT2,
Table 13 helps localize the failure mechanism more precisely. Depth remains the strongest improving factor within the dense family (from 857.62 at
to 770.74 at
under matched controls), whereas width and feed-forward sweeps are non-monotonic in this setting and increasing heads from 8 to 12 does not improve PPL. This indicates that WT2 behavior is still primarily governed by capacity-allocation effects rather than by dense connectivity alone. Importantly, even the best one-factor WT2 setting in this controlled slice (770.74) remains above the baseline WT2 result (726.14,
Table 3).
Within Grid C, depth is the dominant driver of improved PPL for the dense decoder, with dataset-dependent contributions from feed-forward width and head count, while width-only scaling is weaker and often non-monotonic. These ablations contextualize Grid A by showing that denser cross-layer access can benefit from deeper composition, but does not by itself close the baseline–dense gap on WT2 under comparable training regimes.
Table 13.
WT2 one-factor ablation summary (Grid C, dense, hyperparameters). Each row reports the best PPL obtained for each tested value while fixing the remaining architectural factors as indicated.
Table 13.
WT2 one-factor ablation summary (Grid C, dense, hyperparameters). Each row reports the best PPL obtained for each tested value while fixing the remaining architectural factors as indicated.
| Factor | Fixed Setting | Best PPL by Tested Value |
|---|
| Width () | | 512: 774.51; 640: 774.75; 768: 788.07; 832: 790.55 |
| Depth (L) | | 4: 857.62; 6: 807.13; 8: 788.07; 10: 770.74 |
| Heads (h) | | 8: 788.07; 12: 792.50 |
| Feed-forward () | | 2048: 814.79; 2560: 799.59; 3072: 788.07; 3584: 809.54 |