Figure 1.
Tile-level architecture. Each tile contains a compute core, router, and embedded PPO-based DRL agent (12 KB SRAM, <10 ns inference) placed on the route computation (RC) stage, ensuring per-flit next-hop decisions are available before switch allocation without adding pipeline bubbles. Cardinal electrical links (solid) and diagonal photonic express links (dashed) are shown.
Figure 1.
Tile-level architecture. Each tile contains a compute core, router, and embedded PPO-based DRL agent (12 KB SRAM, <10 ns inference) placed on the route computation (RC) stage, ensuring per-flit next-hop decisions are available before switch allocation without adding pipeline bubbles. Cardinal electrical links (solid) and diagonal photonic express links (dashed) are shown.
Figure 2.
PPO-DRL agent memory profile (∼12 KB SRAM per router; weights ≈ 60%, activations ≈ 25%, control FSM ≈ 15%).
Figure 2.
PPO-DRL agent memory profile (∼12 KB SRAM per router; weights ≈ 60%, activations ≈ 25%, control FSM ≈ 15%).
Figure 3.
Conceptual flow from workloads to traffic, local observation, DRL action, and network-level effects.
Figure 3.
Conceptual flow from workloads to traffic, local observation, DRL action, and network-level effects.
Figure 4.
Convergence of decentralized PPO agents. Average episode reward stabilizes as policies mature, reflecting effective diagonal offload and safe fallback behavior.
Figure 4.
Convergence of decentralized PPO agents. Average episode reward stabilizes as policies mature, reflecting effective diagonal offload and safe fallback behavior.
Figure 5.
Synthetic traffic: latency versus injection rate . Means over five seeds on a mesh with .
Figure 5.
Synthetic traffic: latency versus injection rate . Means over five seeds on a mesh with .
Figure 6.
PARSEC 3.0: latency versus injection rate . Means over five seeds on a mesh with .
Figure 6.
PARSEC 3.0: latency versus injection rate . Means over five seeds on a mesh with .
Figure 7.
SPLASH-2: latency versus injection rate . Means over five seeds on a mesh with .
Figure 7.
SPLASH-2: latency versus injection rate . Means over five seeds on a mesh with .
Figure 8.
AI workloads: latency versus injection rate . Means over five seeds on a mesh with .
Figure 8.
AI workloads: latency versus injection rate . Means over five seeds on a mesh with .
Figure 9.
Synthetic traffic: energy per delivered bit (pJ/bit) versus injection rate . Means over five seeds on a mesh with .
Figure 9.
Synthetic traffic: energy per delivered bit (pJ/bit) versus injection rate . Means over five seeds on a mesh with .
Figure 10.
PARSEC 3.0: energy per delivered bit (pJ/bit) versus injection rate . Means over five seeds on a mesh with .
Figure 10.
PARSEC 3.0: energy per delivered bit (pJ/bit) versus injection rate . Means over five seeds on a mesh with .
Figure 11.
SPLASH-2: energy per delivered bit (pJ/bit) versus injection rate . Means over five seeds on a mesh with .
Figure 11.
SPLASH-2: energy per delivered bit (pJ/bit) versus injection rate . Means over five seeds on a mesh with .
Figure 12.
AI workloads: energy per delivered bit (pJ/bit) versus injection rate . Means over five seeds on a mesh with .
Figure 12.
AI workloads: energy per delivered bit (pJ/bit) versus injection rate . Means over five seeds on a mesh with .
Figure 13.
Synthetic traffic: throughput (packets/cycle) versus injection rate . Means over five seeds on a mesh with .
Figure 13.
Synthetic traffic: throughput (packets/cycle) versus injection rate . Means over five seeds on a mesh with .
Figure 14.
PARSEC 3.0: throughput (packets/cycle) versus injection rate . Means over five seeds on a mesh with .
Figure 14.
PARSEC 3.0: throughput (packets/cycle) versus injection rate . Means over five seeds on a mesh with .
Figure 15.
SPLASH-2: throughput (packets/cycle) versus injection rate . Means over five seeds on a mesh with .
Figure 15.
SPLASH-2: throughput (packets/cycle) versus injection rate . Means over five seeds on a mesh with .
Figure 16.
AI workloads: throughput (packets/cycle) versus injection rate . Means over five seeds on a mesh with .
Figure 16.
AI workloads: throughput (packets/cycle) versus injection rate . Means over five seeds on a mesh with .
Figure 17.
Steady-state input-buffer occupancy for three representative workloads. Rows correspond to , , and meshes. Columns correspond to the following routing schemes: Proposed (PPO-based DRL), DRLAR, DeepNR, ARCA, West-First, and XY. A single colorbar per workload block is used to enable direct visual comparison across mesh sizes.
Figure 17.
Steady-state input-buffer occupancy for three representative workloads. Rows correspond to , , and meshes. Columns correspond to the following routing schemes: Proposed (PPO-based DRL), DRLAR, DeepNR, ARCA, West-First, and XY. A single colorbar per workload block is used to enable direct visual comparison across mesh sizes.
Figure 18.
Throughput versus energy per delivered bit (pJ/bit) by suite. Points are class-averaged per suite (Irregular/Bursty, Synchronization-Heavy, Streaming, Low-Contention); means are over five seeds. Relative to the best electronic DRL baseline, the Proposed (PPO-based DRL) method typically achieves lower energy at matched throughput and higher throughput at matched energy (on the orders of 2–5 pJ/bit and 0.02–0.04 pkts/cycle, respectively).
Figure 18.
Throughput versus energy per delivered bit (pJ/bit) by suite. Points are class-averaged per suite (Irregular/Bursty, Synchronization-Heavy, Streaming, Low-Contention); means are over five seeds. Relative to the best electronic DRL baseline, the Proposed (PPO-based DRL) method typically achieves lower energy at matched throughput and higher throughput at matched energy (on the orders of 2–5 pJ/bit and 0.02–0.04 pkts/cycle, respectively).
Figure 19.
Aggregate throughput versus mesh size (, , ) for PARSEC 3.0 (a), SPLASH-2 (b), and AI (c). Suite-level means over workloads and five seeds; .
Figure 19.
Aggregate throughput versus mesh size (, , ) for PARSEC 3.0 (a), SPLASH-2 (b), and AI (c). Suite-level means over workloads and five seeds; .
Figure 20.
Energy per delivered bit (pJ/bit) at on a mesh for PARSEC 3.0, SPLASH-2, and AI (left to right). Grouped bars per compare Proposed (PPO-based DRL) with DRLAR. DRLAR is electronic-only and therefore W-invariant; values are replicated across W for bars-only visualization. Means are over five seeds.
Figure 20.
Energy per delivered bit (pJ/bit) at on a mesh for PARSEC 3.0, SPLASH-2, and AI (left to right). Grouped bars per compare Proposed (PPO-based DRL) with DRLAR. DRLAR is electronic-only and therefore W-invariant; values are replicated across W for bars-only visualization. Means are over five seeds.
Table 1.
Taxonomy of HNoC and DRL routing approaches.
Table 1.
Taxonomy of HNoC and DRL routing approaches.
| Approach (Ref.) | Topology | Photonic-Aware | Thermal Strategy | Control | Scale Readiness |
|---|
| Photonic express HNoC [7] | Mesh with optical express | Yes | Device-level tuning assumed | Deterministic/ Adaptive | Prototype scale |
| Butterfly fat tree hybrid [8] | Butterfly fat tree overlay | Yes | Not explicit | Deterministic | Prototype scale |
| HyPPI concepts [11] | Hybrid plasmonic–photonic | Yes | Not explicit | N/A (device focus) | Early-stage |
| TAFT thermal-aware routing [9] | Mesh hybrid | Yes | Thermal-aware routing | Deterministic | Prototype scale |
| Topology-aware scaling [10] | Mesh hybrid | Yes | Topology and thermal scaling | Deterministic | Prototype scale |
| Thermal Q-learning for optical NoC [13] | Optical routing | Yes | Thermal-aware Q-learning | Centralized or LUT-local | Small testbeds |
| Table-free thermal RL [14] | Optical routing | Yes | Thermal-aware, table-free | Local inference | Small testbeds |
| RL for RWA in optical NoC [15] | Optical RWA | Yes | OSNR and latency constraints | Centralized RL | Small to medium |
| DRLAR [3] | Electronic mesh | No | Not modeled | Centralized training | Medium scale |
| DeepNR [5] | Electronic mesh | No | Not modeled | Centralized training | Medium scale |
| Q-RASP [4] | Electronic mesh | No | Not modeled | Region-coordinated | Medium scale |
| RELAR [12] | Electronic mesh | No | Not modeled | Centralized training | Medium scale |
| Proposed decentralized PPO in HNoC | Mesh with diagonal photonic express links | Yes | Thermal/link validity in state | Fully decentralized (per router) | Mesh-scale |
Table 2.
Hardware resource summary for the router-local controller.
Table 2.
Hardware resource summary for the router-local controller.
| Component | Value | Notes |
|---|
| MLP parameters | <6000 | Shared trunk; policy/value heads |
| Memory footprint | ∼12 KB | Quantized weights and activations; small FSM state |
| Inference latency | <10 ns | Single-cycle routing stage; fully pipelined |
| Area estimate | ∼35 K GE | Post-synthesis at 28 nm CMOS (controller only) |
| Thermal adaptivity | Enabled | Validity bit in state; masked action sampling |
Table 3.
Observation vector components and normalization.
Table 3.
Observation vector components and normalization.
| Feature | Range | Notes |
|---|
| Per-port queues | | Depth divided by max FIFO entries |
|
Queue-delta history | | Signed change per cycle; standardized |
| Local injection avg. | | EWMA of injected flits per cycle |
|
Hop distance | | Manhattan distance divided by mesh max |
| Optical validity | | 1 if diagonal tuned/reserved; else 0 |
Table 4.
Representative PPO hyperparameters.
Table 4.
Representative PPO hyperparameters.
| Parameter | Value or Setting |
|---|
| Discount factor | 0.99 |
| GAE parameter | 0.95 |
| Clip parameter | 0.1–0.2 |
| Policy entropy weight | 0.01 |
| Value loss weight | 0.5 |
| Horizon T | 128–256 steps |
| Optimization epochs K | 4–8 per update |
| Minibatch size M | 256 |
| Optimizer and step size | Adam, |
| Advantage normalization | Enabled per update |
| Gradient clipping | Global norm, 0.5–1.0 |
Table 5.
Representative training budget and computational cost (PPO; mesh). Unless noted, the same episode budget and feasibility masking are used across algorithms in stability comparisons.
Table 5.
Representative training budget and computational cost (PPO; mesh). Unless noted, the same episode budget and feasibility masking are used across algorithms in stability comparisons.
| Quantity | Value |
|---|
| Episodes | 3000 |
| Horizon per episode (T) | 256 steps |
| Routers/agents (mesh ) | 256 |
| Transitions per agent | ≈ |
| Aggregate transitions (all agents) | ≈ |
| Optimizer/learning rate | Adam/ |
| Hardware (offline training) | Dual Intel Xeon 6226R; 256 GB RAM |
| Approx. wall clock (offline PPO training) | ≈24 CPU-hours |
Table 6.
Comparison of DRL algorithms for NoC routing.
Table 6.
Comparison of DRL algorithms for NoC routing.
|
Algorithm | Stability | Sample Eff. | Remarks |
|---|
| Deep Q-Network (DQN) | Low | Low | Value-based; nonstationary multi-agent replay is brittle [23] |
|
Advantage Actor–Critic (A2C) | Medium | Medium | Synchronous updates can require coordination [24] |
| Proximal Policy Optimization (PPO) | High | High | Clipped updates and entropy aid decentralized, high-variance control [22] |
|
Deep Deterministic Policy Gradient (DDPG) | Medium | High | Continuous control; often central critic and noise sensitivity [25] |
| Soft Actor–Critic (SAC) | High | High | Strong performance; often improved by synchronized actor–critic [26] |
Table 7.
Stability under identical training budgets (five seeds; , ). Lower is better for variance and mean policy KL; fewer episodes indicate faster stabilization.
Table 7.
Stability under identical training budgets (five seeds; , ). Lower is better for variance and mean policy KL; fewer episodes indicate faster stabilization.
| Algorithm | Episodes to Plateau | Post-Plateau Reward Variance | Mean Policy KL Divergence |
|---|
| PPO | ≈2800 | ≈0.9 | ≈0.012 |
| DQN | ≈4100 | ≈3.4 | ≈0.031 |
| A2C | ≈3700 | ≈2.7 | ≈0.024 |
Table 8.
Architectural parameters used across all experiments.
Table 8.
Architectural parameters used across all experiments.
| Parameter | Setting |
|---|
| Topology | 2D electronic mesh with diagonal photonic express links |
| Mesh sizes | , , |
| Flow control | Wormhole, credit-based |
| Virtual channels/input | 2 |
| Input buffering | 8-flit FIFOs |
| Electrical datapath | 128-bit flit; 1-hop link latency |
| Technology/freq. | 28 nm CMOS, 1 GHz |
| Photonic WDM | per direction per diagonal |
| Photonic hop semantics | 1-hop equivalent (long-reach bypass) |
| Optical validity | Per-cycle tuned/unavailable bit; enforced via feasibility mask |
| Energy model (electrical) | Orion-style, activity-count based (router + links) |
| Energy model (photonic) | Laser bias, tuning events, link traversal |
| Loss/tuning model | 0.5 dB/cm, 1.2 dB/coupler, 0.1 nm/°C, 1.5 pJ/event |
Table 9.
Workload taxonomy used in analysis.
Table 9.
Workload taxonomy used in analysis.
| Workload | Suite | Comm. Pattern | Mem./Compute Bound | Expected NoC Stress |
|---|
| canneal | PARSEC | Irregular, bursty | Memory-bound | Hotspot-prone |
| fluidanimate | PARSEC | Sync bursts | Mixed | Temporal congestion |
| blackscholes | PARSEC | Uniform, low-contention | Compute-bound | Low stress |
| dedup | PARSEC | Irregular | Memory-bound | Burst-heavy |
| ferret | PARSEC | Irregular + bursts | Memory-bound | High entropy |
| swaptions | PARSEC | Uniform moderate | Compute-bound | Balanced |
| vips | PARSEC | Streaming | Memory-bound | Sustained load |
| streamcluster | PARSEC | Burst-sustained | Memory-bound | High sustained |
| bodytrack | PARSEC | Sync + irregular | Mixed | Medium stress |
| freqmine | PARSEC | Irregular | Memory-bound | Hotspot-prone |
| facesim | PARSEC | Streaming | Compute-bound | Balanced |
| barnes | SPLASH-2 | Irregular | Memory-bound | Hotspot |
| cholesky | SPLASH-2 | Irregular | Mixed | Congestion |
| fft | SPLASH-2 | Regular pattern | Compute-bound | Uniform load |
| fmm | SPLASH-2 | Irregular | Memory-bound | Bursty |
| lu | SPLASH-2 | Regular block | Mixed | Balanced |
| ocean | SPLASH-2 | Structured | Memory-bound | Sustained load |
| radix | SPLASH-2 | Irregular bursty | Memory-bound | Hotspot |
| raytrace | SPLASH-2 | Irregular + sync | Mixed | Congestion |
| volrend | SPLASH-2 | Streaming | Memory-bound | Balanced |
| water-nsquared | SPLASH-2 | Irregular sync | Memory-bound | Medium stress |
| water-spatial | SPLASH-2 | Regular spatial | Mixed | Balanced |
| MLPerf Tiny inf. | AI | Streaming | Compute-bound | Sustained |
| PageRank (GraphBIG) | AI/Graph | Irregular | Memory-bound | Irregular spikes |
Table 10.
Uniform traffic at (means over seeds).
Table 10.
Uniform traffic at (means over seeds).
| Routing Algorithm | Latency (Cycles) | Throughput (Pkts/Cycle) | Energy per Delivered Bit (pJ/Bit) | Congestion (%) |
|---|
| XY | 47.8 | 0.79 | 89.5 | 64 |
| West-First | 44.2 | 0.82 | 87.0 | 61 |
| ARCA | 39.1 | 0.86 | 83.8 | 56 |
| DeepNR | 36.0 | 0.88 | 81.6 | 53 |
| DRLAR | 34.3 | 0.89 | 80.9 | 51 |
| Proposed (PPO-based DRL) | 29.5 | 0.93 | 76.8 | 44 |
Table 11.
Ablation at (uniform traffic). Means over seeds; lower is better for latency/energy/congestion, higher is better for throughput.
Table 11.
Ablation at (uniform traffic). Means over seeds; lower is better for latency/energy/congestion, higher is better for throughput.
| Model Variant | Latency (Cycles) | Throughput (Pkts/Cycle) | Energy per Delivered Bit (pJ/Bit) | Congestion (%) |
|---|
| Full Model | 29.5 | 0.93 | 76.8 | 44 |
| w/o GAE | 31.7 | 0.91 | 77.7 | 47 |
| w/o Entropy Bonus | 31.0 | 0.91 | 77.4 | 46 |
| w/o Validity Mask | 33.2 | 0.90 | 78.7 | 49 |
| w/o Photonic Links | 35.0 | 0.88 | 80.1 | 52 |
Table 12.
Latency change (%, proposed versus baseline; negative is better).
Table 12.
Latency change (%, proposed versus baseline; negative is better).
| Class | XY | West-First | ARCA | DRLAR | DeepNR |
|---|
| Irregular/Bursty | | | | | |
| Synchronization-Heavy | | | | | |
| Streaming | | | | | |
| Low-Contention | | | | | |
Table 13.
Energy per delivered bit (pJ/bit) change (%, proposed versus baseline; negative is better).
Table 13.
Energy per delivered bit (pJ/bit) change (%, proposed versus baseline; negative is better).
| Class | XY | West-First | ARCA | DRLAR | DeepNR |
|---|
| Irregular/Bursty | | | | | |
| Synchronization-Heavy | | | | | |
| Streaming | | | | | |
| Low-Contention | | | | | |
Table 14.
Throughput change (%, proposed versus baseline; positive is better).
Table 14.
Throughput change (%, proposed versus baseline; positive is better).
| Class | XY | West-First | ARCA | DRLAR | DeepNR |
|---|
| Irregular/Bursty | | | | | |
| Synchronization-Heavy | | | | | |
| Streaming | | | | | |
| Low-Contention | | | | | |
Table 15.
PARSEC latency—proposed versus baseline (paired tests; aggregated across meshes and W). p-values are Holm–Bonferroni corrected; Cohen’s d is uncorrected.
Table 15.
PARSEC latency—proposed versus baseline (paired tests; aggregated across meshes and W). p-values are Holm–Bonferroni corrected; Cohen’s d is uncorrected.
| Baseline | p (t-Test) | p (Wilcoxon) | Cohen’s d |
|---|
| XY | < | < | 1.12 |
| West-First | < | < | 0.98 |
| ARCA | < | < | 0.80 |
| DRLAR | < | < | 0.72 |
| DeepNR | | | 0.58 |
Table 16.
SPLASH-2 latency—proposed versus baseline (paired tests; Holm–Bonferroni corrected).
Table 16.
SPLASH-2 latency—proposed versus baseline (paired tests; Holm–Bonferroni corrected).
| Baseline | p (t-Test) | p (Wilcoxon) | Cohen’s d |
|---|
| XY | < | < | 1.05 |
| West-First | < | < | 0.93 |
| ARCA | < | < | 0.77 |
| DRLAR | < | < | 0.70 |
| DeepNR | | | 0.54 |
Table 17.
AI latency—proposed versus baseline (paired tests; Holm–Bonferroni corrected).
Table 17.
AI latency—proposed versus baseline (paired tests; Holm–Bonferroni corrected).
| Baseline | p (t-Test) | p (Wilcoxon) | Cohen’s d |
|---|
| XY | < | < | 0.86 |
| West-First | < | < | 0.79 |
| ARCA | | | 0.55 |
| DRLAR | | | 0.49 |
| DeepNR | | | 0.41 |
Table 18.
Cross-generalization—mean percent change relative to in-domain performance (negative indicates degradation).
Table 18.
Cross-generalization—mean percent change relative to in-domain performance (negative indicates degradation).
| Setup | Latency | Throughput | Energy/bit | Latency |
|---|
| Train–A → Test–A | | | | |
| Train–B → Test–B | | | | |
| Train–C → Test–C | | | | |