Photonic-Aware Routing in Hybrid Networks-on-Chip via Decentralized Deep Reinforcement Learning

Kakoulli, Elena

doi:10.3390/ai7020065

Open AccessArticle

Photonic-Aware Routing in Hybrid Networks-on-Chip via Decentralized Deep Reinforcement Learning

by

Elena Kakoulli

^1,2

¹

Intelligent Systems Laboratory, Department of Computer Science, Neapolis University Pafos, Pafos 8042, Cyprus

²

Department of Electrical Engineering, and Computer Science and Engineering, Cyprus University of Technology, Limassol 3036, Cyprus

AI 2026, 7(2), 65; https://doi.org/10.3390/ai7020065

Submission received: 31 December 2025 / Revised: 22 January 2026 / Accepted: 1 February 2026 / Published: 9 February 2026

Download

Browse Figures

Versions Notes

Abstract

Edge artificial intelligence (AI) workloads generate bursty, heterogeneous traffic on Networks-on-Chip (NoCs) under tight energy and latency constraints. Hybrid NoCs that overlay electronic meshes with silicon photonic express links can reduce long-path latency via wavelength-division multiplexing, but thermal drift and intermittent optical availability complicate routing. This study introduces a decentralized, photonic-aware controller based on Deep Reinforcement Learning (DRL) with Proximal Policy Optimization (PPO). The policy uses router-local observables—per-port buffer occupancy with short histories, hop distance, a local injection estimate, and a per-cycle optical validity signal—and applies action masking so chosen outputs are always feasible; the controller is co-designed with the router pipeline to retain single-cycle decisions and a modest memory footprint. Cycle-accurate simulations with synthetic traffic and benchmark-derived traces evaluate mean packet latency, throughput, and energy per delivered bit against deterministic, adaptive, and recent DRL baselines; ablation studies isolate the roles of optical validity cues and locality. The results show consistent improvements in congestion-forming regimes and on long electronic paths bridged by photonic links, with robustness across mesh sizes and wavelength concurrency. Overall, the evidence indicates that photonic-aware PPO provides a practical, thermally robust control plane for hybrid NoCs and a scalable routing solution for AI-centric manycore and edge systems.

Keywords:

networks-on-chip; hybrid NoC; silicon photonics; wavelength-division multiplexing; deep reinforcement learning; proximal policy optimization; thermal-aware routing; decentralized control; edge AI

1. Introduction

Edge artificial intelligence (AI) applications, including autonomous robotics, smart healthcare, and industrial Internet of Things (IoT), require on-chip interconnects that can sustain low latency and high throughput under tight energy budgets. Networks-on-Chip (NoCs) provide a scalable substrate for manycore processors; however, conventional electronic meshes degrade under bursty and synchronization-heavy traffic, where limited link bandwidth, restricted path diversity, and congestion sensitivity cap performance.

Hybrid Networks-on-Chip (HNoCs) augment electronic routers with high-bandwidth photonic express links. By exploiting wavelength-division multiplexing (WDM), photonic paths offer long-reach transfers with favorable latency and energy characteristics across diverse loads [1,2]. This heterogeneous fabric introduces nontrivial control challenges: runtime routing must arbitrate between electronic and photonic planes in the presence of dynamic traffic, thermally sensitive device behavior, and transient link availability, where deterministic or static policies often prove brittle.

Deep Reinforcement Learning (DRL) has emerged as a promising basis for adaptive NoC control. Prior work, including DRLAR [3], Q-RASP [4], and DeepNR [5], has demonstrated that router-embedded policies can improve throughput, latency, and fairness relative to heuristic schemes. Related efforts in traffic prediction [6] and photonic mesh integration [1] further support the feasibility of intelligent routing. Nevertheless, most DRL-based approaches either disregard photonic channels, depend on centralized training or global state, or rely on reward designs that do not reflect thermal dynamics and heterogeneous link validity. These limitations hinder scalability and generality, particularly for edge and embedded deployments. Unlike prior DRL NoC controllers, the policy encodes an optical validity observable and applies feasibility masking so that every sampled action is immediately realizable; paired with a quantized, router-local PPO agent that meets the route computation timing budget, this yields a photonic-aware controller suited to hybrid meshes.

This article presents a fully decentralized DRL routing framework based on Proximal Policy Optimization (PPO) tailored to HNoCs with thermally adaptive diagonal photonic express links. Each router hosts a lightweight agent that observes only local signals (per-port buffer occupancy and short congestion history, hop distance, a local injection estimate, and an optical validity bit that reflects per-cycle photonic usability) and selects the next hop using feasibility-aware action masking to ensure immediately actionable choices. The controller is co-designed with the router pipeline to preserve single-cycle decision latency (sub-10 ns) and a modest memory footprint (approximately 12 KB SRAM per agent), enabling mesh-scale deployment without global coordination.

Evaluation covers synthetic traffic, the full PARSEC 3.0 and SPLASH-2 suites, and two AI-inspired workloads that represent streaming and irregular graph communication. Baselines include deterministic schemes (XY, West-First), an adaptive router (ARCA), and recent DRL methods (DRLAR [3], DeepNR [5]). The results indicate improvements in latency, throughput, and energy per delivered bit at a high offered load and under thermal variability.

The contributions are the following:

(i): A structured positioning of decentralized PPO within the HNoC and DRL routing literature, emphasizing photonic awareness and deployment realism;
(ii): A photonic-aware, feasibility-masked PPO controller integrated into the router control path with quantized, router-local inference that satisfies timing and memory constraints;
(iii): A reproducible, cycle-accurate evaluation framework of synthetic and application traces, with analyses explaining where gains arise and how they scale.

Scope of novelty. This study’s contribution is a photonic-aware, feasibility-constrained routing and hardware-conscious DRL integration tailored to hybrid electronic–photonic NoCs, rather than a new learning algorithm. Proximal Policy Optimization (PPO) is adopted for stability under partial observability and compatibility with single-cycle, router-local inference; its novelty lies in encoding photonic-link validity, enforcing feasibility-aware action masking, and meeting on-path timing/area budgets in a decentralized execution model.

The remainder of this article is organized as follows. Section 2 reviews background and related work on HNoC architectures and learning-enabled routing, including a concise contrast between photonic-only and hybrid fabrics (Section 2.8). Section 3 presents the system architecture and controller integration. Section 4 details the PPO formulation and decentralized training protocol. Section 5 describes the experimental methodology, benchmarks, and metrics. Section 6 reports the results and analysis. Section 7 outlines future directions, and Section 8 concludes this study.

2. Background and Related Work

2.1. Foundations and Motivation

NoCs provide scalable communication for manycore systems and remain a cornerstone of high-performance processor design. Mature electronic mesh topologies deliver efficient energy and latency characteristics under regular traffic patterns; however, AI- and data-intensive applications frequently generate bursty, spatially skewed, and synchronization-heavy traffic that stresses link bandwidth, reduces path diversity, and exposes bisection bottlenecks. HNoCs mitigate these limitations by integrating electronic routers with photonic express links that exploit WDM and long-reach, low-latency optical paths [1,2]. The resulting heterogeneity offers opportunities to alleviate long-distance contention but introduces additional control complexity, including photonic device physics, thermal tuning, and real-time arbitration across electronic and photonic planes.

2.2. Hybrid NoC Architectures

HNoC proposals vary along three principal axes. (i) Topology and express layout: path diversity and hop reduction depend on the placement of optical overlays, with examples including diagonal express paths [7], butterfly fat tree organizations [8], and cluster-aware layouts. (ii) Devices and thermal management: microring-resonator-based links are thermally sensitive, motivating thermal-aware routing [9] and topology-aware scaling techniques [10]. (iii) Routing and arbitration: policies determine when and how optical lanes are exercised under dynamic load. Beyond silicon photonics, hybrid plasmonic–photonic integration (HyPPI) has been explored to reduce device footprint and switching energy [11], further broadening the design space.

2.3. Photonic Devices and Thermal Issues

Thermal drift in microring resonators shifts resonant wavelengths, producing transient invalidity for specific channels and requiring continuous tuning. As a result, runtime routing must account for per-cycle optical availability and evaluate the latency–energy trade-offs of engaging photonic paths under thermal dynamics [9,10]. Designs that assume photonic links are always-available risk misrouting, queuing delays, or packet replays; conversely, conservative policies leave valuable bandwidth idle. An effective photonic-aware controller must therefore (i) sense link validity at a fine temporal granularity, (ii) prevent infeasible actions when devices are off-resonance, and (iii) arbitrate between electronic and photonic planes without relying on centralized coordination.

2.4. Learning-Enabled Routing in NoCs (Electronic and Optical)

DRL has demonstrated performance gains in electronic NoCs. DRLAR employs latency-informed rewards to reduce delay under congestion [3]; DeepNR propagates feedback to improve fairness and throughput [5]; and region-aware methods such as Q-RASP and RELAR facilitate experience sharing and multi-objective optimization [4,12]. Despite these advances, most electronic DRL schemes ignore photonic channels, assume centralized training or a global state, and omit thermal validity from their learning formulation. In optical contexts, reinforcement learning has been investigated for thermal-aware routing [13], table-free thermal approximations [14], and reinforcement-based wavelength/path assignment under optical signal-to-noise ratio (OSNR) and latency constraints [15]. These efforts, however, typically target simplified optical fabrics rather than mesh-scale HNoCs requiring per-hop, router-local decisions.

2.5. Embedded DRL and TinyML Feasibility

Recent progress in TinyML and efficient actor–critic learning supports the feasibility of embedding lightweight DRL agents in resource-constrained environments [16,17,18,19]. Kilobyte-scale policies implemented with compact multilayer perceptrons have demonstrated the ability to meet strict latency and memory budgets, enabling per-flit decisions in nanosecond-scale control loops without centralized oversight. This evidence motivates router-local inference in HNoCs with thermally aware DRL agents.

2.6. Taxonomy of Related Work

Table 1 categorizes representative approaches across fabric type, topology, photonic awareness, thermal strategy, control centralization, and evaluation scale. The intersection of photonic awareness, thermal validity, and fully decentralized per-router control remains unexplored in prior research, motivating the architecture presented in Section 3.

2.7. Identified Gaps

Three research gaps can be identified. First, DRL methods that perform well in electronic meshes do not encode optical availability or thermal validity, limiting their transfer to HNoCs. Second, learning-based optical studies emphasize wavelength and path assignment in small-scale fabrics rather than mesh-scale ones, per-hop routing across heterogeneous planes. Third, demonstrations of router-local agents that satisfy stringent SRAM and sub-cycle timing constraints are scarce. The proposed decentralized PPO framework addresses these gaps by embedding photonic validity into the agent’s state, applying feasibility masking at action selection, and deploying compact, quantized policies that align with router control path constraints.

2.8. Photonic-Only vs. Hybrid Fabrics: Practical Trade-Offs

All-photonic interconnects excel at long-reach, high-concurrency transfers but face chip-scale constraints: thermal tuning and static laser power scale with device count; waveguide loss and crosstalk limit dense crossings; and many photonic NoCs are circuit-switched with limited fine-grained buffering under a bursty load [2,9,11]. A hybrid fabric—an electronic mesh for local/short paths plus silicon photonic express links for long-distance bypass—trades a modest on-chip photonic footprint for a reduced average hop count and lower energy per delivered bit when optics are exercised selectively [1,7,10]. The photonic-aware controller studied here exploits this structure by (i) engaging diagonal links when valid and (ii) falling back immediately when optics detune via feasibility masking, thereby avoiding stalls and wasted attempts (see Section 3 and Section 4). The empirical sections quantify gains relative to electronic mesh baselines and are complementary to photonic-only designs that target different operating regimes.

3. Proposed Architecture

3.1. Design Goals

The HNoC targets edge AI and data-intensive workloads that exhibit burstiness, spatial skew, and synchronization phases. The architectural objectives are the following: (i) shorten paths for distant flows, (ii) relieve mesh bisection hotspots under load, (iii) react locally to nonstationary conditions without global coordination, and (iv) tolerate thermal variation on photonic links while keeping per-router overheads small.

3.2. Interconnect Topology and Photonic-Express Organization

The chip-scale interconnect is a 2D electronic mesh with bidirectional N/S/E/W electrical links overlaid with diagonal photonic express links that provide long-reach bypass across the mesh (Figure 1). Photonic channels use WDM to aggregate bandwidth and reduce per-bit latency [1,2]. In the prototype, each diagonal express link supports up to eight wavelengths per direction. The overlay is motivated by three traffic-level effects: (1) distance reduction, where a diagonal hop advances in both x and y, lowering Manhattan distance and average hop count for long routes; (2) hotspot relief, where diagonal paths bypass bisection pressure during synchronized phases and spread load [7]; and (3) high-injection resilience, where multiple wavelengths increase effective path diversity at a high offered load, smoothing queue growth on electrical links. Selection between electrical and photonic planes is resolved locally at each router.

3.3. Photonic Devices, Thermal Tuning, and Link Validity

Photonic switching uses thermally tuned microring resonators whose resonance drifts with temperature. Following thermal-aware optical NoC practices [9,10], each router maintains a binary optical validity signal per incident diagonal, indicating whether the link is tuned and usable in the current cycle. Validity is treated as a feasibility constraint: when an express link is invalid, selection is disallowed and traffic remains on the electrical links. Integrated heaters compensate drift; tuning energy is on the order of 1–2 pJ per event and is amortized by selective use of long-reach photonic paths.

3.4. Tile Microarchitecture and Controller Placement

Each tile integrates a processing core, a wormhole router with four electrical ports, and a router-local controller embedded on the router control path. Where diagonals are provisioned, the router exposes optical endpoints to the controller. Per-flit routing decisions rely only on local signals, preserving cycle time and enabling mesh-scale deployment without global synchronization. Figure 1 shows the block view: each tile presents a Core–DRL agent–router stack; solid inter-tile lines depict the electrical mesh; and dashed slanted segments depict diagonal photonic express links. A staggered subset of diagonals is drawn for legibility; density is configurable.

3.5. Router Pipeline Integration

The controller resides on the control path between route computation and switch allocation. For each flit: (i) sensing, update local observables, including per-port queue occupancy, a short history of queue deltas, a moving average of local injection, Manhattan hop distance to the destination, and optical validity for any incident diagonal; (ii) feasibility mask, disable unavailable options, including thermally invalid diagonals, nonprovisioned endpoints, or physically blocked outputs; (iii) decision, select a next hop from

{N, S, E, W, Diagonal}

and, if a diagonal is present and valid, optionally emit a small discrete wavelength index. The feasibility mask is applied before allocation and the distribution is renormalized so the final choice is actionable; (iv) allocation and traversal, compete in switch allocation and, on success, traverse the crossbar onto the selected electrical or photonic link. This placement preserves single-cycle route decisions and avoids futile retries under changing optical availability.

3.6. Flow Control, Safety, and Liveness

Electrical links use credit-based flow control and photonic transfers use reservation. The feasibility mask enforces three invariants: (i) an unavailable or invalid diagonal is never selected, (ii) blocked outputs are not chosen, and (iii) at least one electrical escape path remains available under backpressure. These invariants maintain forward progress during transient blockages and thermal unavailability and reduce the chance of livelock under bursty hotspot traffic.

3.7. Expected Traffic-Level Effects

By construction, the architecture yields the following: (i) fewer hops for distant flows via diagonal shortcuts, (ii) lower queuing at the mesh bisection through optical offload during synchronized phases, (iii) resilience at high injection as WDM increases effective path diversity, and (iv) elimination of wasted cycles on detuned optical links through validity-aware arbitration. The learning formulation that realizes these behaviors is detailed in Section 4.

3.8. Hardware Integration and Overhead

The router-local controller is a compact two-layer multilayer perceptron (MLP) co-synthesized with the router control path. Weights and activations are quantized to a fixed point and stored in tile-local SRAM; the datapath is fully pipelined so per-flit decisions meet the cycle budget without off-chip access. The controller uses fewer than 6000 parameters, occupies about 12 KB of SRAM, and completes inference in under 10 ns. These budgets keep overhead modest while enabling deployment across large meshes.

For the physical design context, synthesis at a 28 nm CMOS node yields a footprint of roughly 35 K gate equivalents (GE) per tile for the controller logic and small control FSMs. Thermal adaptivity is handled in-band via the optical validity bit in the observation vector; when optics detune, the feasibility mask prevents diagonal selection. This integration preserves forward progress, avoids wasted arbitration on unavailable links, and bounds energy overhead by using photonic express paths selectively.

As summarized in Table 2, the controller cost is modest relative to router and link logic, and its placement next to route computation and arbitration (Figure 1) enables feasibility-aware decisions without impacting the critical path.

3.9. Timing and Hardware Feasibility

The router-local policy is implemented as a quantized, two-layer MLP (8-bit weights/activations) with a shared trunk and separate policy/value heads. For an input dimension of 36-48 features (per-port queues with a short history, hop distance, local injection EWMA, and optical validity bit), the forward pass performs approximately

N_{MAC} \approx (48 \times 64) + (64 \times 64) + (64 \times 6) \approx 7.4 k

eight-bit MACs and ReLU activations. Post-synthesis at 28 nm CMOS (0.9 V, standard-cell) achieves sub-10 ns inference latency for the full forward pass, fitting within a single route computation (RC) stage at a 1 GHz router. For higher router frequencies (e.g., 2–3 GHz), the block is retimed across two short stages (RC → RC′) or computed with one-cycle lookahead while the header flit resides in input buffering; in both cases, the per-flit throughput remains one decision per header without stalling switch allocation. A deterministic escape path is retained and guarantees progress should any timing guardrail be violated. The controller footprint is approximately 12 KB SRAM and 35 kGE per tile; these budgets are stable across mesh sizes, as the agent is per router and does not rely on global state or centralized inference. A per-agent SRAM breakdown (weights, activations, and control FSM) is shown in Figure 2.

Implementation assumptions and portability. The results assume 8 -bit fixed-point weights and activations with ReLU nonlinearity and a 28 nm CMOS standard-cell flow at

0.9 V

. At

1 GHz

, the forward pass fits within a single RC stage; at 2–

3 GHz

it is retimed across RC → RC′ or computed with one-cycle lookahead while the header resides in input buffers, preserving one decision per header without stalling switch allocation. The reference footprint (per tile) is ≈12

KB

SRAM and ≈35

kGE

logic at 28 nm; further quantization or light channel pruning can reduce these budgets with a small accuracy impact. Across smaller technology nodes (e.g., 14 nm/7 nm), total area and dynamic energy generally improve while timing headroom increases; conversely, pushing clocks significantly beyond

3 GHz

may require shallow additional retiming or modest hidden layer width reductions. These trade-offs do not alter the decentralized execution model, feasibility-masked action semantics, or the RC path placement of the policy.

4. Learning Methodology

The DRL formulation targets HNoCs in which each router chooses between long-reach photonic bypass and conventional electrical paths under rapidly changing load and temperature. Optical availability varies with thermal drift and reservation state, while congestion evolves quickly during bursty or synchronized phases. To meet these constraints without global coordination, every router embeds a lightweight PPO agent that observes only local signals and selects per-flit actions in real time.

4.1. Conceptual Flow

Figure 3 summarizes the end-to-end control loop. Workloads induce traffic patterns that shape per-router observations; the local agent emits a feasible next hop (optionally selecting a photonic diagonal and a wavelength when available), which then impacts latency, congestion, and energy.

4.2. State Representation

Routing is modeled as a Markov decision process. The observation

s_{t}

anticipates near-term congestion and link usability while keeping inference compact:

Per-port buffer occupancy: normalized queue levels.
Recent congestion history: a short window of queue deltas to capture rising or falling pressure.
Local injection rate: an exponentially weighted moving average (EWMA) at the current tile.
Hop distance: Manhattan distance to the destination for low-load reasonableness.
Optical validity bit: whether the incident diagonal express link is tuned and usable.

A four-cycle history balances temporal fidelity and input size. Table 3 lists ranges and normalization.

4.3. Action Space and Policy Network

The agent chooses from the discrete set

{N, S, E, W, Diagonal}

. When a diagonal port is present and valid, and multiple wavelengths are available, it also emits a discrete wavelength index. A feasibility mask is applied at decision time to remove unavailable actions (detuned or unprovisioned diagonals; blocked electrical ports); remaining probabilities are renormalized so the final choice is always actionable. Policy and value share a compact two-layer multilayer perceptron (MLP) with a common trunk and separate heads, dimensioned to meet router control path timing and per-router SRAM budgets.

4.4. Reward Design and Decentralized Training

Agents optimize a scalar reward that balances delay, congestion, and energy as follows:

R_{t} = - α {Latency}_{t} - β {Congestion}_{t} - γ {Power}_{t},

(1)

where Latency is end-to-end packet delay, Congestion aggregates buffer occupancy, and Power includes electrical switching and photonic tuning overhead. The coefficients

(α, β, γ)

are tuned once per topology and reused across workloads; simulation-based or Bayesian search is applicable [20].

Each router learns independently by collecting local transitions

(s_{t}, a_{t}, r_{t}, s_{t + 1})

over horizon T without replay sharing. This decentralized, replay-free design aligns with privacy- and locality-preserving learning paradigms in distributed systems [21]. Advantages use generalized advantage estimation (GAE), and PPO applies the following clipped surrogate:

L^{CLIP} (θ) = E_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})],

(2)

with

r_{t} (θ) = π_{θ} (a_{t} | s_{t}) / π_{θ_{old}} (a_{t} | s_{t})

. The total loss combines policy fit, value accuracy, and exploration as follows:

L_{total} = L^{CLIP} - c_{1} E_{t} [{(V_{ϕ} (s_{t}) - {\hat{R}}_{t})}^{2}] + c_{2} E_{t} [H (π_{θ} (\cdot | s_{t}))] .

(3)

Clipped updates bound per-update policy change to remain stable under shifting traffic, while the entropy term sustains exploration so the agent exploits diagonal offload when beneficial and falls back to electrical paths when not.

Decentralized execution and offline training. Agents are trained offline in a shared simulator that advances many routers in parallel while collecting local rollouts per router; optimization proceeds per policy without any runtime cross-agent synchronization. Deployment is fully decentralized: each router executes the quantized policy using only local observations, with no global state exchange, inter-agent communication, or centralized critic. Policies are frozen for evaluation. This design choice targets realistic NoC deployments where route decisions must be made on-path under strict per-hop timing guarantees.

4.4.1. Training Loop and Hyperparameters

Algorithm 1 summarizes the decentralized training loop. Inputs are standardized with running per-feature statistics; advantages are normalized each update; and gradients are clipped by a global norm threshold. Before sampling, action masking with renormalization enforces feasibility. Training runs for a fixed budget of episodes; policies typically stabilize after several thousand episodes (Figure 4). Representative hyperparameters appear in Table 4.

Algorithm 1 PPO training loop for decentralized HNoC routing

1:: Initialize policy $π_{θ}$ and value $V_{ϕ}$
2:: for each training episode do
3:: for each router agent in parallel do
4:: Collect $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ over horizon T with feasibility masking
5:: Compute GAE advantages ${\hat{A}}_{t}$ and returns ${\hat{R}}_{t}$
6:: Update $π_{θ}$ and $V_{ϕ}$ by minimizing $L_{total}$ in (3)
7:: end for
8:: end for
9:: Deploy the trained policy $π_{θ}$ for per-flit routing

4.4.2. Training Budget and Computational Cost

To make the training setup reproducible, Table 5 summarizes a representative budget and host configuration used to train the reported PPO policies. Unless stated otherwise, runs use identical episode budgets and feasibility masking across algorithms. Holding episodes and horizon fixed, aggregate transitions scale approximately linearly with the number of agents (routers) in the mesh.

4.5. Algorithmic Justification and DRL Choice

Decentralized HNoC control benefits from four properties: stability under distribution shift caused by bursty traffic and thermal events, no cross-agent synchronization, compact networks for router-local inference, and predictable parameter updates that respect timing budgets. PPO satisfies the following criteria: Clipped updates bound policy change and tolerate partial observability and nonstationarity [22]. Entropy regularization sustains exploration of diagonals when valid and encourages immediate electrical fallback when optics detune. Alternatives have drawbacks in this setting: DQN relies on stationary replay and is brittle in multi-agent regimes [23]; A2C introduces synchronization demands [24]; DDPG often requires a central critic and is noise sensitive [25]; and SAC typically benefits from synchronized actor–critic updates [26]. Table 6 summarizes these trade-offs.

Empirical Stability: PPO Versus DQN/A2C

To substantiate the algorithmic choice, a stability comparison was conducted under identical training budgets, feasibility masking, and observation/action spaces on a

16 \times 16

mesh with

W = 8

wavelengths (training hyperparameters as in Table 4). “Episodes to plateau” denotes the first episode index at which a 200-episode moving average of reward remains within

2 %

of its final mean for the remainder of training. “Post-plateau reward variance” is computed over that terminal window. “Mean policy Kullback–Leibler (KL) divergence per update” reports the average

KL (π_{θ_{old}} ∥ π_{θ})

between successive policy distributions after plateau; for DQN, a Boltzmann (softmax) policy

π (a | s) \propto exp (Q (s, a) / τ)

with

τ = 1

over the feasibility-masked action set is used to compute the KL divergence. As summarized in Table 7, PPO stabilizes faster and exhibits lower post-plateau variance and smaller mean policy KL divergence than DQN or A2C under identical budgets and masking.

5. Experimental Setup

This section describes the simulator and architectural model, scaling dimensions, traffic sources and injection schedules, seeds and metrics, baselines, and post-processing used in this study. The evaluation spans synthetic traffic, the complete PARSEC 3.0 and SPLASH-2 suites, and two AI-inspired workloads representative of streaming and irregular graph communication. Comparative baselines include deterministic schemes (XY, West-First), an adaptive router (ARCA), and recent learning-based methods (DRLAR, DeepNR).

5.1. Execution Platform

All simulations run on a Linux workstation with dual Intel Xeon Gold 6226R CPUs (2.9 GHz) and 256 GB RAM (Ubuntu 22.04 LTS). Each experiment executes a 20,000-cycle warm-up followed by a 100,000-cycle measurement window. Jobs are single-process and CPU-bound to preserve determinism across seeds.

5.2. Simulator and Architectural Model

A cycle-accurate NoC simulator models a 2D electronic mesh overlaid with diagonal photonic express links. Routers implement wormhole flow control with credit return, two virtual channels per input, and 8-flit input FIFOs. The electrical plane provides bidirectional N/S/E/W links. The photonic plane offers chip-scale diagonals with WDM; each photonic endpoint exports a per-cycle validity bit reflecting thermal tuning and reservation state. A feasibility mask enforces validity during route selection.

Electrical energy is estimated using an Orion-style model calibrated to a 28 nm reference at 1 GHz. Photonic energy accounts for laser bias, tuning events, and link traversal. Physical parameters follow common silicon photonics assumptions: waveguide loss 0.5 dB/cm, 1.2 dB/coupler, microring detuning 0.1 nm/°C, and 1.5 pJ per tuning event.

Photonic thermal tuning and validity model. Microring resonators are parameterized with a thermal detuning coefficient of approximately

0.1 nm /^{\circ} C

. Heater-assisted (re)tuning and hold operate on 5–

20 μ s

time scales; at a

1 GHz

router clock, this corresponds to 5–20k cycles. A photonic lane is marked invalid whenever the estimated detuning exceeds a guardband of ≈

0.3 nm

(about 3 °C under the 0.1 nm/°C coefficient) or while a retune is in progress; it returns to valid after the retune latency elapses. Validity dynamics combine (i) a slow background component due to ambient/packaging thermal fields (piecewise-constant plateaus on the order of 50–

200 μ s

) and (ii) a faster workload-driven component coupled to recent injection/queue activity (EWMA over 1–

5 μ s

). Each photonic endpoint exports a per-cycle optical validity bit derived from these processes, which is consumed by the feasibility mask so optical actions are sampled only when immediately realizable.

The key architectural constants are summarized in Table 8 and remain fixed unless explicitly swept.

Scaling dimensions. Two orthogonal axes are explored: spatial scale (mesh sizes

8 \times 8

,

16 \times 16

,

32 \times 32

) and optical concurrency (wavelengths per diagonal

W \in {4, 8, 16}

). Larger meshes increase average path length and bisection pressure; a higher W exposes the benefits and diminishing returns of photonic parallelism under learned control.

5.3. Traffic Sources, Pre-Processing, and Injection Schedules

Synthetic patterns. A uniform random, transpose, bit-complement, hotspot (10% weighted sinks), and on/off bursty source model are used. For uniform traffic, the injection rate

ρ

is swept from 0.02 to 0.90 in steps of 0.02; for structured patterns,

ρ

is swept up to 0.70 to emphasize knee regions without saturation artifacts. This schedule exercises both shortest-path behavior at low load and congestion management at high load, which is essential for assessing diagonal offload and validity-aware masking.

Application traces. Regions of interest are extracted per suite, timestamps are normalized to the target cycle time, and task clustering is preserved where present. When scaling between

8 \times 8

and

32 \times 32

, inter-arrival times are adjusted to maintain per-task rates while naturally increasing path lengths and bisection pressure:

PARSEC 3.0: canneal, fluidanimate, blackscholes, dedup, ferret, swaptions, vips, streamcluster, bodytrack, freqmine, and facesim.
SPLASH-2: barnes, cholesky, fft, fmm, lu, ocean, radix, raytrace, volrend, water-nsquared, and water-spatial.
AI-inspired: MLPerf Tiny inference (streaming) and GraphBIG PageRank (irregular, graph-like).

Workload taxonomy. Table 9 classifies all traces by communication type, compute/memory skew, and expected NoC stress. This taxonomy underpins the class-level aggregates in Section 6: Irregular/Bursty, Synchronization-heavy, Streaming, and Low-contention.

5.4. DRL Agent and Training Protocol

Each router embeds a compact PPO agent (two-layer MLP, 64 hidden units per layer, fixed-point quantization) with ∼12 KB SRAM footprint and sub-10 ns inference latency. The detailed memory profile is provided in Figure 2. Observations include per-port queue occupancy with a short history, an EWMA of local injection, Manhattan hop distance, and an optical validity bit. The action set is

{N, S, E, W, Diagonal}

; when the diagonal is valid, the agent also emits a wavelength index. A feasibility mask disables blocked ports and invalid optics prior to sampling.

Agents are trained offline using Adam (

3 \times 10^{- 4}

),

γ = 0.99

, GAE (

λ = 0.95

), as well as the PPO clipped objective with entropy regularization. Trajectories are collected locally; each agent computes advantages and updates independently. Trained policies are frozen for all evaluation runs to ensure strict determinism and fair comparison against non-learning baselines.

5.5. Measurement Protocol, Seeds, and Metrics

Unless noted otherwise, the results are averaged over five fixed seeds {41, 137, 7331, 9001, 271,828}. For synthetic traffic, these seeds drive per-cycle injection and destinations (and on/off phases for bursty sources). For traces, seeds resolve coincident events and any randomized placement within clusters. The reported metrics are the following: (i) mean packet latency (cycles), (ii) throughput (packets/cycle), (iii) energy per delivered bit (electrical router+links and photonic laser/tuning/traversal), (iv) congestion (time-averaged FIFO occupancy per router, and p₉₉ where shown), (v) packet loss rate, and (vi) photonic engagement (fraction of hops on diagonals; tuning-event counts) (cf. [27]). Where uncertainty is visualized, shaded

95 %

confidence bands are used; otherwise, seed means are reported.

5.6. Offered Load, Control of Injection Rate, and Anchor Point

Traffic intensity is parameterized by the injection rate

ρ

, in other words, the expected number of packets injected per router per cycle, as follows:

ρ ≜ \frac{1}{N} \sum_{i = 1}^{N} E [X_{i}],

where

X_{i} \in {0, 1}

equals 1 when router i injects in a cycle, and equals 0 otherwise.

For synthetic sources, each router injects with Bernoulli probability p each cycle (subject to backpressure), yielding

ρ = p

in a steady state. For trace-based workloads, temporal order is preserved but a global thinning/dilation factor

α \in (0, 1]

is applied to inter-arrivals so that the measured mean over tiles,

ρ_{meas}

, matches the target

ρ

within

\pm 0.01

while retaining a burst structure.

Unless otherwise specified,

ρ

is swept broadly for uniform traffic (0.02–0.90 in steps of 0.02). For structured patterns, the sweep extends to 0.70 to highlight knee regions without saturation artifacts. The same grid is applied to trace workloads via

α

.

Anchor operating point. For crisp head-to-head comparisons, summary tables are reported at a sub-saturation anchor

ρ^{★} = 0.60

on

16 \times 16

meshes with

W = 8

wavelengths. This point lies below the knee for all suites in the configured setup. Load-sweep figures show full

ρ

curves.

Training splits. Class membership follows the taxonomy in Table 9. Three train/test splits are used for cross-generalization:

Train–A → Test–A: Train on PARSEC 3.0 + SPLASH-2 with exemplars from all four traffic classes (Irregular/Bursty, Synchronization-Heavy, Streaming, Low-Contention); test on the AI workloads.
Train–B → Test–B: Train on PARSEC 3.0 + AI with coverage of all four classes; test on SPLASH-2.
Train–C → Test–C: Reduced-diversity training using only Streaming and Low-Contention traces from PARSEC 3.0; test on unseen Irregular/Bursty and Synchronization-Heavy workloads drawn from SPLASH-2 and AI.

Policies are frozen prior to evaluation; cross-generalization outcomes are reported in Section 6.7.

5.7. Baselines and Comparison Methodology

All mesh-based schemes share identical microarchitectures, buffer depths, virtual channels, arbitration, timing, and energy accounting, as follows:

XY (Dimension-Order). Minimal routing with an escape VC for deadlock freedom.
West-First. Turn-restricted deterministic routing that prioritizes westward motion to prevent channel dependency cycles.
ARCA (Adaptive). A local congestion-aware scheme that steers traffic based on queue feedback while preserving minimal path preferences.
DRLAR [3]. A distributed RL policy for electronic meshes without photonic awareness; included to isolate the benefit of optical validity and diagonal actions.
DeepNR [5]. A DRL method with network-level feedback and centrally trained parameters; representative of learning with broader context but no optical plane.

5.8. Post-Processing and Statistical Analysis

The per-workload results are first averaged over seeds and then aggregated to class-level means using the taxonomy in Table 9 (Irregular/Bursty, Synchronization-Heavy, Streaming, Low-Contention). Uncertainty is shown as 95% Student t confidence intervals over seeds (

n = 5

).

Head-to-head comparisons (latency, throughput, and energy per delivered bit) use two-sided paired t-tests on seed-matched runs (n = 5). When normality is uncertain, Wilcoxon signed-rank tests are additionally reported as a nonparametric check. Multiple comparisons across baselines are controlled with Holm–Bonferroni at

α = 0.05

. Effect sizes are reported as paired Cohen’s d, computed from within-pair differences.

Unless otherwise noted, each workload contributes equally to its class average to avoid bias from trace length or message volume. Reporting conventions are as follows: (i) latency in cycles (mean and

p_{99}

where applicable); (ii) throughput in packets/cycle; (iii) energy per delivered bit in pJ/bit; (iv) congestion as time-averaged FIFO occupancy per router (and

p_{99}

where shown). The analysis produces the following: per-workload load-sweep grids (latency, energy per delivered bit, throughput); class-average Pareto fronts (throughput vs. energy per delivered bit); mesh- and wavelength-scaling plots; spatial congestion heatmaps at matched

ρ

; and ablations isolating diagonal links, feasibility masking, GAE, and entropy regularization.

6. Results and Discussion

This section reports the end-to-end performance across synthetic traffic and three benchmark suites (PARSEC 3.0, SPLASH-2, and AI). Load sweeps of latency, throughput, and energy per delivered bit (pJ/bit; hereafter energy/bit) are presented versus the offered injection rate

ρ

. This section then provides spatial congestion maps, throughput–energy Pareto views, scalability with mesh size and wavelength count, and ablation studies. Unless otherwise stated, each point is the mean over five seeds after a 20k-cycle warm-up and a 100k-cycle measurement window. Shaded bands indicate 95% confidence intervals. Baselines include XY, West-First, ARCA, DRLAR, and DeepNR; the proposed controller is denoted Proposed (PPO-based DRL).

6.1. Performance Versus Offered Load

6.1.1. Latency Versus Injection Rate for Synthetic Traffic

Across uniform, transpose, bit-complement, hotspot (10%), and bursty on/off (Figure 5), latency curves coincide at light load and diverge after each pattern’s knee when feasibility-aware diagonal shortcuts become active. The knee then shifts right by

Δ ρ \approx 0.04 - 0.08

. Mid-load reductions are largest for long-path and pressure-forming traffic: for Irregular/Bursty (hotspot, bursty) at

ρ \approx 0.55

, latency is typically

10 - 16 %

below DRLAR and

20 - 28 %

below ARCA, with XY and West-First higher still; for the Long-Path surrogates (transpose, bit-complement), gaps open as diagonals shorten the effective distance and DeepNR narrows but does not close the margin; for Low-Contention (uniform), gains are the smallest yet persistent after the knee. At the uniform anchor

ρ = 0.6

(Table 10), latency is 29.5 cycles versus 34.3 (DRLAR), 36.0 (DeepNR), 39.1 (ARCA), and 44.2/47.8 (West-First/XY).

6.1.2. Latency Versus Injection Rate for PARSEC 3.0

Across PARSEC 3.0 workloads (Figure 6), the most pronounced latency reductions occur where diagonal shortcuts are exercised frequently and feasibility masking is impactful—namely canneal and streamcluster (irregular/bursty) and fluidanimate (synchronization-heavy). In these cases, the knee advances by

Δ ρ \approx 0.05 - 0.10

, and mid-load performance (

ρ \approx 0.55 - 0.65

) improves by roughly

12 - 20 %

relative to DRLAR and

20 - 30 %

relative to ARCA, with deterministic XY/West-First trailing further. Streaming workloads (vips, facesim) exhibit smaller but consistent post-knee reductions, while low-contention blackscholes and swaptions remain tightly clustered across schemes, with the proposed policy best but close. DeepNR typically lies between ARCA and DRLAR yet remains above the proposed curve throughout the sweep.

6.1.3. Latency Versus Injection Rate for SPLASH-2

Across SPLASH-2 (Figure 7), the clearest separations appear in the synchronization-heavy raytrace and water-nsquared and in the irregular radix. After each workload’s knee, the proposed curve remains below ARCA/DRLAR, with typical mid-load reductions of

10 - 18 %

versus DRLAR and

18 - 26 %

versus ARCA; the knee advances by

Δ ρ \approx 0.04 - 0.08

. Regular/spatial kernels (fft, lu, water-spatial) align with baselines at low

ρ

and show modest but consistent gains near and beyond the knee. DeepNR usually narrows the gap relative to ARCA yet remains above the proposed curve; deterministic XY and West-First trail throughout.

6.1.4. Latency Versus Injection Rate for AI Workloads

Across the AI traces (Figure 8), PageRank (irregular/graph-like) shows a pronounced rightward knee and a sustained mid–high-load margin. For

ρ \approx 0.5 - 0.7

, latency is typically 8–15% below DRLAR and 15–25% below ARCA, and the separation widens toward the plateau as long-range routes engage diagonals more often. DeepNR narrows but does not close the gap; deterministic XY and West-First remain highest. MLPerf Tiny (streaming) shows restrained diagonal use and smaller but consistent post-knee reductions: curves track baselines at light load, then improve by roughly

5 - 9 %

versus DRLAR and

10 - 14 %

versus ARCA through the mid load, with no regressions across the sweep.

6.1.5. Energy per Delivered Bit Versus Injection Rate for Synthetic Traffic

Figure 9 reports energy per delivered bit (pJ/bit) versus injection rate for uniform, transpose, bit-complement, hotspot (10%), and bursty on/off. Curves overlap at light load, then separate beyond each pattern’s knee as feasibility-aware diagonal offload becomes frequent. Beyond the knee, the proposed policy is typically

5 - 8 %

lower than DRLAR and

8 - 12 %

lower than ARCA; DeepNR closes part of the gap but remains above the proposed policy across all patterns. Long-path surrogates (transpose, bit-complement) exhibit the largest savings due to path length reduction, while uniform stays closest at low

ρ

when optical engagement is sparse. At the uniform anchor

ρ = 0.6

(Table 10), the proposed policy achieves 76.8 pJ/bit versus 80.9 for DRLAR and 83.8 for ARCA; deterministic baselines reach up to 89.5 pJ/bit.

6.1.6. Energy per Delivered Bit Versus Injection Rate for PARSEC 3.0

Across PARSEC 3.0 (Figure 10), vips and facesim (streaming) show clear reductions in energy/bit with restrained diagonal engagement, while canneal and streamcluster (irregular/bursty) sustain larger gains as load increases. At mid–high injection (

ρ \approx 0.5 - 0.8

), typical reductions are

\sim 3 - 7 %

relative to DRLAR and

\sim 6 - 12 %

relative to ARCA; DeepNR generally lies between ARCA and DRLAR yet remains above the proposed curve. Low-contention blackscholes and swaptions remain tightly clustered across schemes, with the proposed policy best but close.

6.1.7. Energy per Delivered Bit Versus Injection Rate for SPLASH-2

Across SPLASH-2 (Figure 11), the largest reductions in energy/bit appear in volrend (streaming) and in the synchronization-heavy raytrace and water-nsquared, where diagonal offload shortens effective paths during pressure spikes. Around mid–high load (

ρ \approx 0.5 - 0.7

), the proposed policy lowers energy/bit by approximately

\sim 4 - 9 %

relative to DRLAR and

\sim 7 - 13 %

relative to ARCA; DeepNR narrows but does not close the gap. Regular/spatial kernels (fft, lu, water-spatial) remain closely grouped across schemes with modest post-knee improvements, while deterministic XY and West-First are consistently highest.

6.1.8. Energy per Delivered Bit Versus Injection Rate for AI Workloads

Across the AI workloads (Figure 12), energy/bit declines once diagonal offload becomes frequent, with the largest benefit in PageRank (irregular/graph-like). At mid–high load, PageRank typically achieves (∼5–10%) lower energy/bit than DRLAR, and the separation widens beyond the knee; DeepNR narrows but does not close the margin, while deterministic XY/West-First remain highest. MLPerf Tiny (streaming) engages optics more sparingly and shows a smaller but steady advantage—about (∼3–6%) versus DRLAR through the mid-load region—with curves closely aligned at light load and separating modestly after the knee.

6.1.9. Throughput Versus Injection Rate for Synthetic Traffic

Figure 13 shows that the proposed policy attains higher knees and elevated plateaus across all synthetic sources. Gains are most visible under hotspot and bursty on/off, where diagonal offload is exercised frequently and feasibility-aware fallback prevents stalls; in the plateau region, throughput typically improves by +2–4% relative to DRLAR and +5–9% relative to ARCA. Long-path surrogates (transpose, bit-complement) and uniform traffic exhibit smaller but consistent uplifts once the knee is crossed, reflecting distance reduction with restrained optical engagement. The deterministic baselines (XY, West-First) remain lowest throughout the sweep, while DeepNR narrows—but does not remove—the gap with the proposed method.

6.1.10. Throughput Versus Injection Rate for PARSEC 3.0

Across PARSEC 3.0 (Figure 14), canneal, streamcluster, and fluidanimate exhibit a rightward knee (

Δ ρ \approx 0.04 - 0.06

) and a higher plateau; relative to the best mesh baseline (typically DRLAR), plateau throughput improves by (≈0.02–0.04) pkts/cycle. Streaming and low-contention workloads (vips, blackscholes) remain close to the baselines with small positive deltas and no regressions. DeepNR generally narrows the gap to DRLAR yet remains below the proposed curve across the sweep.

6.1.11. Throughput Versus Injection Rate for SPLASH-2

Across SPLASH-2 (Figure 15), the synchronization-heavy raytrace and water-nsquared and irregular radix show sustained throughput uplifts: through mid–high load (

ρ \approx 0.5 - 0.7

), improvements are typically +3–5% relative to DRLAR and +6–9% relative to ARCA, with a modest rightward knee shift (

Δ ρ \approx 0.03 - 0.05

) and elevated plateaus. Regular/spatial kernels (fft, lu, water-spatial) remain close to baselines at light load and deliver modest, reliable gains near and beyond the knee. No workload exhibits a regression relative to the best mesh baseline; DeepNR narrows but does not eliminate the gap to the proposed curve.

6.1.12. Throughput Versus Injection Rate for AI Workloads

Across the AI traces (Figure 16), PageRank (irregular/graph-like) exhibits a pronounced rightward knee and an elevated plateau: once

ρ ≳ 0.6

, throughput improves by approximately

0.02 - 0.03

pkts/cycle relative to DRLAR and

0.05 - 0.07

pkts/cycle relative to ARCA. MLPerf Tiny (streaming) shows smaller but consistent uplifts—about

0.01 - 0.02

and

0.03 - 0.04

pkts/cycle versus DRLAR and ARCA, respectively—with curves overlapping at light load and separating modestly after the knee. DeepNR typically narrows but does not close the gap with the proposed curve, and deterministic XY/West-First remain lowest across the sweep; no regressions are observed.

6.2. Spatial Congestion and Throughput–Energy Pareto Fronts

6.2.1. Spatial Structure for Representative Workloads

Figure 17 presents steady-state input-buffer occupancy at a matched offered load for three representative workloads—canneal (irregular/bursty), raytrace (synchronization-heavy), and MLPerf Tiny (streaming)—across meshes of size

8 \times 8

,

16 \times 16

, and

32 \times 32

. Columns are aligned by routing scheme (from left to right): Proposed (PPO-based DRL), DRLAR, DeepNR, ARCA, West-First, and XY. A single colorbar per workload normalizes the scale across the three mesh sizes.

For canneal and raytrace, the Proposed (PPO-based DRL) maps show the characteristic contraction of the bisection ridge and a load shift toward diagonal endpoints, consistent with photonic offload when optics are valid. Relative to DRLAR (the strongest electronic-only baseline), occupancy decreases by approximately

15 - 25 %

on the mean and

20 - 30 %

at the p₉₉, with DeepNR reducing pressure modestly yet retaining a bright central ridge. ARCA exhibits persistent ridge lines and corner hotspots, while West-First and XY display the canonical cross-shaped hot zone. The relief strengthens with scale: from

8 \times 8

to

32 \times 32

, ridge brightening recedes most under the proposed controller, mirroring the right-shifted knees and higher plateaus reported elsewhere in this section.

For MLPerf Tiny, all schemes remain uniformly cool with mild brightening near sinks; optical engagement is sparse (typically under

10 %

of hops), so inter-scheme differences are correspondingly small. The observed patterns align with two mechanisms: (i) early diagonal offload triggered by rising local queues shortens paths through emerging hot regions; and (ii) feasibility masking enforces immediate fallback during optical invalidity, avoiding futile attempts and preventing secondary hotspots.

6.2.2. Throughput–Energy Trade-Offs

Figure 18 aggregates class-averaged operating points in the throughput–energy plane using a shared energy range (75–95 pJ/bit) across PARSEC, SPLASH-2, and AI. In all suites, the Proposed (PPO-based DRL) points shift the frontier up and left relative to the deterministic (XY, West-First), adaptive (ARCA), and DRL baselines (DRLAR, DeepNR). The largest separations occur for the Irregular/Bursty and Synchronization-Heavy classes; Streaming shows a clear energy advantage with modest throughput gains, while Low-Contention clusters tightly but still shifts positively. These trends accord with the spatial analysis: diagonal shortcuts reduce effective hop count during pressure spikes, and feasibility-aware masking prevents wasted optical attempts during detuning, lowering energy at a given throughput (see Section 6.2.3 for regimes with limited benefit).

6.2.3. Regimes with Limited Benefit

Two regimes exhibit modest improvements by construction. First, under low-contention traffic, electronic minimal paths already maintain short queues and low hop counts, so diagonal engagement is sparse and deltas remain small; this is consistent with the tightly clustered curves reported in the light-load portions of the load sweeps and the uniformly cool maps in Figure 17. Second, when photonic availability is limited (e.g., small wavelength budgets or short validity windows), the feasibility mask forces immediate fallback to electrical routes; in this case, throughput and energy/bit track mesh baselines with similar knees, as reflected in the suite-level Pareto points (Figure 18) and the wavelength scaling behavior (Section 6.3.2). In both regimes, the mask prevents regressions by avoiding non-actionable optical choices; improvements grow as pressure increases and valid optical lanes are present, matching the mechanism established throughout Section 6.

6.3. Scalability with Mesh Size and Optical Concurrency

6.3.1. Throughput Scaling with Mesh Size

Figure 19 plots aggregate throughput versus mesh size (

8 \times 8

,

16 \times 16

,

32 \times 32

) for PARSEC 3.0, SPLASH-2, and AI. The proposed policy scales more steeply than all baselines, driven by two effects that intensify with size: (i) larger absolute hop reduction from diagonal shortcuts and (ii) earlier dispersion of nascent queue waves away from the electrical bisection. The gap to learning baselines (DRLAR, DeepNR) widens from

8 \times 8

to

32 \times 32

, most noticeably in SPLASH-2 and AI, which contain richer burst and synchronization phases; PARSEC shows a steadier but smaller advantage, reflecting a higher share of low-contention segments. Deterministic XY and West-First diverge further as size increases due to growing bisection pressure. No suite exhibits a throughput regression relative to the best mesh baseline at any size.

6.3.2. Energy per Delivered Bit Versus Wavelength Count

Figure 20 reports energy per delivered bit (pJ/bit) at

ρ = 0.6

while sweeping photonic wavelength count

W \in {4, 8, 16}

. The proposed controller remains below DRLAR for every W, with the dominant improvement realized when increasing concurrency from

W = 4

to

W = 8

; moving to

W = 16

yields only modest additional savings, indicating that residual stalls are increasingly dictated by electrical queuing and short optical validity windows rather than the number of photonic lanes. Because DRLAR is electronic-only and thus W-invariant, its values are replicated across W to enable bars-only, per-W visual comparison. Sensitivity to W is strongest in SPLASH-2 and AI (richer burst and synchronization phases with longer effective paths), whereas PARSEC exhibits smaller but consistent deltas due to more low-contention segments. Design implication: provisioning

W = 8

wavelengths per diagonal is a pragmatic operating point; a higher W is warranted only for the burstiest deployments or the largest meshes where overlapping reservations are frequent.

6.4. Ablation Study: Sources of Gain

An ablation at

ρ = 0.6

under uniform traffic compares the full controller with four isolated variants: w/o Photonic Links (diagonals disabled at inference), w/o Validity Mask (optical actions permitted during detuning), w/o GAE (no generalized advantage estimation), and w/o Entropy Bonus (no exploration regularization). As summarized in Table 11, removing photonic links causes the largest regression (latency

29.5 \to 35.0

cycles,

+ 18.6 %

; throughput

0.93 \to 0.88

,

- 5.4 %

; energy per delivered bit

76.8 \to 80.1

pJ/bit,

+ 4.3 %

; congestion

44 \to 52

,

+ 18.2 %

), confirming that distance reduction via diagonals is the dominant lever. Eliminating the feasibility mask is the next most harmful change (latency

+ 12.5 %

; throughput

- 3.2 %

; energy/bit

+ 2.5 %

; congestion

+ 11.4 %

), indicating that masked action selection is critical to avoid futile optical attempts during detuning. Training stabilizers contribute smaller but measurable benefits: removing GAE increases latency by

+ 7.5 %

(throughput

- 2.2 %

, energy/bit

+ 1.2 %

) and removing the entropy leads to a bonus of

+ 5.1 %

(throughput

- 2.2 %

, energy/bit

+ 0.8 %

), consistent with tighter queues and reduced variance in the full model. Overall, the performance ordering

Full > {w / o Entropy, w / o GAE} > w / o Validity Mask > w / o Photonic Links

matches the following mechanism: (i) diagonal shortcuts shorten effective paths, (ii) feasibility masking preserves liveness and prevents stalls when optics detune, and (iii) GAE and entropy primarily stabilize learning and trim latency tails without altering the hardware footprint.

6.5. Aggregate Results by Traffic Class

Table 12, Table 13 and Table 14 summarize percentage changes of the Proposed (PPO) controller relative to each baseline, averaged across seeds, mesh sizes, and wavelength counts W. The largest gains arise in the Irregular/Bursty and Synchronization-Heavy classes: latency reductions reach

\sim 30 - 36 %

versus XY,

\sim 21 - 24 %

versus ARCA, and

\sim 10 - 12 %

versus DRLAR, with corresponding throughput uplifts of (∼12–14%) versus XY and (∼5–6%) versus DRLAR. Streaming traffic shows the clearest energy benefit, lowering energy/bit by (∼13%) versus XY and (∼7) versus DRLAR while maintaining modest latency and throughput improvements. Low-Contention workloads remain tightly clustered yet still favor the proposed method (latency −10% vs. XY, throughput +3–4% vs. XY, energy −5% vs. XY). The class-averaged trends are consistent across suites and reflect the same mechanisms observed in prior subsections: diagonal shortcuts reduce effective hop count during pressure spikes, and feasibility-aware masking prevents wasted optical attempts during transient detuning.

6.6. Statistical Significance and Effect Sizes

Seed-matched paired testing establishes that latency improvements are statistically significant across suites and baselines. Two-sided paired t-tests and Wilcoxon signed-rank tests are applied on per-workload pairs, aggregated across mesh sizes and wavelength counts W, with familywise error controlled via Holm–Bonferroni across the five baselines. As summarized in Table 15, Table 16 and Table 17, p-values remain significant after correction, and Cohen’s d indicates large effects versus deterministic baselines (XY, West-First;

d \approx 0.9 - 1.1

), moderate-to-large effects versus adaptive/DRL baselines (ARCA, DRLAR;

d \approx 0.7 - 0.8

), and consistent moderate effects versus DeepNR (

d \approx 0.4 - 0.6

). Throughput and energy/bit exhibit analogous, suite-consistent significance patterns, with the strongest effects in the irregular/bursty and synchronization-heavy classes and smaller but positive effects in the streaming and low-contention cases.

6.7. Cross-Generalization to Unseen Workloads

Table 18 quantifies out-of-domain performance for frozen policies trained on one subset and evaluated on unseen suites/classes (Train–A/B/C; see Section 5.6 for definitions). Unless noted otherwise, the values are means over five seeds at the anchor operating point (

ρ^{★} = 0.60

,

16 \times 16

mesh,

W = 8

); policies are frozen with no hyperparameter retuning. Percent changes are reported as

(out - of - domain - in - domain) / in - domain \times 100 %

, so a negative value indicates degradation.

When class coverage is preserved (Train–A/B), degradation is modest—mean latency

- 3.8 %

and throughput

- 2.1 %

for Train–A, and

- 4.6 %

and

- 2.5 %

for Train–B—with small energy penalties (

+ 1.2 %

and

+ 1.5 %

) and limited tail impact (

p_{99}

- 4.9 %

and

- 5.6 %

). Reducing class diversity during training (Train–C) incurs larger penalties (latency

- 8.9 %

, throughput

- 5.7 %

, energy

+ 2.9 %

,

p_{99}

- 11.4 %

), indicating that coverage of traffic classes matters more than matching specific benchmark suites. Consistent with Section 6.6, paired tests with Holm–Bonferroni correction confirm the Train–C degradations are statistically significant, while Train–A/B shifts remain modest but systematic.

Two practical mitigations emerge: (i) include at least one exemplar from each class (irregular/bursty, synchronization-heavy, streaming, low-contention) during training; and (ii) apply lightweight augmentations—vary burst amplitude and duty cycle, perturb barrier cadence, and introduce stochastic validity dropouts—to improve robustness without changing inference time cost.

6.8. Limitations and Threats to Validity

Key bounds include placement and mapping sensitivity, trace representativeness, photonic plane fidelity (laser dynamics, OSNR, BER abstracted), energy calibration (technology dependence), control plane assumptions (reliable reservation and validity signaling), deployment scope (offline-trained, frozen agents), scale coverage (

8 \times 8

–

32 \times 32

,

W \in {4, 8, 16}

), statistical resolution (five seeds), and aggregation choices. These motivate future work on realistic control network modeling, placement/packaging sensitivity, broader device calibration, full back-end timing/area/thermal co-analysis, and safe online adaptation under nonstationarity (Section 7). This study does not claim dominance over all-photonic fabrics; rather, it targets chip-scale hybrids where thermal budgets, device density, and buffering needs favor selective photonic engagement guided by local validity and feasibility-aware routing.

Sensitivity to delayed or noisy validity estimates is a known risk: overly conservative estimates may reduce diagonal engagement, whereas stale positives could trigger failed reservations. In both cases, the feasibility mask prevents regressions by forcing immediate electrical fallback. A systematic sensitivity sweep to estimator delay and noise is left as future work and will be released alongside the simulator scripts.

7. Future Work

Building on the constraints and insights from Section 6, the following extensions are delineated with measurable outcomes and reproducible artifacts.

Hardware Prototyping, P&R, and Monitors. Tape out a router macro with the embedded PPO controller; report post-layout timing, area, and power at a target node. Integrate lightweight on-chip monitors (per-port queues, optical validity, tuning events) to validate modeling assumptions and enable hardware-in-the-loop evaluation. Deliverables: netlists, timing/power reports, and monitor register maps.
Control Network Realism. Co-simulate reservation and validity propagation with explicit latency, jitter, and contention; quantify sensitivity of latency, throughput, and energy per delivered bit (pJ/bit) to control delays. Compare simple broadcast with routed control paths and assess backpressure coupling. Deliverables: latency–jitter sweeps and ablation scripts.
Online Adaptation with Guardrails. Add infrequent, safe policy updates (e.g., trust region or conservative policy iteration) triggered by drift detectors on queue/validity statistics. Enforce invariants via action masking and an always-available escape policy; measure stability over long runs (no QoS regressions). Deliverables: drift detectors, update scheduler, and stability benchmarks.
Learning Architecture Variants. Evaluate graph-aware encoders (local attention or 1–2 hop message passing) under the same SRAM/latency budget. Compare PPO with SAC and A2C under identical control path constraints; report accuracy–overhead trade-offs. Deliverables: plug-in encoders and unified training/eval configs.
Photonic Co-Design. Jointly tune wavelength count W, reservation granularity, and tuning policies. Model crosstalk/OSNR and BER margins; study adaptive laser bias and opportunistic lane parking. Extend to multi-diagonal or hierarchical optical planes and quantify diminishing returns. Deliverables: device–system co-sweep notebooks and calibrated energy models.
Placement, Packaging, and Thermal Co-Analysis. Explore cluster-aware vs. neutral mappings, 2.5D integration, and reticle/wafer-scale stitching. Couple electrical/photonic thermal maps to validity dynamics; report end-to-end impact on hotspots and reliability. Deliverables: placement benchmarks and electro-thermal traces.
Reliability and Tail Behavior. Inject transient optical faults and link flaps; evaluate p₉₉/p₉₉₉ latency and queue excursions under adversarial bursts. Add recovery policies (cool-down timers, diagonal throttling) and quantify tail-risk reduction. Deliverables: fault-injection framework and tail-metric dashboards.
Topology and Microarchitectural Options. Sweep express-link density/reach, escape-VC configurations, lookahead/speculative reservation, and lightweight age-/deadline-aware arbitration under matched area/power. Deliverables: Pareto surfaces (throughput–latency–energy) and critical path reports.
Broader Benchmarks and Reproducibility. Incorporate emerging AI/graph workloads with tighter synchronization and sparsity. Release anonymized traces, mesh/photonic configs, seeds, and figure-generation scripts to enable third-party replication. Deliverables: open repository with CI-verified artifacts and run logs.
Formal Safety and Progress Guarantees. Apply model checking for deadlock/livelock freedom with feasibility masks and escape VCs; verify forward progress under worst-case offered load and bounded control latency. Deliverables: properties/specifications and proof logs.
Policy Inspection and Interpretability. Develop per-router saliency/ablation tools to attribute actions to observables (queues, validity, hop distance). Use insights to prune inputs or compress models with no loss. Deliverables: attribution visualizers and compressed policy checkpoints.
Energy and Carbon Accounting. Extend the energy model with workload- and ambient-dependent laser budgets and process-scaling factors; report energy per delivered bit (pJ/bit) alongside estimated carbon intensity for system-level comparisons. Deliverables: energy/carbon calculators and scenario studies.

8. Conclusions

This study introduces a decentralized, feasibility-aware PPO controller for HNoCs with diagonal express links. Router-local agents act on per-port queues, hop distance, and an optical validity signal to select per-flit routes within a single cycle, meeting tight hardware budgets (∼12 KB SRAM, (<10) ns inference) while preserving deterministic escape paths.

Across synthetic sources and the PARSEC 3.0, SPLASH-2, and AI suites, the controller consistently advances the throughput–energy frontier and lowers latency relative to deterministic, adaptive, and prior DRL baselines. Improvements are strongest under irregular/bursty and synchronization-heavy regimes, steady for streaming, and non-regressive for low-contention traffic. At a representative sub-saturation operating point (

ρ = 0.60

,

16 \times 16

,

W = 8

), the results include a 29.5-cycle mean latency, 0.93 packets/cycle throughput, and 76.8 pJ/bit energy per delivered bit.

Two design takeaways follow. First, encoding optical validity in the observation space and masking infeasible actions is pivotal for stability and efficiency at scale. Second, provisioning

W = 8

wavelengths per diagonal captures most of the energy-per-bit benefit, with diminishing returns beyond this point except under severe bursts. These findings, together with spatial congestion relief concentrated away from the electrical bisection, indicate that photonic-aware, decentralized PPO routing is a practical path to lower latency, higher throughput, and reduced energy per bit under diverse Edge-AI and data-centric workloads.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The aggregated simulation results that support the findings of this study, including the minimal dataset submitted with this manuscript, are available from the author upon reasonable request.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
NoC	Network-on-Chip
HNoC	Hybrid Network-on-Chip
WDM	Wavelength-Division Multiplexing
DRL	Deep Reinforcement Learning
PPO	Proximal Policy Optimization
EWMA	Exponentially Weighted Moving Average
HyPPI	Hybrid Plasmonic–Photonic Integration
RWA	Routing and Wavelength Assignment
OSNR	Optical Signal-to-Noise Ratio
p₉₉	99th-Percentile (Latency)
CI	Confidence Interval
BER	Bit Error Rate
GE	Gate Equivalents

References

Kakoulli, E.; Soteriou, V.; Koutsides, C.; Kalli, K. Silica-embedded silicon nanophotonic on-chip networks. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2016, 36, 978–991. [Google Scholar] [CrossRef]
Pan, Y.; Kumar, P.; Kim, J.; Memik, G.; Zhang, Y.; Choudhary, A. Firefly: Illuminating future network-on-chip with nanophotonics. In Proceedings of the 36th Annual International Symposium on Computer Architecture, Austin, TX, USA, 20–24 June 2009; Association for Computing Machinery: New York, NY, USA, 2009; pp. 429–440. [Google Scholar]
Wang, S.; Zhang, X.; Wang, C.; Wu, K.; Li, C.; Dong, D. DRLAR: A deep reinforcement learning-based adaptive routing framework for network-on-chips. Comput. Netw. 2024, 246, 110419. [Google Scholar] [CrossRef]
Khan, K.; Pasricha, S. A reinforcement learning framework with region-awareness and shared path experience for efficient routing in networks-on-chip. arXiv 2023, arXiv:2307.11712. [Google Scholar] [CrossRef]
RS, R.R.; Rohit, R.; Shahreyar, M.S.; Raut, A.; Pournami, P.; Kalady, S.; Jayaraj, P. DeepNR: An adaptive deep reinforcement learning based NoC routing algorithm. Microprocess. Microsyst. 2022, 90, 104485. [Google Scholar]
Kakoulli, E.; Soteriou, V.; Theocharides, T. Intelligent hotspot prediction for network-on-chip-based multicore systems. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2012, 31, 418–431. [Google Scholar] [CrossRef]
Ahmed, A.B.; Meyer, M.; Okuyama, Y.; Abdallah, A.B. Hybrid photonic NoC based on non-blocking photonic switch and light-weight electronic router. In Proceedings of the 2015 IEEE International Conference on Systems, Man, and Cybernetics, Kowloon Tong, Hong Kong, 9–12 October 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 56–61. [Google Scholar]
Tan, X.; Yang, M.; Zhang, L.; Wang, X.; Jiang, Y. A hybrid optoelectronic networks-on-chip architecture. J. Light. Technol. 2014, 32, 991–998. [Google Scholar] [CrossRef]
Yang, M.; Ampadu, P. Thermal-aware adaptive fault-tolerant routing for hybrid photonic-electronic NoC. In Proceedings of the 9th International Workshop on Network on Chip Architectures, Taipei, Taiwan, 15 October 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 33–38. [Google Scholar]
Ramezanpour, K.; Liu, X.; Ampadu, P. Improving Scalability in Thermally Resilient Hybrid Photonic-Electronic NoCs. In Proceedings of the 10th International Workshop on Network on Chip Architectures, Boston, MA, USA, 14–15 October 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 1–6. [Google Scholar]
Narayana, V.K.; Sun, S.; Mehrabian, A.; Sorger, V.J.; El-Ghazawi, T. HyPPI NoC: Bringing hybrid plasmonics to an opto-electronic network-on-chip. In Proceedings of the 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, 14–17 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 131–140. [Google Scholar][Green Version]
Wang, C.; Dong, D.; Wang, Z.; Zhang, X.; Zhao, Z. RELAR: A reinforcement learning framework for adaptive routing in network-on-chips. In Proceedings of the 2021 IEEE International Conference on Cluster Computing (CLUSTER), Online, 7–10 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 813–814. [Google Scholar]
Wu, K.; Ye, Y. Q-Learning Based Bi-Objective Deadlock-Free Routing Optimization for Optical NoCs. In Proceedings of the 2022 15th IEEE/ACM International Workshop on Network on Chip Architectures (NoCArc), Chicago, IL, USA, 2 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Zhang, W.; Ye, Y. A table-free approximate Q-learning-based thermal-aware adaptive routing for optical NoCs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 40, 199–203. [Google Scholar] [CrossRef]
Li, H.; Zhao, J.; Liu, F. Reinforcement Learning (RL)-Based Holistic Routing and Wavelength Assignment in Optical Network-on-Chip (ONoC): Distributed or Centralized? IEEE J. Emerg. Sel. Top. Circuits Syst. 2024, 14, 534–550. [Google Scholar] [CrossRef]
Capogrosso, L.; Cunico, F.; Cheng, D.S.; Fummi, F.; Cristani, M. A machine learning-oriented survey on tiny machine learning. IEEE Access 2024, 12, 23406–23426. [Google Scholar] [CrossRef]
Tsoukas, V.; Gkogkidis, A.; Boumpa, E.; Kakarountas, A. A Review on the emerging technology of TinyML. ACM Comput. Surv. 2024, 56, 1–37. [Google Scholar] [CrossRef]
Szydlo, T.; Jayaraman, P.P.; Li, Y.; Morgan, G.; Ranjan, R. Tinyrl: Towards reinforcement learning on tiny embedded devices. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 4985–4988. [Google Scholar]
Luo, X.; Liu, D.; Kong, H.; Huai, S.; Chen, H.; Xiong, G.; Liu, W. Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision. ACM Trans. Embed. Comput. Syst. 2024, 24, 1–100. [Google Scholar] [CrossRef]
Wang, H.; Sakhadeo, A.; White, A.; Bell, J.; Liu, V.; Zhao, X.; Liu, P.; Kozuno, T.; Fyshe, A.; White, M. No more pesky hyperparameters: Offline hyperparameter tuning for RL. arXiv 2022, arXiv:2205.08716. [Google Scholar] [CrossRef]
Wu, X.; Zhang, Y.; Shi, M.; Li, P.; Li, R.; Xiong, N.N. An adaptive federated learning scheme with differential privacy preserving. Future Gener. Comput. Syst. 2022, 127, 362–372. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 19–24 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; PMLR: New York, NY, USA, 2016; Volume 48, pp. 1928–1937. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; PMLR: Stockholm, Sweden, 2018; Volume 80, pp. 1861–1870. [Google Scholar]
He, Q. A Unified Metric Architecture for AI Infrastructure: A Cross-Layer Taxonomy Integrating Performance, Efficiency, and Cost. arXiv 2025, arXiv:2511.21772. [Google Scholar]

Figure 1. Tile-level architecture. Each tile contains a compute core, router, and embedded PPO-based DRL agent (12 KB SRAM, <10 ns inference) placed on the route computation (RC) stage, ensuring per-flit next-hop decisions are available before switch allocation without adding pipeline bubbles. Cardinal electrical links (solid) and diagonal photonic express links (dashed) are shown.

Figure 2. PPO-DRL agent memory profile (∼12 KB SRAM per router; weights ≈ 60%, activations ≈ 25%, control FSM ≈ 15%).

Figure 3. Conceptual flow from workloads to traffic, local observation, DRL action, and network-level effects.

Figure 4. Convergence of decentralized PPO agents. Average episode reward stabilizes as policies mature, reflecting effective diagonal offload and safe fallback behavior.

Figure 5. Synthetic traffic: latency versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 5. Synthetic traffic: latency versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 6. PARSEC 3.0: latency versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 6. PARSEC 3.0: latency versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 7. SPLASH-2: latency versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 7. SPLASH-2: latency versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 8. AI workloads: latency versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 8. AI workloads: latency versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 9. Synthetic traffic: energy per delivered bit (pJ/bit) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 9. Synthetic traffic: energy per delivered bit (pJ/bit) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 10. PARSEC 3.0: energy per delivered bit (pJ/bit) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 10. PARSEC 3.0: energy per delivered bit (pJ/bit) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 11. SPLASH-2: energy per delivered bit (pJ/bit) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 11. SPLASH-2: energy per delivered bit (pJ/bit) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 12. AI workloads: energy per delivered bit (pJ/bit) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 12. AI workloads: energy per delivered bit (pJ/bit) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 13. Synthetic traffic: throughput (packets/cycle) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 13. Synthetic traffic: throughput (packets/cycle) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 14. PARSEC 3.0: throughput (packets/cycle) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 14. PARSEC 3.0: throughput (packets/cycle) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 15. SPLASH-2: throughput (packets/cycle) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 15. SPLASH-2: throughput (packets/cycle) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 16. AI workloads: throughput (packets/cycle) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 16. AI workloads: throughput (packets/cycle) versus injection rate

(ρ)

. Means over five seeds on a

16 \times 16

mesh with

W = 8

.

Figure 17. Steady-state input-buffer occupancy for three representative workloads. Rows correspond to

8 \times 8

,

16 \times 16

, and

32 \times 32

meshes. Columns correspond to the following routing schemes: Proposed (PPO-based DRL), DRLAR, DeepNR, ARCA, West-First, and XY. A single colorbar per workload block is used to enable direct visual comparison across mesh sizes.

Figure 17. Steady-state input-buffer occupancy for three representative workloads. Rows correspond to

8 \times 8

,

16 \times 16

, and

32 \times 32

meshes. Columns correspond to the following routing schemes: Proposed (PPO-based DRL), DRLAR, DeepNR, ARCA, West-First, and XY. A single colorbar per workload block is used to enable direct visual comparison across mesh sizes.

Figure 18. Throughput versus energy per delivered bit (pJ/bit) by suite. Points are class-averaged per suite (Irregular/Bursty, Synchronization-Heavy, Streaming, Low-Contention); means are over five seeds. Relative to the best electronic DRL baseline, the Proposed (PPO-based DRL) method typically achieves lower energy at matched throughput and higher throughput at matched energy (on the orders of 2–5 pJ/bit and 0.02–0.04 pkts/cycle, respectively).

Figure 19. Aggregate throughput versus mesh size (

8 \times 8

,

16 \times 16

,

32 \times 32

) for PARSEC 3.0 (a), SPLASH-2 (b), and AI (c). Suite-level means over workloads and five seeds;

W = 8

.

Figure 19. Aggregate throughput versus mesh size (

8 \times 8

,

16 \times 16

,

32 \times 32

) for PARSEC 3.0 (a), SPLASH-2 (b), and AI (c). Suite-level means over workloads and five seeds;

W = 8

.

Figure 20. Energy per delivered bit (pJ/bit) at

ρ = 0.6

on a

16 \times 16

mesh for PARSEC 3.0, SPLASH-2, and AI (left to right). Grouped bars per

W \in {4, 8, 16}

compare Proposed (PPO-based DRL) with DRLAR. DRLAR is electronic-only and therefore W-invariant; values are replicated across W for bars-only visualization. Means are over five seeds.

Figure 20. Energy per delivered bit (pJ/bit) at

ρ = 0.6

on a

16 \times 16

mesh for PARSEC 3.0, SPLASH-2, and AI (left to right). Grouped bars per

W \in {4, 8, 16}

compare Proposed (PPO-based DRL) with DRLAR. DRLAR is electronic-only and therefore W-invariant; values are replicated across W for bars-only visualization. Means are over five seeds.

Table 1. Taxonomy of HNoC and DRL routing approaches.

Approach (Ref.)	Topology	Photonic-Aware	Thermal Strategy	Control	Scale Readiness
Photonic express HNoC [7]	Mesh with optical express	Yes	Device-level tuning assumed	Deterministic/ Adaptive	Prototype scale
Butterfly fat tree hybrid [8]	Butterfly fat tree overlay	Yes	Not explicit	Deterministic	Prototype scale
HyPPI concepts [11]	Hybrid plasmonic–photonic	Yes	Not explicit	N/A (device focus)	Early-stage
TAFT thermal-aware routing [9]	Mesh hybrid	Yes	Thermal-aware routing	Deterministic	Prototype scale
Topology-aware scaling [10]	Mesh hybrid	Yes	Topology and thermal scaling	Deterministic	Prototype scale
Thermal Q-learning for optical NoC [13]	Optical routing	Yes	Thermal-aware Q-learning	Centralized or LUT-local	Small testbeds
Table-free thermal RL [14]	Optical routing	Yes	Thermal-aware, table-free	Local inference	Small testbeds
RL for RWA in optical NoC [15]	Optical RWA	Yes	OSNR and latency constraints	Centralized RL	Small to medium
DRLAR [3]	Electronic mesh	No	Not modeled	Centralized training	Medium scale
DeepNR [5]	Electronic mesh	No	Not modeled	Centralized training	Medium scale
Q-RASP [4]	Electronic mesh	No	Not modeled	Region-coordinated	Medium scale
RELAR [12]	Electronic mesh	No	Not modeled	Centralized training	Medium scale
Proposed decentralized PPO in HNoC	Mesh with diagonal photonic express links	Yes	Thermal/link validity in state	Fully decentralized (per router)	Mesh-scale

Notes. HyPPI: hybrid plasmonic–photonic integration; RWA: routing and wavelength assignment; OSNR: optical signal-to-noise ratio.

Table 2. Hardware resource summary for the router-local controller.

Component	Value	Notes
MLP parameters	<6000	Shared trunk; policy/value heads
Memory footprint	∼12 KB	Quantized weights and activations; small FSM state
Inference latency	<10 ns	Single-cycle routing stage; fully pipelined
Area estimate	∼35 K GE	Post-synthesis at 28 nm CMOS (controller only)
Thermal adaptivity	Enabled	Validity bit in state; masked action sampling

Table 3. Observation vector components and normalization.

Feature	Range	Notes
Per-port queues	$[0, 1]$	Depth divided by max FIFO entries
Queue-delta history	$[- 1, 1]$	Signed change per cycle; standardized
Local injection avg.	$[0, 1]$	EWMA of injected flits per cycle
Hop distance	$[0, 1]$	Manhattan distance divided by mesh max
Optical validity	${0, 1}$	1 if diagonal tuned/reserved; else 0

Table 4. Representative PPO hyperparameters.

Parameter	Value or Setting
Discount factor $γ$	0.99
GAE parameter $λ$	0.95
Clip parameter $ϵ$	0.1–0.2
Policy entropy weight $c_{2}$	0.01
Value loss weight $c_{1}$	0.5
Horizon T	128–256 steps
Optimization epochs K	4–8 per update
Minibatch size M	256
Optimizer and step size	Adam, $3 \times 10^{- 4}$
Advantage normalization	Enabled per update
Gradient clipping	Global norm, 0.5–1.0

Notes. (i) For

8 \times 8

and

32 \times 32

meshes, aggregate transitions are ≈

4.9 \times 10^{7}

and ≈

7.9 \times 10^{8}

, respectively, when episodes and T are held fixed. (ii) Wall clock time is implementation- and hardware-dependent; values reported here are indicative for CPU-only training on the stated host. (iii) The same budget is used for DQN/A2C in the stability comparison (Section Empirical Stability: PPO Versus DQN/A2C).

Table 5. Representative training budget and computational cost (PPO;

16 \times 16

mesh). Unless noted, the same episode budget and feasibility masking are used across algorithms in stability comparisons.

Table 5. Representative training budget and computational cost (PPO;

16 \times 16

mesh). Unless noted, the same episode budget and feasibility masking are used across algorithms in stability comparisons.

Quantity	Value
Episodes	3000
Horizon per episode (T)	256 steps
Routers/agents (mesh $16 \times 16$ )	256
Transitions per agent	≈ $7.7 \times 10^{5}$
Aggregate transitions (all agents)	≈ $2.0 \times 10^{8}$
Optimizer/learning rate	Adam/ $3 \times 10^{- 4}$
Hardware (offline training)	Dual Intel Xeon 6226R; 256 GB RAM
Approx. wall clock (offline PPO training)	≈24 CPU-hours

Table 6. Comparison of DRL algorithms for NoC routing.

Algorithm	Stability	Sample Eff.	Remarks
Deep Q-Network (DQN)	Low	Low	Value-based; nonstationary multi-agent replay is brittle [23]
Advantage Actor–Critic (A2C)	Medium	Medium	Synchronous updates can require coordination [24]
Proximal Policy Optimization (PPO)	High	High	Clipped updates and entropy aid decentralized, high-variance control [22]
Deep Deterministic Policy Gradient (DDPG)	Medium	High	Continuous control; often central critic and noise sensitivity [25]
Soft Actor–Critic (SAC)	High	High	Strong performance; often improved by synchronized actor–critic [26]

Table 7. Stability under identical training budgets (five seeds;

16 \times 16

,

W = 8

). Lower is better for variance and mean policy KL; fewer episodes indicate faster stabilization.

Table 7. Stability under identical training budgets (five seeds;

16 \times 16

,

W = 8

). Lower is better for variance and mean policy KL; fewer episodes indicate faster stabilization.

Algorithm	Episodes to Plateau	Post-Plateau Reward Variance	Mean Policy KL Divergence
PPO	≈2800	≈0.9	≈0.012
DQN	≈4100	≈3.4	≈0.031
A2C	≈3700	≈2.7	≈0.024

Table 8. Architectural parameters used across all experiments.

Parameter	Setting
Topology	2D electronic mesh with diagonal photonic express links
Mesh sizes	$8 \times 8$ , $16 \times 16$ , $32 \times 32$
Flow control	Wormhole, credit-based
Virtual channels/input	2
Input buffering	8-flit FIFOs
Electrical datapath	128-bit flit; 1-hop link latency
Technology/freq.	28 nm CMOS, 1 GHz
Photonic WDM	$W \in {4, 8, 16}$ per direction per diagonal
Photonic hop semantics	1-hop equivalent (long-reach bypass)
Optical validity	Per-cycle tuned/unavailable bit; enforced via feasibility mask
Energy model (electrical)	Orion-style, activity-count based (router + links)
Energy model (photonic)	Laser bias, tuning events, link traversal
Loss/tuning model	0.5 dB/cm, 1.2 dB/coupler, 0.1 nm/°C, 1.5 pJ/event

Table 9. Workload taxonomy used in analysis.

Workload	Suite	Comm. Pattern	Mem./Compute Bound	Expected NoC Stress
canneal	PARSEC	Irregular, bursty	Memory-bound	Hotspot-prone
fluidanimate	PARSEC	Sync bursts	Mixed	Temporal congestion
blackscholes	PARSEC	Uniform, low-contention	Compute-bound	Low stress
dedup	PARSEC	Irregular	Memory-bound	Burst-heavy
ferret	PARSEC	Irregular + bursts	Memory-bound	High entropy
swaptions	PARSEC	Uniform moderate	Compute-bound	Balanced
vips	PARSEC	Streaming	Memory-bound	Sustained load
streamcluster	PARSEC	Burst-sustained	Memory-bound	High sustained
bodytrack	PARSEC	Sync + irregular	Mixed	Medium stress
freqmine	PARSEC	Irregular	Memory-bound	Hotspot-prone
facesim	PARSEC	Streaming	Compute-bound	Balanced
barnes	SPLASH-2	Irregular	Memory-bound	Hotspot
cholesky	SPLASH-2	Irregular	Mixed	Congestion
fft	SPLASH-2	Regular pattern	Compute-bound	Uniform load
fmm	SPLASH-2	Irregular	Memory-bound	Bursty
lu	SPLASH-2	Regular block	Mixed	Balanced
ocean	SPLASH-2	Structured	Memory-bound	Sustained load
radix	SPLASH-2	Irregular bursty	Memory-bound	Hotspot
raytrace	SPLASH-2	Irregular + sync	Mixed	Congestion
volrend	SPLASH-2	Streaming	Memory-bound	Balanced
water-nsquared	SPLASH-2	Irregular sync	Memory-bound	Medium stress
water-spatial	SPLASH-2	Regular spatial	Mixed	Balanced
MLPerf Tiny inf.	AI	Streaming	Compute-bound	Sustained
PageRank (GraphBIG)	AI/Graph	Irregular	Memory-bound	Irregular spikes

Table 10. Uniform traffic at

ρ = 0.6

(means over seeds).

Table 10. Uniform traffic at

ρ = 0.6

(means over seeds).

Routing Algorithm	Latency (Cycles)	Throughput (Pkts/Cycle)	Energy per Delivered Bit (pJ/Bit)	Congestion (%)
XY	47.8	0.79	89.5	64
West-First	44.2	0.82	87.0	61
ARCA	39.1	0.86	83.8	56
DeepNR	36.0	0.88	81.6	53
DRLAR	34.3	0.89	80.9	51
Proposed (PPO-based DRL)	29.5	0.93	76.8	44

Table 11. Ablation at

ρ = 0.6

(uniform traffic). Means over seeds; lower is better for latency/energy/congestion, higher is better for throughput.

Table 11. Ablation at

ρ = 0.6

(uniform traffic). Means over seeds; lower is better for latency/energy/congestion, higher is better for throughput.

Model Variant	Latency (Cycles)	Throughput (Pkts/Cycle)	Energy per Delivered Bit (pJ/Bit)	Congestion (%)
Full Model	29.5	0.93	76.8	44
w/o GAE	31.7	0.91	77.7	47
w/o Entropy Bonus	31.0	0.91	77.4	46
w/o Validity Mask	33.2	0.90	78.7	49
w/o Photonic Links	35.0	0.88	80.1	52

Table 12. Latency change (%, proposed versus baseline; negative is better).

Class	XY	West-First	ARCA	DRLAR	DeepNR
Irregular/Bursty	$- 35.6$	$- 30.4$	$- 23.8$	$- 12.1$	$- 9.5$
Synchronization-Heavy	$- 32.1$	$- 28.6$	$- 21.2$	$- 10.5$	$- 8.1$
Streaming	$- 18.9$	$- 16.7$	$- 11.6$	$- 6.0$	$- 4.8$
Low-Contention	$- 10.4$	$- 9.1$	$- 7.1$	$- 3.4$	$- 2.6$

Table 13. Energy per delivered bit (pJ/bit) change (%, proposed versus baseline; negative is better).

Class	XY	West-First	ARCA	DRLAR	DeepNR
Irregular/Bursty	$- 10.8$	$- 9.6$	$- 7.4$	$- 4.9$	$- 4.2$
Synchronization-Heavy	$- 9.8$	$- 8.7$	$- 6.1$	$- 4.4$	$- 3.6$
Streaming	$- 12.9$	$- 11.8$	$- 9.3$	$- 6.7$	$- 5.8$
Low-Contention	$- 4.9$	$- 4.3$	$- 2.8$	$- 2.0$	$- 1.7$

Table 14. Throughput change (%, proposed versus baseline; positive is better).

Class	XY	West-First	ARCA	DRLAR	DeepNR
Irregular/Bursty	$+ 13.8$	$+ 11.9$	$+ 8.9$	$+ 6.2$	$+ 4.9$
Synchronization-Heavy	$+ 12.1$	$+ 10.4$	$+ 7.6$	$+ 5.5$	$+ 4.2$
Streaming	$+ 6.8$	$+ 5.9$	$+ 4.1$	$+ 3.0$	$+ 2.6$
Low-Contention	$+ 3.7$	$+ 3.1$	$+ 2.2$	$+ 1.6$	$+ 1.3$

Table 15. PARSEC latency—proposed versus baseline (paired tests; aggregated across meshes and W). p-values are Holm–Bonferroni corrected; Cohen’s d is uncorrected.

Baseline	p (t-Test)	p (Wilcoxon)	Cohen’s d
XY	< $10^{- 6}$	< $10^{- 6}$	1.12
West-First	< $10^{- 6}$	< $10^{- 6}$	0.98
ARCA	< $10^{- 6}$	< $10^{- 6}$	0.80
DRLAR	< $10^{- 5}$	< $10^{- 5}$	0.72
DeepNR	$2.6 \times 10^{- 4}$	$3.4 \times 10^{- 4}$	0.58

Table 16. SPLASH-2 latency—proposed versus baseline (paired tests; Holm–Bonferroni corrected).

Baseline	p (t-Test)	p (Wilcoxon)	Cohen’s d
XY	< $10^{- 6}$	< $10^{- 6}$	1.05
West-First	< $10^{- 6}$	< $10^{- 6}$	0.93
ARCA	< $10^{- 6}$	< $10^{- 6}$	0.77
DRLAR	< $10^{- 5}$	< $10^{- 5}$	0.70
DeepNR	$3.8 \times 10^{- 4}$	$4.6 \times 10^{- 4}$	0.54

Table 17. AI latency—proposed versus baseline (paired tests; Holm–Bonferroni corrected).

Baseline	p (t-Test)	p (Wilcoxon)	Cohen’s d
XY	< $10^{- 5}$	< $10^{- 5}$	0.86
West-First	< $10^{- 5}$	< $10^{- 5}$	0.79
ARCA	$1.6 \times 10^{- 3}$	$2.1 \times 10^{- 3}$	0.55
DRLAR	$3.0 \times 10^{- 3}$	$3.8 \times 10^{- 3}$	0.49
DeepNR	$7.2 \times 10^{- 3}$	$8.5 \times 10^{- 3}$	0.41

Table 18. Cross-generalization—mean percent change relative to in-domain performance (negative indicates degradation).

Setup	$Δ$ Latency	$Δ$ Throughput	$Δ$ Energy/bit	$Δ$ $p_{99}$ Latency
Train–A → Test–A	$- 3.8 %$	$- 2.1 %$	$+ 1.2 %$	$- 4.9 %$
Train–B → Test–B	$- 4.6 %$	$- 2.5 %$	$+ 1.5 %$	$- 5.6 %$
Train–C → Test–C	$- 8.9 %$	$- 5.7 %$	$+ 2.9 %$	$- 11.4 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kakoulli, E. Photonic-Aware Routing in Hybrid Networks-on-Chip via Decentralized Deep Reinforcement Learning. AI 2026, 7, 65. https://doi.org/10.3390/ai7020065

AMA Style

Kakoulli E. Photonic-Aware Routing in Hybrid Networks-on-Chip via Decentralized Deep Reinforcement Learning. AI. 2026; 7(2):65. https://doi.org/10.3390/ai7020065

Chicago/Turabian Style

Kakoulli, Elena. 2026. "Photonic-Aware Routing in Hybrid Networks-on-Chip via Decentralized Deep Reinforcement Learning" AI 7, no. 2: 65. https://doi.org/10.3390/ai7020065

APA Style

Kakoulli, E. (2026). Photonic-Aware Routing in Hybrid Networks-on-Chip via Decentralized Deep Reinforcement Learning. AI, 7(2), 65. https://doi.org/10.3390/ai7020065

Article Menu

Photonic-Aware Routing in Hybrid Networks-on-Chip via Decentralized Deep Reinforcement Learning

Abstract

1. Introduction

2. Background and Related Work

2.1. Foundations and Motivation

2.2. Hybrid NoC Architectures

2.3. Photonic Devices and Thermal Issues

2.4. Learning-Enabled Routing in NoCs (Electronic and Optical)

2.5. Embedded DRL and TinyML Feasibility

2.6. Taxonomy of Related Work

2.7. Identified Gaps

2.8. Photonic-Only vs. Hybrid Fabrics: Practical Trade-Offs

3. Proposed Architecture

3.1. Design Goals

3.2. Interconnect Topology and Photonic-Express Organization

3.3. Photonic Devices, Thermal Tuning, and Link Validity

3.4. Tile Microarchitecture and Controller Placement

3.5. Router Pipeline Integration

3.6. Flow Control, Safety, and Liveness

3.7. Expected Traffic-Level Effects

3.8. Hardware Integration and Overhead

3.9. Timing and Hardware Feasibility

4. Learning Methodology

4.1. Conceptual Flow

4.2. State Representation

4.3. Action Space and Policy Network

4.4. Reward Design and Decentralized Training

4.4.1. Training Loop and Hyperparameters

4.4.2. Training Budget and Computational Cost

4.5. Algorithmic Justification and DRL Choice

Empirical Stability: PPO Versus DQN/A2C

5. Experimental Setup

5.1. Execution Platform

5.2. Simulator and Architectural Model

5.3. Traffic Sources, Pre-Processing, and Injection Schedules

5.4. DRL Agent and Training Protocol

5.5. Measurement Protocol, Seeds, and Metrics

5.6. Offered Load, Control of Injection Rate, and Anchor Point

5.7. Baselines and Comparison Methodology

5.8. Post-Processing and Statistical Analysis

6. Results and Discussion

6.1. Performance Versus Offered Load

6.1.1. Latency Versus Injection Rate for Synthetic Traffic

6.1.2. Latency Versus Injection Rate for PARSEC 3.0

6.1.3. Latency Versus Injection Rate for SPLASH-2

6.1.4. Latency Versus Injection Rate for AI Workloads

6.1.5. Energy per Delivered Bit Versus Injection Rate for Synthetic Traffic

6.1.6. Energy per Delivered Bit Versus Injection Rate for PARSEC 3.0

6.1.7. Energy per Delivered Bit Versus Injection Rate for SPLASH-2

6.1.8. Energy per Delivered Bit Versus Injection Rate for AI Workloads

6.1.9. Throughput Versus Injection Rate for Synthetic Traffic

6.1.10. Throughput Versus Injection Rate for PARSEC 3.0

6.1.11. Throughput Versus Injection Rate for SPLASH-2

6.1.12. Throughput Versus Injection Rate for AI Workloads

6.2. Spatial Congestion and Throughput–Energy Pareto Fronts

6.2.1. Spatial Structure for Representative Workloads

6.2.2. Throughput–Energy Trade-Offs

6.2.3. Regimes with Limited Benefit

6.3. Scalability with Mesh Size and Optical Concurrency

6.3.1. Throughput Scaling with Mesh Size

6.3.2. Energy per Delivered Bit Versus Wavelength Count

6.4. Ablation Study: Sources of Gain

6.5. Aggregate Results by Traffic Class

6.6. Statistical Significance and Effect Sizes

6.7. Cross-Generalization to Unseen Workloads

6.8. Limitations and Threats to Validity

7. Future Work

8. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information