Miss-Triggered Content Cache Replacement Under Partial Observability: Transformer-Decoder Q-Learning

Kim, Hakho; Sun, Teh-Jen; Huh, Eui-Nam

doi:10.3390/math13193217

Open AccessArticle

Miss-Triggered Content Cache Replacement Under Partial Observability: Transformer-Decoder Q-Learning

by

Hakho Kim

¹

,

Teh-Jen Sun

¹

and

Eui-Nam Huh

^2,*

¹

Department of Artificial Intelligence, Kyung Hee University, Yongin 17104, Republic of Korea

²

Department of Computer Engineering, Kyung Hee University, Yongin 17104, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(19), 3217; https://doi.org/10.3390/math13193217

Submission received: 29 August 2025 / Revised: 29 September 2025 / Accepted: 4 October 2025 / Published: 7 October 2025

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

Content delivery networks (CDNs) face steadily rising, uneven demand, straining heuristic cache replacement. Reinforcement learning (RL) is promising, but most work assumes a fully observable Markov Decision Process (MDP), unrealistic under delayed, partial, and noisy signals. We model cache replacement as a Partially Observable MDP (POMDP) and present the Miss-Triggered Cache Transformer (MTCT), a Transformer-decoder Q-learning agent that encodes recent histories with self-attention. MTCT invokes its policy only on cache misses to align compute with informative events and uses a delayed-hit reward to propagate information from hits. A compact, rank-based action set (12 actions by default) captures popularity–recency trade-offs with complexity independent of cache capacity. We evaluate MTCT on a real trace (MovieLens) and two synthetic workloads (Mandelbrot–Zipf, Pareto) against Adaptive Replacement Cache (ARC), Windowed TinyLFU (W-TinyLFU), classical heuristics, and Double Deep Q-Network (DDQN). MTCT achieves the best or statistically comparable cache-hit rates on most cache sizes; e.g., on MovieLens at

M = 600

, it reaches

0.4703

(DDQN

0.4436

, ARC

0.4513

). Miss-triggered inference also lowers mean wall-clock time per episode; Transformer inference is well suited to modern hardware acceleration. Ablations support

C L = 50

and show that finer action grids improve stability and final accuracy.

Keywords:

deep reinforcement learning; content cache replacement; transformer; POMDP; cache replacement

MSC:

68T07; 90C40; 93C41; 93E35; 68M20

1. Introduction

The rapid growth of internet and mobile traffic has increased the load on content delivery networks (CDNs). As of 2024, about 5.5 billion people (68%) use the internet and roughly 57% are mobile users [1,2]. Over the same period, monthly mobile data traffic rose from 58.2 exabytes in 2020 to 165.7 exabytes in 2024 and is projected to reach 430 exabytes by 2030 [3]. These trends call for caching policies that track rapid, uneven shifts in content popularity without incurring high online overheads.

Reinforcement learning (RL) offers a data-driven alternative to classical policies such as least recently used (LRU) and least frequently used (LFU). Model-free value methods like Deep Q-Network (DQN) [4] and Double DQN (DDQN) [5] can adapt to changing demand, but much prior work assumes a fully observable Markov Decision Process (MDP). In production CDNs, signals are often delayed, partial, or noisy, so the agent observes compressed or lagged statistics rather than the true state. This motivates a partially observable formulation.

We model cache replacement as a Partially Observable MDP (POMDP) and introduce the Miss-Triggered Cache Transformer (MTCT), a Transformer-decoder Q-learning agent that encodes recent observation histories via self-attention. MTCT invokes its policy only on cache misses, aligning updates with informative events and reducing the online inference rate to approximately

1 - C_{HIT}

. A delayed-hit reward carries cache-hit information into miss-triggered updates to stabilize learning under partial observability. The action space is a compact rank–evict grid (12 actions by default) that trades off popularity and recency while keeping decision complexity independent of capacity.

We evaluate MTCT on a real trace (MovieLens) and two synthetic workloads (Mandelbrot–Zipf, Pareto). Baselines include modern admission/replacement methods—Adaptive Replacement Cache (ARC) and Windowed TinyLFU (W-TinyLFU)—and classical heuristics. We also compare against RL baselines, including DDQN and a miss-triggered DDQN control that shares the trigger schedule and replay protocol with MTCT. Across most cache sizes, MTCT attains the best or statistically comparable hit rates. On MovieLens, it is consistently top (e.g.,

M = 600

:

0.4703

vs. DDQN

0.4436

, ARC

0.4513

); on synthetic workloads, ARC/W-TinyLFU are strong at small to mid capacities, with gaps narrowing or reversing as M grows. Despite using a higher-capacity network, the mean wall-clock time per training episode is lower than DDQN due to miss-triggered inference and parallelizable decoder execution.

We also study design choices via ablations. Varying the context length shows four regimes with a stable plateau for

C L \in [30, 70]

, supporting

C L = 50

as a practical default. Adjusting action-set granularity (3/7/12) shows that finer grids improve final accuracy and reduce run-to-run variance, whereas very coarse grids converge faster but to lower fixed points. Under distribution shift on Mandelbrot–Zipf, MTCT achieves the strongest end-of-training averages among compared methods and suggests benefits from prioritized or recency-aware replay to speed post-shift adaptation.

Contributions. We formulate cache replacement under partial observability as a POMDP and instantiate it with a MTCT. The policy is invoked only on cache misses, focusing computation on informative events and reducing the online inference rate to approximately

1 - C_{HIT}

. A delayed-hit reward carries cache-hit feedback into miss-triggered updates, improving stability. We introduce a compact, rank-based action space (12 actions by default) and examine 3/7/12 variants to assess the effect of action-granularity. Against Adaptive Replacement Cache (ARC), Windowed TinyLFU (W-TinyLFU), classical heuristics, and both standard and miss-triggered DDQN, MTCT attains higher hit rates on MovieLens and best or statistically comparable results on the synthetic workloads. It also lowers mean per-episode wall-clock time relative to standard DDQN (with rare exceptions in Appendix B).

2. Related Work

2.1. Cache Management Policies and Related Work

Research on content caching is commonly grouped by admission, eviction, and replacement. Admission policies decide whether a requested object should enter the cache; eviction policies choose a victim when the cache is full; replacement policies couple admission and eviction into a unified decision.

Admission. RL-Cache learns admission directly from request traces with model-free RL [6]. Song et al. [7] formulate proactive mobile caching as an MDP to minimize long-term energy. Wang et al. [8] address long horizons via subsampling and Advantage Actor–Critic (A2C). Niknia et al. [9] combine Double Deep Q-Learning with transfer learning for adaptive edge caching. Srinivasan et al. [10] use LSTM-augmented A2C to cope with non-stationarity in wireless settings.

Eviction. Zhou et al. [11] propose 3L-Cache, applying object-level RL with bidirectional sampling and gradient boosting. Alabed et al. [12] present RLCache, managing admission, eviction, and TTL via multi-task RL. Yang et al. [13] introduce MAT, which mixes heuristics and RL to reduce prediction cost for eviction. Wang et al. [14] develop DeepChunk, a DQN-based scheme for chunk-level caching in wireless systems. Sun et al. [15] report a DQN-driven cache manager with strong adaptability across workloads.

Replacement. Zhong et al. [16] use the Wolpertinger architecture to tackle large discrete action spaces in DRL-based caching. Nguyen et al. [17] (RL-Bélády) unify admission and eviction for CDNs. Zhou et al. [18] propose Catcher, an end-to-end DRL caching system that generalizes across cache sizes and workloads. Wang et al. [19] study multi-agent DRL for cooperative edge caching; Lyu et al. [20] extend this with an actor–critic design for dynamic environments. Abdo et al. [21] present a Q-learning framework that integrates admission and replacement in fog computing.

Building on this line of work, we take a replacement-oriented view and address partial observability directly: we cast cache replacement as a POMDP and learn a policy that couples admission and rank-based eviction. Our formulation differs from fully observable MDP approaches by explicitly encoding recent observation histories and invoking the policy only on cache misses, aligning decision rate with

1 - C_{HIT}

.

2.2. MDP-Based Caching Research and Its Limitations

2.2.1. Markov Decision Processes (MDPs)

Owing to their tractability and well-established solution methods [22], Markov Decision Processes (MDPs) have been widely adopted in content caching research. An MDP is defined by the tuple

〈 S, A, P, R, γ 〉

, where S is the state space, A the action space, P the state-transition probability, R the reward function, and

γ \in [0, 1)

the discount factor. In this setting, the agent fully observes the current state

s \in S

and selects actions

a \in A

to maximize expected long-term reward. The state-value function under a fixed policy

π

is

V^{π} (s) = E [\sum_{t = 0}^{\infty} γ^{t} r_{t} | s_{0} = s, π],

(1)

and the optimal value function is

V^{*} (s) = {max}_{π} V^{π} (s)

. Here,

r_{t}

denotes the immediate reward at step t.

Building on this framework, prior studies have proposed a range of MDP-based caching methods. RL-Cache [6] learns admission policies directly from traces, while Song et al. [7] applied an MDP model to proactive caching in mobile networks. To address scalability, Wang et al. [8] introduced subsampling with Advantage Actor–Critic (A2C), and Srinivasan et al. [10] used LSTM-augmented A2C to adapt under non-stationary demand. Zhou et al. [11] further developed 3L-Cache, combining object-level RL with bidirectional sampling for efficient replacement.

2.2.2. Semi-Markov Decision Processes (SMDPs)

While MDPs assume fixed decision intervals, real caching systems often feature variable timing between events. Semi-Markov Decision Processes (SMDPs) extend MDPs by allowing flexible holding times between decisions [23].

Let

τ_{k}

denote the random holding time between the k-th and

(k + 1)

-th decision steps, and define the elapsed time up to decision step t as

T_{t} = \sum_{k = 0}^{t - 1} τ_{k},

(2)

so that

T_{0} = 0

. The state-value function under policy

π

is

V^{π} (s) = E [\sum_{t = 0}^{\infty} γ^{T_{t}} r_{t} | s_{0} = s, π],

(3)

where

γ \in [0, 1)

is the discount factor and

r_{t}

the immediate reward at decision step t. The optimal value function is

V^{*} (s) = {max}_{π} V^{π} (s)

. By raising

γ

to the realized elapsed time

T_{t}

, this formulation correctly accounts for irregular decision intervals. Such flexibility makes SMDPs particularly suitable for irregular or event-driven caching systems; for example, Niknia et al. [9] applied an SMDP-based approach to edge caching with irregularly timed requests.

2.2.3. Limitations of MDP/SMDP Approaches

While MDPs and SMDPs provide mathematically well-defined objectives (Equations (1) and (3)) and established solution techniques, both rely on the assumption of full state observability. In practice, this assumption rarely holds in CDN environments with distributed caches, delayed telemetry, and noisy logs. As Kaelbling et al. [22] noted, full observability is convenient for analysis but uncommon in real systems.

Empirical evidence from other domains supports this limitation. In partially observable Atari tasks with flickering inputs, Hausknecht and Stone [24] showed that memoryless DQN agents degrade sharply, whereas adding recurrence (DRQN) restores stability under observation aliasing. Across broader POMDP benchmarks, Toro Icarte et al. [25] reported that reactive (memoryless) policies are unreliable and that learned memory is required for stable generalization. From a theoretical perspective, Liu et al. [26] demonstrated that MDP-based learners can be sample-inefficient or unstable under partial observability unless restrictive revealing conditions are satisfied. More recently, Esslinger et al. [27] found that Transformer-based memory (DTQN) is particularly effective, coping with occlusion and noise while outperforming recurrent baselines in learning speed and robustness.

Taken together, these findings indicate that cache control under delayed and partial measurements is better modeled as a POMDP, where incorporating learned memory into the policy is essential. Motivated by this evidence, our work adopts a Transformer decoder as the memory module and trains it with miss-triggered updates tailored to event-driven CDN caching.

2.3. Memory-Based Approximation for POMDP Caching

2.3.1. POMDP Definition and Exact-Solution Complexity

A Partially Observable Markov Decision Process (POMDP) is defined by the tuple

P = 〈 S, A, P, R, γ, O, Z 〉

, where S is the latent state space, A the action space,

P (s^{'} ∣ s, a)

the state-transition probability, R the reward function,

γ \in [0, 1)

the discount factor, O the observation space, and

Z (o ∣ s, a)

the observation (emission) model. Because the true state is unobserved, the agent maintains a belief distribution

b_{t} = P (s_{t} ∣ o_{1 : t}, a_{1 : t - 1})

, updated by Bayes’ rule [22]. The value function over belief states is

V (b) = max_{π} E [\sum_{t = 0}^{\infty} γ^{t} r_{t} | b_{0} = b, π] .

(4)

Solving the corresponding continuous-state Bellman equation exactly is computationally intractable: grid-based or point-based methods must quantize the belief simplex, and the number of grid points grows exponentially with

| S |

. This curse of dimensionality renders exact POMDP solvers impractical for real-time CDN caching, where state dimensionality (thousands of objects) and request dynamics are large and fast.

2.3.2. Memory-Based Approximation

To avoid intractable belief updates (Section 2.3.1), we adopt a memory-based approximation that compresses the most recent L interaction steps into a learned memory vector. At decision time t, define the history

h_{t} = (o_{t - L + 1}, o_{t - L + 2}, \dots, o_{t - 1}, o_{t}) .

(5)

Toro Icarte et al. [25] showed that memoryless policies degrade under partial observability, whereas learned memory (e.g., LSTM) restores stability and generalization. Parisotto et al. [28] introduced Gated Transformer-XL (GTrXL), which augments Transformer-XL with GRU-style gating and a revised normalization scheme; on DMLab-30, GTrXL outperformed LSTM-based agents, indicating that gated self-attention can serve as an efficient memory mechanism for POMDPs.

2.3.3. Advantages of Transformer Architectures in RL for Caching

A Transformer decoder applies causal self-attention to the fixed-length observation history

h_{t}

(Equation (5)), producing a compact memory embedding

m_{t} = {TransformerDecoder}_{θ} (h_{t})

. Unlike recurrent networks, parallel attention avoids stepwise recurrence, mitigates vanishing gradients, and enables faster convergence. Causal masking preserves temporal order while allowing flexible reference to informative past events—useful in sequential decision problems such as caching. We concatenate

m_{t}

with the current cache features

s_{t}

and approximate the action–value function as

Q ([s_{t}; m_{t}], a) \approx E [r_{t} + γ max_{a^{'}} Q ([s_{t + 1}; m_{t + 1}], a^{'})] .

(6)

The policy then selects

a_{t} \in arg {max}_{a} Q ([s_{t}; m_{t}], a)

.

Adaptation for Content Caching

We augment the Transformer inputs with cache-specific metadata (e.g., popularity statistics, categorical tags, timestamps, and derived features), and tailor the reward and decision timing to CDN dynamics. These adaptations guide the learned memory to capture both temporal and structural regularities under partial observability, yielding a practical cache-replacement method for deployment.

3. Problem Formulation

3.1. System Model

We study a single edge-server cache of capacity M, where all content items have unit size so that the cache can hold at most M objects. Time is modeled in discrete steps

t \in {1, 2, \dots, T}

, each corresponding to the arrival of a request

R_{t}

drawn from a predefined workload (see Section 3.2 for formal definitions and Section 5.1 for preparation details).

Cache hit: If $R_{t} \in {CIM}_{t}$ , the request is served from the edge cache, minimizing latency. Here, ${CIM}_{t}$ (Contents in Memory at time t) denotes the set of objects currently stored in the cache.
Cache miss: If $R_{t} \notin {CIM}_{t}$ , the requested object is fetched from the CDN, inserted into the cache (evicting one item if $| {CIM}_{t} | = M$ ), and then served, incurring additional response delay and backhaul traffic.

The cache-replacement policy is invoked only on cache misses, so decisions are made precisely when previously unseen content arrives. We adopt a model-free reinforcement learning setting in which the next state

s_{t + 1}

under

(s_{t}, a_{t})

is determined by the simulator or trace, without explicit transition probabilities.

The effectiveness of a caching policy

π

is primarily evaluated by the long-run cache-hit rate:

C_{HIT} = \frac{1}{T} \sum_{t = 1}^{T} 1 \{R_{t} \in {CIM}_{t}\},

(7)

where

1 {\cdot}

is the indicator function and T the total number of requests. This metric directly captures how effectively the policy exploits capacity to reduce latency and bandwidth usage. To complement hit rate, we also report auxiliary metrics (e.g., average miss latency and total backhaul traffic) in our experiments.

3.2. Request Workload Models

We evaluate the learned policies under three representative request workloads. Each workload is executed independently (no mixing). Let N denote the catalog size (the number of unique content items). Detailed workload parameters and preprocessing steps for MovieLens are provided in Section 5.1.

MovieLens workload [29]: A real-world request trace derived from MovieLens, replayed as a request sequence to reflect realistic user behavior.
Mandelbrot–Zipf workload: A synthetic workload in which each request index $i \in {1, \dots, N}$ is sampled according to

$P_{Zipf} (i) = \frac{{(i + q)}^{- s}}{\sum_{j = 1}^{N} {(j + q)}^{- s}},$

(8)

where $s > 0$ is the skewness exponent and $q \geq 0$ is an offset controlling the head–tail balance.
Pareto workload: A synthetic workload in which request index $i \in {1, \dots, N}$ follows

$P_{Pareto} (i) = \frac{α x_{m}^{α} i^{- (α + 1)}}{\sum_{j = 1}^{N} α x_{m}^{α} j^{- (α + 1)}} = \frac{i^{- (α + 1)}}{\sum_{j = 1}^{N} j^{- (α + 1)}},$

(9)

where $α > 0$ is the shape parameter and $x_{m} > 0$ the minimum-index cutoff.

By separating workload specification from the POMDP formulation in Section 3, we ensure that the state, action, observation, reward definitions, and miss-triggered decision timing remain fixed, while workload variation is controlled entirely through the request-generation process.

3.3. State and Observation

In our POMDP formulation, we distinguish between the latent environment state

s_{t}

, which the agent cannot fully observe, and the partial observation

o_{t}

that the agent actually receives and uses for decision making.

3.3.1. Environment State

We represent the true environment state as

s_{t} = [{CF}_{t}, {CP}_{t}],

where

Content Features ( ${CF}_{t}$ ): Metadata for all N catalog entries, including each item’s identifier, popularity score, and genre tag. Together, these describe the catalog at time t.
Content Popularity ( ${CP}_{t}$ ): A full vector of popularity scores for every item, spanning both the catalog and the cache. We distinguish two components:
- catalog popularity: Scores for all N items in the catalog.
- cache popularity: Scores for the up to M items currently stored in the cache.

Although

s_{t}

captures the complete system state, including precise popularity statistics, it remains hidden from the agent.

3.3.2. Observation

At each decision step, the agent observes a compressed feature vector

o_{t} = [{CRF}_{t}, {CCP}_{t}],

where

Current-Request Features ( ${CRF}_{t}$ ): Identifier of the requested item $R_{t}$ , its popularity rank, genre tag, and request timestamp.
Compressed Content Popularity ( ${CCP}_{t}$ ): Six summary statistics of ${CP}_{t}$ : the mean, top-10% mean, and bottom-10% mean, each computed over (i) the entire catalog and (ii) the cached subset.

Thus, the agent bases its decisions on the compact observation

o_{t}

, rather than the full latent state

s_{t}

. This design reflects partial observability while retaining the most informative signals, enabling efficient and stable learning of cache-replacement policies.

3.4. Action Space

We invoke the policy only on cache misses (i.e.,

R_{t} \notin {CIM}_{t}

), never on hits. To keep decision complexity independent of capacity M, we use a fixed action set

A = {0, 1, \dots, 11}

. On each miss, the agent selects one action

a_{t} \in A

:

$a_{t} = 0$ : Do not cache the missed item (admission suppressed).
$1 \leq a_{t} \leq 5$ : Rank cached items by ascending estimated popularity,

$[i_{1}, \dots, i_{M}] = Rank ({CIM}_{t}; ascending popularity),$

and evict $i_{a_{t}}$ .
$6 \leq a_{t} \leq 11$ : Rank cached items by recency (least recently used first),

$[j_{1}, \dots, j_{M}] = Rank ({CIM}_{t}; LRU),$

and evict $j_{a_{t} - 5}$ .

Recency, tie-breaking, and clamping.

Let

τ (x)

be the last access time (in steps) of item x, and define

r_{t} (x) = t - τ (x)

(larger means less recent use). Sorting by

LRU

is equivalent to ordering by non-increasing

r_{t}

, so

j_{1} = arg {max}_{x} r_{t} (x)

. Ties in either ranking are broken deterministically by (i) smaller item ID, then (ii) earlier insertion time. If a requested rank k exceeds M, we clamp to

k^{'} = min {k, M}

for determinism.

This rank–evict grid subsumes classical heuristics as special cases:

LFU

corresponds to

a_{t} = 1

, and

LRU

to

a_{t} = 6

. Confining decisions to a small, semantically meaningful set that balances long-term popularity and short-term recency reduces exploration variance and stabilizes Q-learning as M grows. We also evaluate reduced action sets with

| A | \in {3, 7, 12}

under identical protocols (see Section 5.4).

3.5. Reward Function

Let

δ_{1} < δ_{2} < \dots < δ_{K}

denote the strictly increasing sequence of cache-miss time steps, a subsequence of the global timeline

t \in {1, 2, \dots, T}

. The policy is updated only at these miss steps, but cache hits between misses still contribute to the reward. At miss step

δ_{k}

, the reward is defined as

r_{δ_{k}} = r_{hit, δ_{k}} + r_{miss} + r_{add, δ_{k}} + (r_{r_{L}} + r_{r_{M}} + r_{r_{S}}) + r_{totalP} .

(10)

Cache-hit window reward $r_{hit, δ_{k}}$ : The number of cache hits occurring between the previous miss $δ_{k - 1}$ and the current miss $δ_{k}$ ,

$r_{hit, δ_{k}} = \sum_{t = δ_{k - 1} + 1}^{δ_{k} - 1} 1 (R_{t} \in {CIM}_{t}) .$

(11)
Cache-miss penalty $r_{miss}$ : A fixed negative penalty applied at each cache miss.
Delayed-hit bonus $r_{add, δ_{k}}$ : Within a fixed horizon H, cache hits on items admitted after $δ_{k - H + 1}$ yield additional reward, reinforcing effective recent admissions.
Multi-scale request-matching rewards $(r_{r_{L}}, r_{r_{M}}, r_{r_{S}})$ : Fractions of hits measured over long, medium, and short recent-request windows.
Global-popularity alignment $r_{totalP}$ : A reward proportional to the overlap between cached items and globally popular content, promoting adaptation to long-term demand.

This composite reward integrates immediate reuse

(r_{hit})

, short- to long-term request matching

(r_{r_{*}})

, and popularity alignment

(r_{add}, r_{totalP})

, guiding the agent to trade off short-term responsiveness against sustained cache efficiency under partial observability.

3.6. Learning Objective

Building on the miss-triggered reward scheme (Section 3.5), the objective is to learn a policy

π

that maximizes the expected average reward per miss:

ρ^{π} = lim_{K \to \infty} \frac{1}{K} E_{π} [\sum_{k = 1}^{K} r_{δ_{k}}],

(12)

where

δ_{1} < δ_{2} < \dots < δ_{K}

are the miss steps in an episode of length T, and

r_{δ_{k}}

is given by (10). The optimal policy is then

π^{*} = arg max_{π} ρ^{π} .

(13)

This objective is closely related to the cache-hit rate

C_{HIT}

, since

ρ^{π} \approx \frac{H}{K} = \frac{C_{HIT}}{1 - C_{HIT}},

(14)

where H and K denote the total numbers of hits and misses, respectively (so

T = H + K

and

C_{HIT} = H / T

). Thus, maximizing

ρ^{π}

is essentially aligned with maximizing

C_{HIT}

. This formulation also satisfies the Bellman optimality condition under miss-triggered updates. In practice, we implement this objective using DDQN and perform value updates only at miss steps

δ_{k}

(see Section 4).

4. Proposed Framework: Miss-Triggered Cache Transformer

This section introduces the MTCT, a framework designed to align learning and decision-making with cache-miss events, thereby capturing partial observability in CDN environments more effectively. The framework consists of three components: (1) a Miss-Triggered Decision Mechanism, which activates policy inference and updates only upon cache misses; (2) Miss-Triggered Delayed-Hit Reward Learning, which accumulates hit-window rewards and incorporates delayed-hit bonuses to reinforce effective caching decisions; and (3) Transformer Decoder Memory Approximation, which embeds fixed-length sequences of observations and actions via a lightweight Transformer decoder to approximate POMDP belief updates.

The rest of this section is structured as follows. Section 4.1 outlines the overall process flow. Section 4.2 explains the agent’s internal data pipeline. Section 4.3 details the Miss-Triggered Delayed-Hit Reward Learning scheme. Finally, Section 4.4 describes aggregation of partial observations into state representations.

4.1. MTCT Framework Process Flow Overview

Figure 1 illustrates the processing steps of a content request under the miss-triggered framework at the edge cache server. When a request

R_{t}

arrives, the system follows a pipeline that confines policy inference and model updates to cache-miss events, minimizing computational overhead. The stages of this workflow are summarized in Figure 1 and detailed below.

Request and cache lookup (Steps 1–2). The system registers the user request $R_{t}$ and checks whether the requested content is present in ${CIM}_{t}$ .
Cache hit (Step 3-a). If the item is found, it is served immediately with no policy call. Metadata such as recency or popularity may be updated.
Cache miss and CDN fetch (Steps 3-b, 4). If the item is absent, it is retrieved from the CDN/origin and placed in a temporary buffer.
MTCT agent decision (Step 5). After the fetch, the MTCT agent decides whether to admit the item into the cache. If admitted, the agent also selects a victim for eviction.
Delivery and update (Steps 6-a, 6-b). The content is delivered to the user regardless of admission. If admitted, the item replaces the chosen victim in the cache; otherwise, the temporary copy is discarded.

This miss-triggered pipeline ensures that policy inference and updates occur only when necessary, focusing learning on the most informative events while avoiding unnecessary computation during cache hits.

4.2. Agent Data Processing Flow

Figure 2 summarizes the pipeline in five stages, described below.

1. Request and cache lookup (red). A user request $R_{t}$ arrives and the cache is queried. No policy call occurs at this stage.
2. Internal state update (gray). Popularity and recency metadata in the internal state $s_{t}$ are updated.
Hit case: if $R_{t} \in {CIM}_{t}$ , metadata and counters for Miss-Triggered Delayed-Hit Reward Learning (MTDHRL; Section 4.3) are updated, and the request is served from the cache. No observation construction, policy inference, or replay write occurs.
Miss case: if $R_{t} \notin {CIM}_{t}$ , miss-related counters and statistics are updated, and the pipeline proceeds to Stage 3.
3. History integration (brown). A partial observation $o_{t}$ is constructed via State Aggregation (Section 4.4) and appended to the last $C L - 1$ observations to form

$h_{t} = \{o_{t - C L + 1}, \dots, o_{t}\},$

represented as a tensor of shape $[1, C L, obs_dim]$ , where $C L$ is the context length.
4. Inference and action/transition (green). The policy consumes the history tensor and produces Q-value estimates ${Q_{t - C L + 1}, \dots, Q_{t}}$ . The cache operation is chosen by $a_{t} = arg {max}_{a} Q_{t} (a)$ and executed (bypass or admit+evict; rank-based). The transition $(o_{t - 1}, a_{t - 1}, r_{t - 1}, o_{t})$ from the previous miss is written to the replay buffer.
5. Replay learning and target update (apricot). Mini-batches of size $B_{s}$ consisting of $C L$ -step sequences are sampled from replay to compute the temporal-difference loss against a target network. Input tensors have shape $[B_{s}, C L, obs_dim]$ . Policy parameters are updated by gradient descent, and the target network is synchronized periodically. Training is miss-triggered, so hit-only intervals incur no forward or backward passes.

This design ensures that computation and learning are concentrated on cache-miss events, reducing overhead during hits while preserving the information needed for stable training under partial observability.

Algorithmic overview. The overall MTCT operation is summarized in Algorithm 1. The pseudocode below follows the same miss-triggered loop as Figure 2: by design, hits do not invoke the policy, whereas misses trigger observation/history assembly, action selection, a replay write, and a small number of learner steps. For each request, the cache is checked first; on a hit, we refresh metadata and update delayed-reward counters only. On a miss, we form

o_{t}

and update the fixed-length history

h_{t}

with the most recent

C L

observations. The agent then chooses either to bypass or to admit with a rank-based eviction (e.g., least popular or least recently used ranks). At the miss, the credited reward aggregates immediate reuse since the last miss, a miss penalty, and delayed-hit bonuses that attribute subsequent hits to recent admissions. We append one transition (from the previous miss) to the replay buffer. We perform G gradient steps with temporal-difference targets and synchronize the target network every K misses. This event-driven schedule concentrates computation on informative events while still propagating information from hit intervals via delayed-reward bookkeeping.

Algorithm 1 MTCT online loop (miss-triggered).

1:: Init $θ$ ; $\bar{θ} \leftarrow θ$ ; replay $D$ ; counters
2:: for each request $R_{t}$ do
3:: if $R_{t} \in {CIM}_{t}$ then ▹ hit
4:: OnHit
5:: continue
6:: else ▹ miss
7:: $o_{t} \leftarrow$ BuildObs
8:: $h_{t} \leftarrow$ AssembleHistory $(o_{t}, C L)$
9:: $a_{t} \leftarrow arg {max}_{a} Q_{θ} (h_{t}, a)$
10:: Execute( $a_{t}$ ) ▹ bypass or admit + evict (rank-based)
11:: $r_{t} \leftarrow$ ComputeReward $(hit-window, miss-pen, delayed-hit, \dots)$
12:: ReplayWrite( $(h_{t - 1}, a_{t - 1}, r_{t - 1}, h_{t})$ if previous step was a miss)
13:: for $g = 1$ to G do
14:: sample $(h, a, r, h^{'}) \sim D$ ; $y \leftarrow r + γ {max}_{a^{'}} Q_{\bar{θ}} (h^{'}, a^{'})$
15:: $θ \leftarrow θ - η \nabla_{θ} {(Q_{θ} (h, a) - y)}^{2}$
16:: end for
17:: if $miss_count mod K = 0$ then $\bar{θ} \leftarrow θ$
18:: end if
19:: end if
20:: end for
21:: function OnHit
22:: serve from cache; update recency/popularity; update delayed-reward counters
23:: /* no policy call, no history build, no replay write, no update */
24:: end function

4.3. Miss-Triggered Delayed-Hit Reward Learning (MTDHRL)

To exploit the high information content of cache misses while minimizing overhead, all policy inference and learning are triggered exclusively at miss events

{δ_{k}}

. Cache hits between misses still carry useful information, so we aggregate them into two delayed-reward components credited at the next miss.

Hit-window reward $r_{hit, δ_{k}}$ . This term counts the number of cache hits that occurred between the previous miss $δ_{k - 1}$ and the current miss $δ_{k}$ (see Equation (11), Section 3.5). For example, if the 5th miss occurs at request $R_{15}$ and the 6th miss at $R_{20}$ , then all hits from $R_{16}$ to $R_{19}$ are accumulated as $r_{hit, δ_{6}} = 4$ . This signal reflects short-term reuse efficiency.
Additional delayed-hit bonus $r_{add, δ_{k}}$ . Over a fixed horizon of the past H miss events ${δ_{k - H}, \dots, δ_{k - 1}}$ , any cache hits on items admitted during those misses earn extra reward, credited at the current miss $δ_{k}$ . For example, with $H = 2$ , if the 6th miss occurs at $R_{20}$ and items admitted at the 4th and 5th misses subsequently hit between $R_{16}$ – $R_{19}$ , then $r_{add, δ_{6}} = 2$ . This signal propagates medium-term impact of admission decisions.

By consolidating hit-window and delayed-hit bonuses into miss-triggered updates, MTDHRL avoids per-hit policy calls while still incorporating both immediate and delayed information from cache hits. These aggregated rewards are stored with their associated transitions in the replay buffer and used during Q-learning updates. The Transformer decoder processes only the observation history for Q-value estimation. This design reduces computational overhead and enables the agent to capture long-term dependencies between replacement actions and their delayed outcomes, leading to efficient and stable cache-management policies under partial observability.

4.4. State Aggregation

In our POMDP formulation, the full environment state

s_{t} = [{CF}_{t}, {CP}_{t}]

includes both catalog-level and cache-level popularity information. At each time step t, we maintain two vectors: a catalog-level popularity vector of length N and a cache-level vector of length M for items in

{CIM}_{t}

. Directly using these high-dimensional vectors is computationally expensive, so we summarize them with six scalar statistics that capture overall, head, and tail popularity for both scopes (Table 1).

Using these six statistics, we define the Compressed Content Popularity vector as

{CCP}_{t} = [μ_{all}, μ_{cache}, μ_{all, top}, μ_{cache, top}, μ_{all, bot}, μ_{cache, bot}],

which replaces the raw popularity vectors. This representation reduces input dimensionality and accelerates both Transformer inference and training updates. It also smooths out noise in item-level popularity while highlighting head (top-10%) and tail (bottom-10%) trends, guiding the agent toward the most impactful cache-management decisions. Combined with Miss-Triggered Delayed-Hit Reward Learning, this state aggregation contributes to stable and efficient policy learning under partial observability.

In summary, the MTCT framework integrates Miss-Triggered Delayed-Hit Reward Learning, Transformer-based memory approximation, and state aggregation to address the challenges of partial observability and high-dimensional cache states. The next section evaluates these contributions empirically across diverse workloads and baselines.

5. Experiments

5.1. Experimental Environment and Workloads

All experiments were conducted on a workstation with an NVIDIA GeForce RTX 3060 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA), 32 GB DDR5-5600 RAM (2 × 16 GB; Samsung Electronics, Suwon-si, Republic of Korea), and an Intel Core i5-13500 CPU (Intel Corporation, Santa Clara, CA, USA). We evaluate the proposed framework on three representative workloads:

MovieLens Latest Small [29]: We construct the request trace by mapping each rating event to a movie request and sorting events by Unix timestamp in ascending order (ties resolved by the original event index). The rating values themselves are not used. Movie metadata is merged with a three-dimensional PCA projection of one-hot genre vectors, and UTC temporal features (date, weekday, time of day) are added. The resulting trace covers requests from 610 users.
Synthetic Mandelbrot–Zipf: We generated a catalog of $N = 3000$ items and simulated $300,000$ requests from 5000 users based on a Mandelbrot–Zipf popularity model $(s = 1.2, q = 0.5)$ . Users were divided into heavy ( $40 %$ ), light ( $40 %$ ), and minor ( $20 %$ ) groups. Request counts were drawn from Poisson distributions and scaled to match the total volume. Heavy and light users sampled items with probabilities proportional to the Zipf exponents, while minor users sampled by normalized rank to emphasize long-tail items. The final traces were temporally shuffled and augmented with PCA-reduced genre features, home region, age group, time class, weekend flag, and request location.
Synthetic Pareto: Using the same procedure, we generated another $N = 3000$ -item catalog and $300,000$ requests following a Pareto popularity distribution $(α = 1.2)$ . All other user-group assignments and preprocessing steps matched those of the Mandelbrot–Zipf workload, thereby isolating the effect of the popularity distribution.

Table 2 lists the core hyperparameters used for MTCT and the DDQN baseline. Based on these configurations, we conducted evaluations covering multiple criteria, including cache-hit rate, wall-clock training time, and sensitivity to context length. The results reported in Section 5.3 are all derived under these controlled experimental settings.

5.2. Comparison Baselines

We evaluate three baseline families to fairly contextualize MTCT.

(A): Modern caching/admission policies (non-RL).

We focus in the main text on two widely adopted, high-performing representatives that cover complementary design axes—adaptive replacement and admission filtering:

ARC [30]: An adaptive replacement that balances recency and frequency via two main lists (recent/frequent) and two ghost lists remembering evicted items; self-tunes the balance based on ghost hits ( $O (1)$ amortized).
W-TinyLFU [31]: An admission policy combining a small recency window with an approximate frequency filter (count-min sketch, optionally a doorkeeper); paired with an LRU/ARC back end to protect against one-hit wonders.

Additional modern policies (LRU-K [32], 2Q [33], LeCaR [34], etc.) are evaluated in Appendix B, where we also provide method descriptions and comparisons beyond ARC/W-TinyLFU. Full hyperparameter specifications for all modern baselines, including window ratios, back-end choices, and sketch dimensions for W-TinyLFU, are documented in Appendix B.4.

(B): Classical heuristics (for continuity).

FIFO: Evicts the oldest resident (queue-based); oblivious to locality (lower-bound reference).
LRU: Removes the least recently used item; strong with short-term locality, slower under long-range gaps.
LFU: Evicts the lowest access count; strong under stable popularity/heavy tails, brittle under abrupt shifts.
Thompson Sampling: Bandit-style admission/eviction; adapts but requires posterior updates and is prior-sensitive.
Random: Uniform victim; very low overhead, typically weakest, used as a sanity check.

(C): RL baselines (with and without memory).

DDQN (MLP): A standard value-based agent that approximates caching as an MDP using aggregated features $({CRF}_{t}, {CCP}_{t})$ as a proxy state; three actions $a_{t} \in {0, 1, 2}$ for bypass/LRU/LFU; invokes the policy at every cache event (hits and misses); no sequence memory or delayed-hit credit assignment.
DDQN (miss-triggered; control): Same MLP, optimizer, replay, targets, and loss as above, but policy calls, replay writes, and updates occur only at miss steps ${δ_{k}}$ , matching MTCT’s trigger schedule and miss-window transitions. This isolates the contribution of Transformer-based memory under an identical decision cadence.

Reporting policy. In the main text, we report ARC and W-TinyLFU alongside classical heuristics and RL baselines; additional modern policies and ablations are documented in Appendix B, with full hyperparameter tables in Appendix B.4. Wall-clock analyses by cache size appear in Appendix A.1.
Evaluation setup. Unless otherwise stated, RL baselines use the same observation features, replay sampling, target-update cadence, and evaluation seeds. DDQN variants adopt a three-way action space, whereas MTCT uses the 12-action grid by default; MTCT and the miss-triggered DDQN control invoke the policy only on misses, while the standard DDQN acts at every cache event. We compute the final cache-hit rate as the mean over the last 10 episodes of each run and report results as mean ± s.d. across the number of runs indicated in each study (for example, 10 runs for the context-length sweep; 5 runs for reward and distribution-shift ablations; and 5 runs for the action-set study), and the corresponding numerical summaries are presented in the Results tables and figures where those artifacts are first introduced. End-of-training averages are summarized in the corresponding tables.

5.3. Evaluation

We evaluate MTCT under cache capacities

M \in {100, 200, 300, 400, 500, 600}

with a fixed context length

C L = 50

. For each M and method, we run 10 independent trials and report the cache-hit rate

C_{HIT}

(Equation (7)) averaged over the last 10 evaluation episodes.

Figure 3, Figure 4 and Figure 5 plot the mean

C_{HIT}

versus cache size on MovieLens (real trace) and two synthetic workloads (Mandelbrot–Zipf, Pareto). Overall, MTCT attains the best or statistically comparable (within one s.d.) hit rates on most cache sizes. On the synthetic workloads, the modern admission/replacement baselines ARC and W-TinyLFU are highly competitive at small to mid capacities, whereas MTCT catches up or leads as M grows. On MovieLens, MTCT is consistently top across all M.

Table 3 provides the detailed per-workload results. On MovieLens, MTCT is consistently the top performer across all M (e.g.,

M = 600

: MTCT

0.4703

vs. DDQN

0.4436

, ARC

0.4513

, W-TinyLFU

0.3905

). On Mandelbrot–Zipf, W-TinyLFU slightly exceeds MTCT at small M (e.g.,

M = 100

), while the gap narrows and reverses by

M = 600

(MTCT

0.6917

vs.

0.6882

). On Pareto, W-TinyLFU often remains ahead through mid-to-large capacities (e.g.,

M = 400, 500

), with a virtual tie at

M = 600

(MTCT

0.6961

vs.

0.6958

).

Complementing Table 3, Table 4 reports macro-averaged improvements (percentage points, pp) of MTCT over the baselines across cache sizes

M \in {100, 200, 300, 400, 500, 600}

. On Pareto, MTCT averages

0.76

pp below W-TinyLFU; however, across all workloads and capacities (18 settings), MTCT attains

+ 1.88

pp over W-TinyLFU, with gains of

+ 11.20

,

+ 8.30

,

+ 8.93

,

+ 19.27

,

+ 11.02

,

+ 2.45

, and

+ 2.72

pp vs. FIFO, LRU, LFU, Thompson, Random, ARC, and DDQN, respectively.

5.3.1. Discussion of Figure 3, Figure 4 and Figure 5

Each plot shows mean curves (no error bands) to maximize readability. As M increases, all methods improve monotonically, but the relative ordering varies by workload. On MovieLens, MTCT is consistently the top curve across M. On Mandelbrot–Zipf and Pareto, W-TinyLFU leads at small M. On Pareto, it remains ahead at mid capacities (e.g.,

M = 400, 500

). At the largest M, the curves converge: Pareto is a near tie, while MTCT holds a slight lead on Mandelbrot–Zipf. For variability and per-run dispersion, please refer to Table 3 (mean ± s.d. across 10 runs).

5.3.2. Wall-Clock Efficiency

While our main focus is on cache-hit rates, we also measured the mean wall-clock time per training episode under the hardware/software setup described in Section 5.1. Detailed breakdowns by cache size are reported in Appendix A.1. In brief, MTCT consistently requires less training time per episode than DDQN, confirming that miss-triggered decision making reduces policy-call overhead while maintaining high hit rates (see Table 5 for workload-level means; detailed per-M results in Appendix A.1).

5.4. Ablation Studies

To further analyze the design of MTCT, we conduct ablation studies along four axes: (i) history length, (ii) reward shaping, (iii) robustness to distributional shifts, and (iv) sensitivity to the action-set design.

5.4.1. Sensitivity to Context Length

To examine the effect of history length, we fixed the cache size at

M = 600

and varied the context length

C L

. Table 6 and Figure 6 summarize representative results. Performance is poor for very short contexts (

C L \leq 5

), indicating that the agent lacks sufficient temporal information to capture request dynamics. In particular,

C L = 1

effectively collapses the POMDP into a degenerate MDP approximation, serving as a lower bound on achievable performance. As

C L

increases, accuracy improves steadily and stabilizes near

0.47

for

C L \in [30, 70]

, with small standard deviations. Beyond

C L \approx 100

, performance degrades slightly (e.g.,

0.4492

at

C L = 128

,

0.4393

at

C L = 256

), suggesting diminishing returns and over-smoothing when the history window is too long. These trends justify adopting

C L = 50

as the default in the main experiments, balancing accuracy, stability, and computational cost. See Table A2 in Appendix A for the complete grid.

5.4.2. Reward Ablations

To isolate the effect of individual shaping terms, we conducted ablations on MovieLens at

M = 600

. Table 7 shows that removing the multi-scale request-matching term reduces accuracy by about

1.8 pp

, while dropping global popularity alignment has a larger impact (

- 3.7 pp

). Eliminating both delayed-hit rewards (

r_{hit, δ_{k}}

and

r_{add, δ_{k}}

) similarly causes a

\sim 3.6 pp

loss, highlighting their joint importance. These results suggest that (i) popularity-aware alignment is critical in long-tail workloads like MovieLens, (ii) multi-scale request matching stabilizes short- vs. medium-term reuse signals, and (iii) delayed-hit propagation is necessary to capture temporal effects of admission decisions beyond immediate hits. Overall, the full composite reward yields the most accurate and stable policy. The advantage of delayed-hit is not limited to stationary settings: on the non-stationary Mandelbrot–Zipf trace, MTCT attains

0.5166

vs.

0.4881

without delayed-hit (

+ 2.85 pp

) with markedly lower run-to-run s.d. (

0.0001

vs.

0.0391

); see Table 8.

5.4.3. Distribution-Shift Robustness (Mandelbrot–Zipf)

We evaluate robustness under evolving access dynamics using a lightweight Mandelbrot workload. It mirrors Section 5.1 but runs for

100,000

requests (vs.

300,000

in the main workloads). We vary the popularity exponent

α

across segments. In the stationary setting,

α = 1.2

. In the piecewise non-stationary schedule, the first

50,000

requests use

α = 1.0

, and the remaining

50,000

use

α = 0.6

. Table 8 confirms that MTCT achieves the highest hit rates under both settings, with the largest margins in the non-stationary case where heuristic and DRL baselines degrade more severely. Figure 7 and Figure 8 provide per-1k breakdowns: under stationarity, MTCT converges fastest and maintains the highest performance across the full trace. Under non-stationarity, all methods drop after the distribution shift, and MTCT—while still best on average—also exhibits a gradual decline in later requests. This indicates that, although miss-triggered sequence modeling improves robustness, a uniform replay buffer can overweight outdated experience, slowing adaptation to newly shifted popularity. A natural extension is to replace uniform replay with prioritized or recency-aware sampling to better align learning with current request distributions.

5.4.4. Action-Set Sensitivity

To assess robustness of the action-space design, we vary the granularity of rank-based eviction choices on MovieLens with

M = 600

, keeping the training and evaluation protocol fixed. As reported in Table 9, finer granularity consistently improves terminal performance and reduces variability: the 12-action configuration attains the highest final hit rate (

+ 5.1 pp

over 3 actions;

+ 1.7 pp

over 7 actions) and shows the smallest run-to-run spread. By contrast, the 3-action configuration tends to converge earlier to a lower fixed point with larger variance. This pattern is consistent with reduced expressivity and action aliasing—distinct states that require different popularity–recency trade-offs collapse to the same choice—together with coarser credit assignment that limits the diversity of TD targets and weakens the effect of exploration. The 7-action setting offers a middle ground, whereas 12 actions recover the full benefit of rank-based control and stabilize learning. Accordingly, the main results adopt the 12-action grid as the default.

6. Discussion

6.1. Overall Performance and Workload Dependence

Across the three workloads, MTCT is competitive with state-of-the-art heuristics and RL baselines and attains the best or statistically comparable hit rates on most cache sizes. On the real trace (MovieLens), MTCT is consistently the top method across all M, highlighting the benefit of miss-triggered sequence modeling and delayed-hit credit assignment under non-stationary, long-tail demand. On the synthetic workloads (Mandelbrot–Zipf, Pareto), modern admission/replacement policies—particularly W-TinyLFU and ARC—are highly competitive at small to mid capacities; gaps narrow or reverse as M increases, with a slight MTCT lead by

M = 600

on Mandelbrot–Zipf and a near-tie on Pareto. We note one exception visible in Appendix B: on MovieLens at

M = 600

, LRU–K (

K = 2

) slightly exceeds MTCT (0.4765 vs. 0.4703; Table A3). Nonetheless, aggregated over cache sizes and workloads, MTCT remains the most consistent top performer.

6.2. Wall-Clock Efficiency

In addition to accuracy, MTCT reduces episode time relative to a standard DDQN, consistent with the fact that policy-call volume scales with the miss rate

(\propto 1 - C_{HIT})

. Concentrating computation on informative miss events lowers overhead without sacrificing final quality; detailed breakdowns by cache size confirm this trend (Appendix A.1).

6.3. Ablations: Context Length and Reward Shaping

On MovieLens, varying the context length reveals four regimes. Very short histories (

C L \leq 5

) underperform and are unstable. Short-to-intermediate windows (

C L = 8 - 20

) improve steadily but still show larger variance. A plateau appears for

C L \in [30, 70]

around

C_{HIT} \approx 0.47

with minimal variance. Excessively long windows (

C L ≳ 100

) yield a mild decline, consistent with over-smoothing. These trends justify adopting

C L = 50

as a practical default. Reward ablations further show that removing multi-scale request matching or global popularity alignment reduces

C_{HIT}

by about

1.8 pp

and

3.7 pp

, respectively, and turning off delayed-hit components yields a similar

\sim 3.6 pp

drop, confirming their complementary roles.

6.4. Robustness Under Distribution Shift

On a lightweight Mandelbrot–Zipf variant with a piecewise shift in the popularity exponent, MTCT maintains the strongest end-of-training averages among compared methods in both stationary and non-stationary settings. Learning curves sampled every 1k requests show that all methods suffer a drop at the shift point; MTCT still leads on average but also exhibits a gradual late-stage decline. This suggests that a uniform replay buffer can overweight stale experience; prioritized or recency-aware sampling is a natural extension to accelerate adaptation without sacrificing stability.

6.5. Action-Space Design

A controlled study on MovieLens at

M = 600

shows that a finer, semantically structured action grid improves terminal accuracy and stability. The 12-action design achieves the highest final hit rate and the smallest run-to-run variance, while the 3-action variant converges faster but to a lower fixed point with larger variance—consistent with reduced expressivity (action aliasing) and coarser credit assignment. We therefore adopt 12 actions as the default, while noting that lower-resolution settings may remain attractive under strict latency/compute budgets.

6.6. Limitations and Future Work

Our study has three primary limitations. (1) Byte-awareness and object-size heterogeneity. We assume unit-size objects and optimize request hit rate; this simplifies analysis but departs from real systems where object sizes vary, byte-hit rate matters, and TTL/invalidations are present. Evaluating MTCT under byte-aware objectives, realistic size distributions, and TTLs—including multi-tier or distributed cache hierarchies—is an important next step. It is also necessary to validate performance on larger-scale industrial datasets to confirm practical applicability. (2) Replay and adaptation under drift. While miss-triggered updates improve efficiency, our uniform replay can overweight stale experience after distributional shifts. Prioritized or recency-aware sampling, as well as adaptive buffers, are natural extensions to accelerate adaptation without sacrificing stability. (3) Scope of RL baselines. We compared against strong modern heuristics (ARC, W-TinyLFU) and MLP-based DDQN (including a miss-triggered control), but we did not undertake a head-to-head comparison among memory-based RL architectures (e.g., DRQN/LSTM-Q, DTQN/GTrXL). This omission is by design: our focus is to establish the benefit of a POMDP formulation with miss-triggered, Transformer-decoder Q-learning over MDP approximations for caching. A comprehensive architecture study across memory-based RL methods is left to future work. (4) Practical deployment considerations. Beyond single-node experiments, real-world deployment requires attention to scalability in distributed settings and cross-cache communication overheads. We emphasize keeping policy invocation miss-triggered to avoid adding latency on the hit path, and leave a systematic, at-scale evaluation of these systems-level choices to future work.

7. Conclusions

We presented MTCT, a cache-replacement framework that marries a POMDP formulation with Transformer-based memory, miss-triggered decision scheduling, compact popularity features, and delayed-hit credit assignment. By invoking the policy only on informative miss events and propagating hit information through delayed rewards, MTCT aligns computation with signal and stabilizes training under partial observability.

Across MovieLens (real trace) and the Mandelbrot–Zipf and Pareto workloads, MTCT achieves the best or statistically comparable hit rates on most cache sizes while reducing per-episode wall-clock time relative to DDQN. On synthetic workloads, strong modern heuristics (ARC, W-TinyLFU) are highly competitive at small to mid capacities, with gaps narrowing or reversing as M grows; on MovieLens, MTCT remains consistently top across all M, with rare exceptions noted in Appendix B. Ablations corroborate these findings: four context-length regimes support

C L = 50

as a practical default, finer 12-action grids improve terminal accuracy and stability over coarser sets, and MTCT retains an edge under distributional shift albeit with room to accelerate post-shift adaptation.

Future work proceeds along four axes. First, we move beyond unit-size objects to byte-aware objectives, realistic TTL/invalidations, and multi-tier deployments. Second, we replace uniform replay with prioritized or recency-aware sampling and explore adaptive buffers to improve responsiveness under drift. Third, we broaden comparisons to alternative memory-based RL architectures (e.g., DRQN/LSTM-Q, DTQN/GTrXL) to more fully map the design space. These directions will test MTCT’s generality and scalability in realistic CDN settings while deepening our understanding of memory and decision scheduling in learning-based caching. Finally, we will evaluate practical deployment considerations—scalability in distributed settings and cross-cache communication overheads—while keeping policy invocation miss-triggered to preserve hit-path latency.

Author Contributions

Conceptualization, H.K.; Data curation, H.K.; Formal analysis, H.K.; Investigation, H.K.; Methodology, H.K.; Project administration, E.-N.H.; Software, H.K.; Supervision, E.-N.H.; Validation, H.K.; Writing—original draft, H.K. and T.-J.S.; Writing—review and editing, H.K., T.-J.S. and E.-N.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (IITP-2025-RS-2023-00258649, Information Technology Research Center Program) and (RS-2022-00155911, Artificial Intelligence Convergence Innovation Human Resources Development (Kyung Hee University)), each contributing 50% of the research funding.

Data Availability Statement

The MovieLens dataset (Latest Small) used in this study is openly available from GroupLens at https://grouplens.org/datasets/movielens/latest/ (accessed on 3 October 2025). The request-trace datasets generated in this study are openly available at https://github.com/SuperCodeAI/mtct-dataset (accessed on 3 October 2025). The code and associated training/evaluation artifacts are not publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Additional Experiments

This appendix reports supplementary experiments that support the main findings without being central to them. We focus on two aspects: (i) per-episode training time across cache sizes M, and (ii) sensitivity of cache-hit rate to the context length

C L

at a fixed capacity

M = 600

.

Appendix A.1. Per-Cache-Size Wall-Clock Analysis

Table A1 summarizes the mean wall-clock time per training episode for MTCT and DDQN across MovieLens, Mandelbrot–Zipf, and Pareto. Two patterns are consistent.

First, MTCT is generally faster or comparable to DDQN, with the gap widening at larger M as miss rates decline. This aligns with the miss-triggered design: the expected policy-call rate scales with

1 - C_{HIT}

, so larger caches produce fewer decisions and lower overhead.

Second, MTCT tends to show smaller or comparable standard deviations across runs, indicating more stable episode times. There are isolated exceptions (e.g., MovieLens at

M = 300

) where MTCT is slower, but the overall trend still favors MTCT.

Table A1. Mean wall-clock time per training episode (seconds) by cache size M for MTCT and DDQN across all workloads. Values are mean ± s.d. over 10 runs.

Workload	M	DDQN (s)	MTCT (s)
MovieLens	100	131.01 ± 0.97	95.20 ± 3.85
	200	142.60 ± 13.44	103.85 ± 5.43
	300	154.19 ± 3.93	165.32 ± 10.88
	400	166.89 ± 3.27	106.14 ± 27.75
	500	164.11 ± 5.36	143.27 ± 13.50
	600	149.18 ± 3.75	136.42 ± 10.16
Mandelbrot–Zipf	100	411.19 ± 47.72	330.41 ± 104.17
	200	342.58 ± 123.45	244.92 ± 52.65
	300	389.53 ± 121.79	190.27 ± 23.32
	400	372.06 ± 129.83	182.80 ± 24.61
	500	391.37 ± 130.34	192.26 ± 54.97
	600	347.02 ± 128.46	155.73 ± 14.93
Pareto	100	338.05 ± 9.94	209.37 ± 11.90
	200	344.25 ± 21.70	191.40 ± 14.26
	300	364.28 ± 13.48	182.87 ± 22.81
	400	356.00 ± 27.09	171.65 ± 15.71
	500	362.94 ± 16.53	180.52 ± 10.58
	600	383.97 ± 31.74	179.06 ± 16.44

Appendix A.2. Context-Length Sensitivity

We report the full sweep over

C L

at

M = 600

. Four regimes are observed. Very short contexts (

C L \leq 5

) produce low and unstable hit rates, reflecting the cost of ignoring temporal dependencies. Short-to-intermediate contexts (

C L = 8 - 20

) improve accuracy but still show higher variance. A plateau appears for

C L \in [30, 70]

, with mean

C_{HIT} \approx 0.47

and minimal variance. Very long contexts (

C L ≳ 100

) yield a slight decline, consistent with diminishing returns and over-smoothing. These trends justify our default

C L = 50

: it sits on the plateau, shows low variability, and is within

0.1 pp

of the best observed mean

C_{HIT}

. For a concise, main-text summary, see Section 5.4.1.

Table A2. Cache-hit rate

C_{HIT}

under varying context lengths

C L

at

M = 600

. Values are mean ± s.d. across the final 10 evaluation episodes per run.

Table A2. Cache-hit rate

C_{HIT}

under varying context lengths

C L

at

M = 600

. Values are mean ± s.d. across the final 10 evaluation episodes per run.

$CL$	Mean ± s.d.	$CL$	Mean ± s.d.
1	0.3979 ± 0.0014	32	0.4699 ± 0.0030
2	0.4025 ± 0.0154	50	0.4703 ± 0.0004
4	0.3984 ± 0.0369	64	0.4700 ± 0.0025
5	0.4003 ± 0.0344	70	0.4695 ± 0.0027
8	0.4182 ± 0.0459	100	0.4502 ± 0.0379
10	0.4257 ± 0.0457	128	0.4492 ± 0.0401
16	0.4257 ± 0.0520	150	0.4496 ± 0.0394
20	0.4340 ± 0.0702	200	0.4371 ± 0.0456
30	0.4703 ± 0.0023	250	0.4443 ± 0.0452
		256	0.4393 ± 0.0460

Appendix B. Additional Modern/Heuristic Baselines: Results and Hyperparameters

This section complements the main text by reporting results and setup details for additional modern/heuristic baselines not discussed in depth. Unless otherwise noted, we reuse the evaluation protocol from the paper: the same request traces, the cache-size grid

M \in {100, 200, 300, 400, 500, 600}

, 10 independent runs per method, and per-run means averaged over the last 10 evaluation episodes. We denote cache-hit rate by

C_{HIT}

.

Appendix B.1. Protocol Recap

All baselines run on the same simulator with identical metadata updates, and evictions occur only when the cache is full. Heuristic policies have

O (1)

amortized update cost (e.g., ghost-list maintenance in ARC, count–min sketch updates in W-TinyLFU), though constant factors vary by implementation. Unless stated otherwise, W-TinyLFU uses an LRU back end (window + filter + LRU). Key hyperparameters are summarized below.

Appendix B.2. Additional Heuristic Results

Table A3 reports

C_{HIT}

for LRU–K (

K = 2

), 2Q, and LeCaR across workloads and cache sizes. For quick reference, the rightmost column reprints the strongest modern heuristic from the main text (ARC or W-TinyLFU) at the same setting; see Table 3 for the full comparison.

Appendix B.3. Interpreting Table A3

On the synthetic workloads, LRU–K and 2Q improve upon classical LRU/LFU but generally trail W-TinyLFU, consistent with the effectiveness of admission filtering with approximate frequency estimates under heavy-tailed, quasi-stationary demand. On MovieLens, ARC tends to outperform these additional heuristics, reflecting the value of ghost-list feedback under non-stationarity; in the main comparison, MTCT remains above ARC across all cache sizes. This trend is consistent across the full cache-size grid; see Table 3 for a complete comparison.

Table A3. Appendix—Additional heuristics (overall cache-hit rates). We report LRU–K (

K = 2

), 2Q, and LeCaR across workloads and cache sizes M. The rightmost column reprints the best modern heuristic from the main text (ARC or W-TinyLFU) for quick reference; see Table 3 and Table A4. Bold indicates the best mean within each workload–capacity row.

Table A3. Appendix—Additional heuristics (overall cache-hit rates). We report LRU–K (

K = 2

), 2Q, and LeCaR across workloads and cache sizes M. The rightmost column reprints the best modern heuristic from the main text (ARC or W-TinyLFU) for quick reference; see Table 3 and Table A4. Bold indicates the best mean within each workload–capacity row.

Workload	M	LRU–K ( $K = 2$ )	2Q	LeCaR	Best Modern (Main)
Mandelbrot–Zipf	100	0.5027	0.5013	0.4571	W-TinyLFU 0.5413
	200	0.5558	0.5539	0.5183	W-TinyLFU 0.5961
	300	0.5898	0.5868	0.5573	W-TinyLFU 0.6286
	400	0.6172	0.6122	0.5887	W-TinyLFU 0.6523
	500	0.6404	0.6332	0.6149	W-TinyLFU 0.6716
	600	0.6608	0.6520	0.6392	W-TinyLFU 0.6882
Pareto	100	0.5217	0.5207	0.4787	W-TinyLFU 0.5580
	200	0.5704	0.5695	0.5359	W-TinyLFU 0.6094
	300	0.6028	0.6004	0.5730	W-TinyLFU 0.6399
	400	0.6290	0.6241	0.6024	W-TinyLFU 0.6621
	500	0.6510	0.6447	0.6282	W-TinyLFU 0.6803
	600	0.6705	0.6626	0.6506	W-TinyLFU 0.6958
MovieLens	100	0.1307	0.1328	0.0761	ARC 0.1423
	200	0.2297	0.2218	0.1540	ARC 0.2344
	300	0.3086	0.2894	0.2211	ARC 0.3016
	400	0.3740	0.3424	0.2837	ARC 0.3573
	500	0.4279	0.3889	0.3410	ARC 0.4076
	600	0.4765	0.4311	0.3934	ARC 0.4513

Appendix B.4. Hyperparameters for Modern Baselines

Table A4 lists the key hyperparameters used for modern and related heuristics. Conventional defaults are used unless noted. For W-TinyLFU, the window ratio, protected fraction, and count–min sketch dimensions follow common practice; smaller sketches raise false-positive risk for novel/sparse items, whereas very large sketches trade memory for diminishing gains. ARC exposes no explicit knob: its balance parameter p adapts online from ghost-list hits.

Table A4. Hyperparameters used for modern caching/admission baselines (defaults unless noted).

Method	Key Hyperparameters	Values (Ours)
W-TinyLFU	window_ratio, protected_ratio, CMS(width, depth)	0.05, 0.80, (4096, 4)
ARC	(explicit knobs)	none (adaptive p online)
LRU-K	K	$K = 2$
2Q	$K_{in} / M$ , $K_{out} / M$	0.25, 1.0
LeCaR	$η$ , $ϵ$	0.1, 0.05

Appendix B.5. Notes on Sensitivity

W-TinyLFU is sensitive to the window ratio and sketch size: too small a window or sketch can misclassify novel items; too large increases memory and may slow reaction. ARC adapts quickly when the recency/frequency balance drifts, whereas under strongly frequency-dominated regimes, LFU-like behavior can remain competitive.

References

International Telecommunication Union. ICT Statistics Database. 2025. Available online: https://www.itu.int/en/ITU-D/Statistics/Pages/stat/default.aspx (accessed on 29 July 2025).
GSMA Intelligence. The Mobile Economy 2024; Technical Report; GSMA Intelligence: London, UK, 2024. [Google Scholar]
Ericsson. Ericsson Mobility Report 2025; Technical Report; Ericsson: Stockholm, Sweden, 2025. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
van Hasselt, H.P.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI), Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar] [CrossRef]
Kirilin, V.; Sundarrajan, A.; Gorinsky, S.; Sitaraman, R.K. RL-Cache: Learning-Based Cache Admission for Content Delivery. In Proceedings of the 2019 Workshop on Network Meets AI & ML, Beijing, China, 23 August 2019; pp. 57–63. [Google Scholar]
Somuyiwa, S.O.; György, A.; Gündüz, D. A Reinforcement-Learning Approach to Proactive Caching in Wireless Networks. IEEE J. Sel. Areas Commun. 2018, 36, 1331–1344. [Google Scholar] [CrossRef]
Wang, H.; He, H.; Alizadeh, M.; Mao, H. Learning Caching Policies with Subsampling. In Proceedings of the NeurIPS 2019 Workshop on Machine Learning for Systems, Vancouver, QC, Canada, 13 December 2019. [Google Scholar]
Niknia, F.; Wang, P.; Agarwal, A.; Wang, Z. Edge Caching Based on Deep Reinforcement Learning. In Proceedings of the 2023 IEEE/CIC International Conference on Communications in China (ICCC) (Proceedings), Dalian, China, 10–12 August 2023; pp. 1–6. [Google Scholar] [CrossRef]
Srinivasan, A.; Amidzadeh, M.; Zhang, J.; Tirkkonen, O. Cache Policy Design via Reinforcement Learning for Cellular Networks in Non-Stationary Environment. In Proceedings of the 2023 IEEE International Conference on Communications Workshops (ICC Workshops), Rome, Italy, 28 May–1 June 2023; pp. 764–769. [Google Scholar] [CrossRef]
Zhou, W.; Niu, Z.; Xiong, Y.; Fang, J.; Wang, Q. 3L-Cache: Low Overhead and Precise Learning-Based Eviction Policy for Caches. In Proceedings of the 23rd USENIX Conference on File and Storage Technologies (FAST’25), Santa Clara, CA, USA, 25–27 February 2025; USENIX Association: Berkeley, CA, USA, 2025; pp. 237–254. [Google Scholar]
Alabed, S. RLCache: Automated Cache Management Using Reinforcement Learning. arXiv 2019, arXiv:1909.13839. [Google Scholar] [CrossRef]
Yang, D.; Berger, D.S.; Li, K.; Lloyd, W. A Learned Cache Eviction Framework with Minimal Overhead. arXiv 2023, arXiv:2301.11886. [Google Scholar] [CrossRef]
Wang, Y.; Li, Y.; Lan, T.; Aggarwal, V. DeepChunk: Deep Q-Learning for Chunk-Based Caching in Wireless Data Processing Networks. IEEE Trans. Cogn. Commun. Netw. 2019, 5, 1034–1045. [Google Scholar] [CrossRef]
Sun, Y.; Meng, R.; Zhang, R.; Wu, Q.; Wang, H. A Deep Q-Network Approach to Intelligent Cache Management in Dynamic Backend Environments. Preprints 2025, 2025060730. [Google Scholar] [CrossRef]
Zhong, C.; Gursoy, M.C.; Velipasalar, S. A Deep Reinforcement Learning-Based Framework for Content Caching. In Proceedings of the 2018 52nd Annual Conference on Information Sciences and Systems (CISS), Princeton, NJ, USA, 21–23 March 2018; pp. 1–6. [Google Scholar]
Yan, G.; Li, J. RL-Bélády: A Unified Learning Framework for Content Caching. In Proceedings of the 28th ACM International Conference on Multimedia (MM ’20), Seattle, WA, USA, 12–16 October 2020; pp. 1009–1017. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, F.; Shi, Z.; Feng, D. An End-to-End Automatic Cache Replacement Policy Using Deep Reinforcement Learning. In Proceedings of the International Conference on Automated Planning and Scheduling, Virtual, 13–24 June 2022; Volume 32, pp. 537–545. [Google Scholar]
Wang, F.; Emara, S.; Kaplan, I.; Li, B.; Zeyl, T. Multi-Agent Deep Reinforcement Learning for Cooperative Edge Caching via Hybrid Communication. In Proceedings of the 2023 IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023; pp. 1206–1211. [Google Scholar] [CrossRef]
Lyu, Z.; Zhang, Y.; Yuan, X.; Wei, Z.; Fu, Y.; Feng, L.; Zhou, H. Innovative Edge Caching: A Multi-Agent Deep Reinforcement Learning Approach for Cooperative Replacement Strategies. Comput. Netw. 2024, 253, 110694. [Google Scholar] [CrossRef]
Abdo, L.; Ahmad, I.; Abed, S. A Smart Admission Control and Cache Replacement Approach in Content Delivery Networks. Clust. Comput. 2024, 27, 2427–2445. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artif. Intell. 1998, 101, 99–134. [Google Scholar] [CrossRef]
Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Hausknecht, M.J.; Stone, P. Deep Recurrent Q-Learning for Partially Observable MDPs. In Proceedings of the AAAI Fall Symposia on Sequential Decision Making for Intelligent Agents, Arlington, VA, USA, 12–14 November 2015; p. 141. [Google Scholar]
Toro Icarte, R.; Valenzano, R.; Klassen, T.Q.; Christoffersen, P.; Farahmand, A.M.; McIlraith, S.A. The act of remembering: A study in partially observable reinforcement learning. arXiv 2020, arXiv:2010.01753. [Google Scholar] [CrossRef]
Liu, Q.; Chung, A.; Szepesvári, C.; Jin, C. When Is Partially Observable Reinforcement Learning Not Scary? In Proceedings of the Conference on Learning Theory (COLT), London, UK, 2–5 July 2022; pp. 5175–5220. [Google Scholar]
Esslinger, K.; Platt, R.; Amato, C. Deep Transformer Q-Networks for Partially Observable Reinforcement Learning. arXiv 2022, arXiv:2206.01078. [Google Scholar] [CrossRef]
Parisotto, E.; Song, F.; Rae, J.; Pascanu, R.; Gulcehre, C.; Jayakumar, S.; Jaderberg, M.; Kaufman, R.L.; Clark, A.; Noury, S.; et al. Stabilizing Transformers for Reinforcement Learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; Proceedings of Machine Learning Research. Volume 119, pp. 7487–7498. [Google Scholar]
Harper, F.M.; Konstan, J.A. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. (TIIS) 2015, 5, 1–19. [Google Scholar] [CrossRef]
Megiddo, N.; Modha, D.S. ARC: A Self-Tuning, Low Overhead Replacement Cache. In Proceedings of the FAST’03: 2nd USENIX Conference on File and Storage Technologies, San Francisco, CA, USA, 31 March–2 April 2003; pp. 115–130. [Google Scholar]
Einziger, G.; Eytan, O.; Friedman, R.; Manes, B. Adaptive Software Cache Management. In Proceedings of the 19th International Middleware Conference (Middleware’18), Rennes, France, 10–14 December 2018; pp. 94–106. [Google Scholar] [CrossRef]
O’Neil, P.; Cheng, E.; Gawlick, D.; O’Neil, E. The LRU-K Page Replacement Algorithm for Database Disk Buffering. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD ’93), Washington, DC, USA, 25–28 May 1993; pp. 297–306. [Google Scholar] [CrossRef]
Johnson, T.; Shasha, D. 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB ’94), Santiago, Chile, 12–15 September 1994; pp. 439–450. [Google Scholar]
Agarwal, A.; Kaplan, H.; Zwick, U. Driving Cache Replacement with ML-based LeCaR. In Proceedings of the 10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage ’18), Boston, MA, USA, 9–10 July 2018. [Google Scholar]

Figure 1. No policy calls on hits. A cache lookup serves hits immediately without invoking the policy. On a miss, the content is fetched from the CDN, the MTCT agent decides admission or eviction, and the cache is updated. Policy inference and learning are triggered exclusively at miss steps

{δ_{k}}

; hit-only intervals update metadata/counters without observation construction, replay writes, or forward/backward passes.

Figure 1. No policy calls on hits. A cache lookup serves hits immediately without invoking the policy. On a miss, the content is fetched from the CDN, the MTCT agent decides admission or eviction, and the cache is updated. Policy inference and learning are triggered exclusively at miss steps

{δ_{k}}

; hit-only intervals update metadata/counters without observation construction, replay writes, or forward/backward passes.

Figure 2. Miss-triggered processing pipeline of the MTCT agent in five stages. Policy inference and updates occur only on cache misses, while hit intervals update metadata and delayed-reward counters without invoking the policy.

Figure 3. Cache-hit rate versus cache capacity for the MovieLens workload (mean over 10 runs).

Figure 4. Cache-hit rate versus cache capacity for the Mandelbrot–Zipf workload (mean over 10 runs).

Figure 5. Cache-hit rate versus cache capacity for the Pareto workload (mean over 10 runs).

Figure 6. Cache-hit rate

C_{HIT}

vs. context length

C L

on MovieLens at

M = 600

. Points denote per-setting means over 10 runs (each run averaged over its final 10 episodes).

Figure 6. Cache-hit rate

C_{HIT}

vs. context length

C L

on MovieLens at

M = 600

. Points denote per-setting means over 10 runs (each run averaged over its final 10 episodes).

Figure 7. Learning curves of

C_{HIT}

on the Mandelbrot–Zipf workload, stationary setting (

α = 1.2

,

M = 600

). Curves are sampled every 1k requests; end-of-training means (averaged over 5 runs) appear in Table 8.

Figure 7. Learning curves of

C_{HIT}

on the Mandelbrot–Zipf workload, stationary setting (

α = 1.2

,

M = 600

). Curves are sampled every 1k requests; end-of-training means (averaged over 5 runs) appear in Table 8.

Figure 8. Learning curves of

C_{HIT}

on the Mandelbrot–Zipf workload, non-stationary setting (

α : 1.0 \to 0.6

,

M = 600

). Curves are sampled every 1k requests; end-of-training means (averaged over 5 runs) are reported in Table 8.

Figure 8. Learning curves of

C_{HIT}

on the Mandelbrot–Zipf workload, non-stationary setting (

α : 1.0 \to 0.6

,

M = 600

). Curves are sampled every 1k requests; end-of-training means (averaged over 5 runs) are reported in Table 8.

Table 1. Summary statistics for the Compressed Content Popularity vector

{CCP}_{t}

.

Table 1. Summary statistics for the Compressed Content Popularity vector

{CCP}_{t}

.

Symbol	Description
$μ_{all}$	Mean popularity of all N catalog items at time t.
$μ_{cache}$	Mean popularity of the M items currently stored in the cache.
$μ_{all, top}$	Mean popularity of the top $10 %$ of catalog items.
$μ_{cache, top}$	Mean popularity of the top $10 %$ of cached items.
$μ_{all, bot}$	Mean popularity of the bottom $10 %$ of catalog items.
$μ_{cache, bot}$	Mean popularity of the bottom $10 %$ of cached items.

Table 2. Core hyperparameters for MTCT and the DDQN baseline. The symbol N/A denotes not applicable.

Hyperparameter	MTCT	DDQN (MLP)
Optimizer	Adam	Adam
Learning rate	$1 \times 10^{- 4}$	$1 \times 10^{- 3}$
Batch size	32	32
Target network update interval (steps)	1000	1000
Context/history length ( $C L$ )	50	N/A
Model dimension ( $D_{model}$ )	64	N/A
Number of layers/blocks	2 decoder blocks	3 hidden layers
Number of attention heads	8	N/A
Activation (core)	ReLU (FFN/head)	ReLU
Input embedding dimension	8	128 (outer embed)
Hidden layer sizes	N/A	$[128, 128, 64]$

Table 3. Cache hit rates

C_{HIT}

by workload and cache size M. For DRL methods (DDQN, MTCT), values are the mean over the last 10 episodes of each run, reported as mean ± s.d. across 10 runs. The best mean value for each workload and cache size is shown in bold.

Table 3. Cache hit rates

C_{HIT}

by workload and cache size M. For DRL methods (DDQN, MTCT), values are the mean over the last 10 episodes of each run, reported as mean ± s.d. across 10 runs. The best mean value for each workload and cache size is shown in bold.

Workload	M	FIFO	LRU	LFU	Thompson	Random	ARC	W-TinyLFU	DDQN	MTCT (Ours)
Mandelbrot–Zipf	100	0.3895	0.4332	0.4668	0.1136	0.3900	0.5156	0.5413	0.5102 ± 0.0002	0.5376 ± 0.0011
	200	0.4627	0.5029	0.5194	0.2100	0.4622	0.5656	0.5961	0.5666 ± 0.0010	0.5960 ± 0.0005
	300	0.5078	0.5453	0.5494	0.3684	0.5083	0.5956	0.6286	0.6218 ± 0.0134	0.6302 ± 0.0001
	400	0.5437	0.5794	0.5854	0.5799	0.5437	0.6181	0.6523	0.6280 ± 0.0034	0.6550 ± 0.0000
	500	0.5738	0.6079	0.6124	0.6076	0.5733	0.6371	0.6716	0.6591 ± 0.0117	0.6749 ± 0.0001
	600	0.6000	0.6331	0.6421	0.6335	0.6001	0.6541	0.6882	0.6847 ± 0.0070	0.6917 ± 0.0001
Pareto	100	0.4175	0.4569	0.4887	0.0900	0.4176	0.5341	0.5580	0.5155 ± 0.0007	0.5232 ± 0.0011
	200	0.4846	0.5210	0.5317	0.1832	0.4844	0.5802	0.6094	0.5713 ± 0.0004	0.6088 ± 0.0017
	300	0.5271	0.5616	0.5668	0.5614	0.5273	0.6082	0.6399	0.6081 ± 0.0018	0.6414 ± 0.0016
	400	0.5607	0.5935	0.6001	0.5925	0.5607	0.6293	0.6621	0.6361 ± 0.0018	0.6512 ± 0.0007
	500	0.5890	0.6205	0.6213	0.6152	0.5889	0.6476	0.6803	0.6711 ± 0.0012	0.6791 ± 0.0010
	600	0.6142	0.6443	0.6549	0.6379	0.6138	0.6639	0.6958	0.6822 ± 0.0008	0.6961 ± 0.0007
MovieLens	100	0.0665	0.0692	0.0727	0.0613	0.0726	0.1423	0.1260	0.1033 ± 0.0051	0.1504 ± 0.0036
	200	0.1388	0.1475	0.1369	0.1118	0.1444	0.2344	0.1970	0.1928 ± 0.0052	0.2482 ± 0.0047
	300	0.1976	0.2104	0.1750	0.1508	0.2044	0.3016	0.2579	0.2713 ± 0.0078	0.3229 ± 0.0065
	400	0.2518	0.2681	0.2089	0.1742	0.2575	0.3573	0.3040	0.3370 ± 0.0076	0.3797 ± 0.0070
	500	0.2987	0.3226	0.2537	0.1904	0.3045	0.4076	0.3476	0.3933 ± 0.0089	0.4280 ± 0.0074
	600	0.3441	0.3731	0.2914	0.2350	0.3467	0.4513	0.3905	0.4436 ± 0.0097	0.4703 ± 0.0086

Table 4. Macro-averaged improvements of MTCT over baselines (percentage points, pp). Per-workload rows average over

M \in {100, 200, 300, 400, 500, 600}

; Macro Avg averages over all workloads and M (18 settings total).

Table 4. Macro-averaged improvements of MTCT over baselines (percentage points, pp). Per-workload rows average over

M \in {100, 200, 300, 400, 500, 600}

; Macro Avg averages over all workloads and M (18 settings total).

	FIFO	LRU	LFU	Thompson	Random	ARC	W-TinyLFU	DDQN
Mandelbrot–Zipf	$+ 11.80$	$+ 8.06$	$+ 6.83$	$+ 21.21$	$+ 11.80$	$+ 3.32$	$+ 0.12$	$+ 1.92$
Pareto	$+ 10.11$	$+ 6.70$	$+ 5.61$	$+ 18.66$	$+ 10.12$	$+ 2.28$	$- 0.76$	$+ 1.93$
MovieLens	$+ 11.70$	$+ 10.14$	$+ 14.35$	$+ 17.93$	$+ 11.16$	$+ 1.75$	$+ 6.27$	$+ 4.30$
Macro Avg	$+ 11.20$	$+ 8.30$	$+ 8.93$	$+ 19.27$	$+ 11.02$	$+ 2.45$	$+ 1.88$	$+ 2.72$

Table 5. Mean wall-clock time (seconds) per training episode across workloads. Detailed results by cache size are reported in Appendix A.1.

Workload	DDQN (Baseline)	MTCT (Ours)
Mandelbrot–Zipf	376.47 ± 115.19	216.14 ± 76.09
Pareto	340.82 ± 102.33	201.67 ± 69.24
MovieLens	298.21 ± 88.11	187.92 ± 64.75

Table 6. Sensitivity to context length

C L

on MovieLens at cache size

M = 600

. Values are mean ± s.d. of

C_{HIT}

across 10 runs (each run averaged over its final 10 episodes).

Table 6. Sensitivity to context length

C L

on MovieLens at cache size

M = 600

. Values are mean ± s.d. of

C_{HIT}

across 10 runs (each run averaged over its final 10 episodes).

$CL$	Mean ± s.d.
1	0.3979 ± 0.0014
5	0.4003 ± 0.0344
10	0.4257 ± 0.0457
32	0.4699 ± 0.0030
50	0.4703 ± 0.0004
64	0.4700 ± 0.0025
128	0.4492 ± 0.0401
256	0.4393 ± 0.0460

Table 7. Reward ablation on MovieLens (

M = 600

). We report mean cache-hit rate (

C_{HIT}

) over the final 10 episodes, averaged across 5 runs per variant. “Full” denotes the complete MTCT reward; variants drop one or more delayed/auxiliary terms.

Δ

is the difference vs. Full in pp.

Table 7. Reward ablation on MovieLens (

M = 600

). We report mean cache-hit rate (

C_{HIT}

) over the final 10 episodes, averaged across 5 runs per variant. “Full” denotes the complete MTCT reward; variants drop one or more delayed/auxiliary terms.

Δ

is the difference vs. Full in pp.

Variant	$C_{HIT}$ (Mean ± s.d.)	$Δ$ vs. Full (pp)
Full (MTCT reward)	0.4703 ± 0.0086	0.0
w/o Multi-scale request matching	0.4519 ± 0.0408	−1.8
w/o Global popularity alignment	0.4334 ± 0.0497	−3.7
w/o All delayed-hit rewards	0.4341 ± 0.0503	−3.6

Table 8. Stationary vs. non-stationary at

M = 600

: overall hit rate for heuristics and DRL methods. Each entry averages the final 10 episodes of each run and is reported as mean ± s.d. across 5 runs.

Table 8. Stationary vs. non-stationary at

M = 600

: overall hit rate for heuristics and DRL methods. Each entry averages the final 10 episodes of each run and is reported as mean ± s.d. across 5 runs.

Mandelbrot–Zipf ( $α$ )	ARC	W-TinyLFU	DDQN	Miss-DDQN	MTCT (Ours)	MTCT-w/o Delayhit
1.2	0.6529	0.6740	0.6642 ± 0.0198	0.6630 ± 0.0198	0.6857 ± 0.0000	N/A
$1.0 \to 0.6$	0.4679	0.5041	0.4893 ± 0.0133	0.4600 ± 0.0080	0.5166 ± 0.0001	0.4881 ± 0.0391

Table 9. Action-set sensitivity on MovieLens at

M = 600

: last-10-episode mean

C_{HIT}

(mean ± s.d.) across 5 runs per action-set size.

Table 9. Action-set sensitivity on MovieLens at

M = 600

: last-10-episode mean

C_{HIT}

(mean ± s.d.) across 5 runs per action-set size.

Action Set Size	Mean $C_{HIT}$ ± s.d.
3	0.4197 ± 0.0466
7	0.4532 ± 0.0414
12 (base)	0.4703 ± 0.0021

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, H.; Sun, T.-J.; Huh, E.-N. Miss-Triggered Content Cache Replacement Under Partial Observability: Transformer-Decoder Q-Learning. Mathematics 2025, 13, 3217. https://doi.org/10.3390/math13193217

AMA Style

Kim H, Sun T-J, Huh E-N. Miss-Triggered Content Cache Replacement Under Partial Observability: Transformer-Decoder Q-Learning. Mathematics. 2025; 13(19):3217. https://doi.org/10.3390/math13193217

Chicago/Turabian Style

Kim, Hakho, Teh-Jen Sun, and Eui-Nam Huh. 2025. "Miss-Triggered Content Cache Replacement Under Partial Observability: Transformer-Decoder Q-Learning" Mathematics 13, no. 19: 3217. https://doi.org/10.3390/math13193217

APA Style

Kim, H., Sun, T.-J., & Huh, E.-N. (2025). Miss-Triggered Content Cache Replacement Under Partial Observability: Transformer-Decoder Q-Learning. Mathematics, 13(19), 3217. https://doi.org/10.3390/math13193217

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Miss-Triggered Content Cache Replacement Under Partial Observability: Transformer-Decoder Q-Learning

Abstract

1. Introduction

2. Related Work

2.1. Cache Management Policies and Related Work

2.2. MDP-Based Caching Research and Its Limitations

2.2.1. Markov Decision Processes (MDPs)

2.2.2. Semi-Markov Decision Processes (SMDPs)

2.2.3. Limitations of MDP/SMDP Approaches

2.3. Memory-Based Approximation for POMDP Caching

2.3.1. POMDP Definition and Exact-Solution Complexity

2.3.2. Memory-Based Approximation

2.3.3. Advantages of Transformer Architectures in RL for Caching

Adaptation for Content Caching

3. Problem Formulation

3.1. System Model

3.2. Request Workload Models

3.3. State and Observation

3.3.1. Environment State

3.3.2. Observation

3.4. Action Space

3.5. Reward Function

3.6. Learning Objective

4. Proposed Framework: Miss-Triggered Cache Transformer

4.1. MTCT Framework Process Flow Overview

4.2. Agent Data Processing Flow

4.3. Miss-Triggered Delayed-Hit Reward Learning (MTDHRL)

4.4. State Aggregation

5. Experiments

5.1. Experimental Environment and Workloads

5.2. Comparison Baselines

5.3. Evaluation

5.3.1. Discussion of Figure 3, Figure 4 and Figure 5

5.3.2. Wall-Clock Efficiency

5.4. Ablation Studies

5.4.1. Sensitivity to Context Length

5.4.2. Reward Ablations

5.4.3. Distribution-Shift Robustness (Mandelbrot–Zipf)

5.4.4. Action-Set Sensitivity

6. Discussion

6.1. Overall Performance and Workload Dependence

6.2. Wall-Clock Efficiency

6.3. Ablations: Context Length and Reward Shaping

6.4. Robustness Under Distribution Shift

6.5. Action-Space Design

6.6. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Additional Experiments

Appendix A.1. Per-Cache-Size Wall-Clock Analysis

Appendix A.2. Context-Length Sensitivity

Appendix B. Additional Modern/Heuristic Baselines: Results and Hyperparameters

Appendix B.1. Protocol Recap

Appendix B.2. Additional Heuristic Results

Appendix B.3. Interpreting Table A3

Appendix B.4. Hyperparameters for Modern Baselines

Appendix B.5. Notes on Sensitivity

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI