A Calibrated Relaunch Distance Framework for App Eviction in Smartphone Memory Management

Lee, Jaehwan; Kyung, Yeunwoong

doi:10.3390/electronics15112415

Open AccessArticle

A Calibrated Relaunch Distance Framework for App Eviction in Smartphone Memory Management

by

Jaehwan Lee

¹

and

Yeunwoong Kyung

^2,*

¹

Department of Computer Science and Engineering, Kongju National University, Cheonan 31080, Republic of Korea

²

Department of Electronic Engineering, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2415; https://doi.org/10.3390/electronics15112415

Submission received: 20 April 2026 / Revised: 20 May 2026 / Accepted: 25 May 2026 / Published: 2 June 2026

(This article belongs to the Special Issue Design and Implementation of Embedded Systems for Real-Time Applications)

Download

Browse Figures

Versions Notes

Abstract

Smartphone operating systems must eventually evict resident apps when memory becomes scarce, yet prior work has focused more on reclaim mechanisms and next app prediction than on the ranking rule that chooses the victim. We study app eviction through relaunch distance and show that generalizing raw relaunch distance prediction is unsafe as a direct policy because small errors among short returns can easily reverse victim ordering, while some resident apps still require fallback handling. Therefore, we propose a calibrated relaunch distance framework that places predicted and fallback candidates on a common scale. In trace-driven fixed capacity app cache simulation on a multi-user smartphone trace, the proposed method remains above LRU from cache capacities

C = 5

to

C = 13

on the 279-user evaluation set and improves average hit ratio from 0.8900 to 0.8935. At low cache capacity

C = 5

, it improves hit ratio from 0.7617 to 0.7691, recovering 21.2% of the remaining Oracle–LRU gap, whereas the raw prediction method is below LRU at 0.6283 for the all-user set. The gains are strongest for users with deeper histories, where the margin at

C = 5

reaches +0.0138 in q4. These results show that calibration is the step that turns relaunch distance prediction into a deployable app eviction policy.

Keywords:

mobile operating system; app eviction; relaunch distance prediction; memory management; smartphone traces

1. Introduction

Mobile operating systems keep recently used apps in memory so that users can return to them quickly. Once available memory becomes tight, however, the system must decide which resident app should leave memory. That decision shapes what the user experiences next: an app may restart cold, resume with degraded state, or preserve useful context for a later relaunch [1]. Prior work on smartphone memory management makes clear that user-visible behavior depends not only on reclaim mechanisms such as swap or kill, but also on which resident app is selected as the victim [2,3,4].

To date, most systems work has improved the reclaim process after low memory handling has already begun. Some studies coordinate swap and kill more carefully, some redesign swap on zRAM or secondary storage, and some learn better control of the low memory killer (LMK) [2,3,4,5]. These methods improve which reclaim path is taken or when reclaim is triggered, but do not directly answer a different question that appears once a fixed app cache is full: among the apps already resident in memory, which should be evicted first? The present paper studies this app eviction question rather than page reclaim itself or the policy that decides when the system enters low memory handling.

A separate line of research predicts future app usage from contextual or sequential signals. Next-app prediction, predictive launching, and prefetching studies have shown that smartphone app sequences are partly learnable [6,7,8,9,10,11,12,13,14]. Yet, predicting the next launch and choosing an eviction victim are not the same task. A next-app model ranks future launches, whereas eviction must compare only those apps that are already resident. In practice, the resident set contains a mixture of cases: some apps have informative model outputs, some have sparse history, and some need pure fallback handling. Therefore, a deployable eviction rule needs one score that can order all resident candidates on the same runtime scale.

Relaunch distance is more directly connected to this setting because it asks not only which app returns but how far in the future that return occurs [15]. However, our trace analysis shows that raw relaunch distance prediction is not enough. Because many finite relaunch distances are small, the ordering among near-future returns is compressed and easily perturbed by small prediction errors. At the same time, the resident set also contains apps where return is distant, rare, or absent; these cases are often handled by recency-based fallback. Thus, the practical problem is not that recency already solves eviction, but that raw model outputs and fallback rules are not directly comparable when these cases coexist.

In such traces, a model can reduce its loss on cases that will relaunch soon and still make poor eviction decisions, since even small errors among short distances can reverse the victim order. The central challenge is to turn naive relaunch distance outputs into an eviction score that preserves the strong part of recency while recovering those cases in which prediction provides additional separation. Our goal is not merely to align two signals abstractly, but to make prediction safer than raw model output and more informative than least recently used (LRU) alone.

This observation motivates the central idea of the paper. We formulate smartphone app eviction as a candidate ranking problem that combines clipped relaunch distance prediction with a recency baseline. The proposed calibrated method first constructs a baseline aging scheme that behaves sensibly even when a prediction model has little support for an app. It then shrinks model outputs toward that baseline according to the evidence available for each candidate while keeping exact distinct age updates and explicit treatment of non-returning cases. As a result, apps with strong predictive support and apps that require fallback handling can still be compared by the same runtime score. This preserves the ordering in which recency is preferred when predictive evidence is weak, yet the victim order can still be changed when the model indicates that an app will return later or not at all.

We evaluate four decision rules in the same fixed capacity app cache simulation: an oracle reference built from true relaunch distance, LRU, raw relaunch distance prediction, and the proposed calibrated method. Learned scores are instantiated with both a per-user recurrent backbone and a compact centralized transformer backbone. On the 279-user set with training, validation, and test splits, the proposed method provides the most stable overall calibrated profile on the all-user set: it stays above LRU from

C = 5

to

C = 13

, where C denotes the number of resident app slots, and improves mean hit ratio from 0.8900 to 0.8935. At

C = 5

, calibration converts the raw transformer margin relative to LRU from

- 0.1334

to

+ 0.0074

, recovering 21.2% of the remaining oracle–LRU gap. When users are further divided into four groups according to processed launch count, the gains are largest in the group with the deepest history, where the margin at

C = 5

reaches +0.0138 in q4 (i.e., the top quartile of users by launch count history). These results indicate that the proposed method turns relaunch distance prediction into a usable eviction policy on the full evaluation set.

The main contributions of this paper are as follows:

We formulate smartphone app eviction as a relaunch distance ranking problem and show why pure relaunch distance predictions can still produce unstable victim orderings when many apps return after only a few distinct intervening launches.
We propose a calibrated eviction framework that maps clipped relaunch distance predictions and recency fallback into a single runtime score. The framework uses a distance aging approach based on recency and explicit handling of non-returning cases, allowing mixed candidate sets to be compared consistently at runtime.
We evaluate LRU, oracle relaunch distance, raw prediction, and the proposed calibrated policy under time series splits and fixed-capacity app cache simulation. On the all-user set, the proposed method stays above LRU from $C = 5$ to $C = 13$ , while subgroup analysis shows that the largest gains appear for users with deeper histories.

The remainder of this paper is organized as follows: Section 2 reviews the most relevant prior work; Section 3 formalizes the problem and presents the calibrated eviction framework; Section 4 describes the experimental methodology, including preprocessing, training, baselines, and evaluation protocol; Section 5 reports the main results and diagnostic analysis; finally, Section 6 concludes the paper and outlines directions for future work.

2. Related Work

2.1. Smartphone Memory Reclamation

Smartphone memory reclamation has been studied primarily from a systems perspective. Early work recognized that Android cannot rely on secondary storage swap in the same way as desktop systems because naive swap support causes severe response time degradation and excessive write traffic under mobile I/O constraints. This led to a sequence of practical designs centered on how memory should be reclaimed under pressure [16], including LMK-aware reclaim [17], zRAM compression [18], hybrid zRAM and secondary storage schemes [4], and non-volatile memory (NVM) assisted-swap architectures [2,3,19,20]. Representative examples include SmartLMK [17], which improves user perceived launch latency through app-aware reclamation; hybrid swapping schemes that combine zRAM with secondary storage swap; studies that characterize Android swap I/O behavior and motivate NVM assisted designs; and app-aware or selective swap mechanisms that explicitly coordinate reclaim decisions with Android app lifecycle behavior.

Subsequent studies focused more directly on reclaim efficiency under realistic mobile workloads. Recent work has explored how to keep more apps resident without unacceptable hot launch penalties, how to use hot launch history to keep useful apps resident longer, and how to accelerate the reclaim path on modern mobile devices [21,22,23]. Across this literature, the main objective is to reduce launch latency, I/O cost, or write amplification under Android memory pressure. However, most of these studies emphasize reclaim mechanisms and system behavior; when they do encode a victim choice, it is usually driven by practical signals such as recency, page hotness, app class, or launch history rather than explicit prediction of future app reuse.

2.2. Predictive App Usage Context

A separate body of work predicts future app usage from contextual or sequential signals. Early contextual models exploited time, location, and previous app information to predict the next app likely to be used [6]. This predictive view was later extended to systems-oriented prelaunch and prefetch frameworks such as FALCON and PREPP, which used predicted future app usage to reduce perceived launch delay [7,8]. Large-scale next-app prediction was subsequently studied with probabilistic context models such as PTAN/TAN [9]. While these studies established that future app usage is predictable to a meaningful degree, their outputs were designed mainly for next-app ranking, prelaunching, or prefetching rather than for low memory reclamation decisions.

Later work expanded the modeling side through recurrent learning, representation learning, multitask, and graph-based methods. Representative examples include next-app prediction based on LSTM, AppUsage2Vec, NAP, DeepApp, App2Vec, and dynamic app usage graph models [10,11,12,13,14,24,25,26]. Some studies have also brought prediction closer to memory management, for example by applying reinforcement learning to LMK policies [5]. The closest prior work to the present problem is Lee and Park’s memory management study based on relaunch distance, which formulated memory decisions around predicted app relaunch distance rather than just the identity of the next app [15]. Relaunch distance is more directly connected to keep or evict decisions than top-k next app ranking, and as such provides the most relevant starting point for the present paper.

While both groups of prior work are relevant, neither is sufficient by itself for the problem addressed here. Smartphone memory reclamation studies provide strong systems insight into LMK/kswapd interaction, swap overhead, and app launch behavior, but usually rely on heuristic decision rules or app/page hotness surrogates. Predictive app usage studies provide an effective way of modeling future launches, but most approaches target recommendation, prelaunch, or prefetch rather than eviction decisions under memory pressure. Memory management based on relaunch distance narrows this gap by making future reuse more directly relevant to memory decisions, yet one additional step is still required: predicted future usage signals must be translated into stable decision scores that remain meaningful under uncertainty and realistic Android reclaim dynamics. Therefore, the next section formalizes the reclamation problem and introduces the proposed calibrated method.

3. Calibrated Relaunch Distance Prediction

This section formalizes app-level memory reclamation as a victim ranking problem, then derives the calibrated score used throughout the paper. Here, app-level memory reclamation means selecting one resident app state to reclaim after memory pressure has already occurred and the system has already determined that one resident state must be removed. Accordingly, our focus is not on LMK triggering, page reclaim, or the choice between kill and swap paths, but on the ranking rule that determines which resident app should be selected as the victim.

The key difficulty is that the runtime victim set is heterogeneous, since some resident apps are supported by model outputs whereas others must still be ordered by an explicit recency based fallback rule when prediction is unavailable, out of vocabulary, or not yet reliable. Therefore, a useful formulation must do more than predict future use; it must place predictive outputs and fallback candidates on one eviction scale. In this section, calibration refers to this alignment through three components: an explicit non-returning outcome, a recency aging baseline as the fallback rule, and an interpolation that moves a candidate from the baseline toward the model output as app-specific evidence accumulates.

3.1. Problem Formulation

For user u, let the launch sequence be

ℓ^{(u)} = (ℓ_{1}^{(u)}, ℓ_{2}^{(u)}, \dots, ℓ_{T_{u}}^{(u)})

, where each token is an app identity from the app set

A

. We write

a_{t} = ℓ_{t}^{(u)}

for the app launched at time t. This sequence is the basic observation used by both label construction and online victim ranking. Then, we use a fixed capacity cache abstraction. Let

C_{t - 1} \subseteq A

denote the resident set immediately before serving app launch request

a_{t}

, with

| C_{t - 1} | \leq C

. From the perspective of Android’s Activity Manager Service (AMS), each cached background app occupies one slot in this fixed capacity abstraction. Therefore, the present section focuses on the case where the mobile operating system must select one victim from the background cached apps under a fixed cache budget.

A request at time t is a relaunch hit if

a_{t} \in C_{t - 1}

and a miss otherwise. If

a_{t} \notin C_{t - 1}

and

| C_{t - 1} | = C

, then the request is a miss requiring app eviction, and the policy must choose one victim

v_{t} \in C_{t - 1}

. Under this abstraction, the objective is to maximize the hit ratio, i.e., the fraction of launches served from the resident set under the fixed cache budget.

To quantify future usefulness, we adopt the app relaunch distance introduced in prior work and restate it here in the notation used in the previous paper [15]. For an occurrence at time t where app

a_{t}

is launched, let

t^{+} (t) = min {s > t ∣ ℓ_{s}^{(u)} = a_{t}}

(1)

if such an index exists. Here,

t^{+} (t)

denotes the first future index after t at which the same app is launched again. The relaunch distance at time t is then defined as

d_{t} = \{\begin{matrix} |\{ℓ_{s}^{(u)} ∣ t < s < t^{+} (t)\}|, & t^{+} (t) < \infty, \\ - 1, & t^{+} (t) = \infty . \end{matrix}

(2)

Thus, the finite relaunch distance counts the number of distinct intervening app identities until the next launch of the same app. Evicting that app for which the next relaunch is farther in the future makes an immediate cold launch less likely; therefore, a larger future relaunch distance makes an app a more suitable victim candidate.

However, the raw target in (2) is not used directly for learning or runtime scoring. The relaunch distance has a long-tailed distribution, whereas the resident set budgets of interest in this paper are small and most of the trace mass is concentrated at small finite distances. Preserving arbitrarily large finite values would allocate excessive numerical resolution to cases that are less relevant to the eviction budgets studied here. Accordingly, we clip the finite relaunch distance at a fixed threshold

d_{clip}

and use

{\tilde{d}}_{t} = min (d_{t}, d_{clip}) for d_{t} \geq 0 .

(3)

This clipped target preserves the ordering that matters most near the cache budgets of interest while keeping the distance range stable.

However, clipping alone is still not sufficient for runtime victim ranking. First, the relaunch distance is discrete and heavily concentrated at small values, so multiple resident apps can share the same finite clipped distance at an eviction point. Therefore, a deployable policy still needs a recency-consistent order for tied candidates and those that must be ranked without usable predictions. Second, some launched occurrences do not reappear in the remaining train/validation split. These cases should not be absorbed into the top clipped finite bin, because an occurrence that does not return within the current trace horizon should be ordered beyond every finite clipped distance. Therefore, we introduce an explicit non-returning outcome. An occurrence is non-returning if it does not reappear in the remaining split, which corresponds to the case

t^{+} (t) = \infty

in (2). This is an occurrence label within the current trace horizon, not a claim that the app is globally abandoned by the user. We define the non-return indicator

y_{t}^{nr} = \{\begin{matrix} 1, & d_{t} = - 1, \\ 0, & d_{t} \geq 0 . \end{matrix}

(4)

Predicting

y_{t}^{nr}

separately is useful because it prevents non-returning cases from being blurred with large finite return cases during training and later allows the runtime score to place them strictly beyond the clipped finite range.

At runtime, the policy cannot observe

d_{t}

directly; it only observes past launches. Therefore, we need an online state that is semantically aligned with (2) and that can provide consistent recency ordering when candidates are tied on the finite target or unsupported by prediction. For any app a that has appeared before time t, let

τ_{t} (a) = max {s < t ∣ ℓ_{s}^{(u)} = a}

(5)

be its most recent launch time before t, and let

H_{t} (a) = {ℓ_{s}^{(u)} ∣ τ_{t} (a) < s < t} .

(6)

We then define the distinct age of a at time t as

k_{t} (a) = | H_{t} (a) | .

(7)

Therefore,

k_{t} (a)

is the number of distinct app identities launched since the last launch of a. The distinct age in (7) is updated online, providing the recency signal that is later used to order tied or fallback candidates. Larger

k_{t} (a)

means that more distinct launches have intervened since app a was last observed.

Therefore, the runtime problem is a mixed-candidate ranking problem. Some resident apps have usable relaunch distance outputs, some are fallback-only, and even apps with predicted values should not be allowed to immediately override recency when their evidence is still scarce. This distinction matters because smartphone traces are heavily concentrated at small finite relaunch distances, so even modest score–scale mismatch can invert the victim order at eviction time. Consequently, the formulation must satisfy three requirements: (i) finite relaunch cases and non-returning cases must be represented separately, with non-returning ordered beyond the clipped finite range; (ii) predicted and fallback candidates must be compared on a common relaunch distance scale; and (iii) when predictions are tied, unavailable, or weak, the policy should remain recency through the aging order.

3.2. Prediction Model

The predictor is designed to provide the quantities required by the calibrated runtime score rather than to solve next-app classification as a separate task. Each launched occurrence is encoded once at launch time, and the resulting outputs are later attached to the corresponding resident app state and consulted only when a miss occurs requiring eviction. We use two outputs: the first is a non-returning logit, which estimates whether the current occurrence has no relaunch within the remaining split suffix; the second is a finite relaunch-distance estimate, which is defined only for occurrences that do return within the current horizon. This two-head decomposition follows the label construction in Section 3.1: boundary detection and finite-distance estimation are learned separately so that non-returning cases do not distort the finite regressor and the later eviction score remains interpretable.

When app a is resident at a later decision time t, the policy reads the most recently stored output pair associated with its last launch. Let

z_{t}^{nr} (a)

and

{\hat{d}}_{t}^{fin} (a)

denote these values.

The non-returning probability is

p_{t}^{nr} (a) = σ (z_{t}^{nr} (a)),

(8)

where

σ (\cdot)

is the sigmoid function. To place non-returning cases strictly beyond the clipped finite range used for finite targets, we define

d_{nr} = d_{clip} + 1,

(9)

which is the smallest integer value that ranks a non-returning occurrence after every finite returning occurrence. This avoids the scale distortion that would arise from assigning a large constant to non-returning applications. The two outputs are then collapsed into a single model-level future use score:

m_{t} (a) = (1 - p_{t}^{nr} (a)) {\hat{d}}_{t}^{fin} (a) + p_{t}^{nr} (a) d_{nr} .

(10)

Thus,

m_{t} (a)

is the model-only future use score on the same clipped distance axis used by the labels, with non-returning cases mapped beyond the finite range.

Let

ℓ_{bce} (z, y) = - y log σ (z) - (1 - y) log (1 - σ (z)),

(11)

where z is a logit,

y \in {0, 1}

, and

σ (\cdot)

is the sigmoid function. For the finite distance residual r, we use the Huber loss

ρ_{δ} (r) = \{\begin{matrix} \frac{1}{2} r^{2}, & | r | \leq δ, \\ δ (| r | - \frac{1}{2} δ), & | r | > δ, \end{matrix}

(12)

which is less sensitive to large finite distance errors than a purely quadratic loss [27].

The training objective is defined as

L_{t} = λ_{nr} ℓ_{bce} (z_{t}^{nr}, y_{t}^{nr}) + 1 [y_{t}^{nr} = 0] λ_{rd} w_{t} ρ_{δ} ({\hat{d}}_{t}^{fin} - {\tilde{d}}_{t}) .

(13)

The first term is a binary cross-entropy loss to the non-returning logit, and separates it from finite return occurrences. The second term is applied only to finite-return occurrences, and regresses the clipped finite relaunch distance. The weight

w_{t}

increases the influence of errors around the cache budgets of interest, while the Huber threshold

δ

limits the effect of large residuals from the long tail of the finite relaunch distances.

3.3. Fallback and Scoring

The model output

m_{t} (a)

is still not sufficient for runtime use, as the resident set at a miss can contain apps with usable predictions, apps with no usable predictions, and apps whose predictions are based on too little evidence to trust the output. Therefore, a deployable policy needs a fallback score that is defined for every resident app and that shares the same numerical meaning as

m_{t} (a)

.

We derive this fallback from distinct age rather than from a separate heuristic. Because both distinct age and relaunch distance count distinct intervening app identities, a natural fallback at age k is the expected clipped relaunch distance of finite-return cases with relaunch distance of at least k. Therefore, we estimate the fallback baseline

b (k) \approx E [min (D, d_{clip}) ∣ D \geq k, D \geq 0] .

(14)

This training trace baseline is enforced to be monotone nondecreasing in k, so a resident app that has been bypassed by more distinct launches never receives a smaller fallback score. The monotonicity also preserves the recency order needed when multiple candidates are indistinguishable on the clipped finite target.

For a fallback-only candidate, the score is

r_{t}^{fb} (a) = b (k_{t} (a))

. Thus, a candidate without prediction is not compared through an unrelated heuristic, but through a calibrated baseline defined on the same clipped future use scale as the predictive score.

Even when a usable prediction exists, raw model outputs should not dominate immediately. For apps with limited history, the model output is best viewed as a correction to the fallback baseline rather than as a standalone decision score. Let

E_{t} \subseteq C_{t - 1}

denote the resident apps for which the most recent launch carries a usable model output pair at decision time t. Let

n_{t} (a)

be the number of past launches of app a observed before time t. We use the evidence shrinkage weight

α_{t} (a) = \frac{n_{t} (a)}{n_{t} (a) + τ_{α}},

(15)

where

τ_{α} > 0

is an evidence scale.

n_{t} (a) = 0

yields pure fallback, while larger evidence gradually transfers weight to the predictor without gating.

τ_{α} > 0

is dataset-dependent, and should be selected during the training or validation stage.

The unified score used for eviction is

s_{t} (a) = \{\begin{matrix} b (k_{t} (a)) + α_{t} (a) (m_{t} (a) - b (k_{t} (a))), & a \in E_{t}, \\ b (k_{t} (a)), & a \notin E_{t} . \end{matrix}

(16)

Equation (16) provides the calibration method. A model predicted candidate starts from the aging baseline and moves gradually towards the model score as app history accumulates. A fallback candidate remains on the baseline. Because both terms are expressed on the same clipped relaunch distance scale, predicted and fallback candidates can be ranked directly in one victim list.

The following two properties clarify why the unified score is suitable for runtime use.

Proposition 1.

If

b (k)

is monotone nondecreasing, then the fallback ranking induced by

b (k_{t} (a))

is equivalent to ranking by distinct age

k_{t} (a)

, meaning that it is recency-consistent on the processed trace.

Proof.

If

k_{1} < k_{2}

, monotonicity gives

b (k_{1}) \leq b (k_{2})

. Thus, the ordering induced by

b (k)

is the same as the ordering induced by distinct age. Since

k_{t} (a)

is the aging recency depth on the processed trace, the fallback ranking is recency-consistent on that trace. □

Proposition 2.

The unified score in (16) compares predicted and fallback candidates on one clipped distance scale. In particular,

α_{t} (a) = 0

yields a pure fallback ranking, whereas

α_{t} (a) = 1

yields a pure model ranking for predictable candidates.

Proof.

Both

b (k_{t} (a))

and

m_{t} (a)

are expressed on the same clipped relaunch distance scale; therefore, Equation (16) is a convex interpolation between a calibrated fallback baseline and a relaunch distance estimate, which makes direct comparison between fallback and predicted candidates numerically meaningful. □

3.4. Online Updates and Victim Selection

Because both

d_{t}

and

k_{t} (a)

are defined in terms of distinct intervening apps, the online state must evolve over distinct identities. Suppose that request

x = a_{t}

is served at time t. The launched app resets its own age, while every other resident app records x as an intervening identity only once:

H_{t + 1} (x) = ⌀, H_{t + 1} (a) = H_{t} (a) \cup {x} for each other resident app a .

(17)

Hence,

k_{t + 1} (x) = 0

. For any other resident app a, the distinct age increases only when

x \notin H_{t} (a)

; repeated launches of identities already counted for a do not change its age further. This exact update is necessary because the online state must represent the same number of distinct identities that appears in the relaunch distance.

When an eviction required miss occurs, the policy selects the victim

v_{t} = arg max_{a \in C_{t - 1}} s_{t} (a),

(18)

that is, the resident app with the largest calibrated score (i.e., the app expected to be reused farthest in the future) is reclaimed first.

If predictive confidence is weak or if no usable prediction is available, then (16) reduces to the fallback order over the aging baseline. When each resident app has stored its current distinct age, eligibility state, evidence count, and model outputs, the online decision in (18) becomes a scan over the resident set, i.e.,

O (C)

.

4. Experimental Setup

4.1. Dataset Preparation

We derive app launch traces from the LSApp dataset by filtering foreground app openings and removing consecutive duplicate launches of the same app. This preprocessing converts the raw usage records into launch traces with event semantics that match the distinct identity for relaunch distance and aging in Section 3. After preprocessing, the dataset contains 198,270 launch events from 291 users covering 87 apps. For each user, the processed launch trace is split into training, validation, and test segments with nominal ratios of

70 %

:

10 %

:

20 %

. Specifically, for a user with n processed launches, we use

⌊ 0.70 n ⌋

events for training,

⌊ 0.10 n ⌋

events for validation, and assign the remaining events to the test split. Labels that depend on future launches are computed within each split. During evaluation, each policy replays the history to recover its online state, but hit ratio is accumulated only on the test split. This protocol preserves a realistic cache state at the test boundary without exposing information from later events. All reported summaries use the evaluation set of 279 users with non-empty training, validation, and test splits. To study the effect of available history, we additionally report four user groups defined by processed launch count: q1 (11–88 events,

n = 70

), q2 (89–223,

n = 70

), q3 (225–593,

n = 69

), and q4 (597–12,753,

n = 70

).

Table 1 summarizes the dataset and evaluation scope. A resident app is treated as predictable when its identity appears in the training vocabulary of the active user split and the resident state carries the output pair produced at its most recent launch; all other residents are scored by the fallback only.

4.2. Model Benchmark Setup

We compare five policies: LRU, Oracle-RD, LeCaR-APP, Hybrid-RD(LSTM), and the proposed Calibrated-RD(Tx). LRU is the principal recency baseline; Oracle-RD is a reference that uses ground truth relaunch distance when a future relaunch exists and applies the same fallback as the proposed method to ineligible or UNK resident apps; LeCaR-APP is an app level adaptation of the online LRU/LFU expert structure in LeCaR [28]; Hybrid-RD(LSTM) combines a raw LSTM relaunch distance rank [15] with an LRU; and Calibrated-RD(Tx) is the proposed method. All direct comparisons use the same 279 evaluation users, same train/validation/test split protocol, and same fixed-capacity replay setting. All preprocessing, model implementation, and evaluation were conducted using Python 3.12 (Python Software Foundation, Wilmington, DE, USA), with NumPy 2.4.6, pandas 3.0.3, and PyTorch 2.12 (Meta Platforms, Inc., Menlo Park, CA, USA).

Each launch event requests one app. A request is counted as a hit when the app is already resident and as a miss otherwise. If free capacity is available, the requested app is inserted without eviction; if the resident set is full, the policy selects one resident app as the victim and inserts the request. All reported comparisons use every integer capacity from

C = 5

to

C = 15

. For compact presentation, selected tables additionally report

C \in {5, 10, 15}

, the tight capacity average over

C = 5, \dots, 8

, and

A v g_C

, where

A v g_C

denotes the per-user mean hit ratio averaged over capacities

5, \dots, 15

. Table 2 lists the evaluated policies and core settings.

LeCaR-APP treats each app identifier as one cache object and updates the LRU/LFU expert weights online during replay. Hybrid-RD(LSTM) ranks resident candidates by

λ {rank}_{RD} + (1 - λ) {rank}_{LRU}

. For Calibrated-RD(Tx), the loss in (13) uses

λ_{n r} = 1

,

λ_{r d} = 1

, and Huber threshold

δ = 1

. The transformer backbone [29] is trained with AdamW using a learning rate of 0.001, weight decay of 0.0001, batch size of 512, gradient clipping of 1.0, and early stopping with patience of 5 for up to 50 epochs. During personalization, the backbone is frozen and only the user specific calibration parameters are updated with AdamW using a learning rate of 0.005, L2 regularization of 0.01, and early stopping with patience of 2 for up to 5 epochs.

4.3. AOSP Implementation and Emulator Setup

The fixed capacity trace simulation described above is useful for isolating the victim selection behavior of each policy under the same launch sequence; however, it abstracts away several effects that are important in an actual Android system, including memory footprints, launch latency, and the interaction between memory management policies. To address this limitation, we additionally implement the proposed method in AOSP and construct an emulator benchmark. We use this to verify that the proposed ranking can be exercised through Android’s cached process management path. Table 3 summarizes the host and Android build configuration used for the emulator experiments.

The implementation modifies the Android framework layer’s frameworks/base subproject and integrates the proposed policy into the ActivityManager/OomAdjuster path. The implementation adds CalibratedRdRanker, which maintains the per-process calibrated relaunch distance score, and NativeRdTransformer, which provides native inference through JNI/C++ under services/core/jni. The compact transformer predictor is embedded in AOSP as a mixed-INT8 quantized native runtime rather than invoked through an external service. Process creation events in ProcessList and activity launch events in ActivityManagerService trigger one native inference for the launched package. The resulting outputs are converted into a calibrated score and stored in a volatile score table indexed by process/package identity. When OomAdjuster computes cache-process adjustment values, the implementation uses the stored score only to reorder eligible cached app/process candidates. Foreground, perceptible, service, and other higher priority processes remain governed by Android’s existing priority rules. The runtime overhead of this mixed-INT8 implementation is reported in Section 5.3.

The original LSApp trace contains app identifiers, meaning that installing and controlling all corresponding Play Store apps would introduce login, network I/O, and irreproducible background behavior. Therefore, the emulator benchmark uses a suite of package distinct synthetic apps. Each synthetic APK shares the same benchmark implementation but has a distinct package name and footprint profile. The benchmark runner deterministically maps LSApp app identifiers to these synthetic packages, preserving the chronological test sequence while allowing the memory behavior of each app to be controlled.

The synthetic apps are designed to create realistic cached process memory pressure. Each app has a mixture of retained Java heap objects, direct/native buffers, file-backed/dirty mmap regions, bitmap memory, SQLite I/O, and a hot working set for app launch. All regions are explicitly touched at page granularity of 4 KiB so that requested memory is reflected in the measured resident footprint. The synthetic apps are distributed into 34 small, 32 medium, 16 large, and 5 huge apps, with their respective PSS memory footprints ranging from 40–85 MB, 90–190 MB, 210–430 MB, and 500–850 MB. After launch, the default policy is to retain memory across onPause() and onStop() so that the process continues to contribute cached-memory pressure. The benchmark logs report the requested footprint, measured PSS/RSS, swap PSS, and launch time.

5. Performance Evaluation

5.1. Main Results

When the resident set is full and a miss occurs requiring eviction, the policy must choose one resident app as the victim. Figure 1 compares the five policies on the 279 evaluation users and Figure 2 shows the performance deltas of the main policies relative to LRU. At the tightest capacity of

C = 5

, Calibrated-RD(Tx) obtains the highest hit ratio, improving LRU from 0.7617 to 0.7691. LeCaR-APP also improves over LRU, reaching 0.7672, whereas Hybrid-RD(LSTM) remains close to LRU at 0.7634. When averaged over

C = 5, \dots, 15

, LeCaR-APP gives the largest average, 0.8943, followed by Calibrated-RD(Tx) at 0.8935. The advantage of Calibrated-RD(Tx) is most visible in the tight capacity range. Over

C = 5, \dots, 8

, Calibrated-RD(Tx) reaches 0.8290, compared with 0.8282 for LeCaR-APP, 0.8220 for Hybrid-RD(LSTM), and 0.8212 for LRU.

Table 4 summarizes the same result at selected operating points. Hybrid-RD(LSTM) provides a small margin over LRU, with an Avg_C of 0.8903 compared to 0.8900 for LRU. This suggests that simply combining raw relaunch distance predictions with recency is insufficient. Although LeCaR-APP is a stronger online baseline and provides the best Avg_C, Calibrated-RD(Tx) assigns a comparable score to every resident candidate rather than selecting between LRU and LFU experts. This full resident ranking is most beneficial when the resident set is small and each eviction decision has a larger effect on future relaunches. At larger capacities, the results become saturated. For example, at

C = 15

, the gap between Oracle-RD and LRU is only 0.0009. The practical operating point for the proposed calibrated ranking is the tight memory regime, where the choice of victim remains consequential.

5.2. Quartile Analysis Based on Launch Count

We next localize the all-user results by processed launch quartile. The four groups q1–q4 are the groups defined in Section 4.1. Figure 3 reports the hit ratio margin over LRU for LeCaR-APP, Hybrid-RD(LSTM), and Calibrated-RD(Tx). The short history groups show limited separation. In q1, while the margins are small, Calibrated-RD(Tx) provides the largest

C = 5

and Avg_C gains over LRU. In q2, LeCaR-APP is stronger than Calibrated-RD(Tx), especially at

C = 5

. This suggests that the online LRU/LFU expert baseline is competitive when the available per user relaunch evidence remains moderate.

The deeper history groups show the strongest tight capacity behavior. In q4 at

C = 5

, Calibrated-RD(Tx) reaches 0.8135, compared to 0.7997 for LRU, 0.8107 for LeCaR-APP, and 0.8015 for Hybrid-RD(LSTM). This corresponds to a

+ 0.0138

margin over LRU and recovers 41.3% of the gap between Oracle-RD and LRU. The corresponding LeCaR-APP recovery is 32.9%, while Hybrid-RD(LSTM) recovers 5.1%. Table 5 shows the same pattern in compact form. Averaged over all capacities, LeCaR-APP remains the strongest policy in q2–q4, whereas Calibrated-RD(Tx) is strongest in q1.

5.3. Emulator Benchmark Results

We next report the AOSP emulator benchmark using the implementation and synthetic APK suite from Section 4.3. The comparison uses the stock LRU cached-process ordering and the proposed Calibrated-RD(Tx) ordering. The reported launch time and virtual memory footprint per launch event are measured over the benchmark interval.

Table 6 summarizes the cost of the native predictor. The mixed-INT8 runtime performs 558k MAC operations per inference and takes 0.2 ms on average in the emulator configuration from Table 3. The implementation runs inference at app launch events and stores the resulting scores. During cached process ordering, the Android path reads the stored scores and ranks only the currently eligible cached candidates.

Table 7 reports the emulator benchmark. Calibrated-RD(Tx) reduces per-event mean launch latency by 1.74 ms relative to LRU. This behavior is consistent with the policy changing the average quality of cached-process retention under pressure, while individual launch events can still be slower depending on the resident set and reclaim state at that point in the replay.

The virtual memory counters move in the direction of lower average memory activity. Relative to LRU, Calibrated-RD(Tx) reduces the average pgfault delta by 244.5, pgmajfault by 2.4, pswpin by 1.7, and pswpout by 129.9. The largest reductions are on the write reclaim path, especially pswpout. These counters do not measure user experience by themselves, but indicate that the calibrated cached-process ordering does not obtain the mean launch time reduction by increasing average swap traffic in this benchmark.

5.4. Discussion

The main takeaway is that capacity determines whether an eviction policy can change the outcome. At large capacities, the resident set already captures most near-relaunches, and the oracle reference coincides with LRU. In our all-user results, the gap between oracle and LRU shrinks from

+ 0.0351

at

C = 5

to only

+ 0.0009

at

C = 15

. The merging of the curves at

C = 14

and

C = 15

is evidence that the fixed capacity cache reaches a saturated regime where little recoverable headroom remains; conversely,

C = 5

is the regime where a wrong eviction is most likely to turn a future return into a cold relaunch, and where a better ranking rule can still recover part of the oracle–LRU gap.

This observation changes how the absolute gains should be interpreted. The all-user Avg_C improvement is small in raw hit ratio units, but is obtained against a strong recency baseline and within a narrow remaining headroom. Table 4 reports both the margin and the oracle-gap closure. On the all-user set, Calibrated-RD(Tx) improves Avg_C from

0.8900

to

0.8935

, recovering

29.3 %

of the remaining oracle–LRU gap. At

C = 5

, it improves the hit ratio from

0.7617

to

0.7691

, recovering

21.2 %

of the remaining gap. This is more informative than the absolute margin alone in that it recovers a measurable fraction of the headroom that is still available under memory pressure.

The second implication is that relaunch distance prediction should be evaluated as a system-level policy, not just a prediction task. Raw relaunch distance outputs are not directly comparable to recency fallback scores or non-returning cases. Hybrid-RD(LSTM) shows that combining a raw relaunch distance ranking with an LRU ranking is a safe but weak use of the predictive signal. While LeCaR-APP is a stronger baseline, it selects only one victim at a time rather than providing an eviction ranking for all resident candidates. In contrast, Calibrated-RD(Tx) produces a system-consumable eviction ranking for the resident app set; when this ranking is applied to the cached-process oom_score_adj, the operating system can use relaunch evidence directly in memory management, protecting apps that are likely to return while reclaiming lower-priority cached apps earlier. This reduces both the launch latency and the measured memory management footprint in the emulator benchmark.

The remaining limits concern generality and measurement scope. The LSApp trace provides chronological foreground app launches and supports per-user training, validation, and test splits; however, the 279-user fixed capacity evaluation does not establish global representativeness or guarantee the same gain magnitude on newer smartphone workloads. Larger and more recent traces may differ in app ecosystems and user switching patterns. The emulator benchmark addresses a different limitation by adding heterogeneous synthetic footprints and Android cached-process execution, but does not replace physical-device energy measurement or evaluation on real installed applications. The proposed calibrated resident ranking is most useful under tight cache capacity and sufficient launch history, and the emulator benchmark shows that the same ranking can be embedded and measured in AOSP. Validation on larger modern app usage traces and physical devices remains necessary in order to establish broader deployment-level generality.

6. Conclusions

This paper studies smartphone app eviction as a victim app selection problem rather than as a next-app classification or lower-layer memory reclamation task. We argue that relaunch distance is the appropriate future use quantity for this setting, and that raw prediction alone is insufficient because eviction decisions must compare predicted and fallback candidates on a common scale. To address this mismatch, we proposed a calibrated relaunch distance framework that combines an explicit non-returning outcome, an aging baseline, and weighted interpolation between prediction and fallback scores.

In our evaluation on the 279-user set, the proposed method stayed above LRU from

C = 5

to

C = 13

and became effectively tied with LRU at

C = 14

and

C = 15

, which is consistent with saturation of the oracle–LRU gap at larger capacities. The most important regime is tight capacity: at

C = 5

, our proposed Calibrated-RD(Tx) demonstrates an improved hit ratio from

0.7617

to

0.7691

, outperforming both LeCaR-APP and Hybrid-RD(LSTM) in this regime. The AOSP emulator results further show that the proposed method reduces the mean launch time and the memory footprint per event. These results indicate that calibrated ranking of resident apps can be used by the operating system as an eviction priority signal.

The main limitation of this paper is that the quantitative gains are measured on the app traces mapped on synthetic APK workloads in an emulator. Broader validation on larger modern app usage traces, real-world apps on physical devices, and energy measurements remains as future work.

Author Contributions

Conceptualization, J.L. and Y.K.; methodology, J.L. and Y.K.; experimental design, J.L. and Y.K.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, Y.K.; supervision, Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Seoul National University of Science and Technology.

Data Availability Statement

The data analyzed in this study were derived from the publicly available LSApp dataset, available at GitHub: https://github.com/aliannejadi/LSApp (accessed on 1 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, J.; Zhang, Y.; Qiu, J.; Liang, Y.; Ausavarungnirun, R.; Li, Q.; Xue, C.J. More Apps, Faster Hot-Launch on Mobile Devices via Fore/Background-aware GC-Swap Co-design. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems; ASPLOS ’24; Association for Computing Machinery: New York, NY, USA, 2024; Volume 3, pp. 654–670. [Google Scholar] [CrossRef]
Kim, J.; Bahn, H. Maintaining Application Context of Smartphones by Selectively Supporting Swap and Kill. IEEE Access 2020, 8, 85140–85153. [Google Scholar] [CrossRef]
Kim, J.; Bahn, H. Analysis of Smartphone I/O Characteristics—Toward Efficient Swap in a Smartphone. IEEE Access 2019, 7, 129930–129941. [Google Scholar] [CrossRef]
Han, J.; Kim, S.; Lee, S.; Lee, J.; Kim, S.J. A Hybrid Swapping Scheme Based On Per-Process Reclaim for Performance Improvement of Android Smartphones (August 2018). IEEE Access 2018, 6, 56099–56108. [Google Scholar] [CrossRef]
Li, C.; Bao, J.; Wang, H. Optimizing low memory killers for mobile devices using reinforcement learning. In Proceedings of the 2017 13th International Wireless Communications and Mobile Computing Conference (IWCMC); IEEE: New York, NY, USA, 2017; pp. 2169–2174. [Google Scholar] [CrossRef]
Huang, K.; Zhang, C.; Ma, X.; Chen, G. Predicting mobile application usage using contextual information. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing; Ubicomp ’12; ACM: New York, NY, USA, 2012; pp. 1059–1065. [Google Scholar] [CrossRef]
Yan, T.; Chu, D.; Ganesan, D.; Kansal, A.; Liu, J. Fast app launching for mobile devices using predictive user context. In Proceedings of the 10th International Conference on Mobile Systems, Applications, and Services; MobiSys ’12; ACM: New York, NY, USA, 2012; pp. 113–126. [Google Scholar] [CrossRef]
Parate, A.; Böhmer, M.; Chu, D.; Ganesan, D.; Marlin, B.M. Practical prediction and prefetch for faster access to applications on mobile phones. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing; UbiComp ’13; ACM: New York, NY, USA, 2013; pp. 275–284. [Google Scholar] [CrossRef]
Baeza-Yates, R.; Jiang, D.; Silvestri, F.; Harrison, B. Predicting The Next App That You Are Going To Use. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining; WSDM 2015; ACM: New York, NY, USA, 2015; pp. 285–294. [Google Scholar] [CrossRef]
Xu, S.; Li, W.; Zhang, X.; Gao, S.; Zhan, T.; Zhao, Y.; Zhu, W.W.; Sun, T. Predicting Smartphone App Usage with Recurrent Neural Networks. In Wireless Algorithms, Systems, and Applications; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; Chapter 44; pp. 532–544. [Google Scholar] [CrossRef]
Zhao, S.; Luo, Z.; Jiang, Z.; Wang, H.; Xu, F.; Li, S.; Yin, J.; Pan, G. AppUsage2Vec: Modeling Smartphone App Usage for Prediction. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE); IEEE: New York, NY, USA, 2019; pp. 1322–1333. [Google Scholar] [CrossRef]
Moreira, G.S.; Jo, H.; Jeong, J. NAP: Natural App Processing for Predictive User Contexts in Mobile Smartphones. Appl. Sci. 2020, 10, 6657. [Google Scholar] [CrossRef]
Alruban, A. Prediction of Application Usage on Smartphones via Deep Learning. IEEE Access 2022, 10, 49198–49206. [Google Scholar] [CrossRef]
Katsarou, K.; Yu, G.; Beierle, F. WhatsNextApp: LSTM-Based Next-App Prediction with App Usage Sequences. IEEE Access 2022, 10, 18233–18247. [Google Scholar] [CrossRef]
Lee, J.; Park, S. An Efficient Memory Management for Mobile Operating Systems Based on Prediction of Relaunch Distance. Comput. Syst. Sci. Eng. 2023, 47, 171–186. [Google Scholar] [CrossRef]
Moon, G.; Kang, D. VSwap: A New Extension to the Swap Mechanism for Enabling Swap Memory Space Optimization. Appl. Sci. 2025, 15, 12049. [Google Scholar] [CrossRef]
Kim, S.H.; Jeong, J.; Kim, J.S.; Maeng, S. SmartLMK: A Memory Reclamation Scheme for Improving User-Perceived App Launch Time. ACM Trans. Embed. Comput. Syst. 2016, 15, 47. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Yang, Z.; Wu, M.; Chen, H.; Zang, B. Object-Aware Memory Compression for Smartphones. ACM Trans. Archit. Code Optim. 2025, 22, 139. [Google Scholar] [CrossRef]
Kim, S.H.; Jeong, J.; Kim, J.S. Application-Aware Swapping for Mobile Systems. ACM Trans. Embed. Comput. Syst. 2017, 16, 182. [Google Scholar] [CrossRef]
Yoon, H.; Cho, K.; Bahn, H. Storage Type and Hot Partition Aware Page Reclamation for NVM Swap in Smartphones. Electronics 2022, 11, 386. [Google Scholar] [CrossRef]
Challa, P.; Song, B.; Jiang, S. MemSaver: Enabling an All-in-memory Switch Experience for Many Apps in a Smartphone. In Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering; ICPE ’24; ACM: New York, NY, USA, 2024; pp. 267–275. [Google Scholar] [CrossRef]
Li, W.; Chang, L.P.; Mao, Y.; Shi, L. PMR: Fast Application Response via Parallel Memory Reclaim on Mobile Devices. In Proceedings of the 2025 USENIX Annual Technical Conference; USENIX Association: Berkeley, CA, USA, 2025; pp. 1569–1584. Available online: https://www.usenix.org/conference/atc25/presentation/li-wentong (accessed on 1 March 2026).
Sareen, K.; Blackburn, S.M.; Hamouda, S.S.; Gidra, L. Memory Management on Mobile Devices. In Proceedings of the 2024 ACM SIGPLAN International Symposium on Memory Management; ISMM ’24; ACM: New York, NY, USA, 2024; pp. 15–29. [Google Scholar] [CrossRef]
Xia, T.; Li, Y.; Feng, J.; Jin, D.; Zhang, Q.; Luo, H.; Liao, Q. DeepApp: Predicting Personalized Smartphone App Usage via Context-Aware Multi-Task Learning. ACM Trans. Intell. Syst. Technol. 2020, 11, 64. [Google Scholar] [CrossRef]
Wang, H.; Li, Y.; Du, M.; Li, Z.; Jin, D. App2Vec: Context-Aware Application Usage Prediction. ACM Trans. Knowl. Discov. Data 2021, 15, 112. [Google Scholar] [CrossRef]
Ouyang, Y.; Guo, B.; Wang, Q.; Liang, Y.; Yu, Z. Learning Dynamic App Usage Graph for Next Mobile App Recommendation. IEEE Trans. Mob. Comput. 2023, 22, 4742–4753. [Google Scholar] [CrossRef]
Huber, P.J. Robust Estimation of a Location Parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
Vietri, G.; Rodriguez, L.V.; Martinez, W.A.; Lyons, S.; Liu, J.; Rangaswami, R.; Zhao, M.; Narasimhan, G. Driving Cache Replacement with ML-based LeCaR. In Proceedings of the 10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 18), Boston, MA, USA, 9–10 July 2018. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]

Figure 1. All user fixed capacity benchmark results over the 279 evaluation users.

Figure 2. Margins over LRU on the all-user set. The calibrated method has the strongest margin on tight capacity, while LeCaR-APP provides broader average improvement over the full capacity range.

Figure 3. Subgroup analysis over four user groups defined by launch count: (a) q1, (b) q2, (c) q3, and (d) q4. Each curve reports the mean hit ratio margin over LRU.

Table 1. Trace profile.

Item	Value
Trace source	LSApp dataset
Events/users/apps	198,270/291/87
Split ratio	train/validation/test = $70 %$ : $10 %$ : $20 %$
Relaunch distance profile	$p 50 = 2$ , $p 90 = 7$ , $p 99 = 16$
Clipping/non-returning	$d_{clip} = 16$ , $d_{nr} = 17$
Evaluation users	279 users with non-empty splits
User groups	q1: 11–88 ( $n = 70$ ); q2: 89–223 ( $n = 70$ ); q3: 225–593 ( $n = 69$ ); q4: 597–12,753 ( $n = 70$ )
Reported capacities	$C = 5, \dots, 15$

Table 2. Benchmarked policies and core evaluation settings.

Item	Setting
Compared policies	LRU, Oracle-RD, LeCaR-APP, Hybrid-RD(LSTM), Calibrated-RD(Tx)
LeCaR-APP	initial weights $W_{LRU} = W_{LFU} = 0.5$ ; learning rate $0.45$ ; history size C; discount rate $0 . 005^{1 / C}$
Hybrid-RD(LSTM)	Hidden sizes $[128, 64]$ ; dropout 0.2; $λ = 0.3$
Calibrated-RD	Context length 32; model dimension 128; 2 layers; 4 heads; dropout 0.2; $d_{clip} = 16$ ; $d_{nr} = 17$ ; trust coefficient $α = n / (n + 10)$
Loss parameters	$λ_{n r} = 1$ ; $λ_{r d} = 1$ ; Huber threshold $δ = 1$

Table 3. AOSP build and emulator host configuration.

Item	Configuration
Host OS	macOS Tahoe 26.4.1 (25E253), arm64
Host CPU	Apple M2 Max, 12 cores @ 3.70 GHz
Host GPU	Apple M2 Max, 38 cores @ 1.40 GHz, integrated
Host memory	96.00 GiB
AOSP base	AOSP 16.0.0_r4
Build target	sdk_phone64_arm64-userdebug
Guest CPU	4 vCPU cores
Guest memory	3 GiB RAM
Device type	ARM64 Android virtual device

Table 4. All-user summary across the five evaluated policies. Bold values mark the strongest non-oracle policy in each row.

Scope	LRU	Oracle-RD	LeCaR-APP	Hybrid-RD (LSTM)	Calibrated-RD (Tx)
$C = 5$	0.7617	0.7968	0.7672	0.7634	0.7691
$C = 5$ –8 avg.	0.8212	0.8455	0.8282	0.8220	0.8290
$C = 10$	0.9098	0.9174	0.9143	0.9096	0.9120
$C = 15$	0.9541	0.9549	0.9550	0.9540	0.9536
Avg_C	0.8900	0.9017	0.8943	0.8903	0.8935

Table 5. Margins over LRU. Δ values are mean hit ratio differences against LRU, while values in bold mark the largest margin in each group and summary column.

Group	Δ_LeCaR (C = 5)	Δ_Cal (C = 5)	Δ_LeCaR (Avg_C)	Δ_Cal (Avg_C)
q1	+0.0002	+0.0056	+0.0032	+0.0039
q2	+0.0075	+0.0046	+0.0032	+0.0013
q3	+0.0034	+0.0059	+0.0050	+0.0037
q4	+0.0110	+0.0138	+0.0056	+0.0047

Table 6. Runtime cost of the mixed-INT8 quantized Calibrated-RD(Tx).

Item	Value
Quantized format	Mixed-INT8
Residency footprint	37 MiB
MAC operations per inference	558 k
Mean inference time	0.2 ms

Table 7. Emulator benchmark comparison between LRU and Calibrated-RD(Tx). Delta is Calibrated-RD(Tx) minus LRU.

Metric	LRU	Calibrated-RD (Tx)	Δ
Resume hit ratio	0.9556	0.9594	$+ 0.0038$
Cold launch ratio	0.1389	0.1367	$- 0.0022$
Mean launch time per event (ms)	235.40	233.66	$- 1.74$
P95 launch time (ms)	602.0	600.0	$- 2.0$
pgfault	$31, 594.2$	$31, 349.7$	$- 244.5$
pgmajfault	$705.0$	$702.6$	$- 2.4$
pswpin	$676.9$	$675.3$	$- 1.7$
pswpout	$2252.8$	$2122.9$	$- 129.9$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, J.; Kyung, Y. A Calibrated Relaunch Distance Framework for App Eviction in Smartphone Memory Management. Electronics 2026, 15, 2415. https://doi.org/10.3390/electronics15112415

AMA Style

Lee J, Kyung Y. A Calibrated Relaunch Distance Framework for App Eviction in Smartphone Memory Management. Electronics. 2026; 15(11):2415. https://doi.org/10.3390/electronics15112415

Chicago/Turabian Style

Lee, Jaehwan, and Yeunwoong Kyung. 2026. "A Calibrated Relaunch Distance Framework for App Eviction in Smartphone Memory Management" Electronics 15, no. 11: 2415. https://doi.org/10.3390/electronics15112415

APA Style

Lee, J., & Kyung, Y. (2026). A Calibrated Relaunch Distance Framework for App Eviction in Smartphone Memory Management. Electronics, 15(11), 2415. https://doi.org/10.3390/electronics15112415

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Calibrated Relaunch Distance Framework for App Eviction in Smartphone Memory Management

Abstract

1. Introduction

2. Related Work

2.1. Smartphone Memory Reclamation

2.2. Predictive App Usage Context

3. Calibrated Relaunch Distance Prediction

3.1. Problem Formulation

3.2. Prediction Model

3.3. Fallback and Scoring

3.4. Online Updates and Victim Selection

4. Experimental Setup

4.1. Dataset Preparation

4.2. Model Benchmark Setup

4.3. AOSP Implementation and Emulator Setup

5. Performance Evaluation

5.1. Main Results

5.2. Quartile Analysis Based on Launch Count

5.3. Emulator Benchmark Results

5.4. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI