scRL: Utilizing Reinforcement Learning to Evaluate Fate Decisions in Single-Cell Data

Fu, Zeyu; Chen, Chunlin; Wang, Song; Wang, Junping; Chen, Shilei

doi:10.3390/biology14060679

Open AccessEditor’s ChoiceArticle

scRL: Utilizing Reinforcement Learning to Evaluate Fate Decisions in Single-Cell Data

by

Zeyu Fu

^1,†

,

Chunlin Chen

^2,†

,

Song Wang

¹,

Junping Wang

^1,* and

Shilei Chen

^1,*

¹

State Key Laboratory of Trauma and Chemical Poisoning, Institute of Combined Injury, Chongqing Engineering Research Center for Nanomedicine, College of Preventive Medicine, Army Medical University, Chongqing 400038, China

²

Department of Rehabilitation Medicine, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Biology 2025, 14(6), 679; https://doi.org/10.3390/biology14060679

Submission received: 8 May 2025 / Revised: 4 June 2025 / Accepted: 8 June 2025 / Published: 11 June 2025

(This article belongs to the Special Issue AI Deep Learning Approach to Study Biological Questions (2nd Edition))

Download

Browse Figures

Review Reports Versions Notes

Simple Summary

Understanding how cells develop into different types during growth and disease is crucial for advancing medicine, but current methods cannot pinpoint exactly when and where cells make these critical decisions. We developed a new artificial intelligence tool called single-cell reinforcement learning that treats cell development like a strategic decision-making game. Just as a chess player learns to make optimal moves, our system learns to identify the precise moments when cells decide their future fate—whether to become blood cells, immune cells or other specialized types. We tested this approach on various biological systems, including normal human blood cell development, cancer cells, mouse organ development and cells responding to radiation damage. Our method consistently outperformed fifteen existing state-of-the-art tools and successfully identified early decision points that occur before cells show obvious signs of commitment to specific lineages. Additionally, we discovered previously unknown regulatory factors that control these decisions. This breakthrough provides scientists with a powerful new way to understand how cells make developmental choices, which could lead to better treatments for diseases like cancer and improved strategies for regenerative medicine. By revealing the hidden decision-making logic of cellular development, this work opens new possibilities for controlling and directing cell fate in therapeutic applications.

Abstract

Single-cell RNA sequencing now profiles whole transcriptomes for hundreds of thousands of cells, yet existing trajectory-inference tools rarely pinpoint where and when fate decisions are made. We present single-cell reinforcement learning (scRL), an actor–critic framework that recasts differentiation as a sequential decision process on an interpretable latent manifold derived with Latent Dirichlet Allocation. The critic learns state-value functions that quantify fate intensity for each cell, while the actor traces optimal developmental routes across the manifold. Benchmarks on hematopoiesis, mouse endocrinogenesis, acute myeloid leukemia, and gene-knockout and irradiation datasets show that scRL surpasses fifteen state-of-the-art methods in five independent evaluation dimensions, recovering early decision states that precede overt lineage commitment and revealing regulators such as Dapp1. Beyond fate decisions, the same framework produces competitive measures of lineage-contribution intensity without requiring ground-truth probabilities, providing a unified and extensible approach for decoding developmental logic from single-cell data.

Keywords:

single–cell; reinforcement learning; actor–critic; fate decisions; trajectory inference; dimensionality reduction

1. Introduction

1.1. Single-Cell Sequencing and Dimensionality Reduction

Single-cell sequencing now enables genome-wide measurements at the resolution of individual cells, exposing previously hidden heterogeneity across tissues and disease states [1,2,3]. Each experiment routinely profiles tens of thousands of genes in thousands to millions of cells, yielding data that are not only high-dimensional but also sparse, noisy and burdened by technical artefacts [4]. Consequently, DR (dimensionality-reduction) methods have become indispensable for extracting biological structure from such data [5].

A major downstream application of DR in the single-cell field is pseudotime analysis, which seeks to order cells along putative developmental trajectories [6,7]. Visualization techniques—including PCA (principal component analysis), t-SNE (t-distributed Stochastic Neighbour Embedding) and UMAP (Uniform Manifold Approximation and Projection)—facilitate exploratory analysis and often serve as the basis for trajectory inference [8,9,10]. Indeed, tools such as PAGA, Monocle 3, Slingshot, and CAPITAL exploit UMAP embeddings to reconstruct branching lineages [11,12,13,14]. Yet, despite their success in ordering cells, existing approaches seldom address where and when lineage commitment occurs, nor do they provide a quantitative framework to evaluate these fate decisions. In reality, differentiation is not strictly hierarchical but resembles a rugged landscape in which progenitors become lineage-restricted at context-dependent times [15]. Detecting the continuous states of fate decisions that guide this process therefore remains an open problem.

1.2. Manifold Learning for Dimensionality Reduction

Linear DR techniques, exemplified by PCA, are computationally efficient but fail to capture the nonlinear geometry typical of biological systems [16]. Manifold-learning algorithms address this limitation by assuming that high-dimensional observations reside on, or near, a low-dimensional manifold embedded in ambient space [10,17]. Isomap [18], LLE (Locally Linear Embedding) [19], t-SNE [20] and UMAP [10] each preserve different aspects of the manifold—geodesic distances, local linear reconstructions or neighbourhood probabilities—thereby recovering structure inaccessible to linear projections. Theoretical foundations draw on differential geometry, topology and graph theories, providing a flexible toolkit for biological data characterized by nonlinear relationships and multiple developmental branches [16].

1.3. Applications in Single-Cell Data Analysis

Manifold learning has become integral to single-cell workflows because it reveals continuous developmental paths, rare subpopulations and functional gradients that remain hidden in the original space [21,22]. UMAP visualizations, for example, readily separate major immune lineages and their subtypes from

{CD45}^{+}

scRNA-seq profiles [23]. Specialized adaptations align transcriptomic with epigenomic or electrophysiological modalities [24,25], and Gaussian-process latent variable models capture noisy gene expression as smooth functions of latent states [26]. Nevertheless, even state-of-the-art manifold approaches primarily describe trajectories; they do not explicitly evaluate the decisions that drive differentiation.

1.4. Reinforcement Learning for Cellular Differentiation Trajectories

RL (reinforcement learning) offers a principled framework for sequential decision-making in complex, dynamic environments [27]. An RL agent iteratively selects actions, observes transitions and updates its policy to maximize the cumulative reward—attributes that have delivered breakthroughs in robotics, game playing and navigation [28,29,30]. Conceptually, cellular differentiation mirrors this paradigm: (i) a transcriptomic profile represents the state, (ii) regulatory events such as gene activation or repression correspond to actions, (iii) differentiation steps define state transitions, and (iv) developmental constraints or lineage markers provide rewards [31].

RL is therefore well-suited to explore high-dimensional single-cell spaces, balance the discovery of novel routes with optimization of known trajectories and pinpoint branch points where fate decisions occur [32,33]. Deep RL extends these capabilities to raw, high-dimensional inputs by coupling neural networks with value- or policy-based optimization [34]. For scRNA-seq data—often exceeding 10,000 genes—such expressivity is essential [35]. Framing differentiation as an RL task promises to reveal regulatory logic that conventional pseudotime tools overlook, particularly in multi-lineage systems with complex branching topologies [36].

1.5. Our Contribution

Here we introduce scRL, a reinforcement-learning framework that integrates manifold learning, an actor–critic architecture and biologically informed reward functions to decode fate decisions from single-cell transcriptomes. Briefly, we (i) construct an interpretable latent space via LDA, (ii) embed this space onto a two-dimensional grid, preserving UMAP topology, and (iii) train an RL agent whose critic learns state values reflecting lineage potential. scRL thereby identifies pre-expression decision states, quantifies lineage and gene decision intensities, and uncovers regulatory factors. Across diverse datasets—including human hematopoiesis, acute myeloid leukemia, mouse endocrinogenesis, Dapp1 knockout and irradiation injury—scRL outperformed benchmark fate-inference and pseudotime methods, revealed novel regulators and mapped dynamic fate biases (see Results, Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10). By viewing differentiation through the lens of sequential decision optimization, scRL provides a rigorous, extensible framework for studying cell-fate decisions, with potential applications in regenerative medicine, oncology and developmental biology.

2. Materials and Methods

2.1. Architecture and Workflow of scRL

scRL couples single-cell manifold learning with an actor–critic algorithm in three sequential stages (Figure 1).

Stage 1: Preprocessing and grid embedding (Figure 1A)
(i)
Dimension reduction. The gene count matrix is first projected into a latent space.
(ii)
Manifold construction and clustering. A UMAP embedding is built from the latent space matrix and Leiden clustering is applied to annotate coarse cell populations.
(iii)
Grid embedding. To obtain a lattice that preserves local topology yet yields fully connected paths, edges in the UMAP k-NN graph are filtered by a Canny-style detector; surviving edges are rasterized into an $m \times n$ grid. This transforms scattered points into a continuum that spans primitive to terminal states.
(iv)
Pseudotime assignment. Dijkstra’s shortest-path algorithm is run on the grid graph from a user-specified root cluster to derive a monotonic pseudotime score for every grid node.
(v)
Bidirectional projection. Each cell inherits the coordinates, pseudotime and neighbourhood of its nearest grid node; conversely, grid nodes maintain links back to the original transcriptomic space.

Figure 1. The comprehensive architecture of the scRL framework for cellular fate decision analysis. (A) The preprocessing pipeline, which integrates dimensionality reduction, clustering, an edge detection-based grid embedding algorithm, subpopulation mapping, pseudotime computation via Dijkstra’s algorithm and bidirectional projection between embedding spaces. Gene expression data and clustering labels form the basis for constructing reward environments, which operate in two modes: a decision mode (reward diminishing along pseudotime) and a contribution mode (reward intensifying along pseudotime). (B) Core functional modules comprising a gene module (gene potential and decision values), a lineage module (lineage potential and decision values) and a trajectory module (differentiation trajectory analysis). (C) The actor–critic reinforcement learning architecture, implemented as distinct neural network architectures within scRL. The critic (value network) processes the latent space—which we validated to be optimally represented by LDA for robust feature extraction—to output cell-state value, specifically the decision or contribution intensity. Concurrently, the actor (policy network) learns and outputs the optimal differentiation trajectories within the grid environment.

Stage 2: Reward design and module definition (Figure 1B)

Gene expression levels create two complementary reward landscapes:

Decision mode—the reward decays exponentially with pseudotime, emphasizing early contribution signals.
Contribution mode—the reward increases with pseudotime, capturing cumulative lineage output.

These landscapes feed the following three functional modules:

Gene module calculates gene-level potential and decision intensity.
Lineage module outputs lineage potential and decision intensity.
Trajectory module simulates differentiation by rolling out the learned policy on the grid.

Definition of Decision and Contribution Intensity

Decision Intensity quantifies the “fate intensity” or “decision value” of a cell state, particularly focusing on its significance in shaping future fate decisions before overt lineage commitment. It is derived from the state value learned by the critic when operating in the “decision mode,” where rewards are structured to emphasize early signals by decaying exponentially along pseudotime. Higher decision intensity therefore indicates a cell state with greater potential to influence downstream lineage choices, capturing the “pre-expression decision states” where crucial developmental commitments are made.
Contribution Intensity quantifies the cumulative “lineage output” or “lineage contribution” from a given cell state over time. It is derived from the state value learned by the critic when operating in the “contribution mode,” where rewards are structured to emphasize cumulative output by increasing along pseudotime. Higher contribution intensity reflects a cell’s accumulated propensity or success in differentiating towards and populating a specific lineage over the course of its developmental trajectory.

Stage 3: Actor–critic learning (Figure 1C)

Critic (value network). A multilayer perceptron ingests latent coordinates and predicts the state value, $V (s)$ , for each grid node.
Actor (policy network). A parallel network outputs a stochastic policy, $π (a | s)$ , that favors moves increasing expected cumulative reward.
Optimization. Networks are updated with an advantage formulation and an adaptive learning rate until both cumulative reward and maximal $V (s)$ converge.

Validation workflow (Figure A1)

We benchmarked scRL on human

{CD34}^{+}

haematopoietic cells (S-SUBS8) and mouse endocrinogenesis (GSE132188). Lineage-specific markers—GATA1, IRF8, EBF1 for hematopoiesis; Ngn3, Fev for endocrinogenesis—served as external validators of pre-expression decision states. Cluster-specific rewards were assigned to erythroid, myeloid and lymphoid (hematopoiesis) or early/late endocrine trajectories. After training

(a): State values were back-projected onto the UMAP to visualize decision hotspots;
(b): Training dynamics showed smooth convergence of the cumulative reward and stabilization of the maximum state value, confirming learning efficacy;
(c): Tabular Q-learning on discretized state spaces reproduced the neural results, validating the actor–critic implementation.

2.2. Grid-Based Embedding Representation

To transform two-dimensional cell embeddings into a structured grid representation that preserves topological relationships while enabling systematic trajectory analysis, we developed a grid mapping algorithm that discretizes the continuous embedding space into a regular lattice structure.

Grid Generation: Given a two-dimensional embedding matrix,

X \in R^{m \times 2}

, representing m cells, we first identify the spatial boundaries:

\begin{matrix} x_{right} & = X [arg max_{i} X_{i, 1}, :], x_{left} = X [arg min_{i} X_{i, 1}, :] \end{matrix}

(1)

\begin{matrix} x_{top} & = X [arg max_{i} X_{i, 2}, :], x_{bottom} = X [arg min_{i} X_{i, 2}, :] \end{matrix}

(2)

We then generate an

n \times n

regular grid spanning the embedding space:

\begin{matrix} x_{k} & = x_{left} + \frac{k - 1}{n - 1} (x_{right} - x_{left}), k = 1, 2, \dots, n \end{matrix}

(3)

\begin{matrix} y_{l} & = y_{bottom} + \frac{l - 1}{n - 1} (y_{top} - y_{bottom}), l = 1, 2, \dots, n \end{matrix}

(4)

The grid points are defined as

G = {(x_{k}, y_{l}) : k, l = 1, 2, \dots, n}

, yielding

n^{2}

total grid positions.

Boundary Detection: To identify the data boundary, we compute angular distances from spine points to all cells. For each spine point

s_{i}

, we calculate:

\begin{matrix} θ_{i, j} & = arctan (\frac{s_{i, x} - X_{j, x} + ϵ}{s_{i, y} - X_{j, y} + ϵ}) \end{matrix}

(5)

\begin{matrix} d_{i, j} & = | | s_{i} - X_{j} {| |}_{2} \end{matrix}

(6)

where

ϵ = 10^{- 6}

prevents division by zero. Boundary cells are identified as:

B_{i} = {arg max_{j : θ_{i, j} = θ} d_{i, j} : θ \in Θ_{i}}

(7)

where

Θ_{i}

represents the set of unique angular values from spine point

s_{i}

.

Grid Masking: We apply a masking procedure to remove grid points in regions with insufficient cell density. Using j observer points, we compute angular distances and identify valid grid regions:

M_{i} = {g \in G : rank (d_{i, g}) > max (rank (d_{i, b})) + j, \forall b \in B}

(8)

where

rank (d_{i, g})

denotes the distance rank of grid point g from observer i.

Adjacency Construction: For the remaining mapped grid points,

G_{mapped}

, we construct an 8-connectivity adjacency matrix,

A

:

A_{g_{1}, g_{2}} = \{\begin{matrix} 1 & if | | p o s (g_{1}) - p o s (g_{2}) {| |}_{\infty} = 1 \\ 0 & otherwise \end{matrix}

(9)

where

p o s (g)

returns the

(i, j)

grid coordinates of grid point g.

Boundary Refinement: Grid boundary points are identified as those with fewer than 8 neighbors:

G_{boundary} = {g \in G_{mapped} : \sum_{g^{'}} A_{g, g^{'}} < 8}

(10)

A pruning step ensures graph connectivity by iteratively adding adjacent points to isolated boundary regions, maintaining a single connected component in the final grid graph

G = (G_{mapped}, A)

.

2.3. Pseudotime Alignment via Graph-Based Distance Computation

To establish a coherent temporal ordering across the grid representation, we developed a pseudotime alignment algorithm that leverages graph-based shortest path computation to propagate temporal information from user-defined starting points throughout the grid network.

Starting Point Selection: The algorithm supports two initialization modes. For single-cell initialization with a specified early cell,

c_{0}

, we identify the corresponding grid point:

g_{start} = arg min_{g \in G_{mapped}} | | X_{c_{0}} - G_{g} {| |}_{2}

(11)

For cluster-based initialization with an early cluster,

C_{early}

, we sample

n_{s}

starting points from the intersection of cluster and boundary grids:

S = {sample}_{n_{s}} ({g \in G_{mapped} : cluster (g) = C_{early}} \cap G_{boundary})

(12)

where

G_{boundary}

represents boundary grid points if the boundary constraint is enabled.

Graph Construction: We construct an undirected graph,

H = (V, E)

, from the grid adjacency matrix:

V = G_{mapped}, E = {(g_{i}, g_{j}) : A_{g_{i}, g_{j}} = 1}

(13)

where

A

is the 8-connectivity adjacency matrix from the grid construction phase.

Single-Component Pseudotime Calculation: For a connected graph, we apply Dijkstra’s algorithm from each starting point,

s \in S

:

d_{s} (g) = Dijkstra (H, s, g) \forall g \in V

(14)

The mean pseudotime across all starting points is computed as:

\bar{d} (g) = \frac{1}{| S |} \sum_{s \in S} d_{s} (g)

(15)

Multi-Component Handling: When the graph contains multiple connected components

{C_{1}, C_{2}, \dots, C_{k}}

, we identify the main component containing the majority of starting points:

C_{main} = arg max_{i} | S \cap C_{i} |

(16)

For each secondary component,

C_{j}

, where

j \neq main

, we establish connection through the closest grid pair:

(g_{main}, g_{j}) = arg min_{g_{m} \in C_{main}, g_{s} \in C_{j}} | | G_{g_{m}} - G_{g_{s}} {| |}_{2}

(17)

The pseudotime offset for component

C_{j}

is determined by:

τ_{j} = \bar{d} (g_{main})

(18)

Pseudotime within secondary component

C_{j}

is calculated as:

d_{j} (g) = Dijkstra (H |_{C_{j}}, g_{j}, g) + τ_{j} \forall g \in C_{j}

(19)

Normalization: The final pseudotime values are normalized to the unit interval:

ψ (g) = \frac{d (g) - {min}_{g^{'} \in V} d (g^{'})}{{max}_{g^{'} \in V} d (g^{'}) - {min}_{g^{'} \in V} d (g^{'})}

(20)

where

d (g)

represents the unified distance function across all components:

d (g) = \{\begin{matrix} \bar{d} (g) & if g \in C_{main} \\ d_{j} (g) & if g \in C_{j}, j \neq main \end{matrix}

(21)

2.4. Bidirectional Projection Between Cells and Grid

To establish correspondence between the original cell embedding and the grid representation, we developed bidirectional projection methods that enable transfer of annotations and continuous variables between these two spaces using distance-based Gaussian kernel weighting.

Cluster Annotation Projection: Given cell cluster annotations

c = {c_{1}, c_{2}, \dots, c_{m}}

, where

c_{i}

represents the cluster label for cell i, we project these discrete labels onto grid points using nearest neighbor assignment:

cluster (g) = c_{j^{*}} where j^{*} = arg min_{j \in {1, \dots, m}} | | G_{g} - X_{j} {| |}_{2}

(22)

The cluster assignment for each mapped grid point,

g \in G_{mapped}

, is determined by the cluster label of its closest cell in the embedding space.

Grid-to-Cell Projection: For projecting continuous values from grid space back to cells, we employ a Gaussian kernel-weighted interpolation scheme. For each cell i, we identify the k nearest grid points:

N_{k} (i) = {g_{1}, g_{2}, \dots, g_{k}} where | | X_{i} - G_{g_{j}} {| |}_{2} \leq | | X_{i} - G_{g_{j + 1}} {| |}_{2}

(23)

The local distance variance is computed as:

σ_{i}^{2} = Var ({| | X_{i} - G_{g} {| |}_{2} : g \in N_{k} (i)})

(24)

Gaussian weights are calculated for each neighboring grid point:

w_{i, g} = \frac{exp (- \frac{| | X_{i} - G_{g} {| |}_{2}^{2}}{2 σ_{i}^{2}})}{\sum_{g^{'} \in N_{k} (i)} exp (- \frac{| | X_{i} - G_{g^{'}} {| |}_{2}^{2}}{2 σ_{i}^{2}})}

(25)

The projected value at cell i is computed as:

v_{i} = \sum_{g \in N_{k} (i)} w_{i, g} \cdot V (g)

(26)

where

V (g)

represents the grid value at position g. For optional annotation weighting with factor

w_{ann}

:

v_{i}^{weighted} = v_{i} \cdot log (w_{ann, i})

(27)

Cell-to-Grid Projection: For projecting cellular data onto grid points, we reverse the projection direction. For each grid point, g, we identify the k nearest cells:

M_{k} (g) = {i_{1}, i_{2}, \dots, i_{k}} where | | G_{g} - X_{i_{j}} {| |}_{2} \leq | | G_{g} - X_{i_{j + 1}} {| |}_{2}

(28)

The local variance for grid point g is:

σ_{g}^{2} = Var ({| | G_{g} - X_{i} {| |}_{2} : i \in M_{k} (g)})

(29)

Gaussian weights for neighboring cells are:

w_{g, i} = \frac{exp (- \frac{| | G_{g} - X_{i} {| |}_{2}^{2}}{2 σ_{g}^{2}})}{\sum_{i^{'} \in M_{k} (g)} exp (- \frac{| | G_{g} - X_{i^{'}} {| |}_{2}^{2}}{2 σ_{g}^{2}})}

(30)

The projected data value at grid point g for feature f is:

D_{g, f} = \sum_{i \in M_{k} (g)} w_{g, i} \cdot D_{i, f}

(31)

where

D_{i, f}

represents the data value for feature f in cell i.

Normalization: For non-negative projections, final values are normalized to the unit interval:

{\tilde{v}}_{i} = \frac{v_{i} - {min}_{j} v_{j}}{{max}_{j} v_{j} - {min}_{j} v_{j}}

(32)

2.5. Reinforcement Learning Environment Configuration

To model cellular fate decisions as a sequential decision-making process, we designed a reinforcement learning environment that transforms the grid representation into a MDP (Markov Decision Process), where agents learn optimal trajectories through reward signals derived from biological annotations.

State Space Construction: For each grid point

g \in G_{mapped}

, we construct a state representation using the k-nearest neighbor cells in the embedding space. Given latent features

Z \in R^{m \times d}

from principal component analysis:

N_{k} (g) = {i_{1}, i_{2}, \dots, i_{k}} where | | G_{g} - X_{i_{j}} {| |}_{2} \leq | | G_{g} - X_{i_{j + 1}} {| |}_{2}

(33)

The state vector for grid point g is computed as:

s_{g} = \frac{1}{k} \sum_{i \in N_{k} (g)} Z_{i}

(34)

Action Space Definition: The action space consists of 8 discrete directional movements corresponding to the 8-connectivity neighborhood:

A = {R, RT, T, LT, L, LB, B, RB}

(35)

For grid position

(i, j)

, the action mapping is defined as:

ϕ (a) = \{\begin{matrix} (i, j + 1) & if a = R \\ (i + 1, j + 1) & if a = RT \\ (i + 1, j) & if a = T \\ ⋮ \\ (i - 1, j + 1) & if a = RB \end{matrix}

(36)

Discrete Reward Function: For lineage-specific trajectory learning, we define rewards based on cluster annotations and pseudotime progression. Given starting clusters

C_{start}

and target clusters

C_{end}

:

R_{d} (s, a) = \{\begin{matrix} - 1 & if transition leads to masked grid \\ exp (- β \cdot {\hat{t}}_{s^{'}}) + Δ t & if s^{'} \in C_{end} (Decision mode) \\ 1 - exp (- β \cdot {\hat{t}}_{s^{'}}) + Δ t & if s^{'} \in C_{end} (Contribution mode) \\ 0 & otherwise \end{matrix}

(37)

where

{\hat{t}}_{s^{'}}

is the normalized pseudotime at next state

s^{'}

,

β

is the decay coefficient and

Δ t = ψ (s^{'}) - ψ (s)

represents pseudotime progression.

Continuous Reward Function: For gene expression-guided learning, rewards are computed using projected continuous values:

\bar{r} (g) = \frac{1}{| K_{reward} |} \sum_{k \in K_{reward}} D_{g, k}

(38)

where

K_{reward}

represents reward gene sets and

D_{g, k}

is the projected expression of gene k at grid g. The continuous reward function is:

R_{c} (s, a) = \{\begin{matrix} - 1 & if transition leads to masked grid \\ w_{reward} \cdot \bar{r} (s^{'}) - w_{punish} \cdot \bar{p} (s^{'}) & if s^{'} is valid \\ 0 & otherwise \end{matrix}

(39)

where the weighting factors are defined as:

w_{reward} = \{\begin{matrix} exp (- β \cdot {\hat{t}}_{s^{'}}) & Decision mode \\ 1 - exp (- β \cdot {\hat{t}}_{s^{'}}) & Contribution mode \end{matrix}

(40)

and

\bar{p} (s^{'})

represents the mean expression of punishment genes.

Episode Termination: Episodes terminate under the following conditions:

terminate = \{\begin{matrix} True & if s^{'} \in G_{boundary} ∖ G_{start} \\ True & if s^{'} \in trajectory history \\ True & if R (s, a) = - 1 \\ False & otherwise \end{matrix}

(41)

with trajectory truncation after

T_{max}

steps.

Experience Replay: Training experiences are stored in a circular buffer,

B

, with capacity

N_{buffer}

:

B = {(s_{t}, a_{t}, r_{t}, s_{t + 1}, d_{t})}_{t = 1}^{| B |}

(42)

where

d_{t} \in {0, 1}

indicates episode termination. During training, mini-batches of size B are uniformly sampled from

B

for gradient updates.

2.6. Reinforcement Learning Model Architecture and Training

To learn optimal cellular fate decision policies, we implemented three distinct reinforcement learning algorithms that leverage different approximation strategies and learning paradigms for trajectory optimization in the grid-based cellular environment.

Tabular Q-Learning: For discrete state spaces, we employ tabular Q-learning with

ϵ

-greedy exploration. The Q-table is initialized as:

Q \in R^{| G_{mapped} | \times 8}

(43)

The exploration probability decays exponentially with training steps:

ϵ (t) = ϵ_{min} + (ϵ_{max} - ϵ_{min}) exp (- \frac{0.01 \cdot t}{1000})

(44)

where

ϵ_{max} = 0.9

,

ϵ_{min} = 0.01

and t represents the current step. Action selection follows:

a_{t} = \{\begin{matrix} arg max_{a} Q (s_{t}, a) & with probability 1 - ϵ (t) \\ uniform random & with probability ϵ (t) \end{matrix}

(45)

Q-value updates use temporal difference learning:

Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ max_{a^{'}} Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})]

(46)

where

α

is the learning rate and

γ

is the discount factor.

Actor–Critic Architecture: For continuous state representations, we implement an actor–critic framework with separate policy and value networks. The policy network outputs action probabilities:

π_{θ} (a | s) = Softmax (1 + 5 \cdot σ (W_{π}^{(2)} ReLU (W_{π}^{(1)} s + b_{π}^{(1)}) + b_{π}^{(2)}))

(47)

where

σ

denotes the sigmoid activation and the scaling factor ensures exploration. The value network approximates state values:

V_{ϕ} (s) = W_{v}^{(2)} ReLU (W_{v}^{(1)} s + b_{v}^{(1)}) + b_{v}^{(2)}

(48)

The temporal difference error is computed as:

δ_{t} = r_{t} + γ V_{ϕ} (s_{t + 1}) (1 - d_{t}) - V_{ϕ} (s_{t})

(49)

where

d_{t} \in {0, 1}

indicates episode termination. Policy updates incorporate entropy regularization:

L_{π} = - E [log π_{θ} (a_{t} | s_{t}) \cdot δ_{t}] + λ_{ent} E [\sum_{a} π_{θ} (a | s_{t}) log π_{θ} (a | s_{t})]

(50)

where

λ_{ent} = 0.01

promotes exploration. The value network loss is:

L_{v} = E [{(V_{ϕ} (s_{t}) - (r_{t} + γ V_{ϕ} (s_{t + 1}) (1 - d_{t})))}^{2}]

(51)

DDQN (Double Deep Q-Network): To address overestimation bias in deep Q-learning, we employ a double DQN architecture with target network stabilization. The policy network estimates action values:

Q_{θ} (s, a) = W_{q}^{(2)} ReLU (W_{q}^{(1)} s + b_{q}^{(1)}) + b_{q}^{(2)}

(52)

Action selection uses decaying

ϵ

-greedy with:

ϵ (t) = ϵ_{end} + (ϵ_{start} - ϵ_{end}) exp (- \frac{0.1 \cdot t}{τ_{decay}})

(53)

The target value computation follows the double Q-learning update:

\begin{matrix} a^{*} & = arg max_{a} Q_{θ} (s_{t + 1}, a) \end{matrix}

(54)

\begin{matrix} y_{t} & = r_{t} + γ Q_{θ^{-}} (s_{t + 1}, a^{*}) (1 - d_{t}) \end{matrix}

(55)

where

Q_{θ^{-}}

denotes the target network. The loss function employs smooth L1 loss:

L_{DQN} = E [SmoothL 1 (Q_{θ} (s_{t}, a_{t}) - y_{t})]

(56)

Target network updates use either soft updates:

θ^{-} \leftarrow (1 - τ) θ^{-} + τ θ

(57)

or hard updates every C steps:

θ^{-} \leftarrow θ

.

Training Dynamics: All algorithms track performance through exponential moving averages. For value-based methods, we monitor the maximum Q-value:

{\bar{Q}}_{max} (t) = 0.005 \cdot max_{a} Q (s_{t}, a) + 0.995 \cdot {\bar{Q}}_{max} (t - 1)

(58)

For actor–critic, we track the state value:

\bar{V} (t) = 0.005 \cdot V_{ϕ} (s_{t}) + 0.995 \cdot \bar{V} (t - 1)

(59)

Episode returns are accumulated and averaged over sliding windows of 100 episodes to assess learning progress and convergence.

Experience Replay: For off-policy DDQN training, experiences are stored in a circular buffer with capacity

N_{buffer}

and sampled uniformly for mini-batch updates once the buffer contains at least

0.1 \cdot N_{buffer}

transitions.

2.7. scRL Parameter Configuration and Data Structure Organization

The scRL framework employs a hierarchical parameter configuration system and structured data organization to ensure computational efficiency and biological interpretability across diverse single-cell analysis scenarios.

Core Data Container Architecture: scRL organizes computational results through a structured container class with five primary data categories:

D_{scRL} = {E, G, Q, S, T}

(60)

where

E

represents embedding data,

G

contains grid information,

Q

stores reinforcement learning components,

S

holds simulation data and

T

maintains trajectory records.

Grid Generation Parameters: The fundamental grid configuration employs two primary parameters. Grid resolution parameter n determines spatial discretization:

| G_{total} | = n^{2}

(61)

Observer parameter j controls boundary detection sensitivity:

S_{observer} = j \times | B_{spine} |

(62)

where

B_{spine}

represents the boundary spine points. Default values are configured as

n_{default} = 50

and

j_{default} = 3

.

Parallel Processing Configuration: Computational parallelization is controlled by the parameter:

n_{jobs} \in {1, 2, \dots, N_{CPU}}

(63)

where

N_{CPU}

represents available processor cores. The default setting is

n_{jobs} = 8

.

Cluster Projection Parameters: Cell type annotation projection employs cluster assignment vectors:

C = {c_{1}, c_{2}, \dots, c_{m}} where c_{i} \in N

(64)

Color mapping utilizes predefined palettes or automatic generation:

P_{color} = {p_{1}, p_{2}, \dots, p_{| C |}} where p_{i} \in {[0, 1]}^{3}

(65)

Pseudotime Alignment Parameters: Temporal ordering configuration includes sampling parameters:

n_{sample} \in {1, 2, \dots, | C_{early} |}

(66)

Boundary restriction flag:

b_{boundary} \in {0, 1}

(67)

Key identifier for storage:

k_{add} \in {strings}

(68)

where the default configuration sets

n_{sample} = 10

,

b_{boundary} = 1

and

k_{add} = ‘ pseudotime ’

.

Projection Parameters: Bidirectional projection between cells and grids employs neighborhood parameters:

k_{neighbors} \in {1, 2, \dots, | G_{mapped} |}

(69)

Weighting factors for optional annotation scaling:

w_{annotation} \in R^{+}

(70)

Normalization flags:

f_{negative} \in {0, 1}

(71)

Default settings utilize

k_{neighbors} = 15

.

Data Structure Hierarchy: The framework maintains structured data organization:

\begin{matrix} E & = {X_{embedding}, C_{clusters}, P_{colors}, D_{data}} \end{matrix}

(72)

\begin{matrix} G & = {n, G_{grids}, M_{masked}, M_{mapped}, A_{adjacency}} \end{matrix}

(73)

\begin{matrix} Q & = {R_{rewards}, M_{matrix}, k_{reward}} \end{matrix}

(74)

2.8. Embedding Techniques and Evaluation Framework

To transform high-dimensional single-cell data into interpretable representations and assess the quality of our grid-based trajectory inference, we employ multiple dimensionality reduction techniques and comprehensive evaluation metrics that capture both biological relevance and computational performance.

Principal Component Analysis: Given an

n \times m

gene expression matrix,

X

, with n cells and m genes, PCA identifies orthogonal axes that maximize variance:

Z = X W

(75)

where

W \in R^{m \times k}

contains the principal component weights and

Z \in R^{n \times k}

represents cell embeddings in the reduced space. The columns of

W

are eigenvectors of the covariance matrix

X^{T} X

corresponding to the k largest eigenvalues

λ_{1} \geq λ_{2} \geq \dots \geq λ_{k}

. The proportion of variance explained by the i-th component is:

{var}_{i} = \frac{λ_{i}}{\sum_{j = 1}^{m} λ_{j}}

(76)

Nonlinear Embedding Methods: t-SNE constructs probability distributions over cell pairs in high-dimensional and low-dimensional spaces:

\begin{matrix} p_{i j} & = \frac{p_{j | i} + p_{i | j}}{2 n}, p_{j | i} = \frac{exp (- | | z_{i} - z_{j} | |^{2} / 2 σ_{i}^{2})}{\sum_{k \neq i} exp (- | | z_{i} - z_{k} {| |}^{2} / 2 σ_{i}^{2})} \end{matrix}

(77)

\begin{matrix} q_{i j} & = \frac{(1 + | | y_{i} - y_{j} {| |}^{2})^{- 1}}{\sum_{k \neq l} (1 + | | y_{k} - y_{l} {| |}^{2})^{- 1}} \end{matrix}

(78)

The embedding minimizes the Kullback–Leibler divergence:

C (Y) = \sum_{i \neq j} p_{i j} log \frac{p_{i j}}{q_{i j}}

.

UMAP constructs a fuzzy topological representation through weighted graphs with connection weights:

\begin{matrix} w_{i j} & = exp (- \frac{d (x_{i}, x_{j}) - ρ_{i}}{σ_{i}}) \end{matrix}

(79)

\begin{matrix} v_{i j} & = (1 + a \cdot | | y_{i} - y_{j} {| |}^{2 b})^{- 1} \end{matrix}

(80)

The optimization objective combines attraction and repulsion terms:

\sum_{i, j} w_{i j} log \frac{w_{i j}}{v_{i j}} + (1 - w_{i j}) log \frac{1 - w_{i j}}{1 - v_{i j}}

(81)

Diffusion Maps: Starting with similarity matrix

W

, where

W_{i j} = exp (- | | x_{i} - x_{j} {| |}^{2} / ϵ)

, we compute the normalized transition matrix:

\tilde{P} = D^{- 1 / 2} W D^{- 1 / 2}

(82)

where

D

is diagonal with

D_{i i} = \sum_{j} W_{i j}

. Diffusion coordinates are defined by eigenvectors

ϕ_{k}

of

\tilde{P}

:

Φ_{t} (x_{i}) = (λ_{1}^{t} ϕ_{1} (i), λ_{2}^{t} ϕ_{2} (i), \dots, λ_{k}^{t} ϕ_{k} (i))

(83)

scVI (single-cell Variational Inference): scVI employs variational autoencoders to model single-cell gene expression through a hierarchical generative process. The latent representation

z_{n} \sim N (0, I)

captures cellular states, while library size factors

l_{n} \sim LogNormal (μ_{l}, σ_{l}^{2})

account for technical variation. Gene expression rates are modeled through neural networks:

ρ_{n g} = f_{h}^{(1)} {(z_{n}, s_{n})}_{g} and α_{n g} = f_{h}^{(2)} {(z_{n}, s_{n})}_{g}

(84)

The observation model follows a zero-inflated negative binomial distribution:

x_{n g} \sim ZINB (μ_{n g}, θ_{g}, π_{n g})

(85)

where

μ_{n g} = l_{n} ρ_{n g}

and zero-inflation probability

π_{n g} = f_{h}^{(3)} {(z_{n}, s_{n})}_{g}

. The variational approximation

q_{ϕ} (z_{n} | x_{n}, s_{n}) = N (μ_{n}^{ϕ}, σ_{n}^{ϕ})

optimizes the evidence lower bound:

L (ϕ, h) = E_{q_{ϕ}} [log p_{h} (x_{n} | z_{n}, s_{n})] - KL [q_{ϕ} (z_{n} | x_{n}, s_{n}) ∥ p (z_{n})]

(86)

Linear-scVI Extensions: Linear-scVI enhances interpretability by imposing linear structure on the latent space:

z_{n} = A c_{n} + ϵ_{n}

(87)

where

c_{n} \sim N (0, I_{k})

represents interpretable linear factors and

A \in R^{d \times k}

provides the transformation matrix. Cellular trajectories are parameterized through time-dependent dynamics:

c_{n} (t) = B t_{n} + ν_{n}

(88)

where

B

captures trajectory directions. The decoder employs linear combinations:

log ρ_{n g} = W_{g}^{T} z_{n} + b_{g} + log l_{n}

(89)

The modified ELBO incorporates linear constraints:

L_{linear} = E_{q_{ϕ}} [log p (x_{n} | z_{n})] - KL [q_{ϕ} (c_{n} | x_{n}) ∥ p (c_{n})] - KL [q_{ϕ} (z_{n} | c_{n}, x_{n}) ∥ p (z_{n} | c_{n})]

(90)

Latent Dirichlet Allocation: LDA identifies latent gene expression programs through topic modeling. The generative process defines gene expression programs

β_{k} \sim Dirichlet (η)

and cell-specific topic proportions

θ_{n} \sim Dirichlet (α)

. Gene assignments follow

z_{n g} \sim Categorical (θ_{n})

and expression levels

w_{n g} \sim Categorical (β_{z_{n g}})

. Variational inference employs coordinate ascent updates:

\begin{matrix} ϕ_{n g k} & \propto exp \{Ψ (γ_{n k}) + Ψ (λ_{k w_{n g}}) - Ψ (\sum_{v} λ_{k v})\} \end{matrix}

(91)

\begin{matrix} γ_{n k} & = α_{k} + \sum_{g = 1}^{G_{n}} ϕ_{n g k} \end{matrix}

(92)

\begin{matrix} λ_{k v} & = η_{v} + \sum_{n = 1}^{N} \sum_{g : w_{n g} = v} ϕ_{n g k} \end{matrix}

(93)

External Evaluation Metrics: ARI corrects for random clustering effects:

ARI = \frac{\sum_{i j} (\binom{n_{i j}}{2}) - [\sum_{i} (\binom{a_{i}}{2}) \sum_{j} (\binom{b_{j}}{2})] / (\binom{n}{2})}{\frac{1}{2} [\sum_{i} (\binom{a_{i}}{2}) + \sum_{j} (\binom{b_{j}}{2})] - [\sum_{i} (\binom{a_{i}}{2}) \sum_{j} (\binom{b_{j}}{2})] / (\binom{n}{2})}

(94)

where

n_{i j}

represents cells belonging to both true class i and predicted class j. NMI quantifies information-theoretic consistency:

NMI (U, V) = \frac{2 \times I (U, V)}{H (U) + H (V)}

(95)

Internal Evaluation Metrics: ASW measures within-cluster cohesion versus between-cluster separation:

s (i) = \frac{b (i) - a (i)}{max {a (i), b (i)}}, ASW = \frac{1}{n} \sum_{i = 1}^{n} s (i)

(96)

CH evaluates the ratio of between-cluster to within-cluster dispersion:

CH = \frac{B (k) / (k - 1)}{W (k) / (n - k)}

(97)

DB assesses intra-cluster scatter relative to inter-cluster distances:

DB = \frac{1}{k} \sum_{i = 1}^{k} max_{j \neq i} (\frac{S_{i} + S_{j}}{d_{i j}})

(98)

Comprehensive Scoring: To enable fair comparison across metrics with different scales and optimization directions, we apply min–max normalization:

X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}

(99)

For metrics requiring minimization (DB), we use

X_{scaled} = \frac{X_{max} - X}{X_{max} - X_{min}}

.

We define two key metrics to quantitatively assess the alignment of cell lineage intensity with learned dimensionality reduction embeddings.

MAS (Manifold Alignment Score): Given a lineage intensity vector, L, and the two-dimensional UMAP manifold represented by its orthogonal dimensions

U_{1}

and

U_{2}

, the MAS is defined as the average of the absolute Pearson correlation coefficients between the lineage intensity vector and each UMAP dimension:

MAS = \frac{1}{2} (| ρ (L, U_{1}) | + | ρ (L, U_{2}) |)

(100)

where

ρ (X, Y)

denotes the Pearson correlation coefficient between variables X and Y.

PAAC (Principal Axis Alignment Coefficient): Given a lineage intensity vector, L, and the first two principal components obtained from PCA, denoted as

P_{1}

and

P_{2}

, the PAAC is defined as the average of the absolute Pearson correlation coefficients between the lineage intensity vector and these two principal components:

PAAC = \frac{1}{2} (| ρ (L, P_{1}) | + | ρ (L, P_{2}) |)

(101)

Similarly,

ρ (X, Y)

denotes the Pearson correlation coefficient.

2.9. Comprehensive Validation Framework for scRL

To establish scRL’s robustness across diverse biological contexts, we designed a systematic validation strategy encompassing five complementary analytical dimensions, each addressing critical aspects of cellular fate decision modeling through carefully selected single-cell datasets.

Benchmark Performance Validation: We assessed scRL’s fundamental capabilities using human bone marrow

{CD34}^{+}

hematopoietic progenitor cells (5780 cells, S-SUBS8) and mouse endocrinogenesis datasets (2531 cells, GSE132188). These canonical differentiation systems enabled direct comparison with 8 existing methodologies across 5 unsupervised metrics, focusing on pseudotime ordering accuracy, lineage contribution strength quantification and fate decision intensity detection. For the hematopoietic dataset, we configured scRL with grid parameters (

n = 50

,

j = 3

), defining erythroid (Ery_1 [

{Erythrocyte}_{1}

], Ery_2 [

{Erythrocyte}_{2}

], Mega [Megakaryocyte]), myeloid (Mono_1 [

{Monocyte}_{1}

], Mono_2 [

{Monocyte}_{2}

], DCs [dendritic cells]), and lymphoid (CLP [Common Lymphoid Progenitor]) lineages as reward targets from HSC_1 (Hematopoietic Stem

{Cell}_{1}

) starting points. The endocrinogenesis analysis targeted alpha, beta, delta, and epsilon cell fates from Ngn3 low/high and Fev⁺ progenitors, enabling systematic performance benchmarking across distinct developmental contexts.

Temporal Precedence Analysis: Using human phenotypical Hematopoietic Stem Cell datasets (GSE117498) containing HSC (Hematopoietic Stem Cell), MPP (Multipotent Progenitor), MLP (Multipotent Lymphoid Progenitor), CMP (Common Myeloid Progenitor), MEP (Megakaryocyte–Erythroid Progenitor), GMP (Granulocyte–Monocyte Progenitor), and PreBNK (Pre-B/NK Progenitor) populations, we validated whether scRL-identified fate decision states represent early specification events preceding observable lineage contributions. This analysis specifically examined the temporal relationship between computational fate predictions and experimentally characterized progenitor hierarchies. By targeting MEP and GMP fates from HSC origins, we established that scRL captures pre-contribution cellular states that systematically anticipate downstream differentiation events, confirming the framework’s predictive temporal resolution.

Multi-scale Decision Integration: The AML (acute myeloid leukemia) dataset (94,311 cells, GSE185993) provided a pathological context for validating coherence between molecular-level and cellular-level decision processes. This large-scale analysis demonstrated that fate decision intensity and gene decision intensity characterize equivalent cellular states, confirming scRL’s capacity to integrate multi-scale biological information. The dataset’s cellular heterogeneity and malignant transformation context enabled assessment of scRL’s robustness under conditions where normal developmental hierarchies are disrupted.

Perturbation Response Validation: We employed conditional Dapp1 knockout Hematopoietic Stem Cell datasets (10,224 cells, GSE277292) to validate the functional significance of scRL-identified dynamical genes through genetic perturbation analysis. Quality control criteria included total counts (1000–70,000), mitochondrial genes (<5%), ribosomal genes (10–50%) and expressed genes (>1000), followed by Harmony integration and Leiden clustering (resolution 0.6). This perturbation study confirmed that genetic alterations systematically modify fate decision landscapes, establishing biological relevance of scRL-captured decision intensities and validating the framework’s sensitivity to functional genetic modifications.

Pathological Condition Analysis: Radiation-induced hematopoietic injury datasets (41,252 cells, GSE278673) enabled validation of scRL’s performance under stress conditions that alter normal differentiation dynamics. Stringent quality control (total counts 5000–70,000, mitochondrial genes

< 5 %

, ribosomal genes 10–50%, expressed genes 1500–7000) preceded Harmony integration and UMAP analysis (min_dist = 1.5). This pathological validation demonstrated scRL’s capacity to capture altered fate decision dynamics, differentiation biases and recovery processes, establishing the framework’s utility for analyzing cellular responses to environmental perturbations and pathological states.

3. Results

3.1. LDA Provides an Interpretable Latent Space for Single-Cell Lineage Analysis

We conducted a comprehensive comparison of dimensionality reduction methods including deep learning-based approaches (scVI, Linear scVI, and LDA) and traditional methods (PCA, ICA, FA, NMF, and Diff [Diffusion Map]) to identify techniques that effectively capture lineage relationships for investigating cell fate decision mechanisms. Using hematopoietic and pancreatic single-cell datasets, we systematically evaluated these approaches across varying numbers of highly variable genes (1000–5000) and latent dimensions (5–25) using five discriminative metrics: ARI, NMI, ASW, CH, and DB. LDA achieved the highest overall performance scores across all evaluation metrics (Figure 2A,B), with substantial improvements over baseline methods including scVI (+0.785 across HVG [highly variable gene] sizes, +0.713 across latent dimensions) and ICA (+0.675 and +0.618, respectively) (Table 1 and Table 2). Individual latent component analysis showed that LDA exhibited strong subgroup specificity in intensity distributions within original cell type annotations (Figure 2C). Parallel analyses on the endocrinogenesis dataset confirmed these results, with LDA showing enhanced performance gains over scVI (+0.806 across HVG sizes, +0.839 across latent dimensions) and ICA (+0.789 and +0.765, respectively) (Figure A2, Table A1 and Table A2).

Figure 2. LDA exhibits superior interpretability in comparative analysis of dimensionality reduction methods. (S-SUBS8) (A,B) Comparison of interpretability metrics including ARI, NMI, ASW, CH and DB for dimensionality reduction methods (scVI, ICA, PCA, LscVI, FA, Diff, NMF and LDA) evaluated across varying numbers of highly variable genes (1000, 2000, 3000, 4000, 5000) and latent space components (5, 10, 15, 20, 25). (C) Intensity distributions of latent components obtained by LDA across different subgroups, demonstrating subgroup specificity.

3.2. Reinforcement Learning Captures Early Fate-Decision Signals

scRL constructs a grid embedding framework through edge detection-based algorithms, employing reinforcement learning to address the sequential decision-making problem inherent in cellular differentiation and quantify fate decision intensity at any developmental stage (Figure 3A). We utilized intensities from each LDA latent space component as lineage markers and projected them onto two-dimensional UMAP embeddings, revealing clear correspondence between these lineage intensities and distinct differentiation branches across both hematopoietic and pancreatic endocrinogenesis datasets (Figure 3B and Figure A3A). Critical to scRL’s innovation is its ability to identify pre-commitment cellular states: when projecting scRL-derived fate decision intensity onto developmental trajectories, peak decision regions consistently preceded observable lineage specification events, indicating that scRL captures regulatory checkpoints before overt lineage commitment becomes apparent. To quantitatively validate scRL’s superior interpretability, we compared its performance against percentile-based measures of LDA lineage intensities (95%, 90%, 85%, and 80% percentiles) using unsupervised metrics including ASW, CH, and DB (Figure 3C and Figure A3B). Quantitative analysis demonstrated scRL’s consistent superiority across both datasets, with substantial improvements of +0.986, +0.932, +0.735, and +0.375 over LDA at 80%, 85%, 90% and 95% percentiles, respectively, on the hematopoietic dataset (Table 3), and comparable improvements of +0.848, +0.549, +0.371 and +0.287 over corresponding LDA percentile thresholds on the endocrinogenesis dataset (Figure A3, Table A3).

Figure 3. scRL demonstrates superior interpretability in fate decision intensity analysis. (S-SUBS8) (A) Workflow illustration of scRL’s grid embedding construction and reinforcement learning framework for fate decision intensity inference. (B) Projection of lineage intensities from LDA latent space components onto UMAP embeddings showing correspondence with differentiation branches in hematopoietic and pancreatic datasets. (C) Comparison of scRL fate decision intensity with LDA percentile variants (95%, 90%, 85%, 80%) using ASW, CH and DB interpretability metrics on hematopoietic dataset. The overall score includes data from ARI and NMI.

3.3. scRL Outperforms Alternative Approaches in Fate-Decision Inference

We evaluated scRL’s fate decision intensity performance against various machine learning and deep learning dimensionality reduction methods including Diff, FA, ICA, LscVI, NMF, PCA and scVI across hematopoietic and endocrinogenesis datasets (Figure 4 and Figure A4A). Using scRL-derived grid embeddings and pseudotime information as the foundation, we extracted 95%, 90%, 85% and 80% percentile intensities from each method as alternative fate decision measures, conducting experiments across HVG sequences from 1000 to 5000 and evaluating performance using unsupervised metrics ARI, NMI, ASW, DB and CH. scRL consistently achieved the highest overall scores across all experimental conditions (Figure 4 and Figure A4B). On the hematopoietic dataset, scRL showed substantial improvements over all baseline methods, with pronounced advantages against ICA-based approaches (overall improvements ranging from +0.614 to +0.811) and scVI-based methods (overall improvements from +0.609 to +0.733) (Table 4). Validation on the endocrinogenesis dataset confirmed these findings, where scRL maintained superior performance with the most significant improvements observed against Diffusion Map approaches (overall improvements from +0.461 to +0.811) and ICA-based methods (overall improvements from +0.612 to +0.812) (Table A4). scRL’s performance advantages were consistently maintained across different percentile thresholds, with higher improvements generally observed at lower percentile values (80–85% percentiles) compared with higher percentiles (95% percentile), indicating that scRL’s reinforcement learning framework effectively captures early fate decision signals that precede observable lineage contribution events.

Figure 4. scRL exhibits superior interpretability in fate decision intensity compared with various dimensionality reduction methods. (S-SUBS8) Comprehensive comparison of interpretability metrics for scRL fate decision intensity against Diff, FA, ICA, LscVI, NMF, PCA and scVI at 95%, 90%, 85% and 80% percentile intensities on hematopoietic dataset.

3.4. Label-Free scRL Accurately Quantifies Lineage Contribution

When accurate lineage probabilities are unavailable or cannot be mapped to subgroup trajectory branches, scRL offers an alternative solution leveraging reinforcement learning principles with clustering labels as reward signals to evaluate lineage contribution intensity. We applied this approach to hematopoietic and endocrinogenesis datasets using KMeans clustering to obtain labels for distinct subgroups, with scRL computing lineage contribution intensity that exhibited clear subgroup specificity (Figure 5A). We evaluated scRL’s interpretability against established dimensionality reduction techniques (PCA, ICA, FA, NMF, Diff, scVI, LscVI) across varying KMeans subgroup numbers (4, 6, 8, 10, 12) and highly variable gene sequences (1000–5000) using ASW, DB and CH metrics. scRL achieved the best overall assessment scores across all experimental conditions (Figure 5B), with substantial improvements over baseline methods including pronounced advantages against scVI-based approaches (overall improvements from +0.856 to +0.924), ICA methods (+0.409 to +0.694) and PCA approaches (+0.329 to +0.515) across different cluster configurations (Table 5). Validation on the endocrinogenesis dataset confirmed these findings, where scRL maintained superior performance with clear subgroup specificity in both KMeans labels and contribution intensities, showing distinct intensity distribution patterns across lineage branches and consistently strong performance advantages over scVI (overall improvements from +0.856 to +0.924) and ICA (+0.409 to +0.694) across all cluster configurations (Figure A5A,B, Table A5).

Figure 5. scRL constructs superior lineage intensity using subgroup information compared with various methods. (S-SUBS8) (A) Distribution of 5 subgroups under KMeans clustering and corresponding subgroup lineage intensities obtained by scRL projected onto UMAP embedding, demonstrating subgroup specificity. (B) Comprehensive comparison of scRL lineage contribution intensity against PCA, ICA, FA, NMF, Diff, scVI and LscVI using ASW, DB and CH metrics. Evaluations conducted across KMeans subgroup numbers (4, 6, 8, 10, 12) and highly variable gene sequences (1000–5000) as experimental replicates.

3.5. scRL Rivals State-of-the-Art Pseudotime and Fate Inference Tools

We conducted comprehensive performance evaluations comparing scRL with established methods for pseudotime reconstruction (Wishbone [37], Palantir [38], DPT (Diffusion Pseudotime) [39], Monocle2 [40], Monocle3 [12]) and lineage probability inference (Palantir [38], FateID [41], CellRank [42]) using Manifold Alignment Score and Principal Axis Alignment Coefficient metrics. On the hematopoietic dataset, scRL demonstrated superior pseudotime reconstruction performance, achieving a Manifold Alignment Score of 0.773 (+0.091 improvement over the previous best method) and Principal Axis Alignment Coefficient of 0.848 (+0.127 improvement) (Figure 6A,B). For lineage probability assessment, scRL consistently outperformed all baseline methods with a Manifold Alignment Score of 0.323 (+0.085 improvement) and Principal Axis Alignment Coefficient of 0.373 (+0.048 improvement) (Figure 6C,D). Validation on the endocrinogenesis dataset confirmed these findings, where scRL maintained superior performance with pseudotime reconstruction achieving a Manifold Alignment Score of 0.865 (+0.061 improvement over the previous best method) and fate probability analysis demonstrating robust performance with a score of 0.674 (+0.312 improvement over the previous best method) (Figure A6A,B).

3.6. scRL Decision Intensities Reveal Pre-Contribution States in Hematopoiesis

We applied scRL to a human hematopoietic progenitor cell dataset containing diverse immune phenotypes [43], conducting pseudotime reconstruction and comprehensive evaluation of lineage contribution strength and fate decision intensity for MEP (Megakaryocyte–Erythroid Progenitor) and GMP (Granulocyte-Monocyte Progenitor) lineages (Figure 7A). scRL revealed distinct temporal dynamics where fate decision intensity and lineage contribution strength exhibited complementary temporal profiles along pseudotime trajectories, with pseudotime-weighted analysis demonstrating that decision-weighted values consistently preceded contribution-weighted values, indicating scRL captures earlier fate specification events before observable lineage contribution (Figure 7B). To validate scRL’s predictive capacity for identifying pre-expression cellular states, we analyzed key transcription factors GATA1 and CEBPA, which serve as master regulators of MEP and GMP differentiation, respectively, with GATA1 essential for erythroid and megakaryocytic development and CEBPA functioning as a master regulator of myeloid lineage contribution. Both transcription factors exhibited peak expression at terminal differentiation stages in their respective lineages, while scRL fate decision analysis demonstrated that both displayed high decision intensity values in more primitive regions compared with their contribution values, suggesting early fate specification events preceding actual gene expression (Figure 7C). Detailed temporal analysis revealed that, for both GATA1 in MEP lineage and CEBPA in GMP lineage, decision state values peaked at intermediate differentiation stages systematically preceding expression peaks, while contribution state values reached maximum intensity at terminal stages and exhibited stronger correlation with pseudotime compared with original expression patterns, with pseudotime-weighted analysis consistently showing decision-weighted values occurred earlier than contribution-weighted values (Figure 7D).

3.7. Integrated Gene- and Lineage-Level Decisions Resolve AML Branch Points

We applied scRL’s pseudotime analysis framework to an AML dataset to elucidate the relationship between gene-level and lineage-level decision-making processes during cellular differentiation [44]. Following cellular clustering and pseudotime reconstruction revealing distinct cellular populations and developmental trajectories (Figure 8A,B), we identified four distinct branches representing different lineage decision trajectories (Figure 8C). For each branch, we selected 10 marker genes based on differential expression analysis, requiring expression in at least 25% of cells within the target branch while being expressed in less than 25% of reference cells (Figure 8D), then conducted parallel analyses of gene decision values and lineage decision values across all four branches (Figure 8E,F). Correlation analysis revealed strong relationships between lineage and gene decision values within each branch, demonstrating coordinated decision-making processes at both molecular and cellular levels (Figure 8G), while temporal analysis showed that weighted average pseudotime of lineage decision values consistently preceded actual gene expression across all four branches (Figure 8H). Further characterization through binned pseudotime analysis revealed that lineage decision values across the four branches showed more distinct trajectory dynamics compared with mean Pearson correlation values with branch-specific cells (Figure 8I,J), while gene decision values along binned pseudotime provided clearer temporal dynamics compared with Pearson correlation values with branch-specific genes (Figure 8K,L).

Figure 6. Comparative performance analysis of scRL with pseudotime and fate inference methods. (S-SUBS8) (A) Pseudotime reconstruction performance comparison using Manifold Alignment Score across methods including Wishbone, Palantir, DPT, Monocle2, Monocle3 and scRL on hematopoietic dataset. (B) Principal Axis Alignment Coefficient comparison for pseudotime reconstruction methods. (C) Fate probability inference performance using Manifold Alignment Score comparing FateID, Palantir, CellRank and scRL. (D) Principal Axis Alignment Coefficient comparison for fate probability inference methods. Green bars indicate performance improvements over previous best methods.

Figure 7. scRL application to hematopoietic dataset reveals temporal precedence of decision over contribution states. (GSE117498) (A) Standard analysis workflow including pseudotime reconstruction and scRL-derived lineage contribution strength and fate decision intensity results for the two major lineages, MEP and GMP. (B) Temporal dynamics of fate decision intensity and lineage contribution strength along pseudotime, with bar plots showing pseudotime-weighted values for both intensities (lower values correspond to more primitive developmental stages). (C) Original expression patterns of key transcription factors GATA1 and CEBPA corresponding to MEP and GMP lineages, respectively, alongside scRL-derived contribution strength and decision strength when these genes serve as reward signals. (D) Temporal profiles along pseudotime showing original expression intensity, decision strength, contribution strength, and pseudotime-weighted decision and contribution intensities for GATA1 and CEBPA (lower values correspond to more primitive stages), with correlation comparison between contribution strength and original expression relative to pseudotime (higher values indicate superior differentiation trajectory matching).

Figure 8. Integrative analysis of lineage and gene decision states in acute myeloid leukemia progression. (GSE185993) (A) UMAP visualization of AML dataset colored by cellular clusters. Different colors represent distinct cell subpopulations. (B) Pseudotime trajectory inferred by scRL across the developmental landscape. (C) Four distinct branches identified within the UMAP embedding space. (D) Top 10 marker genes for each branch, selected based on log-fold changes with expression criteria of at least 25% cells in target branch and less than 25% in reference cells. (E) Gene set decision values computed for each of the four branches. (F) Lineage decision values determined for each branch using scRL framework. (G) Pearson correlation heatmap between gene set decision values (G1-4) and lineage decision values (L1-4) for each branch. (H) Comparison of average pseudotime weighted by lineage decision values versus gene expression for each branch. (I) Mean Pearson correlation with branch-specific cells along uniformly 50-binned pseudotime for the complete dataset (top) and trunk region (bottom). (J) Lineage decision values for each branch along uniformly 50-binned pseudotime for the complete dataset (top) and trunk region (bottom). (K) Mean expression of branch-specific gene sets along uniformly 50-binned pseudotime for the complete dataset (top) and trunk region (bottom). (L) Gene decision values of branch-specific gene sets along uniformly 50-binned pseudotime for the complete dataset (top) and trunk region (bottom).

3.8. scRL Pinpoints Dapp1 as an Early Regulator of HSC Fate

We developed a comprehensive workflow for discovering critical dynamical genes in HSC differentiation, establishing a systematic approach beginning with LSK (

{Lin}^{-} {Sca1}^{+} {ckit}^{+}

) single-cell data embeddings to identify differentiation starting points, employing velocity analysis for top dynamical genes, utilizing pseudotime analysis for earliest expressing genes, and intersecting highest expressed and most dynamic genes followed by time correlation scoring to identify Dapp1 as a critical early differentiation regulator (Figure 9A). Comprehensive trajectory analysis using Monocle2 identified 280 genes through intersection of dynamical genes and BEAM (Branching Expression Analysis Modeling) test significant genes with enrichment in hematopoietic differentiation-related GOBP (Gene Ontology Biological Process) terms, while gene correlation analysis using LSK datasets revealed Dapp1 exhibited strong correlation with myeloid lineage marker Dach1 and minimal correlation with lymphoid marker Dntt (Figure A7A). To validate scRL’s capacity to capture perturbation effects, we constructed a single-cell atlas from bone marrow LSK cells with conditional Dapp1 knockout, annotating cellular populations based on key transcription factors including Hlf and Tcf15 for primitive HSCs, Gata1 and Klf1 for erythroid bias, Id2 for dendritic cells, Irf8 for monocytes, Cebpe for neutrophils, Ebf1 for B cells, Hoxa9 for multipotency, Satb1 for lymphoid bias and Gfi1 for myeloid bias (Figure A7B). The Dapp1 knockout atlas revealed clusters annotated as early clusters (1, 2), multipotent clusters (3, 5, 7), lymphoid clusters (4, 9), myeloid cluster (6) and erythroid clusters (8, 10), with comparative analysis demonstrating significant disruption in lineage contribution homeostasis through altered cluster distributions and pseudotime patterns (Figure 9B,C). Pseudotime analysis and expression patterns of maturation markers Cd34, Mpo and Ctsg indicated impediment to cellular maturation following Dapp1 deletion (Figure 9D). Analysis along uniformly 50-binned pseudotime revealed Dapp1 deficiency led to significant perturbations in lineage decision trajectories with decreased decision values for myeloid and erythroid lineages while lymphoid lineage decision values increased (Figure 9E,F). Gene decision values for Mpo (myeloid), Itga2b (erythroid) and Dntt (lymphoid) showed notable alterations along pseudotime, with Dapp1 deletion decreasing decision values for Mpo and Itga2b while increasing Dntt decision values (Figure 9G,H). Detailed cluster-specific analysis revealed that, within earliest cluster (cluster 1), both myeloid lineage and Mpo gene decision trajectories were significantly disrupted with decreased decision values after knockout (Figure 9I,J), while, within multipotent clusters, lymphoid lineage and Dntt gene decision values were enhanced following Dapp1 deletion (Figure 9K). Comprehensive trajectory analysis comparing wild-type and knockout conditions demonstrated distinct alterations in myeloid, erythroid and lymphoid differentiation trajectories with cluster-specific responses to Dapp1 perturbation, where lymphoid lineage decisions remained largely unchanged in early clusters but were enhanced in multipotent clusters, while erythroid and myeloid decision values showed varying degrees of disruption across different developmental stages (Figure A7C and Figure A8A–E).

Figure 9. Lineage contribution of HSCs at the earliest primitive stage perturbed by knockout of critical dynamical gene Dapp1. (GSE277292) (A) Systematic workflow for identifying critical dynamical gene Dapp1 in HSC differentiation, beginning with LSK single-cell data embeddings to identify differentiation starting points, employing velocity analysis for top dynamical genes, using pseudotime for earliest expressing genes, intersecting highest expressed and most dynamic genes, and ranking through time correlation scoring. (B) Dapp1 knockout LSK atlas with annotated clusters: 1, 2 as early clusters, 3, 5, 7 as multipotent clusters, 4, 9 as lymphoid clusters, 6 as myeloid cluster and 8,10 as erythroid clusters. (C) Cluster distributions and pseudotime comparison between control and knockout groups. (D) Box plots comparing pseudotime and expression of maturation markers Cd34, Mpo and Ctsg between conditions. (E) Lineage decision values for myeloid, erythroid and lymphoid lineages along uniformly 50-binned pseudotime. (F) Box plots comparing lineage decision values across myeloid, erythroid and lymphoid lineages between conditions. (G) Gene decision values for Mpo, Itga2b and Dntt along uniformly 50-binned pseudotime. (H) Box plots comparing gene decision values for Mpo, Itga2b and Dntt between conditions. (I) Myeloid lineage and Mpo decision values within early cluster 1 along binned pseudotime. (J) Box plots comparing myeloid lineage and Mpo decision values within early cluster 1. (K) Box plots comparing lymphoid lineage and Dntt decision values within multipotent cluster 5.

3.9. scRL Charts Dynamic Fate Decisions During Post-Irradiation HSC Recovery

We applied scRL to analyze HSC fate decision dynamics under pathological conditions by collecting LSK cells from mice subjected to total body ionizing radiation at multiple time points (days 2, 5, 8, 11, 14, 21, 30) and performing single-cell RNA sequencing to construct a comprehensive temporal atlas. Cellular populations were annotated based on characteristic marker expression patterns including Kit and Ly6a for LSK identification, Cd34 and Flt3 for Multipotent Progenitor components, and lineage-specific markers such as Il7r for lymphoid bias, Cd19 for B cells, Cd3e for T cells, and Itgam and Fcer1g for myeloid bias (Figure A9A). Eleven clusters were identified and annotated as HSC, MPP2/3 (Multipotent Progenitors 2/3), MPP4 (Multipotent Progenitors 4), MK (Megakaryocyte), Ery (Erythrocyte), Ma (Mast cell), Neu (neutrophil), Mo (monocyte), DC (dendritic cell), B cell and T cell, revealing dynamic changes in cellular composition across the recovery timeline (Figure 10A,B). Population dynamics analysis revealed HSPC (Hematopoietic Stem and Progenitor Cell) components, particularly HSCs, decreased dramatically after radiation, reaching their nadir around day 5, followed by gradual recovery (Figure 10D), while an erythroid contribution bias emerged after radiation, peaking at day 8 and persisting until day 21 (Figure 10E), and myeloid cell populations showed characteristic temporal changes throughout recovery (Figure 10F). Using pseudotime as a cellular primitivity metric, the recovery process demonstrated HSCs gradually regaining their primitive state with coordinated expression of primitive factors (Hlf, Meis1, Satb1, Hoxa9) indicating primitivity recovery, while myeloid factors (Spi1, Gfi1, Cebpa, Cebpe) and erythroid factors (Klf1, Gata1) exhibited lineage-specific temporal dynamics (Figure A9B and Figure 10C). scRL computation of lineage decision values for erythroid and myeloid lineages across all time points revealed characteristic spatial and temporal patterns projected onto UMAP space (Figure 10G,H), with comprehensive analysis showing distinct temporal patterns in both contribution and decision values (Figure A9C–E). Detailed analysis within HSPC subpopulations across uniformly 50-binned pseudotime demonstrated that erythroid-biased stem cell fate decision potential decreased dramatically immediately after radiation but rapidly returned to near-normal levels by day 8, while myeloid-biased stem cell proportions increased following irradiation and subsequently decreased as erythroid-biased populations recovered (Figure 10I,J). Correlation analysis revealed erythroid potential of HSPCs showed strongest association with overall HSPC proportions, whereas myeloid potential was closely correlated with myeloid cell proportions during early recovery phases, with validation through Pearson correlation coefficients confirming that myeloid and erythroid decision intensities exhibited positive correlations with their corresponding lineage-specific transcription factors while pseudotime demonstrated negative correlation with primitive transcription factor expression (Figure A9F and Figure 10K).

Figure 10. Erythroid bias of primitive HSCs correlated with their proportion in time series of irradiation recovery. (GSE278673) (A) LSK phenotype bone marrow cells with clusters annotated as HSC (Hematopoietic Stem Cell), MPP2/3 (Multipotent Progenitors 2/3), MPP4 (Multipotent Progenitors 4), MK (Megakaryocyte), Ery (Erythrocyte), Ma (Mast cell), Neu (neutrophil), Mo (monocyte), DC (dendritic cell), B (B cell) and T (T cell). Different colors represent distinct cell subpopulations throughout the analysis. (B) Single-cell atlas of irradiation-injured LSK at time points 2, 5, 8, 11, 14, 21 and 30 days post-radiation, colored with annotated cell types. (C) Box plot of pseudotime distribution at each time point post-irradiation. (D) HSPC (Hematopoietic Stem and Progenitor Cell: HSC, MPP2/3, MPP4) proportion at each time point. (E) Erythroid cells (MK, Ery, Ma) proportion at each time point. (F) Myeloid cells (Neu, Mo, DC) proportion at each time point. (G) Erythroid lineage decision value projected on UMAP space at different time points. (H) Myeloid lineage decision value projected on UMAP space at different time points. (I) Lineage decision atlas for each time point within HSPC populations, showing erythroid decision values (red dots) and myeloid decision values (blue dots) along uniformly 50-binned pseudotime. (J) Horizontal bar plot of erythroid-biased proportion (red) and myeloid-biased proportion (blue) of HSPC at different time points. (K) Horizontal bar plot of Pearson correlations between erythroid-biased and myeloid-biased HSPC proportions at each time point with respect to average pseudotime, HSPC proportion, erythroid proportion and myeloid proportion.

4. Discussion

The adaptation of reinforcement learning methodologies to biological systems represents a significant advancement in computational biology, addressing a fundamental gap in single-cell trajectory analysis: the identification of where and when fate decisions occur during cellular differentiation [45,46]. While traditional approaches primarily focus on ordering cells along developmental trajectories, scRL’s actor–critic framework recasts differentiation as a sequential decision process, enabling quantitative assessment of fate intensity and regulatory checkpoints that precede overt lineage commitment [32,33]. This paradigm shift from descriptive to mechanistic analysis represents a crucial step toward understanding the decision-making logic that governs cellular differentiation in both healthy and diseased contexts.

scRL’s superior performance across diverse biological systems—including human hematopoiesis, acute myeloid leukemia, mouse endocrinogenesis and perturbation studies—demonstrates its versatility in capturing early decision states that conventional methods fail to detect. By integrating LDA-derived interpretable latent spaces with biologically informed reward functions, scRL’s critic network learns state-value functions that quantify lineage potential, while the actor component traces optimal developmental routes across the manifold. This dual approach enables the framework to balance exploration of novel differentiation pathways with exploitation of established routes, consistently outperforming fifteen state-of-the-art methods across five independent evaluation dimensions. The identification of previously uncharacterized regulators such as Dapp1 further validates scRL’s capacity to uncover biologically meaningful insights that extend beyond trajectory reconstruction.

A critical consideration in single-cell trajectory inference is the potential confounding effect of doublets—cellular aggregates that arise during droplet-based capture and may create spurious intermediate states with hybrid transcriptional profiles [47,48]. However, rigorous quality control measures implemented in preprocessing pipelines substantially mitigate doublet-induced artifacts, as these aggregates typically exhibit abnormally elevated transcript counts enabling their identification and removal through standard filtering criteria. Any residual doublets manifest as sporadic noise points rather than coherent cellular populations, appearing as isolated outliers in two-dimensional manifold representations that minimally impact the structural integrity of inferred developmental pathways.

The reliance on two-dimensional embeddings, while computationally tractable and visually interpretable, represents a pragmatic compromise between analytical precision and biological interpretability [49,50]. Although high-dimensional single-cell data exhibit complex nonlinear structure that may not be fully preserved in 2D projections, the interpretability advantages substantially outweigh potential inaccuracies introduced by dimensional compression. This strategic design choice enables meaningful biological discovery and experimental validation within a comprehensible framework that facilitates integration of computational predictions with experimental observations.

While scRL’s graph-based pseudotime alignment provides robust temporal ordering across grid representations, several methodological limitations merit consideration. The requirement for user-defined starting points introduces subjectivity that may bias trajectory inference, as suboptimal selection of early cells or clusters propagates errors throughout Dijkstra’s shortest-path calculations [51]. More fundamentally, the assumption that shortest-path distances correspond to developmental time may not accurately reflect cellular differentiation dynamics, particularly in systems with rapid state transitions or non-monotonic developmental patterns [52,53]. Grid discretization compounds these issues by introducing artifacts through rasterization of continuous UMAP embeddings, potentially creating artificial barriers or connections that misrepresent the underlying biological manifold [54]. The multi-component handling strategy, despite its mathematical elegance, introduces approximation errors when connecting disconnected graph components, potentially mischaracterizing relationships between separated cellular populations [55]. Finally, the static, undirected graph structure overlooks the inherently directional nature of developmental processes, while the absence of uncertainty quantification limits reliability assessment, particularly in sparsely sampled regions where multiple plausible trajectories may exist [56].

The selection of grid embedding hyperparameters n and j critically influences grid construction and computational efficiency. Based on systematic parameter evaluation across grid resolution n (25, 50, 75, 100, 125) and border observer parameter j (3, 5, 8, 11, 15), we recommend

n = 50

and

j = 3

as optimal choices that balance embedding accuracy with computational tractability (Figure A10).

Beyond trajectory inference, scRL’s unified framework addresses broader challenges in developmental biology by providing competitive measures of lineage-contribution intensity without requiring ground-truth probabilities. This capability, combined with its ability to reconcile gene- and lineage-level information across healthy, diseased and perturbed systems, positions scRL as a versatile tool for regenerative medicine, oncology and developmental biology. The framework’s extensibility toward multi-omic and spatial modalities, coupled with ongoing efforts to streamline unsupervised identification of key subpopulations, promises to advance our mechanistic understanding of cell fate decisions and their therapeutic manipulation.

5. Conclusions

scRL brings reinforcement learning into single-cell biology, offering a principled, data-driven approach for charting differentiation landscapes and pinpointing regulatory checkpoints. Its ability to capture early decision states, reconcile gene- and lineage-level information, and generalize across healthy, diseased and perturbed systems positions scRL as a versatile tool for developmental biology, oncology and regenerative medicine. Ongoing work aims to integrate multi-omic and spatial modalities and to streamline unsupervised identification of key subpopulations, extending scRL’s reach toward a comprehensive, mechanistic atlas of cell fate decisions.

Author Contributions

Z.F.: Conceptualization, methodology, software, validation, formal analysis, investigation, data curation, resources, writing—original draft, writing—review and editing, visualization. C.C.: formal analysis, writing—original draft, writing—review and editing, visualization. S.W.: methodology, software, formal analysis, resources. J.W.: writing—review and editing, resources, supervision, project administration, funding acquisition. S.C.: writing—review and editing, resources, supervision, project administration, funding acquisition. All authors read and approved the final manuscript.

Funding

This research was funded by National Natural Science Foundation of China of funder grant number 82222060, 82430103, 82473572, 81930090, 81725019, 82073487, 81602790. The APC was funded by National Natural Science Foundation of China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The single-cell RNA-seq data have been deposited in the GEO database under accession numbers GSE277292 and GSE278673 and are publicly available. All original codes have been deposited at GitHub (v0.0.5) at https://github.com/PeterPonyu/scRL accessed on 19 November 2024.

Acknowledgments

The authors have reviewed the experimental results and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

scRL	single-cell Reinforcement Learning
PCA	Principal Component Analysis
ICA	Independent Component Analysis
NMF	Non-negative Matrix Factorization
FA	Factor Analysis
LDA	Latent Dirichlet Allocation
UMAP	Uniform Manifold Approximation and Projection
tSNE	t-distributed Stochastic Neighbor Embedding
scVI	single-cell Variational Inference
LscVI	Linearly decoded single-cell Variational Inference
Diff	Diffusion Map
DPT	Diffusion Pseudotime
NMI	Normalized Mutual Information
ARI	Adjusted Rand Index
ASW	Average Silhouette Width
CH	Calinski–Harabasz index
DB	Davies–Bouldin index
MAS	Manifold Alignment Score
PAAC	Principal Axis Alignment Coefficient
LSK	${Lin}^{-} {Sca-1}^{+} {c-Kit}^{+}$ cells
HSPC	Hematopoietic Stem and Progenitor Cell
HSC	Hematopoietic Stem Cell
Ery	Erythrocyte
Mega	Megakaryocyte
Mono	Monocyte
DC	Dendritic Cell
CLP	Common Lymphoid Progenitor
CMP	Common Myeloid Progenitor
MPP	Multipotent Progenitor
MLP	Multilayer Perceptron
PreBNK	Pre-B and Natural Killer cells
GMP	Granulocyte–Monocyte Progenitor
AML	Acute Myeloid Leukemia
GOBP	Gene Ontology Biological Process

Appendix A

Figure A1. Comprehensive analytical framework for scRL implementation and validation. (S-SUBS8, GSE132188) (A) Establishment of scRL efficacy through integration of tabular Q-learning validation on human hematopoietic

{CD34}^{+}

cells and mouse endocrinogenesis datasets, with evaluation of pre-expression state identification using lineage-specific marker genes: GATA1 (erythroid), IRF8 (myeloid), EBF1 (lymphoid) for hematopoiesis, and Ngn3 (early), Fev (intermediate) for endocrinogenesis. (B) Cellular subpopulation-level analysis with cluster information projected onto grid space, establishing specific rewards for erythroid, myeloid and lymphoid lineages in hematopoietic dataset and early/late differentiation stages in endocrinogenesis dataset, with HSC (hematopoietic stem cell) and EP (endocrinogenesis progenitor) clusters as respective starting points. (C) Projection of derived final state values onto original embedding space for biological interpretation. (D) Training efficacy confirmation through convergence analysis of cumulative rewards from policy network and maximum state values from critic network over time.

Figure A1. Comprehensive analytical framework for scRL implementation and validation. (S-SUBS8, GSE132188) (A) Establishment of scRL efficacy through integration of tabular Q-learning validation on human hematopoietic

{CD34}^{+}

cells and mouse endocrinogenesis datasets, with evaluation of pre-expression state identification using lineage-specific marker genes: GATA1 (erythroid), IRF8 (myeloid), EBF1 (lymphoid) for hematopoiesis, and Ngn3 (early), Fev (intermediate) for endocrinogenesis. (B) Cellular subpopulation-level analysis with cluster information projected onto grid space, establishing specific rewards for erythroid, myeloid and lymphoid lineages in hematopoietic dataset and early/late differentiation stages in endocrinogenesis dataset, with HSC (hematopoietic stem cell) and EP (endocrinogenesis progenitor) clusters as respective starting points. (C) Projection of derived final state values onto original embedding space for biological interpretation. (D) Training efficacy confirmation through convergence analysis of cumulative rewards from policy network and maximum state values from critic network over time.

Figure A2. LDA exhibits superior interpretability on endocrinogenesis dataset. (GSE132188) (A,B) Comparison of interpretability metrics including ARI, NMI, ASW, CH and DB for dimensionality reduction methods (scVI, ICA, PCA, LscVI, FA, Diff, NMF and LDA) evaluated across varying numbers of highly variable genes (1000, 2000, 3000, 4000, 5000) and latent space components (5, 10, 15, 20, 25). (C) Intensity distributions of latent components obtained by LDA across different subgroups.

Figure A3. scRL validation on endocrinogenesis dataset. (GSE132188) (A) Projection of fate decision intensity obtained by scRL onto UMAP embedding in endocrinogenesis dataset. (B) Comparison of scRL fate decision intensity with LDA percentile variants (95%, 90%, 85%, 80%) using ASW, CH and DB interpretability metrics.

Figure A4. scRL validation on endocrinogenesis dataset. (GSE132188) (A) Workflow of edge detection, grid embedding and pseudotime construction applied to hematopoietic and endocrinogenesis datasets. (B) Comprehensive comparison of interpretability metrics for scRL fate decision intensity against Diff, FA, ICA, LscVI, NMF, PCA and scVI at 95%, 90%, 85% and 80% percentile intensities on endocrinogenesis dataset.

Figure A5. scRL lineage intensity inference validation on endocrinogenesis dataset. (GSE132188) (A) Workflow demonstration showing KMeans clustering labels (left), scRL-derived contribution intensities (middle) and intensity distributions of each lineage branch across different labels (right). (B) Comprehensive comparison of scRL lineage contribution intensity against baseline methods (PCA, ICA, FA, NMF, Diff, scVI, LscVI) across KMeans cluster numbers (4, 6, 8, 10, 12) using ASW, DB and CH metrics, and overall composite scores.

Figure A6. Validation of scRL performance using Manifold Alignment Score metrics. (GSE132188, GSE117498) (A) Pseudotime reconstruction comparison between scRL and baseline methods (Monocle2, Monocle3, Palantir, Wishbone, DPT) on endocrinogenesis dataset showing scRL’s superior performance with +0.061 improvement over previous best method. (B) Fate probability inference comparison between scRL and established methods (Palantir, FateID, CellRank) on phenotypical HSC dataset demonstrating scRL’s substantial +0.312 improvement over previous best method.

Figure A7. Comprehensive analysis of hematopoietic differentiation regulators and validation of Dapp1’s role in lineage contribution. (GSE277292) (A) Monocle2 differentiation trajectory analysis showing intersection of dynamical genes and BEAM (Branching Expression Analysis Modeling) test significant genes yielding 280 genes with Dapp1 as an early gene. Gene expression trends across trajectories with GOBP enrichment analysis. Gene correlation analysis of LSK single-cell data from GSE136341 and GSE145491 showing Dach1 and Dntt correlation patterns with associated gene sets and intersection analysis. (B) Transcription factor expression patterns critical for hematopoietic lineage identification on UMAP, including Hlf and Tcf15 for primitive HSCs, Gata1 and Klf1 for erythroid bias, Id2 for dendritic cells, Irf8 for monocytes, Cebpe for neutrophils, Ebf1 for B cells, Hoxa9 for multipotency, Satb1 for lymphoid bias and Gfi1 for myeloid bias. Expression patterns of lineage markers Mpo, Dntt and Itga2b are also displayed. (C) Comparative analysis of lineage decision intensities for myeloid, erythroid and lymphoid lineages, along with gene decision intensities for Mpo, Itga2b and Dntt between Dapp1 knockout and control groups.

Figure A8. Detailed trajectory analysis of lineage decision dynamics following Dapp1 knockout in hematopoietic differentiation. (GSE277292) (A) Myeloid, erythroid and lymphoid differentiation trajectories comparing wild-type (left) and knockout (right) conditions. (B) Early (left) and multipotent (right) cluster identification within UMAP embedding for wild-type and knockout conditions. (C) Dntt gene decision value and lymphoid lineage decision value progression along pseudotime within early cluster populations. (D) Left panel: Dntt gene decision value and lymphoid lineage decision value along pseudotime within multipotent cluster. Right panel: Itga2b gene decision value and erythroid lineage decision value along pseudotime within early cluster. (E) Left panel: Itga2b gene decision value and erythroid lineage decision value along pseudotime within multipotent cluster. Right panel: Mpo gene decision value and myeloid lineage decision value along pseudotime within multipotent cluster.

Figure A9. Comprehensive validation of LSK cell identification and lineage contribution dynamics following irradiation. (GSE278673) (A) Expression patterns of Kit (c-Kit) and Ly6a (Sca-1) genes validating LSK cell identification, Cd34 and Flt3 indicating multipotent progenitor components, key surface markers for hematopoietic lineage identification including Il7r (lymphoid bias), Cd19 (B cells), Cd3e (T cells), Itgam and Fcer1g (myeloid bias), Cd63 (granulocyte bias), Itga2b (megakaryocytic bias), Gypa (Erythrocytes), Csf1r (monocytes) and Xcr1 (dendritic cells), and critical transcription factors Hlf (stem cells), Satb1 (multipotent progenitors), Cebpe (neutrophil lineage), Gata1 (erythroid lineage) and Id2 (dendritic cell lineage). (B) Temporal expression patterns of primitive factors (Hlf, Meis1, Satb1, Hoxa9), myeloid factors (Spi1, Gfi1, Cebpa, Cebpe) and erythroid factors (Klf1, Gata1) across irradiation recovery time course. (C) Erythroid and myeloid contribution values at different time points projected on UMAP embedding with corresponding box plot quantification. (D,E) Comparative analysis of erythroid and myeloid lineage contribution and decision intensities between normal LSK and irradiated samples across all time points. (F) Correlation heatmap of lineage decision values at each time point against expression of primitive genes (Hlf, Meis1, Satb1), myeloid genes (Gfi1, Cebpa, Cebpe) and erythroid genes (Klf1, Gata1).

Figure A10. Impact of hyperparameters n and j on grid embedding. Performance was evaluated on the hematopoiesis (left) and endocrinogenesis (right) datasets using grid size

n = 25

, 50, 75, 100, 125 and neighborhood parameter

j = 3

, 5, 8, 11, 15.

Figure A10. Impact of hyperparameters n and j on grid embedding. Performance was evaluated on the hematopoiesis (left) and endocrinogenesis (right) datasets using grid size

n = 25

, 50, 75, 100, 125 and neighborhood parameter

j = 3

, 5, 8, 11, 15.

Table A1. LDA performance improvements over baseline methods on endocrinogenesis dataset across HVG sizes.

Method	NMI	ARI	ASW	DB	CH	Overall
LDA vs. scVI	+0.258	+0.187	+0.379	+1.613	+1416.596	+0.806
LDA vs. ICA	+0.251	+0.157	+0.349	+1.869	+1411.359	+0.789
LDA vs. LscVI	+0.061	−0.006	+0.306	+1.254	+1061.209	+0.455
LDA vs. Diff	+0.012	+0.022	+0.165	+0.832	+1166.559	+0.347
LDA vs. PCA	+0.012	−0.060	+0.254	+0.971	+1022.558	+0.341
LDA vs. FA	−0.023	−0.100	+0.159	+0.145	+1085.451	+0.173
LDA vs. NMF	−0.074	−0.084	+0.117	+0.147	+744.455	+0.087

Table A2. LDA performance improvements over baseline methods on endocrinogenesis dataset across latent space sizes.

Method	NMI	ARI	ASW	DB	CH	Overall
LDA vs. scVI	+0.278	+0.186	+0.370	+1.579	+1506.128	+0.839
LDA vs. ICA	+0.189	+0.124	+0.337	+1.799	+1495.191	+0.765
LDA vs. LscVI	+0.099	+0.057	+0.329	+1.514	+1109.556	+0.605
LDA vs. Diff	+0.034	+0.028	+0.206	+0.906	+1323.790	+0.420
LDA vs. PCA	+0.027	-0.082	+0.266	+1.486	+1218.136	+0.410
LDA vs. FA	+0.023	-0.060	+0.200	+0.503	+1270.592	+0.313
LDA vs. NMF	+0.004	-0.054	+0.185	+0.428	+719.630	+0.236

Table A3. scRL performance improvements over LDA percentile variants on endocrinogenesis dataset.

Method	ARI	NMI	ASW	CH	DB	Overall
scRL vs. ${LDA}_{80}$	+0.219	+0.122	+0.277	+310.840	+0.205	+0.848
scRL vs. ${LDA}_{85}$	+0.176	+0.104	+0.196	+256.830	+0.095	+0.549
scRL vs. ${LDA}_{90}$	+0.099	+0.055	+0.083	+194.023	+0.076	+0.371
scRL vs. ${LDA}_{95}$	+0.025	−0.012	+0.049	+108.108	+0.112	+0.287

Table A4. scRL fate decision performance improvements over baseline methods on endocrinogenesis dataset.

Method	NMI	ARI	ASW	DB	CH	Overall
scRL vs. ${Diff}_{80}$	+0.337	+0.236	+0.364	+1.436	+615.913	+0.811
scRL vs. ${Diff}_{85}$	+0.310	+0.234	+0.389	+1.218	+610.215	+0.793
scRL vs. ${Diff}_{90}$	+0.240	+0.213	+0.359	+1.222	+596.458	+0.722
scRL vs. ${Diff}_{95}$	+0.102	+0.120	+0.224	+0.831	+534.217	+0.461
scRL vs. ${FA}_{80}$	+0.187	+0.115	+0.213	+2.754	+503.551	+0.605
scRL vs. ${FA}_{85}$	+0.152	+0.092	+0.190	+2.089	+461.722	+0.502
scRL vs. ${FA}_{90}$	+0.100	+0.033	+0.177	+2.471	+440.212	+0.439
scRL vs. ${FA}_{95}$	+0.037	−0.041	+0.134	+2.100	+418.401	+0.300
scRL vs. ${ICA}_{80}$	+0.345	+0.201	+0.255	+2.819	+599.145	+0.812
scRL vs. ${ICA}_{85}$	+0.331	+0.186	+0.244	+2.557	+592.465	+0.770
scRL vs. ${ICA}_{90}$	+0.303	+0.165	+0.229	+2.227	+584.174	+0.709
scRL vs. ${ICA}_{95}$	+0.248	+0.129	+0.210	+1.804	+563.078	+0.612
scRL vs. ${LscVI}_{80}$	+0.171	+0.090	+0.219	+1.772	+465.352	+0.506
scRL vs. ${LscVI}_{85}$	+0.134	+0.070	+0.210	+1.450	+435.556	+0.443
scRL vs. ${LscVI}_{90}$	+0.085	+0.032	+0.192	+1.251	+393.053	+0.351
scRL vs. ${LscVI}_{95}$	+0.016	−0.037	+0.164	+1.037	+297.412	+0.207
scRL vs. ${NMF}_{80}$	+0.335	+0.202	+0.178	+0.066	+325.897	+0.499
scRL vs. ${NMF}_{85}$	+0.287	+0.171	+0.167	+0.134	+332.902	+0.452
scRL vs. ${NMF}_{90}$	+0.193	+0.120	+0.151	+0.123	+353.477	+0.362
scRL vs. ${NMF}_{95}$	+0.113	+0.049	+0.134	+0.116	+291.898	+0.238
scRL vs. ${PCA}_{80}$	+0.200	+0.111	+0.164	+2.066	+376.044	+0.495
scRL vs. ${PCA}_{85}$	+0.160	+0.107	+0.142	+1.450	+304.960	+0.395
scRL vs. ${PCA}_{90}$	+0.117	+0.074	+0.110	+1.214	+253.100	+0.298
scRL vs. ${PCA}_{95}$	+0.032	−0.007	+0.095	+0.961	+252.420	+0.173
scRL vs. ${scVI}_{80}$	+0.330	+0.203	+0.221	+1.741	+577.491	+0.713
scRL vs. ${scVI}_{85}$	+0.312	+0.191	+0.217	+1.546	+568.437	+0.678
scRL vs. ${scVI}_{90}$	+0.283	+0.173	+0.213	+1.392	+559.935	+0.634
scRL vs. ${scVI}_{95}$	+0.237	+0.140	+0.206	+1.267	+548.500	+0.571

Table A5. scRL lineage contribution performance improvements over baseline methods across different cluster numbers on endocrinogenesis dataset.

Method	ASW	DB	CH	Overall
scRL vs. ${PCA}_{4}$	+0.247	+0.359	+5121.830	+0.353
scRL vs. ${PCA}_{6}$	+0.124	+0.235	+5058.890	+0.329
scRL vs. ${PCA}_{8}$	+0.157	+0.310	+3963.771	+0.426
scRL vs. ${PCA}_{10}$	+0.137	+0.237	+3330.795	+0.379
scRL vs. ${PCA}_{12}$	+0.194	+0.465	+2821.714	+0.515
scRL vs. ${ICA}_{4}$	+0.256	+0.447	+6269.940	+0.409
scRL vs. ${ICA}_{6}$	+0.150	+0.312	+6535.205	+0.425
scRL vs. ${ICA}_{8}$	+0.189	+0.353	+5198.874	+0.535
scRL vs. ${ICA}_{10}$	+0.169	+0.280	+3284.407	+0.430
scRL vs. ${ICA}_{12}$	+0.235	+0.578	+3874.281	+0.694
scRL vs. ${FA}_{4}$	+0.119	+0.100	+5613.054	+0.242
scRL vs. ${FA}_{6}$	+0.058	+0.023	+5792.529	+0.254
scRL vs. ${FA}_{8}$	+0.120	+0.080	+4634.559	+0.375
scRL vs. ${FA}_{10}$	+0.069	+0.112	+3547.165	+0.236
scRL vs. ${FA}_{12}$	+0.096	+0.106	+2951.559	+0.284
scRL vs. ${NMF}_{4}$	+0.114	+0.105	+816.813	+0.105
scRL vs. ${NMF}_{6}$	+0.042	+0.077	+2848.851	+0.131
scRL vs. ${NMF}_{8}$	+0.103	+0.196	+2651.048	+0.265
scRL vs. ${NMF}_{10}$	+0.061	+0.103	+2414.816	+0.210
scRL vs. ${NMF}_{12}$	+0.114	+0.240	+2419.564	+0.334
scRL vs. ${Diff}_{4}$	+0.041	+0.012	−2261.901	−0.048
scRL vs. ${Diff}_{6}$	−0.086	+0.212	+2454.264	−0.024
scRL vs. ${Diff}_{8}$	−0.047	+0.133	+3041.397	+0.091
scRL vs. ${Diff}_{10}$	−0.069	+0.161	+2894.393	+0.059
scRL vs. ${Diff}_{12}$	−0.064	+0.043	+2708.683	+0.120
scRL vs. ${scVI}_{4}$	+0.551	+1.761	+8639.039	+0.899
scRL vs. ${scVI}_{6}$	+0.368	+1.301	+7855.647	+0.859
scRL vs. ${scVI}_{8}$	+0.360	+1.106	+5999.717	+0.924
scRL vs. ${scVI}_{10}$	+0.308	+1.024	+4944.921	+0.856
scRL vs. ${scVI}_{12}$	+0.300	+1.019	+4055.950	+0.899
scRL vs. ${LscVI}_{4}$	+0.241	+0.350	+4669.743	+0.336
scRL vs. ${LscVI}_{6}$	+0.119	+0.255	+5366.970	+0.337
scRL vs. ${LscVI}_{8}$	+0.177	+0.320	+4190.457	+0.456
scRL vs. ${LscVI}_{10}$	+0.161	+0.321	+3663.236	+0.447
scRL vs. ${LscVI}_{12}$	+0.164	+0.311	+2998.675	+0.461

References

Auerbach, B.J.; Hu, J.; Reilly, M.P.; Li, M. Applications of single-cell genomics and computational strategies to study common disease and population-level variation. Genome Res. 2021, 31, 1728–1741. [Google Scholar] [CrossRef] [PubMed]
Yang, X.H.; Chen, X.; Guo, X.; Lan, J.; Zhu, L.; Wang, X. Detecting critical transition signals from single-cell transcriptomes to infer lineage-determining transcription factors. Nucleic Acids Res. 2022, 50, e91. [Google Scholar] [CrossRef]
Hou, W.; Ji, Z.; Chen, Z.; Gottardo, E.J.; Mohanty, V.; Yan, X.; Zhao, Y.; Xia, R.; Kleinstein, S.H.; Huang, H.; et al. A statistical framework for differential pseudotime analysis with multiple single-cell RNA-seq samples. Nat. Commun. 2023, 14, 7286. [Google Scholar] [CrossRef]
Lähnemann, D.; Köster, J.; Szczurek, E.; McCarthy, D.J.; Hicks, S.C.; Robinson, M.D.; Vallejos, C.A.; Campbell, K.R.; Beerenwinkel, N.; Mahfouz, A.; et al. Eleven grand challenges in single-cell data science. Genome Biol. 2020, 21, 31. [Google Scholar] [CrossRef]
Van der Maaten, L.; Postma, E.; Van den Herik, J. Dimensionality Reduction: A Comparative Review; Technical Report TiCC TR 2009-005; Tilburg University: Tilburg, The Netherlands, 2009. [Google Scholar]
Pandey, K.; Zafar, H. Inference of cell state transitions and cell fate plasticity from single-cell with MARGARET. Nucleic Acids Res. 2022, 50, e86. [Google Scholar] [CrossRef] [PubMed]
Pillai, M.; Chen, Z.; Jolly, M.K.; Li, C. Quantitative landscapes reveal trajectories of cell-state transitions associated with drug resistance in melanoma. iScience 2022, 25, 105499. [Google Scholar] [CrossRef] [PubMed]
Sun, S.; Zhu, J.; Ma, Y.; Zhou, X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 2019, 20, 269. [Google Scholar] [CrossRef]
Wang, K.; Abrams, Z.R.; Wilk, A.J.; Lu, M.A.; Chen, B.; Nolan, G.P.; Li, B. Comparative analysis of dimension reduction methods for cytometry by time-of-flight data. Nat. Commun. 2023, 14, 1836. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. J. Open Source Softw. 2018, 3, 861. [Google Scholar] [CrossRef]
Wolf, F.A.; Hamey, F.K.; Plass, M.; Solana, J.; Dahlin, J.S.; Göttgens, B.; Rajewsky, N.; Simon, L.; Theis, F.J. PAGA: Graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 2019, 20, 59. [Google Scholar] [CrossRef]
Cao, J.; Spielmann, M.; Qiu, X.; Huang, X.; Ibrahim, D.M.; Hill, A.J.; Zhang, F.; Mundlos, S.; Christiansen, L.; Steemers, F.J.; et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 2019, 566, 496–502. [Google Scholar] [CrossRef]
Street, K.; Risso, D.; Fletcher, R.B.; Das, D.; Ngai, J.; Yosef, N.; Purdom, E.; Dudoit, S. Slingshot: Cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genom. 2018, 19, 477. [Google Scholar] [CrossRef]
Sugihara, R.; Kato, Y.; Mori, T.; Kawahara, Y. Alignment of single-cell trajectory trees with CAPITAL. Nat. Commun. 2022, 13, 5972. [Google Scholar] [CrossRef]
Sagar; Granja, J.M.; Ananthakrishnan, B.; Zhang, C.A.; Camacho, J.; Santos, P.M.; Wu, M.O.; Dal Molin, A.; Dhanasekaran, P.; Fischbeck, N.P. Deciphering the regulatory landscape of fetal and adult human epicardial ATMs at single-cell resolution. bioRxiv 2020. bioRxiv:2020.04.27.999235. [Google Scholar]
Meilă, M. Manifold Learning: What, How, and Why. Annu. Rev. Stat. Its Appl. 2024, 11, 393–417. [Google Scholar] [CrossRef]
Belkin, M.; Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2003, 15, 1373–1396. [Google Scholar] [CrossRef]
Tenenbaum, J.B.; De Silva, V.; Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science 2000, 290, 2319–2323. [Google Scholar] [CrossRef] [PubMed]
Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science 2000, 290, 2323–2326. [Google Scholar] [CrossRef]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Moon, K.R.; van Dijk, D.; Wang, Z.; Gigante, S.; Burkhardt, D.B.; Chen, W.S.; Yim, K.; van den Elzen, A.; Hirn, M.J.; Coifman, R.R.; et al. Manifold learning-based methods for analyzing single-cell RNA-sequencing data. Curr. Opin. Syst. Biol. 2018, 7, 36–46. [Google Scholar] [CrossRef]
Vasighizaker, A.; Danda, S.; Rueda, L. Discovering cell types using manifold learning and enhanced visualization of single-cell RNA-Seq data. Sci. Rep. 2022, 12, 120. [Google Scholar] [CrossRef] [PubMed]
Gunawan, I.; Vafaee, F.; Meijering, E.; Lock, J.G. An introduction to representation learning for single-cell data analysis. Cell Rep. Methods 2023, 3, 100547. [Google Scholar] [CrossRef] [PubMed]
Nguyen, N.D.; Wang, D. ManiNetCluster: A novel manifold learning approach to reveal the functional links between gene networks. BMC Genom. 2019, 20, 1003. [Google Scholar] [CrossRef]
Huang, J.X.; Zeng, H.; Wang, L.; Hill, S.L.; Zhao, F. Manifold learning analysis suggests strategies to align gene expression and electrophysiology measurements of neuronal diversity in the mouse brain. Commun. Biol. 2021, 4, 1253. [Google Scholar] [CrossRef] [PubMed]
Verma, A.; Engelhardt, B.E. A robust nonlinear low-dimensional manifold for single cell RNA-seq data. BMC Bioinform. 2020, 21, 324. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Kempka, M.; Wydmuch, M.; Runc, G.; Toczek, J.; Jaśkowski, W. ViZDoom: A Doom-based AI research platform for visual reinforcement learning. In Proceedings of the 2016 IEEE Conference on Computational Intelligence and Games (CIG), Santorini, Greece, 20–23 September 2016; pp. 1–8. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Brahmbhatt, S.; Hays, J. DeepNav: Learning to Navigate Large Cities. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3087–3096. [Google Scholar]
Moris, N.; Pina, C.; Arias, A.M. Transition states and cell fate decisions in epigenetic landscapes. Nat. Rev. Genet. 2016, 17, 693–703. [Google Scholar] [CrossRef]
Yu, Y.; Gong, H.; Qiu, L.; Yuan, X. Reinforcement learning in continuous action spaces: A case study in the game of Simulated Curling. J. Artif. Intell. Res. 2018, 62, 261–290. [Google Scholar]
Peng, X.; Harris, K.R.; Saha, D. Reinforcement learning approaches to decision-making in biological systems: An overview and future directions. J. Theor. Biol. 2018, 451, 30–52. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Schiebinger, G.; Shu, J.; Tabaka, M.; Cleary, B.; Subramanian, V.; Solomon, A.; Gould, J.; Liu, S.; Lin, S.; Berube, P.; et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell 2019, 176, 928–943. [Google Scholar] [CrossRef] [PubMed]
Anger, A.M.; McLornan, D.P.; Lindner, A.; Sonnabend, E.; Karadimitris, A.; Coluccio, V.; Kim, Y.; Melchinger, W.; Thompson, H.; Thrift, A.P.; et al. Exploring cell identity as a stochastic process. Nat. Commun. 2021, 12, 1677. [Google Scholar]
Setty, M.; Tadmor, M.D.; Reich-Zeliger, S.; Angel, O.; Salame, T.M.; Kathail, P.; Choi, K.; Bendall, S.; Friedman, N.; Pe’er, D. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat. Biotechnol. 2016, 34, 637–645. [Google Scholar] [CrossRef]
Setty, M.; Kiseliovas, V.; Levine, J.; Gayoso, A.; Mazutis, L.; Pe’er, D. Characterization of cell fate probabilities in single-cell data with Palantir. Nat. Biotechnol. 2019, 37, 451–460. [Google Scholar] [CrossRef] [PubMed]
Haghverdi, L.; Buettner, F.; Theis, F.J. Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics 2015, 31, 2989–2998. [Google Scholar] [CrossRef]
Qiu, X.; Mao, Q.; Tang, Y.; Wang, L.; Chawla, R.; Pliner, H.A.; Trapnell, C. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 2017, 14, 979–982. [Google Scholar] [CrossRef]
Herman, J.S.; Sagar; Grün, D. FateID infers cell fate bias in multipotent progenitors from single-cell RNA-seq data. Nat. Methods 2018, 15, 379–386. [Google Scholar] [CrossRef]
Lange, M.; Bergen, V.; Klein, M.; Setty, M.; Reuter, B.; Bakhti, M.; Lickert, H.; Ansari, M.; Schniering, J.; Schiller, H.B.; et al. CellRank for directed single-cell fate mapping. Nat. Methods 2022, 19, 159–170. [Google Scholar] [CrossRef]
Pellin, D.; Loperfido, M.; Baricordi, C.; Wolock, S.L.; Montepeloso, A.; Weinberg, O.K.; Biffi, A.; Klein, A.M.; Biasco, L. A comprehensive single cell transcriptional landscape of human hematopoietic progenitors. Nat. Commun. 2019, 10, 2395. [Google Scholar] [CrossRef]
Naldini, M.M.; Vendramin, R.; Elias, M.; Bovellan, M.; Lakshminarasimha, K.A.; Tani, E.; Meyers, J.; Kempf, A.; Green, S.C.; Buquicchio, F.A.; et al. Longitudinal single-cell profiling of chemotherapy response in acute myeloid leukemia. Nat. Commun. 2023, 14, 1285. [Google Scholar] [CrossRef]
Neftci, E.O.; Averbeck, B.B. Reinforcement learning in artificial and biological systems. Nat. Mach. Intell. 2019, 1, 133–143. [Google Scholar] [CrossRef]
Mahmud, M.; Kaiser, M.S.; Hussain, A.; Vassanelli, S. Applications of Deep Learning and Reinforcement Learning to Biological Data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2063–2079. [Google Scholar] [CrossRef] [PubMed]
Xi, N.M.; Li, J.J. Benchmarking computational doublet-detection methods for single-cell RNA sequencing data. Cell Syst. 2021, 12, 176–194. [Google Scholar] [CrossRef]
Su, M.; Pan, T.; Chen, Q.Z.; Wei, W.W.; Zhang, Y.; Gao, S.Q.; Liu, Z.X.; Lin, X.; Tang, G.; Tang, K.; et al. Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications. Mil. Med Res. 2022, 9, 68. [Google Scholar] [CrossRef] [PubMed]
Chari, T.; Pachter, L. The specious art of single-cell genomics. PLoS Comput. Biol. 2023, 19, e1011288. [Google Scholar] [CrossRef]
Moon, K.R.; van Dijk, D.; Wang, Z.; Gigante, S.; Burkhardt, D.B.; Chen, W.S.; Yim, K.; Elzen, A.V.D.; Hirn, M.J.; Coifman, R.R.; et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 2019, 37, 1482–1492. [Google Scholar] [CrossRef]
Saelens, W.; Cannoodt, R.; Todorov, H.; Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 2019, 37, 547–554. [Google Scholar] [CrossRef]
Weinreb, C.; Wolock, S.; Tusi, B.K.; Socolovsky, M.; Klein, A.M. Fundamental limits on dynamic inference from single-cell snapshots. Proc. Natl. Acad. Sci. USA 2018, 115, E2467–E2476. [Google Scholar] [CrossRef]
Burkhardt, D.B.; Stanley, J.S., 3rd; Tong, A.; Perdigoto, A.L.; Gigante, S.A.; Herold, K.C.; Wolf, G.; Giraldez, A.J.; van Dijk, D.; Krishnaswamy, S. Quantifying the effect of experimental perturbations at single-cell resolution. Nat. Biotechnol. 2021, 39, 619–629. [Google Scholar] [CrossRef]
Li, Q. scTour: A deep learning architecture for robust inference and accurate prediction of cellular dynamics. Genome Biol. 2023, 24, 149. [Google Scholar] [CrossRef]
Stassen, S.V.; Yip, G.G.K.; Wong, K.K.Y.; Ho, J.W.K.; Tsia, K.K. Generalized and scalable trajectory inference in single-cell omics data with VIA. Nat. Commun. 2021, 12, 5528. [Google Scholar] [CrossRef] [PubMed]
Fang, M.; Gorin, G.; Pachter, L. Trajectory inference from single-cell genomics data with a process time model. PLoS Comput. Biol. 2025, 21, e1012752. [Google Scholar] [CrossRef]

Table 1. LDA performance improvements over baseline methods across HVG sizes.

Method	NMI	ARI	ASW	DB	CH	Overall
LDA vs. scVI	+0.267	+0.199	+0.334	+1.504	+2151.767	+0.785
LDA vs. ICA	+0.220	+0.149	+0.331	+1.938	+1857.164	+0.675
LDA vs. PCA	+0.064	+0.058	+0.303	+1.316	+1745.558	+0.511
LDA vs. LscVI	+0.081	+0.032	+0.246	+1.213	+1786.359	+0.459
LDA vs. FA	+0.012	−0.021	+0.190	+0.557	+1577.247	+0.261
LDA vs. Diff	−0.016	−0.045	+0.057	+0.585	+1568.978	+0.204
LDA vs. NMF	+0.007	+0.010	+0.159	+0.179	+375.229	+0.123

Table 2. LDA performance improvements over baseline methods across latent space sizes.

Method	NMI	ARI	ASW	DB	CH	Overall
LDA vs. scVI	+0.213	+0.106	+0.338	+1.516	+2643.360	+0.713
LDA vs. ICA	+0.127	+0.029	+0.334	+2.147	+2337.782	+0.618
LDA vs. LscVI	+0.075	+0.028	+0.326	+1.451	+2354.817	+0.560
LDA vs. PCA	−0.003	−0.056	+0.293	+1.576	+2273.306	+0.409
LDA vs. FA	−0.005	−0.084	+0.260	+0.686	+2383.781	+0.278
LDA vs. Diff	−0.042	−0.099	+0.124	+0.796	+2157.947	+0.198
LDA vs. NMF	−0.066	−0.117	+0.153	+0.536	+886.494	+0.075

Table 3. scRL performance improvements over LDA percentile variants on hematopoietic dataset.

Method	ARI	NMI	ASW	CH	DB	Overall
scRL vs. ${LDA}_{80 %}$	+0.330	+0.209	+0.549	+1016.722	+0.495	+0.986
scRL vs. ${LDA}_{85 %}$	+0.308	+0.192	+0.533	+981.563	+0.465	+0.932
scRL vs. ${LDA}_{90 %}$	+0.253	+0.170	+0.434	+851.186	+0.350	+0.735
scRL vs. ${LDA}_{95 %}$	+0.153	+0.091	+0.251	+482.741	+0.172	+0.375

Table 4. scRL fate decision performance improvements over baseline methods on hematopoietic dataset.

Method	NMI	ARI	ASW	DB	CH	Overall
scRL vs. ${Diff}_{80 %}$	+0.121	+0.060	+0.224	+1.314	+966.124	+0.465
scRL vs. ${Diff}_{85 %}$	+0.115	+0.046	+0.213	+1.265	+940.620	+0.441
scRL vs. ${Diff}_{90 %}$	+0.100	+0.039	+0.256	+0.939	+927.518	+0.432
scRL vs. ${Diff}_{95 %}$	+0.037	−0.007	+0.103	+0.638	+720.043	+0.228
scRL vs. ${FA}_{80 %}$	+0.229	+0.142	+0.226	+2.317	+979.821	+0.647
scRL vs. ${FA}_{85 %}$	+0.195	+0.114	+0.202	+1.892	+926.712	+0.563
scRL vs. ${FA}_{90 %}$	+0.150	+0.076	+0.233	+1.090	+836.652	+0.466
scRL vs. ${FA}_{95 %}$	+0.080	+0.009	+0.186	+0.578	+648.564	+0.296
scRL vs. ${ICA}_{80 %}$	+0.346	+0.224	+0.243	+3.010	+995.896	+0.811
scRL vs. ${ICA}_{85 %}$	+0.344	+0.224	+0.241	+2.829	+968.163	+0.785
scRL vs. ${ICA}_{90 %}$	+0.325	+0.213	+0.231	+2.577	+926.928	+0.731
scRL vs. ${ICA}_{95 %}$	+0.258	+0.162	+0.206	+2.548	+837.487	+0.614
scRL vs. ${LscVI}_{80 %}$	+0.236	+0.170	+0.262	+1.989	+1014.079	+0.677
scRL vs. ${LscVI}_{85 %}$	+0.213	+0.155	+0.254	+1.674	+991.179	+0.627
scRL vs. ${LscVI}_{90 %}$	+0.169	+0.125	+0.239	+1.365	+952.584	+0.551
scRL vs. ${LscVI}_{95 %}$	+0.119	+0.078	+0.208	+1.228	+883.328	+0.454
scRL vs. ${NMF}_{80 %}$	+0.307	+0.262	+0.008	+0.095	+656.817	+0.455
scRL vs. ${NMF}_{85 %}$	+0.311	+0.258	+0.093	+0.143	+817.645	+0.540
scRL vs. ${NMF}_{90 %}$	+0.278	+0.237	+0.158	+0.169	+799.054	+0.544
scRL vs. ${NMF}_{95 %}$	+0.160	+0.171	+0.096	+0.023	+349.374	+0.297
scRL vs. ${PCA}_{80 %}$	+0.240	+0.166	+0.199	+2.434	+920.403	+0.651
scRL vs. ${PCA}_{85 %}$	+0.206	+0.150	+0.179	+2.095	+861.993	+0.579
scRL vs. ${PCA}_{90 %}$	+0.152	+0.110	+0.156	+1.697	+792.187	+0.474
scRL vs. ${PCA}_{95 %}$	+0.089	+0.059	+0.159	+1.205	+702.267	+0.362
scRL vs. ${scVI}_{80 %}$	+0.344	+0.226	+0.200	+1.731	+1061.220	+0.733
scRL vs. ${scVI}_{85 %}$	+0.325	+0.216	+0.195	+1.542	+1040.949	+0.698
scRL vs. ${scVI}_{90 %}$	+0.299	+0.202	+0.190	+1.390	+1020.921	+0.659
scRL vs. ${scVI}_{95 %}$	+0.259	+0.181	+0.185	+1.272	+998.612	+0.609

Table 5. scRL lineage contribution performance improvements over baseline methods across different cluster numbers on hematopoietic dataset.

Method	ASW	DB	CH	Overall
scRL vs. ${PCA}_{4}$	+0.247	+0.359	+5121.830	+0.353
scRL vs. ${ICA}_{4}$	+0.256	+0.447	+6269.940	+0.409
scRL vs. ${FA}_{4}$	+0.119	+0.100	+5613.054	+0.242
scRL vs. ${NMF}_{4}$	+0.114	+0.105	+816.813	+0.105
scRL vs. ${Diff}_{4}$	+0.041	+0.012	−2261.901	−0.048
scRL vs. ${scVI}_{4}$	+0.551	+1.761	+8639.039	+0.899
scRL vs. ${LscVI}_{4}$	+0.241	+0.350	+4669.743	+0.336
scRL vs. ${PCA}_{6}$	+0.124	+0.235	+5058.890	+0.329
scRL vs. ${ICA}_{6}$	+0.150	+0.312	+6535.205	+0.425
scRL vs. ${FA}_{6}$	+0.058	+0.023	+5792.529	+0.254
scRL vs. ${NMF}_{6}$	+0.042	+0.077	+2848.851	+0.131
scRL vs. ${Diff}_{6}$	−0.086	+0.212	+2454.264	−0.024
scRL vs. ${scVI}_{6}$	+0.368	+1.301	+7855.647	+0.859
scRL vs. ${LscVI}_{6}$	+0.119	+0.255	+5366.970	+0.337
scRL vs. ${PCA}_{8}$	+0.157	+0.310	+3963.771	+0.426
scRL vs. ${ICA}_{8}$	+0.189	+0.353	+5198.874	+0.535
scRL vs. ${FA}_{8}$	+0.120	+0.080	+4634.559	+0.375
scRL vs. ${NMF}_{8}$	+0.103	+0.196	+2651.048	+0.265
scRL vs. ${Diff}_{8}$	−0.047	+0.133	+3041.397	+0.091
scRL vs. ${scVI}_{8}$	+0.360	+1.106	+5999.717	+0.924
scRL vs. LscVI₈	+0.177	+0.320	+4190.457	+0.456
scRL vs. PCA₁₀	+0.137	+0.237	+3330.795	+0.379
scRL vs. ICA₁₀	+0.169	+0.280	+3284.407	+0.430
scRL vs. FA₁₀	+0.069	+0.112	+3547.165	+0.236
scRL vs. NMF₁₀	+0.061	+0.103	+2414.816	+0.210
scRL vs. Diff₁₀	−0.069	+0.161	+2894.393	+0.059
scRL vs. scVI₁₀	+0.308	+1.024	+4944.921	+0.856
scRL vs. ${LscVI}_{10}$	+0.161	+0.321	+3663.236	+0.447
scRL vs. PCA₁₂	+0.194	+0.465	+2821.714	+0.515
scRL vs. ICA₁₂	+0.235	+0.578	+3874.281	+0.694
scRL vs. FA₁₂	+0.096	+0.106	+2951.559	+0.284
scRL vs. NMF₁₂	+0.114	+0.240	+2419.564	+0.334
scRL vs. Diff₁₂	−0.064	+0.043	+2708.683	+0.120
scRL vs. scVI₁₂	+0.300	+1.019	+4055.950	+0.899
scRL vs. LscVI₁₂	+0.164	+0.311	+2998.675	+0.461

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, Z.; Chen, C.; Wang, S.; Wang, J.; Chen, S. scRL: Utilizing Reinforcement Learning to Evaluate Fate Decisions in Single-Cell Data. Biology 2025, 14, 679. https://doi.org/10.3390/biology14060679

AMA Style

Fu Z, Chen C, Wang S, Wang J, Chen S. scRL: Utilizing Reinforcement Learning to Evaluate Fate Decisions in Single-Cell Data. Biology. 2025; 14(6):679. https://doi.org/10.3390/biology14060679

Chicago/Turabian Style

Fu, Zeyu, Chunlin Chen, Song Wang, Junping Wang, and Shilei Chen. 2025. "scRL: Utilizing Reinforcement Learning to Evaluate Fate Decisions in Single-Cell Data" Biology 14, no. 6: 679. https://doi.org/10.3390/biology14060679

APA Style

Fu, Z., Chen, C., Wang, S., Wang, J., & Chen, S. (2025). scRL: Utilizing Reinforcement Learning to Evaluate Fate Decisions in Single-Cell Data. Biology, 14(6), 679. https://doi.org/10.3390/biology14060679

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

scRL: Utilizing Reinforcement Learning to Evaluate Fate Decisions in Single-Cell Data

Simple Summary

Abstract

1. Introduction

1.1. Single-Cell Sequencing and Dimensionality Reduction

1.2. Manifold Learning for Dimensionality Reduction

1.3. Applications in Single-Cell Data Analysis

1.4. Reinforcement Learning for Cellular Differentiation Trajectories

1.5. Our Contribution

2. Materials and Methods

2.1. Architecture and Workflow of scRL

2.2. Grid-Based Embedding Representation

2.3. Pseudotime Alignment via Graph-Based Distance Computation

2.4. Bidirectional Projection Between Cells and Grid

2.5. Reinforcement Learning Environment Configuration

2.6. Reinforcement Learning Model Architecture and Training

2.7. scRL Parameter Configuration and Data Structure Organization

2.8. Embedding Techniques and Evaluation Framework

2.9. Comprehensive Validation Framework for scRL

3. Results

3.1. LDA Provides an Interpretable Latent Space for Single-Cell Lineage Analysis

3.2. Reinforcement Learning Captures Early Fate-Decision Signals

3.3. scRL Outperforms Alternative Approaches in Fate-Decision Inference

3.4. Label-Free scRL Accurately Quantifies Lineage Contribution

3.5. scRL Rivals State-of-the-Art Pseudotime and Fate Inference Tools

3.6. scRL Decision Intensities Reveal Pre-Contribution States in Hematopoiesis

3.7. Integrated Gene- and Lineage-Level Decisions Resolve AML Branch Points

3.8. scRL Pinpoints Dapp1 as an Early Regulator of HSC Fate

3.9. scRL Charts Dynamic Fate Decisions During Post-Irradiation HSC Recovery

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI