3.1. Problem Formulation
Given
N documents with embeddings
(
from text-embedding-3-small), we seek a coherence function (
) such that the maximum-capacity path, i.e.,
produces storylines that align with human judgments of narrative coherence.
The classical Narrative Trails approach computes as a geometric mean of angular similarity (in UMAP-projected space) and topic similarity (from HDBSCAN clustering). We propose replacing this with a quantum kernel: , where projects embeddings to lower dimensions and k computes quantum-state overlap.
To understand which quantum kernel configurations are effective for narrative coherence, we employ a factorial experimental design covering 93 method evaluations: 54 quantum kernels across three encoder families (R
Y+CNOT-ring data re-uploading, IQP/ZZ-feature-map, and the projected quantum kernel of Huang et al. [
29]) and 39 classical kernels (cosine, RBF, and the cluster-aware Narrative Trails baseline) on matched projected inputs and on the raw 1536-dimensional embeddings.
3.2. Datasets and Evaluation Metrics
We evaluate on two datasets. The Wikispeedia dataset [
16] contains human navigation paths through a subset of Wikipedia—records of paths that people actually traversed when asked to navigate between articles. This provides a supervised ground truth for what constitutes a reasonable inter-article sequence. The dataset contains 3928 articles with 11,215 complete human navigation paths, which we use in full for evaluation. For each method, we report five metrics from the Narrative Trails reference implementation [
2], all built on Dynamic Time Warping (DTW) [
40] or set overlap. DTW computes the minimum-cost monotone alignment between two sequences (
and
) by filling a cost table (
) with
, where
c is the per-element cost; the bottom-right cell (
) is the total alignment cost, and the back-trace gives the alignment of length
. We use
, where
is the embedding in the original 1536-dim space. We report: (i) the length-normalised DTW distance (
nDTW—lower is better;
, the per-aligned-step cost), (ii) the nDTW similarity (1 minus the mean per-step cost along the alignment), (iii) the per-step DTW similarity (length-independent mean of
over each transition in the extracted path), (iv) the Jaccard overlap between the extracted-path article set and the human-path article set, and (v) the harmonic-mean F1 of precision and recall on those same sets. As we show in
Section 4, these metrics fall into two clusters that rank methods differently, so we report both.
For unsupervised evaluation, we use a corpus of 418 news articles about Cuban political events from 2020–2021 [
17]. Without ground-truth paths, we rely on an LLM-as-a-judge evaluation approach for coherence [
41]. Each storyline is scored on four dimensions—logical flow, thematic consistency, temporal coherence, and narrative completeness—each on a 1–10 scale. Two judges, Claude Sonnet 4.5 and GPT-4o, each score every storyline independently at a temperature of
via structured tool/function calling that emits exactly the four integer scores; no free-form justification is generated. Therefore, each cell (method, seed, and endpoint pair) yields one score per dimension per judge. We evaluate 30 spatially distant endpoint pairs, each sampled once and reused across all method–seed–judge combinations to enable paired comparisons. For each method we use 5 random projection seeds (42, 123, 456, 789, and 1024) so that conclusions are not driven by a single projection draw. For supervised evaluation, we report the nDTW, per-step similarity, Jaccard overlap, and F1 score; for unsupervised evaluation, we report the mean overall coherence and per-dimension scores with paired Wilcoxon tests against the classical baseline.
Note that coherence scores from different methods are not directly comparable in absolute terms—each defines its own similarity space. The path extraction algorithm uses only the ranking of document pairs, so what matters is whether a method’s rankings produce paths that align with human behaviour or external quality judgements.
The prompt shown in
Figure 1 is used by both judges. The same prompt is sent to each judge, with the only difference being the underlying API endpoint (Anthropic Messages API for Sonnet 4.5; OpenAI Chat Completions API for GPT-4o).
3.3. Quantum Coherence Kernels
We replace the coherence function with a quantum kernel. The classical approach computes coherence as a geometric mean of angular similarity (in UMAP space) and topic similarity (from clustering). Our quantum approach computes coherence directly from kernel similarity: given two documents, we project their embeddings to low dimensions, encode them via a quantum circuit, and measure the overlap of the resulting representations. Higher overlap indicates more similar documents, yielding higher coherence.
The bottleneck path formulation requires accurate similarity measurement for all document pairs, including subtle relationships that simple similarity measures might miss. Quantum kernels map inputs into an exponentially larger feature space through a parameterised nonlinear transformation. Whether this expressivity translates to better coherence measurement is an empirical question our experiments address.
We note that there are practical constraints that shape our approach. Document embeddings from OpenAI’s text-embedding-3-small have 1536 dimensions, but encoding high-dimensional data into quantum circuits is computationally intractable—each dimension adds circuit depth, and simulation of deep circuits scales exponentially. We therefore project embeddings to between 2 and 64 dimensions before computing kernel similarity, with the projection method becoming a key experimental variable.
We use a layered encoding circuit based on incremental data uploading [
34], where input features are distributed across successive layers with entanglement between them, increasing the expressivity of the quantum feature map. For
n qubits and
L layers, the circuit encodes a
dimensional input vector (
). Each layer (
) consists of the following:
Rotation gates: Each qubit (i) receives an rotation parameterised by input feature , where rotates the qubit state around the Y-axis of the Bloch sphere.
Entangling gates: CNOT gates in a ring topology connect neighbouring qubits, creating entanglement that allows the circuit to represent correlations between features.
For n qubits and an d-dimensional input, each layer encodes n distinct features from the input vector, requiring layers to encode all d features. The entanglement gates between layers allow the circuit to capture correlations across feature groups.
Before encoding, projected features are normalized globally (per feature dimension) to the
range across all documents. The
range is required by the quantum encoding: the projected coordinates are used directly as
rotation angles, and angles outside
alias onto identical states. The classical kernel baselines (cosine and RBF) do not need this normalization, but we apply the same projection-and-normalization pipeline to them so that the quantum-versus-classical comparison is matched on input format and isolates the effect of the kernel rather than the input scale. Global (as opposed to per-document) normalization preserves relative positions between documents in each feature dimension, ensuring the rotation angles reflect meaningful inter-document differences. The kernel is then read out via the compute–uncompute construction of Equations (
2) and (
3): applied to the layered encoder described above, this means a forward pass of
L +CNOT-ring layers parameterised by
, followed by the adjoint of the same circuit parameterised by
(reversed layer order, reversed CNOT order, and negated angles), with the all-zeros bitstring probability serving as the kernel value and, therefore, as the coherence score.
Figure 2 shows the encoding architecture at three illustrative sizes (
,
, and a generic
n). The three panels illustrate how the CNOT-ring entangler closes for any
n: at
, the ring wraps from
back to
(panel b), and the same wrap rule applies for an arbitrary
n (panel c). The
case (panel a) is drawn schematically as a single CNOT but is degenerate—the sequential sweep and the wrap-around CNOT act on the same wire pair in opposite directions, so the implementation applies the pair
per layer; the figure abstracts these as one CNOT for visual symmetry with the
panels. For an input of dimension
, each of the
L layers consumes
n consecutive features as
rotation angles, and the CNOT ring after each layer entangles them with the qubits in the next layer’s rotation block. The data re-uploading pattern—rotation block, entangler, rotation block, entangler,
…—is the source of the kernel’s non-trivial dependence on the input at depth
; at depth 1, in contrast, the entangler (
G) appears once on each side of the compute–uncompute product, and the resulting
collapses the kernel into a per-feature product (this is the source of the depth-1 cancellation we make precise in
Section 4.1). The figure is therefore not just a schematic: the visual symmetry between the forward
+CNOT-ring block and its adjoint is precisely the structural feature that the cancellation theorem exploits.
What does the kernel actually measure? The compute–uncompute readout is operationally simple—the probability of the all-zeros bitstring—but its information content is not opaque. For the depth-1 R
Y+CNOT-ring family, Theorem 1 (
Section 4.1) gives the closed form, i.e.,
, which is a product of per-feature angle similarities. The kernel is therefore directly interpretable: it is a smooth, monotone function of the per-feature angular displacement, with the product structure encoding “all features must agree” in the same way a Gaussian RBF does. Higher-depth and non-product encodings (IQP and PQK) lift this product structure, but the depth-1 case anchors the family in classical kernel intuition rather than treating the quantum circuit as a black box.
Two Alternative Kernel Families
Beyond the data re-uploading +CNOT-ring kernel just described (henceforth referred to as the q_* family), we also evaluate two architecturally distinct quantum kernels chosen to break specific structural properties of that family.
IQP/ZZ-feature-map kernel (
iqpML_*): A single-layer, instantaneous, quantum polynomial encoding [
4] in the form of
where
couples qubits
along a ring entangler pattern (
E) and
H is the Hadamard gate. Unlike
+CNOT-ring, the
rotation angles depend on the product (
), so the compute–uncompute product (
) contains phases in the form of
, which do not factor into a function of
alone. We use the multilayer variant (
iqpML_random_8q_2L_16d), which stacks two such encodings to enable data re-uploading of
features into
qubits.
Projected quantum kernel (
pqk_*): Following Huang et al. [
29], we reuse the same
+CNOT-ring data re-uploading circuit described above and replace the compute–uncompute fidelity readout with a classical kernel on local Pauli expectations. For each input (
z), we compute
i.e., the
-dim vector of single-qubit Pauli expectation values in the encoded state (
). The projected quantum kernel is then the classical RBF kernel on the basis of these expectations:
The motivation in [
29] is that PQK retains some quantum non-classicality (the encoded state (
) is genuinely quantum) while avoiding the exponential concentration that plagues fidelity kernels at depth.
Figure 3 illustrates the IQP encoding at
qubits with a single layer: a Hadamard column rotates the qubits onto the
X eigenstates, single-qubit
rotations imprint each input feature, and a ring of
entanglers couples adjacent qubits using
products of features. The product structure of the entangler angles is exactly the property that lets IQP escape the depth-1 cancellation theorem (Proposition 3).
Figure 4 illustrates the projected quantum kernel: the encoding is the same R
Y+CNOT-ring data re-uploading circuit as in
Figure 2, but the compute–uncompute readout is replaced with a per-qubit local Pauli measurement of the encoded state (
). The resulting
-dim vector (
) is fed into a classical RBF kernel; this side-steps the depth-induced concentration of fidelity readouts (Remark 1) at the cost of computing a classical kernel on top of a quantum representation.
3.5. Factorial Experimental Design
To systematically investigate how projection type, qubit count, and circuit depth affect quantum kernel performance, we employ a full factorial design on the supervised Wikispeedia benchmark. We test the following:
Projection types: UMAP and random Gaussian projection;
Qubit counts: 2, 4, and 8 qubits;
Layer counts: 1, 2, 4, 6, and 8 encoding layers.
The projection dimensionality is determined by qubits × layers, yielding feature dimensions ranging from 2 to 64. This gives 30 RY+CNOT-ring quantum configurations (2 projections × 3 qubit counts × 5 layer counts).
To isolate the contribution of the quantum kernel from input-format effects, we evaluate the same coherence-graph pipeline with classical kernels on identical projected inputs and with two architecturally distinct quantum kernel families. The classical kernel baselines are cosine and RBF applied to Gaussian random projections innine target dimensions (2, 4, 8, 12, 16, 24, 32, 48, and 64) and to UMAP projections inthe same nine dimensions, plus cosine and RBF computed directly on the raw 1536-dim text-embedding-3-small representation. We label the projected variants as
cos_random_dd,
rbf_random_dd,
cos_umap_dd, and
rbf_umap_dd; all consume the same projected and
-normalised input as the quantum kernels, with no qubits in the kernel itself. The raw-embedding variants (
cos_raw_1536d and
rbf_raw_1536d) sit at the other end of the projection axis: because text-embedding-3-small is, itself, a pre-trained transformer encoder, cosine on its raw output is the natural transformer-based “neural similarity” baseline. The two alternative quantum kernel families (IQP/ZZ-feature-map and the projected quantum kernel of Huang et al. [
29]) probe whether the empirical findings extend beyond R
Y+CNOT-ring to encoders that escape the depth-1 cancellation theorem proven in
Section 4.1. They are evaluated on a subset of the qubit-and-depth grid used for R
Y+CNOT-ring (14 IQP and 10 PQK variants spanning 4 and 8 qubits at 1, 2, 4, and 8 layers; the omitted depths are listed in
Section S1 of the Supplementary Material). Together with the cluster-aware classical baseline (
classical_NT), the totals are: 1
classical_NT + 30 R
Y+CNOT + 14 IQP + 10 PQK + 36 cosine/RBF on projected inputs + 2 cosine/RBF on raw embeddings =
93 methods.
For the unsupervised LLM evaluation, the same 30 endpoint pairs are scored across a 12-method subset chosen by aggregate rank on the supervised metrics: the classical baseline, the headline UMAP-projected quantum kernel (
q_umap_4q_4d), three random-projection quantum kernels (
q_random_4q_8d,
q_random_8q_16d, and a multi-layer IQP variant
iqpML_random_8q_2L_16d), one projected quantum kernel (
pqk_random_8q_8L_64d) [
29], and three target dimensions each of the cosine and RBF classical kernel baselines on random projection. This subset gives 12 methods × 5 seeds × 30 pairs × 2 judges = 3600 LLM calls, of which 1800 distinct storylines are scored once per judge.