Narrative extraction builds coherent ordered sequences of documents that trace how concepts develop over time, and is a growing area of information retrieval. In this work we focus on scientific literature, using a corpus of 3549 IEEE visualization research papers (1990–2022). A natural
[...] Read more.
Narrative extraction builds coherent ordered sequences of documents that trace how concepts develop over time, and is a growing area of information retrieval. In this work we focus on scientific literature, using a corpus of 3549 IEEE visualization research papers (1990–2022). A natural hypothesis is that augmenting embedding-based pathfinding with explicit domain knowledge should improve narrative quality. We present the Knowledge-Coherence Framework (KCF), which integrates structured metadata from OpenAlex into narrative extraction (building on the Narrative Trails algorithm), and conduct a systematic empirical investigation along three axes: (1) the effect of embedding model choice (MiniLM vs. SPECTER), (2) the effect of knowledge augmentation (with and without, plus sensitivity to the knowledge weight
), and (3) the reliability of LLM-based evaluation (cross-agreement among 13 large language models). Throughout, mathematical coherence denotes the geometric mean of angular and topic similarity between consecutive documents along a path—an automatic, model-computed quantity inherited from Narrative Maps and Narrative Trails—while narrative quality refers to the LLM-judged construct. Using up to 600 evaluation pairs, we find that embedding model choice has a large effect on mathematical coherence (SPECTER: 0.94 vs. MiniLM: 0.81) and that, contrary to expectations, knowledge augmentation does not improve LLM-judged narrative quality—it slightly decreases it for both embeddings. Notably, the two notions dissociate: SPECTER produces the most mathematically coherent paths, yet MiniLM paths receive the highest LLM narrative-quality scores (5.87 vs. 5.36 out of 10). Alpha sensitivity analysis over five values (
, 500 pairs) confirms that LLM scores remain essentially flat while mathematical coherence steadily declines with increasing knowledge weight. Cross-model evaluation with 13 LLM judges shows high inter-model agreement (median Pearson
), supporting evaluation reliability. The main practical takeaways are that (i) embedding model choice, not knowledge augmentation, is the more consequential design decision, and (ii) mathematical coherence and LLM-judged narrative quality are distinct optimization targets that practitioners should not conflate.
Full article