VLM-as-a-Judge Approaches for Evaluating Visual Narrative Coherence in Historical Photographical Records

Brian Keith; Claudio Meneses; Mauricio Matus; María Constanza Castro; Diego Urrutia

doi:10.3390/electronics14214199

,

and

¹

Department of Computing and Systems Engineering, Universidad Católica del Norte, Antofagasta 1270709, Chile

²

School of Journalism, Universidad Católica del Norte, Antofagasta 1270398, Chile

^*

Author to whom correspondence should be addressed.

Electronics2025, 14(21), 4199;https://doi.org/10.3390/electronics14214199
(registering DOI)

This article belongs to the Special Issue Artificial Intelligence-Driven Emerging Applications

Version Notes

Order Reprints

Abstract

Evaluating the coherence of visual narrative sequences extracted from image collections remains a challenge in digital humanities and computational journalism. While mathematical coherence metrics based on visual embeddings provide objective measures, they require computational resources and technical expertise to interpret. We propose using vision-language models (VLMs) as judges to evaluate visual narrative coherence, comparing two approaches: caption-based evaluation that converts images to text descriptions and direct vision evaluation that processes images without intermediate text generation. Through experiments on 126 narratives from historical photographs, we show that both approaches achieve weak-to-moderate correlations with mathematical coherence metrics (r = 0.28–0.36) while differing in reliability and efficiency. Direct VLM evaluation achieves higher inter-rater reliability (

ICC () = 0.718

vs.

0.339

) but requires

10.8 \times

more computation time after initial caption generation. Both methods successfully discriminate between human-curated, algorithmically extracted, and random narratives, with all pairwise comparisons achieving statistical significance (

p < 0.05

, with five of six comparisons at

p < 0.001

). Human sequences consistently score highest, followed by algorithmic extractions, then random sequences. Our findings indicate that the choice between approaches depends on application requirements: caption-based for efficient large-scale screening versus direct vision for consistent curatorial assessment.

Keywords:

visual narrative evaluation; VLM-as-a-judge; narrative extraction; multimodal evaluation; coherence metrics

VLM-as-a-Judge Approaches for Evaluating Visual Narrative Coherence in Historical Photographical Records

Abstract

Article Metrics

Citations

Article Access Statistics