You are currently viewing a new version of our website. To view the old version click .
Electronics
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

27 October 2025

VLM-as-a-Judge Approaches for Evaluating Visual Narrative Coherence in Historical Photographical Records

,
,
,
and
1
Department of Computing and Systems Engineering, Universidad Católica del Norte, Antofagasta 1270709, Chile
2
School of Journalism, Universidad Católica del Norte, Antofagasta 1270398, Chile
*
Author to whom correspondence should be addressed.
Electronics2025, 14(21), 4199;https://doi.org/10.3390/electronics14214199 
(registering DOI)
This article belongs to the Special Issue Artificial Intelligence-Driven Emerging Applications

Abstract

Evaluating the coherence of visual narrative sequences extracted from image collections remains a challenge in digital humanities and computational journalism. While mathematical coherence metrics based on visual embeddings provide objective measures, they require computational resources and technical expertise to interpret. We propose using vision-language models (VLMs) as judges to evaluate visual narrative coherence, comparing two approaches: caption-based evaluation that converts images to text descriptions and direct vision evaluation that processes images without intermediate text generation. Through experiments on 126 narratives from historical photographs, we show that both approaches achieve weak-to-moderate correlations with mathematical coherence metrics (r = 0.28–0.36) while differing in reliability and efficiency. Direct VLM evaluation achieves higher inter-rater reliability (ICC()=0.718 vs. 0.339) but requires 10.8× more computation time after initial caption generation. Both methods successfully discriminate between human-curated, algorithmically extracted, and random narratives, with all pairwise comparisons achieving statistical significance (p<0.05, with five of six comparisons at p<0.001). Human sequences consistently score highest, followed by algorithmic extractions, then random sequences. Our findings indicate that the choice between approaches depends on application requirements: caption-based for efficient large-scale screening versus direct vision for consistent curatorial assessment.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.