This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessReview
From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data
1
Image Processing and Analysis Laboratory, National University of Science and Technology Politehnica Bucharest, Splaiul Independentei 313, 060042 Bucharest, Romania
2
AI4AGRI, Romanian Excellence Center on AI for Agriculture, Transilvania University of Brașov, 500024 Brașov, Romania
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(1), 175; https://doi.org/10.3390/math14010175 (registering DOI)
Submission received: 21 November 2025
/
Revised: 24 December 2025
/
Accepted: 29 December 2025
/
Published: 2 January 2026
Abstract
This paper presents a narrative review of the contextualization and contribution offered by vision–language models (VLMs) for human-centric understanding in images. Starting from the correlation between humans and their context (background) and by incorporating VLM-generated embeddings into recognition architectures, recent solutions have advanced the recognition of human actions, the detection and classification of violent behavior, and inference of human emotions from body posture and facial expression. While powerful and general, VLMs may also introduce biases that can be reflected in the overall performance. Unlike prior reviews that focus on a single task or generic image captioning, this review jointly examines multiple human-centric problems in VLM-based approaches. The study begins by describing the key elements of VLMs (including architectural foundations, pre-training techniques, and cross-modal fusion strategies) and explains why they are suitable for contextualization. In addition to highlighting the improvements brought by VLMs, it critically discusses their limitations (including human-related biases) and presents a mathematical perspective and strategies for mitigating them. This review aims to consolidate the technical landscape of VLM-based contextualization for human state recognition and detection. It aims to serve as a foundational reference for researchers seeking to control the power of language-guided VLMs in recognizing human states correlated with contextual cues.
Share and Cite
MDPI and ACS Style
Florea, C.; Popescu, C.-B.; Racovițeanu, A.; Nițu, A.; Florea, L.
From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data. Mathematics 2026, 14, 175.
https://doi.org/10.3390/math14010175
AMA Style
Florea C, Popescu C-B, Racovițeanu A, Nițu A, Florea L.
From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data. Mathematics. 2026; 14(1):175.
https://doi.org/10.3390/math14010175
Chicago/Turabian Style
Florea, Corneliu, Constantin-Bogdan Popescu, Andrei Racovițeanu, Andreea Nițu, and Laura Florea.
2026. "From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data" Mathematics 14, no. 1: 175.
https://doi.org/10.3390/math14010175
APA Style
Florea, C., Popescu, C.-B., Racovițeanu, A., Nițu, A., & Florea, L.
(2026). From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data. Mathematics, 14(1), 175.
https://doi.org/10.3390/math14010175
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article metric data becomes available approximately 24 hours after publication online.