From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data

Florea, Corneliu; Popescu, Constantin-Bogdan; Racovițeanu, Andrei; Nițu, Andreea; Florea, Laura

doi:10.3390/math14010175

Open AccessReview

From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data

by

Corneliu Florea

^1,2

,

Constantin-Bogdan Popescu

¹,

Andrei Racovițeanu

^1,2

,

Andreea Nițu

¹

and

Laura Florea

^1,*

¹

Image Processing and Analysis Laboratory, National University of Science and Technology Politehnica Bucharest, Splaiul Independentei 313, 060042 Bucharest, Romania

²

AI4AGRI, Romanian Excellence Center on AI for Agriculture, Transilvania University of Brașov, 500024 Brașov, Romania

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(1), 175; https://doi.org/10.3390/math14010175

Submission received: 21 November 2025 / Revised: 24 December 2025 / Accepted: 29 December 2025 / Published: 2 January 2026

(This article belongs to the Special Issue Advance in Neural Networks and Visual Learning)

Download

Browse Figures

Versions Notes

Abstract

This paper presents a narrative review of the contextualization and contribution offered by vision–language models (VLMs) for human-centric understanding in images. Starting from the correlation between humans and their context (background) and by incorporating VLM-generated embeddings into recognition architectures, recent solutions have advanced the recognition of human actions, the detection and classification of violent behavior, and inference of human emotions from body posture and facial expression. While powerful and general, VLMs may also introduce biases that can be reflected in the overall performance. Unlike prior reviews that focus on a single task or generic image captioning, this review jointly examines multiple human-centric problems in VLM-based approaches. The study begins by describing the key elements of VLMs (including architectural foundations, pre-training techniques, and cross-modal fusion strategies) and explains why they are suitable for contextualization. In addition to highlighting the improvements brought by VLMs, it critically discusses their limitations (including human-related biases) and presents a mathematical perspective and strategies for mitigating them. This review aims to consolidate the technical landscape of VLM-based contextualization for human state recognition and detection. It aims to serve as a foundational reference for researchers seeking to control the power of language-guided VLMs in recognizing human states correlated with contextual cues.

Keywords:

vision–language models (VLMs); contextual bias; image contextualization; multimodal learning; affective computing

MSC:

68T45

1. Introduction

We, as humans, have a human-centric perspective of the world. Consequently, many human-related aspects have become focal points of research in computer vision. These include action recognition (sometimes encompassing anomaly/violence detection) and human–computer interaction (which covers emotion recognition).

The most common approach in recent computer vision research is to define a benchmark consisting of a training set and a test set, develop a learning model on the training set, and evaluate it on the test set. However, in this basic form, the learning model cannot by itself overcome the limitations of the training data. A notable limitation is the finite amount of available data, which has led to the trend of introducing larger and richer datasets [1,2]. Another clear limitation is revealed when a model trained on one benchmark performs significantly worse when tested on another [3,4]. As a direct consequence, the research community has actively pursued methods to improve performance. One direction is to design methods specifically for cross-domain evaluation and comparison [5]. Another direction is to use machine learning tools trained on much larger datasets, thereby mitigating the limitations of the original benchmarks. The latter models are named foundation models or large models [6].

One approach involves incorporating an image captioning or caption-inspired language branch into the solution architecture, providing textual descriptions of visual inputs. Recall that image captioning is the process in which a (foundation) model analyzes an image, identifies objects, actions, and relationships, and generates a short natural language sentence or description explaining what is happening in the image [7]. A brief illustration of the captioning principle applied to human state recognition problems is shown in Figure 1. The key assumption underlying their success is that image captioning relies on VLMs, which are trained on datasets far larger than standard benchmarks. These models can provide richer context, allowing more nuanced predictions compared to a simple model that merely associates content with a label.

This narrative review surveys vision–language models (VLMs) for human-centric tasks. Key references are identified based on author expertise and citation chasing, and the literature is organized thematically, covering developments from the first concepts from 2019 to the emergence of the main research directions. From the beginning, VLMs were noted for their capacity to be used in downstream image tasks. First, Lu et al. [8] demonstrated that a single vision + language pre-trained model (though not yet large scale) could be adapted to a range of downstream vision–language tasks. Next, Radford et al. [9], introducing the CLIP paradigm, showed that a large-scale VLM can serve as a general zero-shot/few-shot solution for many computer vision problems (image classification, open-vocabulary recognition, etc.). More recently, Zhang et al. [10] highlighted that “Multi-Branch” architectures (vision encoder + text encoder) allow models to use natural language as a bridge for solving tasks without exhaustive per-task labeling, improving performance across diverse applications like autonomous driving [11], remote sensing [12], and robot vision [13]. Relating to the specific directions approached in this work, the use of VLMs has been demonstrated by Wang et al. [14] (the work became publicly available in 2021 as pre-print), thus initializing the use of VLM captioning/descriptions for human action recognition, and by Li et al. [15] (the work became public in 2023 as pre-print) for expression recognition.

As we will show later, there exist multiple cases in relation to human-centric aspects where the addition of an image captioning branch brought improvement. For instance, Li et al. [15] reported an increase in emotion recognition accuracy from 90.91 to 91.61, while Xenos et al. [16] reported an increase in EMOTIC from 29.19 to 34.66, achieved by addition of captioning. In violence detection, Wang et al. [17] reported a 4% increase in accuracy when adding text description. However, while greatly diminished compared to limited models, large models and their databases still have limitations and biases [18,19,20]. We will specifically review these while providing a mathematical formulation, thus further facilitating development in this direction.

Unlike prior surveys that focus on a single task or generic image captioning, this review jointly examines multiple human-centric problems within a unified contextualization framework. From an application perspective, we group the existing works into task categories. We separately discuss human action recognition and anomaly detection (including violence detection) and emotion recognition from facial cues, full-body posture, and context. Each category exhibits distinct challenges in terms of context reliance and bias sensitivity.

The three main contributions of this work are as follows: First, it provides a systematic review of VLM-based approaches for human-centric state recognition in visual data, covering human action recognition with outliers (e.g., violence detection) as well as emotion recognition from both full-body and facial expressions. While broader surveys of VLMs for visual recognition exist, this is, to the best of our knowledge, the first to focus specifically on these tasks, offering a consolidated categorization of existing work. Second, it proposes a unifying perspective of diverse research directions in human-centric state recognition. Third, in addition to highlighting the advances enabled by VLMs, it critically discusses their limitations, particularly human-related biases, and argues for the explicit integration of debiasing or bias mitigation mechanisms in future methods.

The remainder of the paper is organized as follows: Section 2 identifies summarizing reviews that focus on one of the directions approached in this paper. In Section 3, the prototypical solution is presented. Core contributors are the vision–language models (VLMs) which are detailed in Section 4. A factor affecting future potential success is the problem of bias in datasets and VLMs, which is investigated in Section 5. The second part of this paper focuses on narrow themes such as action recognition (Section 6), anomaly detection (Section 7), emotion recognition in context (Section 8), and facial expression recognition (Section 8). We end the paper with a discussion in Section 9 and conclusions in Section 10.

2. Related Work

This paper investigates the complex solutions with multiple branches that have emerged following the rise of VLMs. It spans several areas, with the specific aim of summarizing the impact that VLMs have on tasks related to humans in images and videos. Consequently, its purpose and scope overlap with additional research areas, and, in this section, we provide targeted references to help readers better understand the broader picture.

A key component of the solutions we analyze is the use of large models. These models first appeared in natural language processing as large language models (LLMs), and we refer readers to the work of Chang et al. [21] for a more thorough presentation. The concept has since been extended to VLMs, and the directions in architectures and tasks are summarized in the work of Zhang et al. [10]. In the reviewed literature, the approach often relies on prompts to guide and extract informative captions. A review focused on this direction is provided by Gu et al. [6]. The task may be considered an extension of image captioning, which has also recently been reviewed by Sharma et al. [7]. Given that, in most of the solutions we review, the VLMs are employed in a zero-shot learning framework, we point out that more details can be found in [22].

The popularity of VLMs in such solutions is largely due to their higher capacity and reduced overfitting and bias. A summary of the biases in large-scale image collections is provided in the works of Navigli et al. [23] and Sun et al. [24]. However, while discussing the bias in large models, the research community has tended to focus more on social biases. Key findings and developments have been summarized by Gallegos et al. [18]. The specific problem of bias in machine learning has been reviewed in several works, some of the more recent of which are those of Hort et al. [25] and Pagano et al. [26].

In the second part of this paper, we explore specific themes such as human action recognition, for which a comprehensive review is available in the work of Kong and Fu [27]. A subarea of this theme is violence detection, with recent developments presented by Mumtaz et al. [28].

Another direction addressed in this work is the recognition and exploration of human emotions as captured in images and videos [29]. This is a broad theme that encompasses several subdirections, including facial expression recognition from portrait images [30] and emotion recognition from context and body language [31].

Several others have explored multimodal or conventional image-based human state recognition without using vision–language models, including Hu et al. [32] for violence detection and Wang et al. [33] for video emotion recognition. While demonstrating the importance of the tasks, these methods do not leverage VLMs or caption-based embeddings and are therefore complementary to the approaches reviewed in this paper.

The works mentioned in this section focus on a specific theme, discussing in detail the problematic challenges and proposed solutions, where all architectures are considered. Existing reviews are largely fragmented, focusing on single tasks such as action or emotion recognition, while captioning surveys emphasize generation quality rather than the use of captions as contextual representations for downstream recognition. Furthermore, prior works rarely analyze how VLMs propagate bias into human-centric classifiers from a mathematical or architectural perspective, and they lack a unified comparison of how the same contextualization mechanisms perform across different human state recognition tasks. In our work, we suggest that the rise of VLMs leads to an encompassing solution which may be particularized to various directions but where the key element is the captioning of the input data. Thus, this work proposes a novel integrating perspective concentrated on the gains and disadvantages of incorporating VLMs into architectures dealing with human states in visual data.

3. General Architecture

Individual solutions differ in their approach and may be adapted to the theme, yet the use of image captioning to describe the context in which human-centric actions take place remains a mainstay.

A general architecture of human centric recognition that includes VLM captioning is presented in Figure 2. There, the input data depends on the problem; it can be either a single image or a video. In some cases, a preliminary pre-processing step is applied where the person or their face is detected, and only that region is used in subsequent processing.

With respect to the parallel branches, the following apply:

Image description is marked with blue. The image content can be described using either a convolutional neural network (CNN, suitable for low-resource settings) or a Visual Transformer (ViT).
Problem-specific descriptors. The central branch is optional and depends on the problem. For example, in action recognition or body emotion, it may include a system that identifies keypoints on human articulations (a skeletal representation). For facial expression recognition, it may involve facial keypoints or an action unit descriptor. For recognition in video, an optical flow descriptor may be used.
VLM captioning. The branch highlighted in red is the focal point of this paper.

Explicit implementations of problem-specific descriptors will be presented in the respective task sections of the paper. In practice, in the pursuit of accuracy and efficiency, various solutions resort to different strategies to combine the three branches, such as early, sequential, and late fusion, depending on tasks requirements. These strategies will also be detailed in Section 6, Section 7 and Section 8 and concrete models and results will be presented.

Formalization

About VLMs, we note that often their use is prohibitive due to high resource demands. On the one hand, it is intuitive to construct the entire solution as a single large model (LM)—a general model carefully fine-tuned to balance generalization capability, optimize performance on the task of interest, and avoid memorization and overfitting [34]. In this case, if an input sample i is denoted by

x_{i} \in X

with associated label

y_{i}

, the LM, represented by a function

f_{L M}

, is trained using a loss function

L (y_{i}, \hat{y_{i}})

. Here,

\hat{y_{i}}

is the prediction, obtained as follows:

\hat{y_{i}} = f_{L M} (x_{i}) .

(1)

The large model includes a prediction layer specifically designed to address the problem at hand.

Training an LM is resource-intensive, and only a limited number have the resources required [35]. Thus, most approaches prefer to employ an off-the-shelf VLM, use it to generate a textual description (caption) of the scene, and then convert the caption into a numerical vector (embedding). Formally, this can be described as follows:

e_{V L M}^{i} = f_{V L M} (x_{i}),

(2)

In this case,

f_{L M}

is a cascade of a VLM and a Word2Vec model:

f_{L M} = f_{w 2 v e c} (f_{V L M} (\cdot)) .

(3)

Other parallel branches may also be incorporated. These are typically trained (rarely from scratch, often fine-tuned) or used off-the-shelf to produce additional embeddings:

\begin{matrix} e_{1}^{i} & = f_{1} (x_{i}), \\ \dots & = \dots, \\ e_{p}^{i} & = f_{p} (x_{i}) . \end{matrix}

(4)

The concatenated embedding then serves as the representation of the data:

x_{i} \to [e_{1}^{i}, \dots, e_{p}^{i}, e_{V L M}^{i}] .

(5)

Fusion may be performed directly, but it is preferable to train a smaller model on the concatenated embedding space to obtain the final prediction

\hat{y_{i}}

.

4. Vision–Language Models

In this section, we present a short mathematical formulation of VLMs. According to Gallegos et al. [18], a large model (LM) parametrized by

θ

is defined as a model with an autoregressive, autoencoding, or encoder–decoder architecture trained on a corpus of hundreds of millions to near trillions of tokens. These LMs, with respect to the function, are language large models (LLMs, historically first), vision–language models (VLMs), or Large Multimodal Models.

From a constructive point of view, LMs are built on several paradigms that include autoregressive models such as GPT [36,37]; autoencoding models like BERT [38], RoBERTa [39], and XLM-R [40]; and encoder–decoder models such as BART [41].

Large models are commonly adapted for a specific task. LLMs address tasks such as text generation, sequence classification, or question answering, typically via fine-tuning. VLMs address vision problems such as image captioning and other downstream tasks (e.g., classification, object detection, etc.) [7,10]. Multimodal (general) models can handle multiple modalities beyond just vision or text, like text, images, audio, video, code, sensor data, etc. VLMs are specifically within the scope of this work.

Since VLMs are derived from LLMs, they can be divided into the following categories based on how the two modalities are fused: (i) early fusion models, which jointly encode an image and text from the start, such as UNITER [42], (ii) late fusion models, which have separate encoders for vision and language, then combine them, such as CLIP [9], (iii) unified multimodal Transformers, such as Flamingo [43] and BLIP-2 [44], and (iv) models extending pre-trained LLMs with vision, such as LiT [45] or LLava [46].

4.1. Categories of VLMs

A more relevant point of categorization is maybe the training paradigm, which leads the VLMs to be broadly categorized into four families [47]:

Contrastive (dual-encoder) models;
Generative/masked modeling VLMs;
Pre-trained backbones with adapters;
Unified multimodal Transformers.

4.1.1. Contrastive (Dual-Encoder) Models

Contrastive models make use of two distinct encoders—a visual encoder and a textual encoder—that project their respective inputs into a shared embedding space. For instance, CLIP [9] is trained with a contrastive loss to align paired images and texts. Such models are highly effective for tasks that can be reduced to similarity comparisons in the joint embedding space (e.g., zero-shot classification).

Training based on contrastive learning [47] follows the Energy-Based Models (EBM) paradigm [48], where a model

f_{θ}

is optimized to assign low energy to observed variables and high energy to unobserved ones. Data sampled from the target distribution should be given low energy, whereas out-of-distribution samples should receive higher energy. The corresponding Boltzmann distribution over input data

x

is expressed as follows:

p_{θ} (x_{i}) = \frac{e^{- f_{θ} (x_{i})}}{\sum_{x_{i}} e^{- f_{θ} (x)}}

(6)

The maximum-likelihood objective models the underlying distribution

P_{X}

from which inputs are drawn:

f_{θ} = \underset{θ}{argmin} E_{x \sim P_{X} (x)} [- log p_{θ} (x)]

(7)

However, evaluating

x \sim p_{θ} (x)

is generally intractable. To bypass it, Noise Contrastive Estimation (NCE) [49] reformulates the task as binary classification; the model predicts

C = 1

for real data samples and

C = 0

for samples from a noise distribution. Thus, the model learns to distinguish authentic data points from noise. The resulting objective is a binary cross-entropy loss:

L_{N C E} (f_{θ}) = - \sum_{i} log P (C_{i} = 1 | x_{i}; θ) - \sum_{j} log P (C_{j} = 1 | x_{j}; θ)

(8)

with

x_{i}

drawn from the data distribution and

x_{j} \sim p_{n} (x), j \neq i

drawn from the noise distribution.

In the original CLIP model [9], positive examples consist of an image paired with its ground-truth caption, while negative examples are constructed by pairing the same image with captions of other images from the mini-batch. A key innovation in CLIP is the training of vision and text encoders within a unified representation space. The encoders, initialized randomly, are optimized to produce similar embeddings for paired images and captions via a contrastive loss. Trained on 400 million image–caption pairs from the web, CLIP demonstrated striking zero-shot classification transfer performance.

CLIP-based VLMs exhibit stronger contextual and compositional capabilities than earlier or alternative frameworks due to their contrastive training on image–text pairs and the use of language as the sole supervision signal. By aligning entire images with entire sentences rather than individual regions or words, CLIP models learn concepts relationally (e.g., “a person riding a horse” versus “a person next to a horse”), making contextual information—such as attributes, relationships, and actions—central to their representations.

This is particularly advantageous for zero-shot tasks like violence detection, where labeled data for all categories is often unavailable and traditional supervised models struggle. CLIP-based models instead compare visual inputs directly with textual descriptions (e.g., “a person hitting another person” or “a crowd running violently”), and can similarly model emotions in context by considering the full image rather than relying solely on facial cues.

4.1.2. Generative/Masked Modeling VLMs

Generative approaches, exemplified by BLIP [50], are typically trained on caption generation or masked language modeling objectives. They can be understood as a variant of denoising autoencoders [51], where the noise follows a spatial pattern. This perspective also connects them to image inpainting methods, such as those proposed by Pathak et al. [52], which exploit missing regions to learn rich visual representations.

From a theoretical standpoint, Dubois et al. [53] demonstrate that any transformation

f (x)

applied to data

x

implicitly induces an equivalence relation, partitioning

f (x)

into disjoint equivalence classes. Conditional densities remain constant within a class, i.e.,

f (x) \sim f (x^{'}) ⟶ p (z | f (x)) = p (z | f (x^{'}))

, where Z represents the learned representation of X. This perspective unifies masking, augmentation, and modality selection as transformations of data. The problem can then be related to a rate–distortion formulation [54] as follows:

\underset{p (z | x)}{argmin} I (f (x); Z) + β \cdot H (X | Z)

(9)

Recovering the masked VLM objective involves bounding the above expression, yielding the following:

L = - \sum_{x \in X} E_{p (f) p (Z | f (x))} [log q (z) + β \cdot log q (x | z)] .

(10)

Here,

log q (z)

serves as an entropy bottleneck, limiting the rate

I (f (x); Z)

by filtering redundant information. In masking-based VLMs, this entropy bottleneck is often bounded by a constant proportional to the information removed during masking. For multimodal VLMs, Z is constrained to retain the minimum essential information across modalities. Meanwhile,

log q (x | z)

bounds the distortion

H (Z | X)

, ensuring that predictive information is preserved. In practice, this term corresponds to autoencoding.

Unlike CLIP-style models that score hypotheses, generative VLMs aim to explain the scene. Trained to predict missing words conditioned on visual input, they are more sensitive to verbs, verb–object relations, and prepositions, which is valuable when linguistic reasoning, temporal abstraction, or label scarcity is important. For example, completing “A person is [MASK] toward a crouched man” requires inferring body pose and motion direction—key cues for tasks such as violence detection or emotion recognition from body posture. Therefore, they are advised when scene details matter more.

4.1.3. Pre-Trained Backbones with Adapters

A third family of approaches augments large frozen backbones with lightweight adapter modules. Models such as BLIP-2 [44], LLaVA [46], and Flamingo [43] adopt the following strategy: the vision encoder or large language model is kept frozen, while compact adapters, such as the Q-Former, are trained to connect the modalities, yielding a fraction of the full training cost. We recall that Q-Former is a lightweight Transformer that uses learnable query vectors to extract key visual features from a frozen image encoder. It selects the most relevant visual information to feed into a frozen LLM, enabling the model to generate the desired text efficiently.

This category of VLMs is more useful when reasoning, temporal aggregation, or task adaptation matters more than raw visual matching. In this setup, the model’s strength lies in the language prior, so visual features must be interpreted in context rather than simply embedded. From an accuracy perspective, these models are recommended when actions are ambiguous, without intent or causality—for example, distinguishing “pushing” from “helping” or “intimidated by a crowd” from “angry toward the crowd”. They are particularly effective when fine-tuned data is scarce, or when recognition relies on frame-level data and the sequence must be analyzed at the frame level.

4.1.4. Unified Multimodal Transformers

Unified multimodal Transformers such as Flamingo [43] or Kosmos-2 [55] process textual and visual inputs jointly within a single model. Typically, this is achieved via early fusion, where both images and text are tokenized and jointly attended to through co-attention or a shared embedding space.

Unified VLM models are task-agnostic, performing equally well—or equally poorly—across different tasks. In general, they excel at fluid generative tasks and complex cross-modal reasoning. For example, they can infer “What action is most likely occurring and why?”, since the same model parses the image, reasons about actions, and generates an answer or label. On the other hand, while these models can handle heterogeneous actions, they are computationally demanding and require large, carefully curated datasets.

4.2. Practical Aspects

4.2.1. Choosing a Model

The optimal choice of VLM depends on the downstream task [47]. Retrieval and matching tasks are best served by contrastive models, while generation-oriented tasks such as captioning or Visual Question Answering (VQA) benefit more from generative or unified architectures. Adapter-based methods strike a practical middle ground between efficiency and flexibility, making them attractive for resource-constrained environments or when reusing powerful frozen backbones.

4.2.2. Dataset Sizes and Curation

A defining aspect of VLMs is their reliance on very large-scale datasets, often spanning hundreds of millions to billions of image–text pairs. Dataset scale plays a critical role in enabling broad generalization across tasks and modalities. At the same time, dataset quality is equally vital. Raw web data is noisy by default, requiring filtering and pre-processing to yield reliable alignment. In many cases, the training data is not publicly released, though accompanying reports often describe the collection strategy, which typically includes the following:

Web-scale collections of image–text pairs;
Filtering and bootstrapping for noise reduction;
Curated domain-specific datasets.

For example, BLIP employs caption bootstrapping to improve noisy web data, whereas curated datasets are crucial for fine-tuning domain-specific applications. To systematically evaluate dataset quality, DataComp [56] introduces a benchmark in which the CLIP architecture and training hyperparameters are fixed, isolating the impact of the pre-training dataset. Results show that dataset diversity and domain alignment significantly affect downstream performance and robustness. More broadly, scaling laws observed in large neural networks [57] emphasize that both dataset size and careful curation are central to advancing multimodal learning.

Fine-Tuning Strategies

Because training VLMs from scratch is highly resource-intensive, fine-tuning strategies have been adapted and several popular methods exist [47]:

Full fine-tuning of all model parameters;
Adapter-based fine-tuning (e.g., LoRA);
Prompt tuning;
Instruction tuning;
Probing classifiers.

Full fine-tuning, where all parameters are updated, is generally the most effective but also the most computationally expensive. Adapter-based techniques such as Low-Rank Adaptation (LoRA) [58] insert trainable modules into frozen backbones, allowing efficient adaptation with minimal parameter updates. We recall that LoRA (Low-Rank Adaptation) efficiently fine-tunes large VLMs by updating only small, low-rank matrices in certain layers, keeping most of the original model frozen and thus using much less computation and memory. Prompt tuning [59] goes even further, learning task-specific soft prompts—textual or visual—without modifying backbone weights.

Instruction tuning [46] has proven especially powerful in models like LLaVA and InstructBLIP, where training on instruction-style input–output pairs enhances zero-shot reasoning and alignment with human intent. Probing classifiers [60] offer another lightweight approach: embeddings from frozen VLMs can be used to train compact classifiers for downstream tasks such as multimodal fact-checking.

5. Bias in VLMs and Implications

Bias in predictors (including VLMs) usually refers to the difference between the expected prediction of the model and the true value it is trying to estimate. Mainly, the bias quantifies a systematic error introduced by the model prediction [61]. Two root causes are often identified as the source of the model bias: training dataset bias and model assumptions-derived bias. Dataset bias refers to systematic errors or distortions in the data that cause certain patterns, groups, or outcomes to be over- or under-represented. This means the dataset does not fully or fairly reflect the real-world distribution.

Probably the first and one of the most influential arguments is the “Name That Dataset” experiment introduced by Torralba and Efros [62]. There, each dataset is treated as a class in a classification task, and models are trained to predict the dataset of origin for each image. As early as 2011, SVMs were able to classify datasets with high accuracy [62], which highlights the built-in bias of visual datasets. More recently, Liu and He [63] revisited this experiment and reported 84.7% accuracy on a three-way classification task involving the YFCC, CC, and DataComp datasets. Furthermore, increasing the number of training samples or using stronger data augmentation further improved accuracy on held-out validation data even though the classification task itself became more challenging.

A key observation is that these datasets should be difficult to distinguish from one another; this separability indicates strong built-in dataset bias, and later studies sought to identify its sources. For example, Zeng et al. [19] analyzed object-level semantics and used language models to describe dataset-specific traits.

Concerning model bias, You et al. [64] showed that classifiers can easily distinguish images generated by different diffusion model families, though not within the same family. This implies that model-specific assumptions shape outputs in ways imperceptible to humans, demonstrating that unobservable biases may still meaningfully exist.

With respect to bias analysis and identification, there are two main directions: The first direction is more tangible, focusing on identifying social bias in both large databases and large models. This direction relies more heavily on basic statistical measurements that reveal non-uniformity and under- or over-representation of specific categories based on easily identifiable human traits such as age, gender, or race. The second direction is more algorithmic and aims to detect and isolate biases through the analysis of databases and, correspondingly, the models. Often, the outcome is the identification of a category, but this category does not necessarily have a clear real-world interpretation. In the next subsection, we will review key works and analyze both directions after formalizing the mathematical notations.

5.1. Mathematics of Bias

Inductive Bias

Let us now formalize the notion of bias. We begin by discussing the problem in inductive learning, where the goal is to use a dataset

(x_{i}, y_{i}), i = 1 \dots N

to compute a model function

f_{θ} (x)

such that

f_{θ} (x_{i})

approximates

y_{i}

accurately. Here,

x_{i} \in X

denotes a data instance, while

y_{i} \in Y

represents the corresponding output, target variable, or label. The search for

f_{θ}

is carried out within a specific set

Θ

, and the preference for this set is referred to as bias by Mitchell [65].

The inductive learning process is formulated as the following minimization problem:

f_{θ} = \underset{θ}{argmin} L (f),

(11)

where training corresponds to solving the minimization in the equation (Equation 11). The training set

(x_{i}, y_{i}), i = 1 \dots N

is used to identify the parametric function

f_{θ}

that minimizes the loss function

L

.

All machine learning techniques for inductive learning require some form of inductive bias to function effectively, and the choice of

Θ

is often a critical design decision. If the inductive bias is too low (i.e.,

Θ

is too large), the model may overfit, allowing noise in the data to unduly influence the choice of

f_{θ}

. This results in poor generalization, meaning that

f_{θ}

fails to approximate unseen data. Conversely, if the inductive bias is too high (i.e.,

Θ

is too small), the model may underfit, producing poor approximations. Thus, inductive bias is essential, but it must be applied in the right amount.

5.2. Dataset Bias

As mentioned, bias in datasets refers to systematic errors or distortions that cause certain groups to be over-represented or unnaturally correlated with specific patterns. Following the approach of Gallegos et al. [18], we define the groups as follows:

A group

G \in X

is a subset of the data that shares a particular trait, which may be fixed, contextual, or socially constructed. For example, if

X

describes persons (e.g., portrait images), the shared trait may correspond to, e.g., age, skin color, disability, gender identity, national origin, or race.

In the more general case of images, such traits may be less intuitive to define. For instance, a group might consist of images containing an unusually high concentration of high-frequency components (e.g., dominated by very small objects) or images where textures predominantly align within a specific range of orientations.

For a model to be considered unbiased, it must perform fairly across all possible groups. To formalize group fairness, we consider a model

f_{θ}

producing outcomes

\hat{Y} = f_{θ} (X; θ)

. Given a set of groups

G

, group fairness requires approximate parity (within a tolerance

ϵ

) across all groups

G \in G

with respect to a statistical outcome measure

M_{Y} (G)

conditioned on group membership:

| M_{Y} (G) - M_{Y} (G^{'}) | \leq ϵ, \forall G, G^{'} \in G .

(12)

The choice of M specifies the fairness constraint, which is inherently subjective and context-dependent. Examples of M include accuracy, true-positive rate, and false-positive rate.

Group fairness may be affected by unwanted correlations [61,66]. To quantify these effects, a bias score b for an outcome

\hat{y}

with respect to a group G is calculated as follows:

b (\hat{y}; x_{g}) = \frac{c (\hat{y}; x_{g})}{\sum_{x^{'} \in G} c (o; x^{'})},

(13)

where

c (o; x_{g})

denotes the number of co-occurrences of o and

x_{g}

within group

x \in G

. If

b (\hat{y}; x_{g}) > \frac{1}{| G |}

, then o is positively correlated with

x_{g}

, indicating that the data is biased with respect to this group. Lately, this has been referred to as spurious correlation [67].

A stricter requirement is individual fairness, which demands fairness at the level of each individual data point. Considering [18], let

x, x^{'} \in X

be two individuals, and let

d : X \times X \to R

denote a distance metric. Let O be the set of possible outcomes, and let

f_{θ_{M}} : X \to Δ (O)

denote a mapping from an individual to a distribution over outcomes. Individual fairness requires that individuals who are similar with respect to the task receive similar treatment, i.e.,

\forall x, x^{'} \in X : D (f_{θ_{M}} (x), f_{θ_{M}} (x^{'})) \leq d (x, x^{'}),

(14)

where D is a divergence measure between distributions, such as statistical distance.

The trait defining a group may be either explicit and intuitive or subtle and harder to detect. Gallegos et al. [18] distinguish between these cases via the concept of fairness through unawareness. A model satisfies it if a sensitive group G is not explicitly used, such that

f_{θ} (G) = f_{θ} (\neg G)

(15)

where

\neg G

denotes the complement of G, i.e.,

\neg G = X ∖ G

.

A model is further said to satisfy invariance if, for any two groups

G_{i}, G_{j}

, the outcomes

f_{θ} (G_{i})

and

f_{θ} (G_{j})

are identical under some invariance metric

ψ

.

Systematic error or bias is easier to measure when the trait of interest corresponds directly to a specific label value (in classification) or range (in regression). For example, in binary classification, one may require that the overall misclassification rate (OMR) be independent of a protected attribute A (where

A \in G_{i}, G_{j}

). The corresponding fairness condition for the classifier is as follows [68]:

P (\hat{Y} ∣ A \in G_{i}, y = 0) = P (\hat{Y} ∣ A \in G_{i}, y = 1),

(16)

where

\hat{Y}

is the classifier output

f_{θ} (x)

, and y is the true label for input

x

. Both

\hat{Y}

and y take values in

0, 1

.

In Equation (16), the chosen measure is the OMR, but the property can equivalently be formulated using other quality metrics, such as the false-positive rate (FPR), false-negative rate (FNR), false omission rate (FOR), or false discovery rate (FDR) [61].

To enforce fairness, the model optimization process described in Equation (11) is constrained to ensure comparable error rates across all relevant groups

G_{i} \in G

:

f_{θ} = \underset{θ}{argmin} L (f), s . t . P (f_{θ} (x) \neq y ∣ A \in G_{i}) = P (f_{θ} (x) \neq y ∣ A \in G_{j}) .

(17)

5.3. Theoretical Limitations in Bias and Fairness

Kleinberg et al. [69] rigorously examined the formulation in Equation (17) in relation to the binary classification case (

y \in {0, 1}

) and introduced three fairness conditions: (A) calibration within groups (each bin in the database should proportionally reflect the true values across groups), (B) balance for positives, and (C) balance for negatives. Conditions (B) and (C) ensure that score assignments are not systematically more inaccurate for positive or negative instances in one group than in another. They established and proved the following theorem [69]:

Theorem 1.

Consider an instance of the problem in which there exists a risk assignment (or loss function) satisfying fairness conditions (A), (B), and (C). Then, the instance must either permit perfect prediction (with

f_{θ} (x)

equal to 0 or 1 for all

x

) or exhibit equal base rates.

This theorem can also be formulated and proven in terms of asymptotic tendencies. It primarily addresses formal model-level fairness by highlighting the trade-offs between dataset statistics (e.g., unequal base rates) and predictor capabilities. The intuition is straightforward: if a dataset has unequal base rates, then achieving fairness metrics requires predictor adjustments through prior assumptions that artificially distort the feature space. Such distortions, however, prevent perfect performance, especially when the test set approximates the true (but unavailable) population distribution.

These foundational results were later extended. Zhao and Gordon [70] derived a lower bound on the joint error of classifiers satisfying statistical parity under unequal base rates, demonstrating that fairness constraints necessarily induce certain error behaviors. Their work links the learning of fair representations with information-theoretic bounds, bridging predictor bias (how fairness constraints affect prediction error) and dataset bias (group-specific base rates and distributions). However, when the population contains more than two groups, the performance lower bound no longer has a closed-form analytical expression [70].

While these theoretical insights are valuable, their existence alone does not resolve the problem of bias. Practical interventions have been proposed, most notably by Hardt et al. [71], who modified classifiers to satisfy Equalized Odds and Equality of Opportunity. Given an unfair classifier

f_{θ}

, they constructed a new classifier

f_{θ_{2}}

by solving an optimization problem with fairness loss terms. However, Woodworth et al. [72] later showed that such modifications can significantly reduce accuracy—for example, when the loss function is not strictly convex. Moreover, these methods require prior knowledge of dataset biases, such as group-specific label distributions.

Further work by Lee and Shou [73] investigated bias and its compensation in continuous outcomes. They demonstrated both theoretically and empirically that ML models often exhibit systematic biases in regression settings—for example, under-predicting large values and over-predicting small ones, a phenomenon known as central-tendency bias.

In the following subsections, we review recent and effective methods for mitigating bias. Since this work focuses on human-centric data, we also consider the relevance of social unfairness and examine studies that identify social biases in large models.

5.4. Bias Quantification and Mitigation

Bias is an aspect that still troubles the machine learning community. Over time, stronger and newer models become less prone to bias, but this problem still exists. To illustrate this, Figure 3 presents the prompt and images generated with an ML solution; the results show clear differences based on race between the two categories. On the other hand, different VLMs can exhibit subtle biases when generating descriptions of people (see Figure 4).

While surveying more than 100 studies on fairness, Hort et al. [25] identified several categories of bias mitigation methods, which first are classified depending on the moment of intervention. Afterwards, each category, is further detailed.

Pre-Processing Methods:
- Relabeling and Perturbation (RP): Modify training data either by changing ground-truth labels (relabeling) or altering feature values (perturbation).
- Sampling (Samp): Adjust the training data distribution by adding or removing samples, or by reweighting their influence during training.
- Latent Variables(LV): Augment training data with additional, preferably unbiased, features.
- Representation Learning (Repr): Learn transformations of training data that reduce bias while retaining as much useful information as possible.
In-Processing Methods:
- Regularization and Constraints (RC): Modify the learning algorithm’s loss function. Regularization introduces penalty terms for discrimination (increasing loss when discrimination occurs), while constraints impose bias limits that cannot be violated during training.
- Adversarial Learning (AvL): Train models alongside adversaries. The model predicts ground-truth values, while the adversary attempts to exploit fairness violations.
- Compositional Approaches (CA): Train multiple classifiers, each specialized for a specific group (e.g., privileged vs. unprivileged), and combine their predictions.
- Adjusted Learning (AdL): Adapt existing algorithms or develop new ones with explicit bias mitigation mechanisms.
Post-Processing Methods:
- Input Correction: Apply modifications to test data before prediction.
- Classifier Correction: Adjust trained models directly to satisfy fairness criteria.
- Output Correction: Modify predicted labels as a final step in enforcing fairness.

Bias can affect any of the ML modules that are used in the general architecture in Figure 2, and detailed discussions are provided in [25,26]. In this work, however, our primary interest lies in how biases within VLMs influence the final solution, and thus we focus on recent proposals that aimed to isolate and measure bias in vision–language models. A compact view of these solutions is given in Table 1.

Table 1. General bias estimation solutions for VLMs. Methods listed here aim to estimate and correct the bias of a VLM with respect to a specified group. The “Moment” refers to the point of intervention in the task processing, which may be before (“Pre-Proc”), during (“In-Proc”), or after, by correcting the prediction (“Post-Proc”).

Solution	Moment	Type	VLM Targeted	Benchmarks Reported
Revise [74]	Pre-Proc	RP, Samp	All	COCO; Places; Visual Genome, etc.
VLBiasBench [75]	Pre-Proc	Samp	All	Synthetic
Multifair [76]	Pre-Proc	Samp, Repr	All	CelebA
Zhu et al. [77]	Pre-Proc	RP	All	Debiased COCO; other
BiMa [78]	Pre-Proc	LV	All txt-vid retrieval	MSR-VTT; ActivityNet; etc.
OxonFair [79]	Pre-Proc	AdL	All	CelebA; others
DomInd [80]	In-Proc	RC	All	CIFAR-10S; CelebA
IDR [81]	In-Proc	RC	All	MiniImageNet; CalTech 101, etc.
BA-LoRA [82]	In-Proc	RC	All (LLM)	Waterbirds; CelebA
PRISM [67]	In-Proc	RC	CLIP	Waterbirds; CelebA
LogicCLIP [83]	In-Proc	RC	All	LogicBench
DIM [84]	In-Proc	CA, RC	All	CIFAR-100; Breeds
REAL [85]	Post-Proc	CC	All	ImageNet; Flowers; EuroSAT; etc
TriProTesting [24]	Post-Proc	CC	All	CelebA; UTKFace; FairFace, etc.

Among the recent works that focus on large databases, the study of Wang et al. [74] presents REVISE, a tool that detects and mitigates dataset biases across objects, demographics, and geography early in the ML pipeline. Wang et al. [75] introduce VLBiasBench, a large benchmark evaluating social and intersectional biases in VLMs, finding commercial models to be generally less biased than open-source ones. Delaney et al. [79] present OxonFair, a customizable fairness toolkit that uses per-group thresholding to balance accuracy and fairness across domains. Tian et al. [76] introduce MultiFair, a mixup-based method that enforces fairness across multiple sensitive attributes simultaneously. Zhu et al. [77] reveal flaws in membership inference benchmarks for VLM auditing and construct unbiased datasets to show realistic auditing challenges.

With respect to bias mitigation methods that adjust the learning in-process, Wang et al. [80] created CIFAR-10S to benchmark debiasing strategies and showed that domain-independent training outperforms adversarial approaches in reducing spurious correlations. Zhang et al. [84] introduce DIM, a framework that discovers multiple biased subgroups in image classifiers and mitigates them for greater robustness. Ma et al. [81] propose Intrinsic Dimension Regularization, showing that balancing class perceptual manifold complexity improves fairness. Chang et al. [82] propose BA-LoRA, a fine-tuning method with novel regularizers that efficiently mitigates pre-training bias in LLMs. Molahasani et al. [67] introduce PRISM, a data-free, task-agnostic method that removes spurious bias in vision–language models using LLM-generated correlations and projection learning. Zhou et al. [83] introduce LogicBench and LogicCLIP, showing VLMs’ logical blindspots and enhancing reasoning with logic-aware training. Le et al. [78] propose BiMa, a framework that disentangles visual and textual biases in text–video retrieval to improve generalization.

Recent post-processing methods include REAL, a retrieval-augmented approach proposed by Parashar et al. [85] that uses LLMs to improve VLM performance on rare concepts in long-tailed distributions. Also, Sun et al. [24] proposed a two-stage method, first using the so-called TriProTesting to identify the point of unbalance and then, to mitigate it, using adaptive logit adjustment. Their work again emphasizes that VLMs conserve existing societal biases from the large databases that they have been trained on.

Overall, the methods summarized in this subsection are general, seeking numerical or statistical bias rather than a very specific and narrow category. Overall, some aspects must be emphasized: (i) Methods isolate bias, which is sometimes explainable, sometimes not, and emphasize that it exists. (ii) Zhu et al. [77], who evaluated multiple models, found them rather ineffective, and thus strongly concluded that there was a need for improvement. Methods that target particular, tangible biases, such as the so-called ethical and social biases, tend to be more effective, as directed algorithms show greater improvement. (iii) Biases and uneven performance may appear across groups, but, first, one needs to define the groups of interest. These are problem-dependent, and examples may exist outside societal biases—for instance, spurious correlations between certain background colors and corresponding emotions.

5.5. Social Bias and Fairness

Since the purpose of this paper is to examine human-centric solutions, it is logical to study the unevenness found in databases and foundational models that are likely transferable to specific solutions. The work of Gallegos et al. [18] integrates reviews, yet their work is focused on language models, and, while the results are partially transferable to visual models, they are not identical. With respect to VLMs, several directions can be identified in recent works.

5.5.1. Bias in Training Data and Representation Spaces

One perspective is that the root cause is in the large databases used for training. One particular problem is that, in many cases, such databases are not public and the bias can only be inferred. However, there is one public very large database. Here, Birhane et al. [86] critically examined LAION-400M, revealing that large-scale internet-scraped datasets contain pornography, slurs, and stereotypes, and they argue that “scale beats noise” is an inadequate justification for dataset quality.

5.5.2. Bias in Image Captioning Outputs

Works in this area separate the bias in the outputs of predictors. Zhao et al. [87] analyzed racial bias in image captioning, showing that darker-skinned individuals are described with less detail, lower accuracy, and differing sentiment compared to lighter-skinned individuals. Sabir et al. [88] investigated spurious gender correlations in captions, introducing a gender score to quantify how objects like “lipstick” are strongly and stereotypically linked to women. Abdelrahman et al. [89] introduced the ImageCaptioner metric to assess how captioning models amplify dataset bias for protected attributes, finding widespread and nontrivial bias amplification across architectures and datasets. Fraser et al. [90] constructed a parallel image dataset to demonstrate systematic differences in generated captions depending only on perceived gender or race, revealing stereotypical tendencies in LVLM outputs. Konavoor et al. [91] systematically mapped gender associations in contrastive vision–language encoders by comparing embeddings of male/female faces with labor-related statements, producing category-wise bias profiles with uncertainty estimates.

5.5.3. Mitigation-Oriented Approaches in Captioning

Here, works aim to both isolate bias and propose specific methods to mitigate it. Yang et al. [92] proposed adversarial masking to suppress gendered words in captions, replacing over 99% of gender markers with neutral terms while maintaining high captioning quality. Hirota et al. [93] developed LIBRA, a debiasing framework that addresses both contextual and stereotypical gender bias in captions, reducing misclassification while improving output fairness. Hamideh et al. [20] proposed the So-B-IT taxonomy of social biases and empirically show that CLIP associates harmful stereotypes (e.g., linking “terrorist” disproportionately with Middle Eastern men), with debiasing efforts often redistributing bias rather than eliminating it.

5.5.4. Normative and Philosophical Considerations

Other works take a more detached perspective. Buijsman [94] argues that fairness metrics in AI cannot all be optimized simultaneously and proposes Rawls’ justice as fairness as a guiding principle, prioritizing the least advantaged when navigating technical fairness–accuracy trade-offs.

In conclusion, these works collectively reveal that VLMs and captioning systems encode, amplify, and sometimes exacerbate social stereotypes along dimensions such as race, gender, and occupation, with biases arising both from pre-training datasets and model architectures.

Many studies in image captioning focus on gender, especially gendered pronouns, misclassification, and stereotypical associations. There are also solid works, though fewer, focusing on other axes like race, skin tone, intersectional bias, ableism, etc. Datasets used, like MSCOCO, often have their own bias, and quantification depends strongly on the annotation quality (e.g., perceived skin color, manual labeling). Some metrics treat any gendered terms as bad, which may not always be desired depending on context (if gender is visible and relevant). Therefore, distinguishing unnecessary or stereotyped gender mentions from correct visible gender mentions is a double-edged sword.

While debiasing strategies (e.g., masking, adversarial interventions, fairness-aware training) show partial success, bias removal remains an open challenge—feasible in controlled cases, but far from a solved problem, often trading one form of bias for another.

6. VLMs in Human Action Recognition

In this section, we explore recent progress in human action recognition (HAR), focusing on approaches that integrate multimodal reasoning and contextual understanding through VLMs. In the context of HAR, the video modality plays a central role, which implies the need for well-synchronized captions and, optionally, additional scene descriptors such as object detections. The domain is mature and the recent search focused on the specific benchmarks which are summarized in Table 2.

The idea is as follows: Modern VLMs built on large pre-trained language models can effectively leverage richer and more structured annotations, including scene context (e.g., “inside a kitchen”, “outdoors on a baseball field”, as in Kinetics-600 [95] or Ego4D [96]), object relations (e.g., “placing the cup on the table next to the laptop”, as in Something-Something V2 [97]), and goal-driven narratives (e.g., “searching for keys before leaving the house”, as in Ego4D [96] or YouCook2 [98]). Here, the term “fine-grained” refers to the ability to distinguish between actions that share similar motion patterns but differ in object identity, spatial relations, or intent—for example, reaching for versus grasping an object, or placing versus adjusting it—rather than assigning broad categories such as cooking or playing sports. These detailed annotations help resolve ambiguity between visually similar motions by providing richer semantic cues that ground entities and their interactions within a specific scene, enabling more precise action understanding.

In the research community, there is broad recognition that both scene context and object interaction relations are valuable for action recognition, and that the most effective models integrate them both rather than prioritizing one exclusively.

As outlined in the introduction, the main architectural approaches in VLMs can be grouped into four primary categories. Table 3 condenses the reviewed methods and highlights their advantages over unimodal approaches.

Table 2. Popular benchmarks for action recognition. In the observation columns, we nominate which context contributes most to HAR performance: Low—L, Medium—M, High—H.

Benchmark	Size/Duration	Data	Observations
MSR-VTT [99]	10k YouTube videos; 200k descriptions (20 captions per clip); 40 h	Videos (10–30 s each) + manually labeled captions (avg 10 words/caption)	Standard benchmark for video → text retrieval and video question answering (Video QA). Scene context: H (diverse environments, activities) + interaction context: L (implicit, coarse) → tasks: video–text retrieval, Video QA, captioning.
Youcook2 [98]	2k videos; 176 h (avg 5.26 min/video—each video is divided in 7–16 clips)	Cooking instructional videos + one caption per clip (10–20 words)	Used in instructional video understanding tasks (captioning, retrieval). Scene context: M (kitchen, cooking setup) + interaction context: M (human–object, procedural) → tasks: instructional captioning, retrieval, procedure understanding.
Kinetics-600 [95]	474k YouTube videos (avg 10 s/video); 1317 h	Videos + 600 distinct actions classes (e.g., sports, domestic tasks)	Mainly used for video-level action classification. Scene context: H (sports, indoor/outdoor cues) + interaction context: L (coarse labels) → tasks: action classification, representation, pre-training.
Ego4D [96]	3670 h collected from 923 participants across 74 locations and 9 countries (avg 8 min/video)	Egocentric videos (first-person view) + narrations—each narration is a free-form sentence associated with a single timestamp.	Scene context: H (egocentric, environment-aware) + interaction context: H (who–does–what–why, long-term)→ tasks: long-term action anticipation, temporal grounding, QA, NL retrieval.
Something-Something-V2 (SSv2) [97]	170K videos (220k clips with 4–5 s/clip); 276 h	Clips + labels—each clip is labeled with a template caption such as “Putting [something] onto [something else]”.	Widely used for action recognition tasks—specifically, video-level classification of fine-grained human–object interactions (e.g., “Pushing [something] from left to right”). Scene context: L (background suppressed) + interaction context: H (fine-grained object relations) → tasks: fine-grained, action recognition, interaction-centric classification.
COIN [100]	11k videos; 476 h; 46,354 annotated segments	Instructional videos, covering 180 distinct tasks across 12 different domains (e.g., vehicles, gadgets, household items) + annotations	It supports tasks such as step localization (finding where each step occurs in the video), action segmentation (parsing the video into segments corresponding to steps). Scene context: M (task-/domain-specific) + interaction context: M (step-wise human–object actions) → tasks: action segmentation, step localization, instruction understanding.

The first category of solutions is based on early fusion, referred to as the “shared-type” architecture by Luo et al. [101], where video and language modalities are fused together at the initial stages of encoding. This approach was used in one of the earliest instances of VLMs applied to HAR, namely, VideoBERT [102], which learns unified video–language representations through self-supervised pre-training on large-scale videos from YouTube paired with their captions using a masked token completion strategy. A key limitation of VideoBERT is its tendency to lose fine-grained, frame-level visual and temporal details, such as subtle object movements or minor scene changes.

ActBERT [103] addresses this limitation by explicitly modeling global actions and local regional objects alongside linguistic descriptions. Its training objectives include masked language modeling, masked action classification, masked object classification, and cross-modal matching to better align video and text representations. Pre-trained on large instructional datasets like HowTo100M [104] (a data sample—caption + visual—can be seen in Figure 5), ActBERT was evaluated on challenging tasks such as action step localization (CrossTask [105]) and action segmentation (COIN [100]), demonstrating the efficacy of incorporating detailed local cues.

The second category follows a late-fusion paradigm, employing independent video and text encoders with no cross-attention during unimodal encoding. UniVL [101] exemplifies this approach with separate unimodal encoders, a cross-modal encoder, and a decoder trained using coarse-grained contrastive, fine-grained masked modeling, and generative objectives. It was evaluated on YouCook2 [98] for tasks such as video–text retrieval and caption generation.

VideoCLIP [106] extends this paradigm using a dual-encoder structure trained solely with an InfoNCE contrastive objective and derives global embeddings via average token pooling rather than a [CLS] token, preserving token-level features useful for temporal localization. Frozen in Time [107] maintains independent encoders but unifies image and video pre-training by treating images as single-frame videos and progressively increasing temporal context through a curriculum strategy. Despite these advances, coarse global alignment remains a limitation of dual-encoder methods, motivating the development of X-CLIP [108], which introduces both coarse (video–sentence) and fine-grained (frame–word) alignment, cross-granular contrasts, and an Attention over Similarity Matrix mechanism. More recently, the Perception Encoder family [109] has demonstrated that scaled vision–language contrastive pre-training combined with intermediate-layer language and spatial alignment can achieve strong performance with a relatively simple design.

The next paradigm, unified multimodal transfer, or fusion-in-the-backbone architectures, integrates visual and linguistic modalities directly within the model, allowing cross-modal interactions to occur during encoding rather than only at the output level. Here, Flamingo [43] takes a generative LLM-centric approach, connecting frozen visual and language backbones (e.g., Chinchilla 70B [110]) via gated cross-attention layers, where the GATE mechanism blocks the influence of the visual input on the LLM and gradually increases it during training, preventing the model from becoming destabilized. EgoVLPv2 [111], designed for egocentric video understanding, can flexibly switch between dual-encoder and fully fused modes via a gating scalar, enabling both efficient encoding and detailed action reasoning. Extending this approach to long-term action prediction, PALM [112] decomposes tasks into action recognition, captioning, and LLM-based action anticipation modules, leveraging in-context learning to predict future actions in Ego4D [96] and EPIC-KITCHENS [113] datasets. At larger scales, InternVL [114] aligns a six-billion-parameter Vision Transformer (InternViT) with a substantial LLM (QLLaMA, ≈8B) using cross-attention layers to produce fully LLM-compatible embeddings, achieving state-of-the-art performance on Kinetics-600 [95] and Something-Something V2 [97]. Tarsier2-7B builds upon InternVL, incorporating denser latent queries and more sophisticated temporal modeling to enhance subtle action dynamics, supporting both contrastive (with cross-modal fused representations) and generative downstream tasks.

The final category extends pre-trained LLMs with visual inputs, leveraging frozen linguistic priors while integrating vision through lightweight adapters or projection modules. mPLUG-2 [115] adopts a modular design that disentangles spatial and temporal features via a dual-vision encoder, using shared layers and cross-attention to align visual and textual representations in a common semantic space. BLIP-2 [50] follows a model-level bootstrapping strategy, connecting frozen vision encoders to frozen LLMs through a lightweight Q-Former that produces compact query embeddings interpretable by the LLM. This design enables competitive action recognition and multimodal generation while requiring up to ≈

54 \times

fewer trainable parameters than Flamingo variants, highlighting the efficiency of leveraging pre-trained linguistic knowledge for structured reasoning over complex visual dynamics.

Table 3. Selected human action recognition solutions that benefited from VLM captioning. R@n = recall at n; FT = fine-tuned; ZS = zero shot; Acc = accuracy.

Solution	VLM (Visual Language Model/Architecture)	Benchmark and Performance (Metric and Value)	Improvement/Key Contribution
Luo et al., 2021 [101]	UniVL (Text Encoder + Video Encoder); Cross-Modal Encoder; Decoder	Video → Text Retrieval on YouCook2 (FT): R@1: 28.9%; R@10: 70.0%. Video Captioning on YouCook2 (FT): BLEU-4: 17.35, CIDEr: 1.81, METEOR: 22.35.	Early cross-modal VLM; objectives: retrieval, captioning; fine-grained semantics limited by caption quality (e.g., noisy ASR in HowTo100M); no LLM → limited contextual reasoning.
Xu et al., 2021 [106]	VideoCLIP (CLIP + Video Encoder)	Text → Video Retrieval on YouCook2 (ZS): R@1: 22.7, R@5: 50.4, R@10: 63.1. Action Segmentation on COIN (ZS): Accuracy 58.9%.	Contrastive video–text model; objectives: retrieval, zero-shot transfer; fine-grained reasoning constrained by pre-training captions; limited temporal modeling.
Alayrac et al., 2022 [43]	Flamingo (Frozen Vision Encoder + LLM)	VQA on MSRVTT (ZS): Top-1: 19.2%; (32 shots): Top-1: 31.0%.	LLM-based VLM enabling in-context few-shot learning; objectives: retrieval, captioning, QA; distinguishes similar actions via compositional/contextual reasoning; prompt-sensitive, high compute, weaker for strict classification.
Pramanick et al., 2023 [111]	EgoVLPv2 (TimeSformer + RoBERTa-B + Cross-Attention Gated Fusion)	Video → Text Retrieval on EgoMCQ (ZS): Inter Acc. 91.0%, Intra Acc. 60.9%. Video QA on EgoTaskQA head-tuned: Mean Acc.: 36.53%.	Egocentric VLM pre-trained on Ego4D; objectives: retrieval, QA, grounding; fine-grained semantics via hand/object and temporal context; relies on dense aligned narrations.
Xu et al., 2023 [115]	mPLUG-2 (Dual-Vision Encoder + Cross-Attention Fusion + Shared Decoder)	Text → Video Retrieval on MSRVTT (ZS): R@1: 47.1; (FT): R@1: 53.1. Video Captioning MSRVTT (FT): CIDEr: 80.3; BLEU-4: 57.8. Video QA on MSRVTT-QA (FT): Acc. 48.0.	Video–language model exploiting cross-modal correlations; objectives: retrieval, captioning, QA; fine-grained semantics via object–actor–scene grounding; needs careful alignment in multi-task supervision.
Chen et al., 2024 [114]	InternVL (InternViT-6B + Language Middleware + LLM Decoder)	Text → Video Retrieval on MSRVTT (ZS): R@1: 46.3. Video Classification on Kinetics-600 (ZS): 78.8% (Top-1/5 average).	Large-scale multimodal VLM with LLM alignment; objectives: cross-modal matching, generative captioning; fine-grained semantics limited by lack of temporal/action supervision; weaker for long-horizon reasoning.
Bolya et al., 2025 [109]	Perception Encoder (PE) (Scaled Vision Encoder + CLIP-Like Pre-Training + Video Data Engine)	Video Classification on Kinetics-400 (ZS, 8 frames): Top-1: 76.9%. Kinetics-600 (ZS, 8 frames): Top-1: 76.1%. Text→Video retrieval on MSR-VTT (ZS): 51.2 R@1.	Scaled vision encoder with CLIP-style pre-training; objectives: zero-shot classification, retrieval; limited fine-grained multimodal grounding; coarse linguistic and interaction supervision.
Yuan et al., 2025 [116]	Tarsier2-7B (VLM with 7B Parameters, Initialized from Qwen2-VL)	VQA EgoTaskQA (FT): 77.5% Exact Match.	Large-scale video–text pre-training with temporal alignment; objectives: retrieval, QA, captioning; detailed grounding via object–actor–scene; susceptible to hallucination and subtle visual ambiguities.

A fundamental distinction exists between video-only and multimodal models in action recognition. While both can be evaluated on the same benchmarks, multimodal models are typically assessed in zero-shot settings. For example, on Kinetics-400, VideoMAE [117] achieves 87.4% Top-1 accuracy when fine-tuned, whereas the Perception Encoder (PE) reaches approximately 76.9% Top-1 accuracy in zero-shot evaluation. This underscores that direct accuracy comparisons with fully fine-tuned video-only models are not strictly valid; multimodal pre-training emphasizes generalization to unseen categories rather than peak performance on a single task. Consequently, unimodal–multimodal comparisons are meaningful only for action classification with identical labels and metrics; for tasks such as language-driven retrieval, video captioning, video question answering, or instructional step localization, unimodal baselines are inapplicable.

This distinction persists under fine-tuning. In video-only models, adaptation to a new dataset primarily addresses domain shift; pre-trained representations (e.g., VideoMAE) are fine-tuned on smaller datasets such as UCF101 [118] to match the target distribution while preserving the same task. In multimodal models, fine-tuning additionally entails a shift in learning objectives. For instance, UniVL transitions from reconstructing raw text during pre-training to generating human-annotated captions on YouCook2, while mPLUG specializes its modular components for task-specific video understanding objectives. More broadly, multimodal fine-tuning goes beyond domain adaptation by progressively aligning semantically distinct modalities, constituting a deeper transformation of the model’s representational space.

7. VLMs in Violence Detection

In this section, we review recent advancements in video anomaly detection (VAD) and video violence detection (VVD), emphasizing methods that harness the multimodal reasoning and contextual understanding capabilities of large language models (LLMs) and vision–language models (VLMs). These approaches are categorized by their contributions to training efficiency, structural design, and cross-modal reasoning.

First, we note that specific benchmarks have been developed. In Table 4, a brief description of the violence/anomaly detection benchmarks is presented. Examples of images from those benchmarks can be seen in Figure 6.

A major innovation trend involves adapting frozen VLMs for VAD without costly parameter tuning, relying instead on prompt engineering or learned guiding questions [119,120,121]. ASK-HINT [120] refines this paradigm by introducing fine-grained, action-centric prompts that align linguistic reasoning with discriminative visual cues. Through Semantic Compression, it synthesizes a compact set of representative questions that enhance interpretability and reduce hallucination effects. VERA (Verbalized Learning) [121] further extends training-free adaptation by transforming the reasoning process into a learnable dialogue between two VLMs, where optimized guiding questions are iteratively refined. During inference, these questions guide segment-level scoring and are refined via temporal and scene context fusion, allowing explainable and fine-grained anomaly detection.

Other works emphasize domain adaptation, bias reduction, and efficient explanation generation. Holmes-VAD [122] introduces the VAD-Instruct50k benchmark—an instruction-tuning dataset with semi-automated annotations—to train multimodal LLMs for explainable VAD. A Temporal Sampler accelerates inference from 32.82 s to 4.24 s per video, while Projector + LoRA fine-tuning ensures lightweight adaptability. VIVID [123] addresses interpretability and fairness in VVD by leveraging an external knowledge base of 48 violence definitions (from sources such as Cambridge and Oxford) and employing multimodal retrieval and optical flow-based key frame selection. It also reorders visual queries to mitigate the “lost-in-the-middle” effect common in long visual sequences. PiercingEye [124] advances geometric representation learning through Dual-Space Learning, combining Euclidean and Hyperbolic GCNs to model spatial and hierarchical event contexts. Its Ambiguous Event Text Generation module produces pseudo-supervisory textual descriptions of uncertain scenes, while Hyperbolic Vision–Language Guided Loss prioritizes challenging negatives using text similarity-based weighting.

Further contributions focus on multimodal integration and generalization. ViCap-AD [125] introduces a weakly supervised framework that fuses video captions generated by CLIP4Clip with BERT-based textual features under a Multiple-Instance Learning scheme. The final anomaly score combines Classification Score (CS) and Multiple-Instance Learning Ranking Loss for improved stability. The Multimodal Dual-Stream VAD [17] model fuses video, audio, and text modalities via Abnormal-Aware Context Prompts. These encode prior knowledge and anomalous cues into textual representations, enabling fine-grained discrimination, while a coarse stream employs contrastive optimization for global audio–visual alignment. The language-based VAD (LAVAD) [119] framework introduces a training-free approach to video anomaly detection by combining frozen LLMs and VLMs. It first generates frame-level captions using an off-the-shelf model, then refines them through Image–Text Caption Cleaning to reduce noise. Finally, an LLM-based Anomaly Scoring module aggregates the cleaned captions into temporal summaries and prompts the LLM to produce anomaly scores, enabling effective detection without model fine-tuning.

Table 4. Summary of common datasets used for video anomaly and violence detection.

Dataset	Clips	Anomaly/Violence Types	Source	Key Characteristics
UCF-Crime [126]	1900	13 anomalies (abuse, fighting, shooting, etc.)	Surveillance (indoor/outdoor)	Long, untrimmed videos; video-level training, frame-level testing.
XD-Violence [127]	4754	6 classes (abuse, accidents, riots, etc.)	Movies, YouTube, CCTV	Large-scale; includes audio; weakly labeled untrimmed videos.
RWF-2000 [128]	1600 train 400 test	Violent vs. non-violent	YouTube surveillance	Uniform 5 s clips at 30 fps.
Movies [129]	200	Fights vs. non-fight scenes	Action films, sports	Diverse scenes; manually labeled.
Surv. Fight [130]	300	Fights (kicks, fists, wrestling) vs. non-fight	YouTube (cafes, streets, buses)	Short clips (2 s) from real-world surveillance.
Hockey [129]	1000	Fights vs. non-fights	NHL games	Dynamic setting; violent and normal actions look similar.

Results Comments

Table 5 shows that the integration of textual information, whether through detailed captions, formalized rules, or targeted prompts, consistently demonstrates a strong positive impact on the performance and interpretability of video anomaly detection (VAD) models, often surpassing visual-only baselines or abstract language models.

ASK-HINT [120] leverages fine-grained, action-centric prompts to better align linguistic reasoning with visual cues, achieving state-of-the-art AUC results through structured prompting. Holmes-VAD [122] employs instruction tuning with paired visual–textual explanations, markedly improving interpretability and human-aligned reasoning metrics. VERA [121] improves reasoning by refining prompts into detailed, anomaly-focused questions, enabling fine-grained frame-level scoring with temporal and contextual awareness.

The comparison in Table 6 highlights the strong performance achieved by a diverse set of methods. The Holmes-VAD [122] model, which is instruction-tuned on a purpose-built dataset for explainable VAD, achieves state-of-the-art results on both benchmarks. In the case of multimodal VAD [17], the use of Abnormal-Aware Context Prompts embeds semantic and anomaly-related knowledge into the text, guiding detection with meaningful context. This multimodal synergy enhances sensitivity to subtle and complex anomalies, achieving superior fine-grained accuracy. At the same time, the strong performance of diverse methods such as the dual-space geometric PiercingEye [124] and the verbalized-learning VERA [121] indicates that, while large-scale instruction tuning is currently SOTA, specialized architectural innovations that require less data and computation offer highly competitive and potentially more practical alternatives.

Table 5. Impact of captioning and textual guidance on VLM/LLM-based VAD performance. Incorporating textual information—via detailed captions, formalized rules, or targeted prompts—consistently enhances both the performance and interpretability of video anomaly detection models, often outperforming visual-only approaches or standalone language models. AUC = Area under Curve; AP = Average Precision; JA = Judgement Accuracy; CP = Content Perception; AE = Anomaly Explanatory.

Method	Datasets	Metrics (Values)	Effect of Captioning/Text Guidance
Holmes-VAD [122]	UCF-Crime, XD-Violence	AUC: 89.51% (UCF); AP: 90.67% (XD); JA: 86.0%; CP: 61.2%; AE: 51.9%	Instruction tuning (VAD-Instruct50k) greatly boosted interpretability: JA (86.0 vs. 65.1), CP (61.2 vs. 11.6), AE (51.9 vs. 15.9).
ASK-HINT [120]	UCF-Crime, XD-Violence	AUC (UCF): 89.83%; AUC (XD): 90.31%	Fine-grained prompting (6 prompts) outperformed full-prompt baseline (67.17%), improving AUC by 22.6%.
VERA [121]	UCF-Crime, XD-Violence	AUC (UCF): 86.55%; AUC (XD): 88.26%	Guiding question prompts raised AUC from 78.81% to 86.55%, highlighting textual reasoning benefits.
VIVID [123]	Movies, Surv. Fight, RWF-2000, Hockey, XD-Violence	Acc: Movies 0.985; Surv.fight: 0.736; RWF-2000: 0.797; Hockey: 0.954; 0.985; XD 0.826	Violence definition database reduced bias and improved accuracy (up to +0.32).
ViCap-AD [125]	UCF-Crime, XD-Violence	AUC (UCF): 87.20%; AUC (XD): 85.02%	CLIP4Clip captions + Multiple-Instance Learning losses increased AUC by 1.72% (87.20 vs. 85.48) on UCF crime.
PiercingEye [124]	XD-Violence, UCF-Crime	AP: 88.82% (XD); AUC: 86.64% (UCF)	Hyperbolic Vision-Language Guided Loss with Ambiguous Event Text Generation text raised AP by 1.21% (87.61 → 88.82).
Multimodal VAD [17]	UCF-Crime, XD-Violence	AUC: 87.96% (UCF); AP: 86.32% (XD)	Adding text in V+A+T setup improved AP by at least 4% over video-only models for XD Violence.
LAVAD [119]	UCF-Crime, XD-Violence	AUC: 80.28% (UCF); AUC: 85.36% (XD)	Removing LLM-based anomaly scoring dropped AUC by −7.58% (80.28 → 72.70) on UCF-Crime.

Table 6. Performance of advanced VAD methods on key datasets. AUC = Area under Curve; AP = Average Precision.

Method	Primary Approach	UCF-Crime (AUC %)	XD-Violence (AP %)
LAVAD [119]	Training-Free (VLM Captions + LLM)	80.28	62.01
VERA [121]	Verbalized Learning (Frozen VLM)	86.55	70.54
PiercingEye [124]	Dual-Space Geometric Learning	86.64	88.82
Multimodal-AD [17]	Dual-Stream Network (Coarse: V+A, Fine: T)	87.96	86.32
Holmes-VAD [122]	Instruction-Tuned (Explainable VAD)	89.51	90.67

8. VLMs in Contextual Emotions Recognition

Understanding human states such as emotions, intentions, and actions is a central challenge in computer vision. These states are critical not only for applications in human–computer interaction, assistive technologies, and surveillance, but also for advancing fundamental research on how machines perceive and reason about people. Emotions play a vital role in communication and social dynamics, and recognizing them from visual data has therefore become an active research field [131,132,133].

Traditional approaches in visual recognition have primarily focused on localized cues, most often relying on cropped facial images or isolated body poses. This face-centered perspective has been strengthened by the rise of large models.

While effective in constrained settings, the methods focused on a specific part (like the ones using only the face expression) sometimes fall short of capturing how humans naturally perceive others: by integrating multiple signals including body posture, gestures, social interactions, and background context. Context-aware emotion recognition (CAER) [134] and datasets such as EMOTIC have highlighted that environmental and scene information strongly influences the interpretation of human states. For example, the same facial expression may be perceived differently when situated in a playground versus a conflict zone. This contextualization is therefore essential for achieving reliable recognition.

To visually illustrate various problems in emotion inference from visual data, in Figure 7, we present images from popular databases. The images are clustered in rows according to theme. Details on the datasets may be found in Table 7 and Table 8.

When compared to the classical solution that is trained within a single database, more recent versions have benefited from the development of large models. Recent advances in VLMs offer a powerful framework for addressing various challenges in this domain too, challenges that mostly stem from data domain limitations. VLMs such as CLIP [9], BLIP [50], and LLaVA [46] learn joint representations of images and text from massive collections of paired data. By aligning vision and language in a shared space, they enable models to capture not only local objects or body features, but also broader semantic cues present in the surrounding environment. Solutions for contextual emotion recognition that benefited from VLM captioning may be found in Table 9.

Regarding contextual emotion recognition (Table 9), a point of analysis is the trade-offs between interpretability, performance, and computational cost across different model categories. LLaVA-based approaches [16,145] and Collaborative VLM-LLM systems [146,147] demonstrate strong contextual understanding and reasoning capabilities. By employing two-stage processing, caption generation followed by classification, or iterative refinement between modalities, these models provide interpretable narrative captions and nuanced emotion predictions. However, this depth comes at the cost of increased latency and computational complexity, with the risk of caption generation errors propagating to the final classification. Similarly, large-scale LVLMs [148] leverage in-context learning and Chain-of-Thought prompting to achieve high-quality reasoning with few-shot learning, but they require significant computational resources (e.g., 8B parameters) and depend heavily on high-quality prompt generation.

In contrast, off-the-shelf evaluation approaches [149] offer rapid deployment and strong zero-shot capabilities without fine-tuning, though they often yield lower absolute performance compared to specialized models and raise privacy or cost concerns due to API dependencies. Instruction-tuned models (e.g., [150]) strike a balance by aligning visual features with emotional concepts through diverse training signals, yet their success is contingent on the computationally expensive generation of high-quality instruction datasets. Overall, while VLM integration consistently yields substantial performance gains (e.g., 5–15% F1 improvement) on typical benchmarks, researchers must weigh the benefits of explicit reasoning and interpretability against the higher resource demands of these complex architectures.

VLMs in Face Expression Recognition

Inference of emotion from face expression recognition is the most well-established direction. A thorough review may be found in the work of Kalateh et al. [29]. Examples of illustrative images are found in the top row of Figure 7. In this case, the focus is on the portrait, as often the background is too loosely correlated to the emotion of the character.

A listing of the methods is given in Table 10. In summary, recent advancements in facial expression recognition (FER) have demonstrated a clear shift toward leveraging VLMs and Large Multimodal Models, which have improved contextual understanding, reasoning, and generalization across complex affective scenarios.

The integration of VLMs and LLMs has redefined facial expression recognition by enabling models to move beyond isolated facial features toward context-aware, semantically grounded, and interpretable emotional reasoning. These approaches demonstrate key advantages. Methods like EmoGist [151] and SoVTP [152] demonstrate that emotional interpretation can successfully rely on global scene reasoning rather than facial features alone, emphasizing the role of context in emotion recognition. Techniques such as Exp-CLIP [153] and FEALLM [154] connect the gap between visual features and language-driven emotional concepts by leveraging semantic alignment and knowledge transfer. ExpLLM [155] and GPT-based micro-expression systems [156] highlight how Chain-of-Thought reasoning enables interpretability and trust in FER systems. By using flexible, descriptive, and high-dimensional representations of emotions, EmoCapCLIP [157] shows the value of using natural language captions to capture the complexity of human affect beyond categorical labels. The concept has been expanded further than classical emotion recognition to depression detection [158], with contextualization and fine-grained emotional annotation being applied to depression video data.

In summary, for FER solutions (Table 10), the landscape is dominated by approaches that leverage pre-trained vision–language knowledge. CLIP-based temporal models [159] and fine-grained captioning models [160] excel at capturing dynamic facial movements and subtle expression changes, significantly outperforming baselines on video benchmarks. However, temporal modeling adds computational complexity, and generic CLIP features may sometimes miss micro-movements. Multi-prompt learning approaches [15] and task-aware alignment models [153] offer efficient alternatives by learning rich emotion representations or aligning features with emotion-specific semantic spaces (often using action units) without explicit caption generation, achieving very strong results on static datasets like RAF-DB.

Hybrid expert systems [161] and rule-enhanced LLM systems [156] push performance further by specializing in different emotion aspects or amplifying weak signals (like micro-expressions) through expert rules. While these systems achieve strong accuracy on specific benchmarks, they introduce significant complexity into the model architecture and hyperparameter tuning. Example-based cluster analysis [151] offers a computationally efficient inference alternative by pre-generating context descriptions, though it risks missing rare expressions that have not been seen by the VLM. Ultimately, the choice of model involves balancing the need for temporal generalization and interpretability (favored by captioning and AU-based models) against model complexity and computational constraints.

Table 8. Multimodal emotion databases.

Model	Size	Data	Observations
MELD [162]	13,000 utterances across 1433 dialogues from 1000+ speakers.	7 categ. emotions + sentiment polarity (pos.–neut.–neg.).	Multimodal dataset (audio, visual, text) derived from the TV series Friends, designed for multi-party conversational emotion recognition at the utterance level.
VCR [163]	110,000 movie scenes forming ∼290,000 question–answer pairs.	4 multiple-choice answers + 4 rationales per question.	Multimodal (image + text) dataset targeting visual common-sense reasoning, requiring understanding of context, intentions, and affect.
MER2023 [164]	∼3373 labeled samples and >73,000 unlabeled samples for semi-supervised learning.	6 discrete emotions (neutral, anger, happiness, sadness, worry, surprise).	Multimodal dataset (video, audio, text) designed for robust emotion recognition under noisy and semi-supervised conditions. Includes multiple tracks: MER-MULTI, MER-NOISE, MER-SEMI.
MAFW [165]	10,045 video–audio clips with textual affective captions.	11 basic and 32 compound emotions.	Multimodal dataset combining video, audio, and text for emotion recognition in complex natural contexts. Supports hierarchical affective labeling.

Table 9. Solutions for contextual emotion recognition that benefited from VLM captioning. In certain cases, the results are the consequence of specific conditions which are detailed in the cited references. mAP = Mean Average Precision; Acc = accuracy; Rec = recall; Prec = precision.

Solution	VLM	Benchmark and Performance	Improvement
Xenos et al. (2024) [16]	LLaVA (two-stage; caption + classify)	$m A P_{E M O T I C}$ = 38.52% $m A P_{B o L D}$ = 26.66% $A c c_{C A E R_{S}}$ = 93.08%	Fuse visual features + the generated text via a Transformer architecture (vision encoder → learnable queries → Q-Former cross + self attention, then classifier)
Etesam et al. (2024) [145]	LLaVA, GPT-4V	$P r e c_{E M O T I C}$ = 54.27% $R e c_{E M O T I C}$ = 78.40% $F 1_{E M O T I C}$ = 36.83%	Introduces narrative captioning (NarraCap): generates descriptive captions (who, action, social/physical signals, environment) then feeds to LLM for emotion inference
Lei et al. (2024) [148]	VILA-8B (LVLM few-shot + CoT)	$P r e c_{E M O T I C}$ = 57.8% $R e c_{E M O T I C}$ = 46.9% $F 1_{E M O T I C}$ = 29.8% $P r e c_{H E C O}$ = 48.3% $R e c_{H E C O}$ = 33.6% $F 1_{H E C O}$ = 35.8%	Uses in-context learning (ICL) and Chain-of-Thought (CoT) prompts generated by GPT-4V for contextual reasoning
Bhattacharyya & Wang (2025) [149]	GPT-4o, CLIP variants	$F 1_{E m o t i o n 6}$ = 63.5% $F 1_{A b s t r a c t}$ = 27.0% $F 1_{A r t P h o t o}$ = 45.0% $F 1_{H a r d}$ = 44.9%	Evaluates off-the-shelf VLMs using reasoning-based prompts for improved emotion prediction accuracy
Xie et al. (2024) [150]	EmoVIT (InstructBLIP + visual instruction tuning)	$A c c_{E m o t i o n 6}$ = 57.6% $A c c_{A b s t r a c t}$ = 32.3% $A c c_{A r t P h o t o}$ = 44.9% $A c c_{W e b E m o}$ = 21.1%	Uses GPT-4-generated visual instruction data including captions, categorical, and reasoning forms to fine-tune emotion understanding via instruction following
Zhou et al. (2024) [146]	ViCor (LLM + VLM collaboration)	$A c c_{V C R}$ = 59.8% $A c c_{A - O K V Q A}$ = 70.9%	Bridges vision understanding and common-sense reasoning: captions + visual clues are iteratively exchanged between the VLM and LLM to refine emotion and common-sense inference
Lian et al. (2024) [147]	AffectGPT (Multimodal LLM + MER-Caption dataset)	$A c c_{E M E R - F i n e}$ = 64.56%	Introduces descriptive emotion captions (MER-Caption), enabling free-form emotion reasoning across modalities (vision, audio, text)

Table 10. Face expression recognition solutions that benefited from VLM captioning. In certain cases, the results are the consequence of specific conditions which are detailed in the cited references.

U A R

= Unweighted Average Recall;

W A R

= Weighted Average Recall.

Table 10. Face expression recognition solutions that benefited from VLM captioning. In certain cases, the results are the consequence of specific conditions which are detailed in the cited references.

U A R

= Unweighted Average Recall;

W A R

= Weighted Average Recall.

Solution	VLM	Benchmark and Performance	Improvement
Zhao & Patras (2023) [159]	DFER-CLIP (CLIP + LLM text descriptions)	$U A R / W A R_{D F E W}$ = 59.6/71.3%; $U A R / W A R_{F E R V 39 k}$ = 41.3/51.6%; $U A R / W A R_{M A F W}$ = 38.9/52.6%;	Uses ChatGPT-generated textual behavior descriptions instead of class labels → improved temporal FER generalization
Foteinopoulou & Patras (2024) [166]	EmoCLIP (CLIP + temporal transformer)	$U A R / W A R_{D F E W}$ = 58.1/62.1%; $U A R / W A R_{F E R V 39 k}$ = 31.3/43.5%; $U A R / W A R_{M A F W}$ = 34.3/44.2%; $U A R / W A R_{A F E W}$ = 44.3/46.2%;	Uses sample-level captions of expressions and LLM-generated class descriptions for zero-shot video FER
Li et al. (2024) [15]	CLIPER (CLIP + METD)	$U A R / W A R_{F E R V 39 k}$ = 41.2/51.4%; $W A R_{R A F - D B}$ = 91.8%	Learns multiple text prompts per emotion for richer semantics; better alignment without explicit captions
Chen et al. (2024) [160]	FineCLIPER (CLIP + AdaptERs + Video-LLaVA)	$U A R / W A R_{D F E W}$ = 66.0/76.2%; $U A R / W A R_{F E R V 39 k}$ = 45.2/54.0%; $U A R / W A R_{M A F W}$ = 45.0/56.9%;	Fine-grained dynamic captions from Video-LLaVA improve temporal FER; surpasses CLIP baseline across datasets
Huang et al. (2025) [161]	Emotion-Qwen (Hybrid Expert Mixture)	$U A R / W A R_{D F E W}$ = 77.1/62.2%;	Integrates VER dataset for instruction fine-tuning with contextual and causal captions to enhance emotion reasoning
Seoh et al. [151]	Qwen2.5-VL; Aya Vision; InternVL2.5 8B-MPO	$P r e c_{F I}$ = 52.9; $R e c_{F I}$ = 52.2; $F 1_{F I}$ = 48.5	EmoGist analyzes clusters of example images to pre-generate multiple context-specific descriptions via the VLM for emotion labels
Zhao et al. [153]	CLIP-L	$U A R / W A R_{R A F - D B}$ = 58.7/65.4; $U A R / W A R_{A f f e c t N e t}$ = 38.4/38.4; $U A R / W A R_{D F E W}$ = 40.2/47.1; $U A R / W A R_{A F E W}$ = 38.7/40.4	Exp-CLIP introduces a simple, learnable projection head that aligns the general vision–language feature space of a pre-trained model with a task-aware semantic space derived from an LLM encoder
Lan et al. [155]	ViT-L/14 (DINOv2; LLAVA 1.5)	$A c c_{A f f e c t n e t}$ = 62.9; $A c c_{R a f D B}$ = 91.0; $A c c_{E x p W}$ = 68.1	EXPLLM leverages an AU model and GPT-4o to construct instruction–description data pairs that articulate the relationship between facial movements and emotion
Hu et al. [154]	LLaVA-LCS-558K (with LoRA)	$A c c_{A f f e c t N e t}$ = 42.9; $A c c_{R a f D B}$ = 69.9; $A c c_{E x p W}$ = 68.1	FEALLM aligned descriptions of facial expressions (FE) and action units (AU), along with causal reasoning instructions to explain how AUs lead to FEs
Sălăgean et al. [156]	GPT-3.5	$P r e c_{C A S M E I I}$ = 96.0; $R e c_{C A S M E I I}$ = 93.0; $F 1_{C A S M E I I}$ = 94	It integrates expert-defined rules and mechanisms for amplifying weak signals with the reasoning capabilities of a large language model

9. Discussion

9.1. Action Recognition

In human action recognition, leveraging captions translates into substantial performance gains; models can capture both high-level action semantics and fine-grained, frame-level nuances such as object interactions, hand movements, and subtle scene changes. Empirical results on benchmarks including MSR-VTT, YouCook2, Ego4D, and EPIC-KITCHENS demonstrate that VLMs using caption supervision consistently outperform unimodal approaches, particularly in zero-shot and few-shot scenarios where conventional video-only encoders struggle. These improvements are far from incremental—they significantly enhance both retrieval and generative capabilities, firmly establishing captions as a critical modality for advancing state-of-the-art HAR.

9.2. Violence Detection

Captioning and textual guidance enhance video anomaly and violence detection by providing a richer semantic context that complements visual information. By converting video content into structured textual representations, these approaches improve model interpretability, support fine-grained reasoning, and help capture subtle or complex anomalous behaviors. Overall, integrating textual cues consistently boosts detection performance and human-aligned understanding compared to relying on visual information alone. Despite these advances, current methods face key limitations: they can inherit biases from pre-trained LLMs/VLMs, rely on large datasets and heavy computation, and may propagate captioning errors or hallucinations. Many also struggle with long-term temporal dependencies and subtle behaviors, leading to missed or false detections, underscoring the need for more robust, efficient, and bias-aware textual–visual integration.

9.3. Emotion Recognition

Captioning and textual guidance enhance emotion recognition from both body language and facial expressions by adding semantic context that complements visual cues. Converting images into structured text improves efficiency and captures subtle behaviors, leading to more accurate emotion descriptions. Integrating textual cues consistently outperforms visual-only approaches in detection and human-aligned interpretation. However, current methods remain limited, often over-relying on assumed contextual correlations [167] and inheriting biases from the training data.

Beyond dataset bias and hallucination, a key limitation arises when visual cues themselves are unreliable; forged or manipulated images can mislead VLMs into generating confident but incorrect emotional attributions, while mental disorders (e.g., schizophrenia, autism spectrum conditions) may decouple facial or contextual signals from internal affect, creating systematic mismatches between expression and emotion. Context itself can also be deceptive—for instance, a smile in a funeral setting may not imply joy, but grief coping. These challenges highlight the fragility of context-based inference.

9.4. Issues and Mitigation

However, despite their promise, VLMs are not without limitations. Their reliance on web-scale data makes them prone to biases, and their generative capacities often lead to hallucinations or over-interpretations of visual signals [43,44]. In human state recognition, such errors may translate into misleading or biased predictions with serious downstream implications. Moreover, the exploration of large VLMs in the domain of context-aware recognition remains relatively under-developed, leaving open questions about optimal model choice, fine-tuning strategies, and dataset requirements.

Alternative mitigation solutions can be found in the VLM itself. VLMs, under specific prompts, are able to adjust their own biased predictions [167]. In other situations, specially designed solution are needed, and alternatives do exist.

To address such issues, several recent works have explored temporal and reasoning augmentations. Temporal modeling approaches add sequence encoders (e.g., Transformer layers on top of CLIP [131]) or specialized adapters to capture the dynamics of expressions across video frames. Reasoning strategies such as Chain-of-Thought prompting [148,168] guide VLMs to articulate intermediate steps, improving interpretability and robustness. Others propose fine-grained textual descriptions (e.g., multiple expression descriptors [15], LLM-generated prompts [159]) or hybrid multimodal fusion (e.g., combining landmarks, segmentation, and language adapters [160], or hybrid MoE compressors [161]). These methods expand beyond static frame-level prediction to incorporate temporal coherence, common-sense reasoning, and multi-cue integration, which, together, represent promising directions for reliable and fair emotion recognition.

10. Conclusions

Overall, the rise of Visual Language Models (VLMs) has had a positive impact in multiple areas related to the recognition and analysis of human states in images and video data. Their zero-shot capabilities and strong pre-training serve as the foundation for significant improvements across various domains. These solutions have become increasingly creative, inspiring alternative approaches—such as innovative VLM design choices, integration with complementary branches, and refined pre- and post-processing methods.

However, we argue that solution development is still in its golden age; the major improvements compared to non-VLM solutions are heavily emphasized, and the research community remains focused on enhancing key performance metrics. Nevertheless, we anticipate that this rapid progress will soon plateau, shifting attention toward fine-tuning details—specifically, mitigating biases and isolating spurious correlations. Ultimately, we hope this work provides key insights into the current state of the art and offers valuable suggestions for future developments in this area, where VLM-based captioning is becoming a central element driving further evolution.

Author Contributions

Conceptualization, C.-B.P., A.R., A.N., L.F. and C.F.; methodology, L.F. and C.F.; validation, A.R., C.F. and L.F.; formal analysis, C.F.; investigation, C.-B.P., A.R. and A.N.; resources, C.-B.P., A.R., A.N., L.F. and C.F.; writing—original draft preparation, C.-B.P., A.R., A.N., L.F. and C.F.; writing—review and editing, C.F. and L.F.; visualization, C.-B.P., A.R., A.N., L.F. and C.F.; supervision, C.F. and L.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the project “Platform for Fusion and Management of Multi-Source Data Collections Exploitable by Artificial Intelligence Models for Risk Estimation and Predictive Analysis—AI4RISK”, project code SOL3/2024 – PN-IV-P6-6.3-SOL-2024-2-0251, funded by the UEFISCDI (Executive Agency for Higher Education, Research, Development and Innovation Funding), Romania.

Data Availability Statement

No new data was created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CLIP	Contrastive Language–Image Pre-Training
FER	Face Expression Recognition
FS	Few Shot
FT	Fine-Tuned
HAR	Human Action Recognition
LM	Large Models
LLM	Large Language Model
ML	Machine Learning
UAR	Unweighted Average Recall
VAD	Video Anomaly Detection
VVD	Video Violence Detection
VLM	Vision–Language Model
VQA	Visual Question Answering
WAR	Weighted Average Recall
ZS	Zero Shot

References

Pareek, P.; Thakkar, A. A survey on video-based human action recognition: Recent updates, datasets, challenges, and applications. Artif. Intell. Rev. 2021, 54, 2259–2322. [Google Scholar] [CrossRef]
Ahmed, N.; Al Aghbari, Z.; Girija, S. A systematic survey on multimodal emotion recognition using learning algorithms. Intell. Syst. Appl. 2023, 17, 200171. [Google Scholar] [CrossRef]
Liu, D.; Bao, Z.; Mi, J.; Gan, Y.; Ye, M.; Zhang, J. Cross-domain video action recognition via adaptive gradual learning. Neurocomputing 2023, 556, 126622. [Google Scholar] [CrossRef]
Chen, T.; Pu, T.; Wu, H.; Xie, Y.; Liu, L.; Lin, L. Cross-domain facial expression recognition: A unified evaluation benchmark and adversarial graph learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9887–9903. [Google Scholar] [CrossRef] [PubMed]
Xu, H.; Zhi, S.; Sun, S.; Patel, V.; Liu, L. Deep learning for cross-domain few-shot visual recognition: A survey. ACM Comput. Surv. 2025, 57, 1–37. [Google Scholar] [CrossRef]
Gu, J.; Han, Z.; Chen, S.; Beirami, A.; He, B.; Zhang, G.; Liao, R.; Qin, Y.; Tresp, V.; Torr, P. A systematic survey of prompt engineering on vision-language foundation models. arXiv 2023, arXiv:2307.12980. [Google Scholar] [CrossRef]
Sharma, H.; Padha, D. A comprehensive survey on image captioning: From handcrafted to deep learning-based techniques, a taxonomy and open research issues. Artif. Intell. Rev. 2023, 56, 13619–13661. [Google Scholar] [CrossRef]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PmLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef]
Zhou, X.; Liu, M.; Yurtsever, E.; Zagar, B.L.; Zimmer, W.; Cao, H.; Knoll, A.C. Vision Language Models in Autonomous Driving: A Survey and Outlook. arXiv 2024, arXiv:2310.14414. [Google Scholar] [CrossRef]
Huang, Z.; Yan, H.; Zhan, Q.; Yang, S.; Zhang, M.; Zhang, C.; Lei, Y.; Liu, Z.; Liu, Q.; Wang, Y. A Survey on Remote Sensing Foundation Models: From Vision to Multimodality. arXiv 2025, arXiv:2503.22081. [Google Scholar] [CrossRef]
Han, X.; Chen, S.; Fu, Z.; Feng, Z.; Fan, L.; An, D.; Wang, C.; Guo, L.; Meng, W.; Zhang, X.; et al. Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision. arXiv 2025, arXiv:2504.02477. [Google Scholar] [CrossRef]
Wang, M.; Xing, J.; Mei, J.; Liu, Y.; Jiang, Y. ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 625–637. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Niu, H.; Zhu, Z.; Zhao, F. CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Xenos, A.; Foteinopoulou, N.M.; Ntinou, I.; Patras, I.; Tzimiropoulos, G. VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning. arXiv 2024, arXiv:2404.07078. [Google Scholar] [CrossRef]
Wang, D.; Wang, Q.; Hu, Q.; Wu, K. Multimodal VAD: Visual Anomaly Detection in Intelligent Monitoring System via Audio-Vision-Language. IEEE Trans. Instrum. Meas. 2025, 74, 4012212. [Google Scholar] [CrossRef]
Gallegos, I.O.; Rossi, R.A.; Barrow, J.; Tanjim, M.M.; Kim, S.; Dernoncourt, F.; Yu, T.; Zhang, R.; Ahmed, N.K. Bias and fairness in large language models: A survey. Comput. Linguist. 2024, 50, 1097–1179. [Google Scholar] [CrossRef]
Zeng, B.; Yin, Y.; Liu, Z. Understanding bias in large-scale visual datasets. Adv. Neural Inf. Process. Syst. 2024, 37, 61839–61871. [Google Scholar]
Hamidieh, K.; Zhang, H.; Gerych, W.; Hartvigsen, T.; Ghassemi, M. Identifying implicit social biases in vision-language models. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, San Jose, CA, USA, 21–23 October 2024; Volume 7, pp. 547–561. [Google Scholar]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
Pourpanah, F.; Abdar, M.; Luo, Y.; Zhou, X.; Wang, R.; Lim, C.P.; Wang, X.Z.; Wu, Q.J. A review of generalized zero-shot learning methods. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4051–4070. [Google Scholar] [CrossRef]
Navigli, R.; Conia, S.; Ross, B. Biases in large language models: Origins, inventory, and discussion. ACM J. Data Inf. Qual. 2023, 15, 1–21. [Google Scholar] [CrossRef]
Sun, S.; Liu, L.; Liu, Y.; Liu, Z.; Zhang, S.; Heikkilä, J.; Li, X. Uncovering bias in foundation models: Impact, testing, harm, and mitigation. arXiv 2025, arXiv:2501.10453. [Google Scholar]
Hort, M.; Chen, Z.; Zhang, J.M.; Harman, M.; Sarro, F. Bias mitigation for machine learning classifiers: A comprehensive survey. ACM J. Responsible Comput. 2024, 1, 1–52. [Google Scholar] [CrossRef]
Pagano, T.P.; Loureiro, R.B.; Lisboa, F.V.; Peixoto, R.M.; Guimarães, G.A.; Cruz, G.O.; Araujo, M.M.; Santos, L.L.; Cruz, M.A.; Oliveira, E.L.; et al. Bias and unfairness in machine learning models: A systematic review on datasets, tools, fairness metrics, and identification and mitigation methods. Big Data Cogn. Comput. 2023, 7, 15. [Google Scholar] [CrossRef]
Kong, Y.; Fu, Y. Human action recognition and prediction: A survey. Int. J. Comput. Vis. 2022, 130, 1366–1401. [Google Scholar] [CrossRef]
Mumtaz, N.; Ejaz, N.; Habib, S.; Mohsin, S.M.; Tiwari, P.; Band, S.S.; Kumar, N. An overview of violence detection techniques: Current challenges and future directions. Artif. Intell. Rev. 2023, 56, 4641–4666. [Google Scholar] [CrossRef]
Kalateh, S.; Estrada-Jimenez, L.A.; Nikghadam-Hojjati, S.; Barata, J. A systematic review on multimodal emotion recognition: Building blocks, current state, applications, and challenges. IEEE Access 2024, 12, 103976–104019. [Google Scholar] [CrossRef]
Kopalidis, T.; Solachidis, V.; Vretos, N.; Daras, P. Advances in facial expression recognition: A survey of methods, benchmarks, models, and datasets. Information 2024, 15, 135. [Google Scholar] [CrossRef]
Abbas, R.; Ni, B.; Ma, R.; Li, T.; Lu, Y.; Li, X. Context-based emotion recognition: A survey. Neurocomputing 2025, 618, 129073. [Google Scholar] [CrossRef]
Hu, X.; Fan, Z.; Jiang, L.; Xu, J.; Li, G.; Chen, W.; Zeng, X.; Yang, G.; Zhang, D. TOP-ALCM: A novel video analysis method for violence detection in crowded scenes. Inf. Sci. 2022, 606, 313–327. [Google Scholar] [CrossRef]
Wang, J.; Wang, C.; Guo, L.; Zhao, S.; Wang, D.; Zhang, S.; Zhao, X.; Yu, J.; Wang, Y.; Yang, Y.; et al. MDKAT: Multimodal Decoupling with Knowledge Aggregation and Transfer for Video Emotion Recognition. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 9809–9822. [Google Scholar] [CrossRef]
Zheng, H.; Shen, L.; Tang, A.; Luo, Y.; Hu, H.; Du, B.; Wen, Y.; Tao, D. Learning from models beyond fine-tuning. Nat. Mach. Intell. 2025, 7, 6–17. [Google Scholar] [CrossRef]
Zhou, J.; Chen, Y.; Hong, Z.; Chen, W.; Yu, Y.; Zhang, T.; Wang, H.; Zhang, C.; Zheng, Z. Training and serving system of foundation models: A comprehensive survey. IEEE Open J. Comput. Soc. 2024, 5, 107–119. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. arXiv 2018, arXiv:2012.11747. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; p. 8440. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 104–120. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, Orleans, LA, USA, 28 November–9 December 2022; pp. 23716–23736. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
Zhai, X.; Wang, X.; Mustafa, B.; Steiner, A.; Keysers, D.; Kolesnikov, A.; Beyer, L. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18123–18133. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, Orleans, LA, USA, 10–16 December 2023; pp. 34892–34916. [Google Scholar]
Bordes, F.; Pang, R.Y.; Ajay, A.; Li, A.C.; Bardes, A.; Petryk, S.; Mañas, O.; Lin, Z.; Mahmoud, A.; Jayaraman, B.; et al. An introduction to vision-language modeling. arXiv 2024, arXiv:2405.17247. [Google Scholar] [CrossRef]
LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F. A tutorial on energy-based learning. Predict. Struct. Data 2006, 1. [Google Scholar]
Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Sardinia, Italy, 13–15 May 2010; pp. 297–304. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML, Helsinki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2536–2544. [Google Scholar]
Dubois, Y.; Bloem-Reddy, B.; Ullrich, K.; Maddison, C.J. Lossy compression for lossless prediction. Adv. Neural Inf. Process. Syst. 2021, 34, 14014–14028. [Google Scholar]
Shwartz Ziv, R.; LeCun, Y. To compress or not to compress—Self-supervised learning and information theory: A review. Entropy 2024, 26, 252. [Google Scholar] [CrossRef]
Peng, Z.; Wang, W.; Dong, L.; Hao, Y.; Huang, S.; Ma, S.; Ye, Q.; Wei, F. Kosmos-2: Grounding multimodal large language models to the world. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Gadre, S.Y.; Ilharco, G.; Fang, A.; Hayase, J.; Smyrnis, G.; Nguyen, T.; Marten, R.; Wortsman, M.; Ghosh, D.; Zhang, J.; et al. Datacomp: In search of the next generation of multimodal datasets. Adv. Neural Inf. Process. Syst. 2023, 36, 27092–27112. [Google Scholar]
Henighan, T.; Kaplan, J.; Katz, M.; Chen, M.; Hesse, C.; Jackson, J.; Jun, H.; Brown, T.B.; Dhariwal, P.; Gray, S.; et al. Scaling laws for autoregressive generative modeling. arXiv 2020, arXiv:2010.14701. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Thrush, T.; Jiang, R.; Bartolo, M.; Singh, A.; Williams, A.; Kiela, D.; Ross, C. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5238–5248. [Google Scholar]
Hellström, T.; Dignum, V.; Bensch, S. Bias in machine learning-what is it good for? In Proceedings of the International Workshop on New Foundations for Human-Centered AI (NeHuAI) Co-Located with 24th European Conference on Artificial Intelligence (ECAI 2020), Santiago de Compostela, Spain, 29 August–8 September 2020; pp. 3–10. [Google Scholar]
Torralba, A.; Efros, A.A. Unbiased look at dataset bias. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 21–23 June 2011; pp. 1521–1528. [Google Scholar]
Liu, Z.; He, K. A Decade’s Battle on Dataset Bias: Are We There Yet? In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
You, Z.; Zhang, X.; Guo, H.; Wang, J.; Li, C. Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers? In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 28790–28800. [Google Scholar]
Mitchell, T.M. The Need for Biases in Learning Generalizations; Technical Report No. CBM-TR-117; Department of Computer Science, Rutgers University: New Brunswick, NJ, USA, 1980. [Google Scholar]
Zhao, J.; Wang, T.; Yatskar, M.; Ordonez, V.; Chang, K.W. Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 2979–2989. [Google Scholar]
Molahasani, M.; Motamedi, A.; Greenspan, M.; Kim, I.M.; Etemad, A. Prism: Reducing spurious implicit biases in vision-language models with llm-guided embedding projection. arXiv 2025, arXiv:2507.08979. [Google Scholar]
Zafar, M.B.; Valera, I.; Gomez Rodriguez, M.; Gummadi, K.P. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 1171–1180. [Google Scholar]
Kleinberg, J.; Mullainathan, S.; Raghavan, M. Inherent Trade-Offs in the Fair Determination of Risk Scores. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Berkeley, CA, USA, 9–11 January 2017. [Google Scholar]
Zhao, H.; Gordon, G.J. Inherent tradeoffs in learning fair representations. J. Mach. Learn. Res. 2022, 23, 1–26. [Google Scholar]
Hardt, M.; Price, E.; Srebro, N. Equality of opportunity in supervised learning. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
Woodworth, B.; Gunasekar, S.; Ohannessian, M.I.; Srebro, N. Learning non-discriminatory predictors. In Proceedings of the Conference on Learning Theory, PMLR, Amsterdam, The Netherlands, 7–10 July 2017; pp. 1920–1953. [Google Scholar]
Lee, H.; Chen, S. Systematic bias of machine learning regression models and correction. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4974–4983. [Google Scholar] [CrossRef] [PubMed]
Wang, A.; Liu, A.; Zhang, R.; Kleiman, A.; Kim, L.; Zhao, D.; Shirai, I.; Narayanan, A.; Russakovsky, O. Revise: A tool for measuring and mitigating bias in visual datasets. Int. J. Comput. Vis. 2022, 130, 1790–1810. [Google Scholar] [CrossRef]
Wang, S.; Cao, X.; Zhang, J.; Yuan, Z.; Shan, S.; Chen, X.; Gao, W. Vlbiasbench: A comprehensive benchmark for evaluating bias in large vision-language model. arXiv 2024, arXiv:2406.14194. [Google Scholar]
Tian, H.; Liu, B.; Zhu, T.; Zhou, W.; Yu, P.S. Multifair: Model fairness with multiple sensitive attributes. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 5654–5667. [Google Scholar] [CrossRef]
Zhu, H.; Liang, S.; Wang, W.; Li, B.; Yuan, T.; Li, F.; Wang, S.; Zhang, Z. Revisiting Data Auditing in Large Vision-Language Models. arXiv 2025, arXiv:2504.18349. [Google Scholar] [CrossRef]
Le, H.; Chung, N.; Kieu, T.; Nguyen, A.; Le, N. BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance. arXiv 2025, arXiv:2506.03589. [Google Scholar]
Delaney, E.; Fu, Z.; Wachter, S.; Mittelstadt, B.; Russell, C. Oxonfair: A flexible toolkit for algorithmic fairness. Adv. Neural Inf. Process. Syst. 2024, 37, 94209–94245. [Google Scholar]
Wang, Z.; Qinami, K.; Karakozis, I.C.; Genova, K.; Nair, P.; Hata, K.; Russakovsky, O. Towards fairness in visual recognition: Effective strategies for bias mitigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8919–8928. [Google Scholar]
Ma, Y.; Jiao, L.; Liu, F.; Li, L.; Ma, W.; Yang, S.; Liu, X.; Chen, P. Unveiling and mitigating generalized biases of dnns through the intrinsic dimensions of perceptual manifolds. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2237–2244. [Google Scholar] [CrossRef] [PubMed]
Chang, Y.; Chang, Y.; Wu, Y. BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models. arXiv 2024, arXiv:2408.04556. [Google Scholar]
Zhou, Y.; Tang, J.; Yang, S.; Xiao, X.; Dai, Y.; Yang, W.; Gou, C.; Xia, X.; Chua, T.S. Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models. arXiv 2025, arXiv:2508.11317. [Google Scholar] [CrossRef]
Zhang, Z.; Feng, M.; Li, Z.; Xu, C. Discover and mitigate multiple biased subgroups in image classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 10906–10915. [Google Scholar]
Parashar, S.; Lin, Z.; Liu, T.; Dong, X.; Li, Y.; Ramanan, D.; Caverlee, J.; Kong, S. The neglected tails in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 12988–12997. [Google Scholar]
Birhane, A.; Prabhu, V.U.; Kahembwe, E. Multimodal datasets: Misogyny, pornography, and malignant stereotypes. arXiv 2021, arXiv:2110.01963. [Google Scholar] [CrossRef]
Zhao, D.; Wang, A.; Russakovsky, O. Understanding and evaluating racial biases in image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14830–14840. [Google Scholar]
Sabir, A.; Padró, L. Women Wearing Lipstick: Measuring the Bias Between an Object and Its Related Gender. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar]
Abdelrahman, E.; Sun, P.; Li, L.E.; Elhoseiny, M. Imagecaptioner2: Image captioner for image captioning bias amplification assessment. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 20902–20911. [Google Scholar]
Fraser, K.C.; Kiritchenko, S. Examining Gender and Racial Bias in Large Vision–Language Models Using a Novel Dataset of Parallel Images. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian’s, Malta, 17–22 March 2024; pp. 690–713. [Google Scholar]
Konavoor, A.; Dandekar, R.A.; Dandekar, R.; Panat, S. Vision-Language Models display a strong gender bias. arXiv 2025, arXiv:2508.11262. [Google Scholar] [CrossRef]
Yang, F.; Ghosh, S.; Barut, E.; Qin, K.; Wanigasekara, P.; Su, C.; Ruan, W.; Gupta, R. Masking latent gender knowledge for debiasing image captioning. In Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024), Mexico City, Mexico, 21 June 2024; pp. 227–238. [Google Scholar]
Hirota, Y.; Nakashima, Y.; Garcia, N. Model-agnostic gender debiased image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15191–15200. [Google Scholar]
Buijsman, S. Navigating fairness measures and trade-offs. AI Ethics 2024, 4, 1323–1334. [Google Scholar] [CrossRef]
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The Kinetics Human Action Video Dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar] [CrossRef]
Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. Ego4d: Around the world in 3000 h of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18995–19012. [Google Scholar]
Goyal, R.; Ebrahimi Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5842–5850. [Google Scholar]
Zhou, L.; Xu, C.; Corso, J. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Xu, J.; Mei, T.; Yao, T.; Rui, Y. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Tang, Y.; Ding, D.; Rao, Y.; Zheng, Y.; Zhang, D.; Zhao, L.; Lu, J.; Zhou, J. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1207–1216. [Google Scholar]
Luo, H.; Ji, L.; Shi, B.; Huang, H.; Duan, N.; Li, T.; Li, J.; Bharti, T.; Zhou, M. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv 2020, arXiv:2002.06353. [Google Scholar]
Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; Schmid, C. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7464–7473. [Google Scholar]
Zhu, L.; Yang, Y. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8746–8755. [Google Scholar]
Miech, A.; Zhukov, D.; Alayrac, J.B.; Tapaswi, M.; Laptev, I.; Sivic, J. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2630–2640. [Google Scholar]
Zhukov, D.; Alayrac, J.B.; Cinbis, R.G.; Fouhey, D.; Laptev, I.; Sivic, J. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3537–3545. [Google Scholar]
Xu, H.; Ghosh, G.; Huang, P.Y.; Okhonko, D.; Aghajanyan, A.; Metze, F.; Zettlemoyer, L.; Feichtenhofer, C. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6787–6800. [Google Scholar]
Bain, M.; Nagrani, A.; Varol, G.; Zisserman, A. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1728–1738. [Google Scholar]
Ma, Y.; Xu, G.; Sun, X.; Yan, M.; Zhang, J.; Ji, R. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 638–647. [Google Scholar]
Bolya, D.; Huang, P.Y.; Sun, P.; Cho, J.H.; Madotto, A.; Wei, C.; Ma, T.; Zhi, J.; Rajasegaran, J.; Rasheed, H.; et al. Perception encoder: The best visual embeddings are not at the output of the network. arXiv 2025, arXiv:2504.13181. [Google Scholar] [CrossRef]
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 30016–30030. [Google Scholar]
Pramanick, S.; Song, Y.; Nag, S.; Lin, K.Q.; Shah, H.; Shou, M.Z.; Chellappa, R.; Zhang, P. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 5285–5297. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
Damen, D.; Doughty, H.; Farinella, G.M.; Fidler, S.; Furnari, A.; Kazakos, E.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4125–4141. [Google Scholar] [CrossRef]
Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 24185–24198. [Google Scholar]
Xu, H.; Ye, Q.; Yan, M.; Shi, Y.; Ye, J.; Xu, Y.; Li, C.; Bi, B.; Qian, Q.; Wang, W.; et al. mplug-2: A modularized multi-modal foundation model across text, image and video. In Proceedings of the International Conference on Machine Learning, ICML, Honolulu, HI, USA, 23–29 July 2023; pp. 38728–38748. [Google Scholar]
Yuan, L.; Wang, J.; Sun, H.; Zhang, Y.; Lin, Y. Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding. arXiv 2025, arXiv:2501.07888. [Google Scholar] [CrossRef]
Tong, Z.; Song, Y.; Wang, J.; Wang, L. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. arXiv 2022, arXiv:2203.12602. [Google Scholar]
Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar] [CrossRef]
Zanella, L.; Menapace, W.; Mancini, M.; Wang, Y.; Ricci, E. Harnessing large language models for training-free video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18527–18536. [Google Scholar]
Zou, S.; Tian, X.; Wesemann, L.; Waschkowski, F.; Yang, Z.; Zhang, J. Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting. arXiv 2025, arXiv:2510.02155. [Google Scholar]
Ye, M.; Liu, W.; He, P. Vera: Explainable video anomaly detection via verbalized learning of vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, Vancouver, BC, Canada, 17–24 June 2025; pp. 8679–8688. [Google Scholar]
Zhang, H.; Xu, X.; Wang, X.; Zuo, J.; Han, C.; Huang, X.; Gao, C.; Wang, Y.; Sang, N. Holmes-VAD: Towards unbiased and explainable video anomaly detection via multi-modal LLM. arXiv 2024, arXiv:2406.12235. [Google Scholar]
Gonzalez, J.A.A.; Matsukawa, T.; Suzuki, E. Leveraging Vision Language Models for Understanding and Detecting Violence in Videos. In Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Porto, Portugal, 26–28 February 2025; Science and Technology Publications: Setúbal, Portugal, 2025; Volume 2, pp. 99–113. [Google Scholar]
Leng, J.; Wu, Z.; Tan, M.; Mo, M.; Zheng, J.; Li, Q.; Gan, J.; Gao, X. PiercingEye: Dual-Space Video Violence Detection with Hyperbolic Vision-Language Guidance. arXiv 2025, arXiv:2504.18866. [Google Scholar] [CrossRef] [PubMed]
Lim, J.; Lee, J.; Kim, H.; Park, E. ViCap-AD: Video caption-based weakly supervised video anomaly detection. Mach. Vis. Appl. 2025, 36, 61. [Google Scholar] [CrossRef]
Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6479–6488. [Google Scholar]
Wu, P.; Liu, J.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; Yang, Z. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 322–339. [Google Scholar]
Cheng, M.; Cai, K.; Li, M. RWF-2000: An open large scale video database for violence detection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4183–4190. [Google Scholar]
Bermejo Nievas, E.; Deniz Suarez, O.; Bueno García, G.; Sukthankar, R. Violence detection in video using computer vision techniques. In Proceedings of the International Conference on Computer Analysis of Images and Patterns, Seville, Spain, 29–31 August 2011; pp. 332–339. [Google Scholar]
Aktı, Ş.; Tataroğlu, G.A.; Ekenel, H.K. Vision-based fight detection from surveillance cameras. In Proceedings of the 2019 Ninth International Conference on Image Processing Theory, Tools and Applications (IPTA), Istanbul, Turkey, 6–9 November 2019; pp. 1–6. [Google Scholar]
Jiang, X.; Zong, Y.; Zheng, W.; Tang, C.; Xia, W.; Lu, C.; Liu, J. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2881–2889. [Google Scholar]
Pepa, L.; Spalazzi, L.; Capecci, M.; Ceravolo, M.G. Automatic emotion recognition in clinical scenario: A systematic review of methods. IEEE Trans. Affect. Comput. 2021, 14, 1675–1695. [Google Scholar] [CrossRef]
Yang, D.; Huang, S.; Xu, Z.; Li, Z.; Wang, S.; Li, M.; Wang, Y.; Liu, Y.; Yang, K.; Chen, Z.; et al. Aide: A vision-driven multi-view, multi-modal, multi-tasking dataset for assistive driving perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 20459–20470. [Google Scholar]
Kosti, R.; Alvarez, J.M.; Recasens, A.; Lapedriza, A. Context based emotion recognition using emotic dataset. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2755–2766. [Google Scholar] [CrossRef]
Luo, Y.; Ye, J.; Adams, R.B., Jr.; Li, J.; Newman, M.G.; Wang, J.Z. ARBEE: Towards automated recognition of bodily expression of emotion in the wild. Int. J. Comput. Vis. 2020, 128, 1–25. [Google Scholar] [CrossRef] [PubMed]
Yang, D.; Huang, S.; Wang, S.; Liu, Y.; Zhai, P.; Su, L.; Li, M.; Zhang, L. Emotion recognition for multiple context awareness. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 144–162. [Google Scholar]
Kosti, R.; Alvarez, J.M.; Recasens, A.; Lapedriza, A. Emotic: Emotions in context dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–16 July 2017; pp. 61–69. [Google Scholar]
Lee, J.; Kim, S.; Kim, S.; Park, J.; Sohn, K. Context-aware emotion recognition networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10143–10152. [Google Scholar]
Yang, J.; Huang, Q.; Ding, T.; Lischinski, D.; Cohen-Or, D.; Huang, H. Emoset: A large-scale visual emotion dataset with rich attributes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 20383–20394. [Google Scholar]
Wang, Y.; Sun, Y.; Huang, Y.; Liu, Z.; Gao, S.; Zhang, W.; Ge, W.; Zhang, W. Ferv39k: A large-scale multi-scene dataset for facial expression recognition in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20922–20931. [Google Scholar]
Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T. Acted Facial Expressions in the Wild Database; Technical Report TR-CS-11; Australian National University: Canberra, Australia, 2011; Volume 2. [Google Scholar]
Li, S.; Deng, W.; Du, J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2852–2861. [Google Scholar]
Mollahosseini, A.; Hasani, B.; Mahoor, M.H. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef]
Zhang, Z.; Luo, P.; Loy, C.C.; Tang, X. Learning social relation traits from face images. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3631–3639. [Google Scholar]
Etesam, Y.; Yalçın, Ö.N.; Zhang, C.; Lim, A. Contextual emotion recognition using large vision language models. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 4769–4776. [Google Scholar]
Zhou, K.; Lee, K.; Misu, T.; Wang, X. ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 10783–10795. [Google Scholar]
Lian, Z.; Sun, H.; Sun, L.; Yi, J.; Liu, B.; Tao, J. AffectGPT: Dataset and framework for explainable multimodal emotion recognition. arXiv 2024, arXiv:2407.07653. [Google Scholar] [CrossRef]
Lei, Y.; Yang, D.; Chen, Z.; Chen, J.; Zhai, P.; Zhang, L. Large Vision-Language Models as Emotion Recognizers in Context Awareness. In Proceedings of the Asian Conference on Machine Learning, PMLR, Taipei, Taiwan, 9–12 December 2025; pp. 111–126. [Google Scholar]
Bhattacharyya, S.; Wang, J.Z. Evaluating Vision-Language Models for Emotion Recognition. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 1798–1820. [Google Scholar]
Xie, H.; Peng, C.J.; Tseng, Y.W.; Chen, H.J.; Hsu, C.F.; Shuai, H.H.; Cheng, W.H. Emovit: Revolutionizing emotion insights with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26596–26605. [Google Scholar]
Seoh, R.; Goldwasser, D. EmoGist: Efficient In-Context Learning for Visual Emotion Understanding. arXiv 2025, arXiv:2505.14660. [Google Scholar]
Wang, Z.; Zhang, Q.; Zhang, P.; Niu, W.; Zhang, K.; Sankaranarayana, R.; Caldwell, S.; Gedeon, T. Visual and textual prompts in vllms for enhancing emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 2025, 19, 14. [Google Scholar] [CrossRef]
Zhao, Z.; Cao, Y.; Gong, S.; Patras, I. Enhancing zero-shot facial expression recognition by llm knowledge transfer. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 815–824. [Google Scholar]
Hu, Z.; Yuan, K.; Liu, X.; Yu, Z.; Zong, Y.; Shi, J.; Yue, H.; Yang, J. Feallm: Advancing facial emotion analysis in multimodal large language models with emotional synergy and reasoning. arXiv 2025, arXiv:2505.13419. [Google Scholar] [CrossRef]
Lan, X.; Xue, J.; Qi, J.; Jiang, D.; Lu, K.; Chua, T.S. Expllm: Towards chain of thought for facial expression recognition. IEEE Trans. Multimed. 2025, 27, 3069–3081. [Google Scholar] [CrossRef]
Sălăgean, G.L.; Leba, M.; Ionica, A.C. Seeing the Unseen: Real-Time Micro-Expression Recognition with Action Units and GPT-Based Reasoning. Appl. Sci. 2025, 15, 6417. [Google Scholar] [CrossRef]
Sun, L.; Jiang, X.; Chen, H.; Li, Y.; Lian, Z.; Liu, B.; Zong, Y.; Zheng, W.; Leppänen, J.M.; Zhao, G. Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions. arXiv 2025, arXiv:2507.21015. [Google Scholar] [CrossRef]
Lin, Z.; Wang, Y.; Zhou, Y.; Du, F.; Yang, Y. MLM-EOE: Automatic Depression Detection via Sentimental Annotation and Multi-Expert Ensemble. IEEE Trans. Affect. Comput. 2025, 16, 2842–2858. [Google Scholar] [CrossRef]
Zhao, Z.; Patras, I. Prompting Visual-Language Models for Dynamic Facial Expression Recognition. In Proceedings of the BMVC, Aberdeen, UK, 20–24 November 2023. [Google Scholar]
Chen, H.; Huang, H.; Dong, J.; Zheng, M.; Shao, D. Finecliper: Multi-modal fine-grained clip for dynamic facial expression recognition with adapters. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 2301–2310. [Google Scholar]
Huang, D.; Li, Q.; Yan, C.; Cheng, Z.; Huang, Y.; Li, X.; Li, B.; Wang, X.; Lian, Z.; Peng, X. Emotion-Qwen: Training Hybrid Experts for Unified Emotion and General Vision-Language Understanding. arXiv 2025, arXiv:2505.06685. [Google Scholar]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019. [Google Scholar]
Zellers, R.; Bisk, Y.; Farhadi, A.; Choi, Y. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6720–6731. [Google Scholar]
Lian, Z.; Sun, H.; Sun, L.; Chen, K.; Xu, M.; Wang, K.; Xu, K.; He, Y.; Li, Y.; Zhao, J.; et al. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 9610–9614. [Google Scholar]
Liu, Y.; Dai, W.; Feng, C.; Wang, W.; Yin, G.; Zeng, J.; Shan, S. Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 24–32. [Google Scholar]
Foteinopoulou, N.M.; Patras, I. Emoclip: A vision-language method for zero-shot video facial expression recognition. In Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkey, 27–31 May 2024; pp. 1–10. [Google Scholar]
Popescu, C.B.; Florea, L.; Florea, C. Mitigating Context Bias in Vision–Language Models via Multimodal Emotion Recognition. Electronics 2025, 14, 3311. [Google Scholar] [CrossRef]
Lu, H.; Niu, X.; Wang, J.; Wang, Y.; Hu, Q.; Tang, J.; Zhang, Y.; Yuan, K.; Huang, B.; Yu, Z.; et al. Gpt as psychologist? Preliminary evaluations for gpt-4v on visual affective computing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 322–331. [Google Scholar]

Figure 1. Captioning examples for violence detection (a) and emotion recognition (b). A prompt (top) is used to interrogate a VLM (here, Google Gemini) which responds with a description (caption) of the query image. The description is further integrated into the general solution or used directly.

Figure 2. A general diagram of an architecture that adds image captioning to improve performance.

Figure 3. Example of bias in image generation. The solution (here, CANVA) generated 3 images given a prompt (listed in top left). Note the differences in dress.

Figure 4. Example of bias in image captioning. In all cases, the prompt was “What are the 5 keywords that describe the characteristics of people like the person in this image?” The top row of answers was provided by ChatGPT 5.2 and the second by Gemini 2. Note the slight nuances in adjectives between the central images compared to the ones on either side. In some cases, the focus is on elegance, in others, there is a plain description of physical attributes.

Figure 5. Examples of video–caption pairs drawn from three representative datasets: (a) Ego4D [96], (b) MSR-VTT [99], and (c) HowTo100M [104].

Figure 6. Example of images from common datasets used for video anomaly and violence detection. The caption was provided by ChatGPT.

Figure 7. Examples from major emotion recognition datasets grouped by modality: (top row)—full-body emotion datasets, (middle row)—facial expression datasets, and (bottom row)—multimodal or contextual datasets integrating audio, video, and text cues.

Table 7. Databases used in contextual emotion recognition that benefited from VLM captioning. The top part lists databases focuses on body language and the face, while the bottom sets are focused on face expressions.

Model	Size	Data	Observations
BOLD [135]	9876 video clips and 13,239 annotated human characters; 70/10/20 split.	26 emotion classes.	In-the-wild body language dataset focusing on motion and keypoints. Designed for bodily expression recognition.
HECO [136]	9385 images with 19,781 annotated agents (multiple people per image).	8 discrete emotions + VAD dimensional scores.	Context-rich static image dataset focusing on multiple agents and their interactions; annotated by experts. Contains occlusions and ambiguous cases for realism.
EMOTIC [137]	23,571 images and 34,320 annotated people; 70/10/20 split.	26 emotion classes.	In-the-wild dataset emphasizing contextual cues beyond facial expressions. Highly imbalanced, reflecting real-world emotion distributions.
CAER/CAER-S [138]	13,201 video clips (∼1,107,877 frames) and ∼70 k static images (CAER-S).	7 emotion classes.	Collected from TV shows, providing rich contextual emotion cues from human interactions in diverse scenarios.
EMOSET [139]	3.3 M weakly labeled images; 118 K human-annotated subset.	8 emotion categories based on Mikel’s Model.	Large-scale visual emotion dataset from social media and artistic sources, emphasizing contextual and affective content.
DFEW [131]	16,372 facial expression clips from 1500 movies.	7 emotion classes.	Dynamic facial expression recognition dataset in the wild; focuses exclusively on facial cues from diverse cinematic sources.
FERV39K [140]	38,935 video clips (>1 M frames) from 4K raw videos across 22 scene types.	7 emotion classes.	Large-scale in-the-wild dataset emphasizing diverse multi-scene contexts and dynamic facial cues.
AFEW [141]	1426 video clips from movies featuring 330 subjects aged 1–70.	7 emotion classes.	One of the earliest in-the-wild benchmarks for emotion recognition. Acted, expressions include real-world variability in lighting, occlusion, and background.
RAF-DB [142]	29,672 images.	7 basic + 12 compound emotions.	In-the-wild Internet images; crowd-annotated with quality control.
AffectNet [143]	450 k images (400 k manually annotated).	8 categorical; valence/arousal (dimensional).	Large-scale, Internet-sourced; dual annotation scheme (categorical + dimensional).
ExpW [144]	91,793 faces (manually annotated).	7 categorical emotions.	Large-scale, Internet-sourced.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Florea, C.; Popescu, C.-B.; Racovițeanu, A.; Nițu, A.; Florea, L. From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data. Mathematics 2026, 14, 175. https://doi.org/10.3390/math14010175

AMA Style

Florea C, Popescu C-B, Racovițeanu A, Nițu A, Florea L. From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data. Mathematics. 2026; 14(1):175. https://doi.org/10.3390/math14010175

Chicago/Turabian Style

Florea, Corneliu, Constantin-Bogdan Popescu, Andrei Racovițeanu, Andreea Nițu, and Laura Florea. 2026. "From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data" Mathematics 14, no. 1: 175. https://doi.org/10.3390/math14010175

APA Style

Florea, C., Popescu, C.-B., Racovițeanu, A., Nițu, A., & Florea, L. (2026). From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data. Mathematics, 14(1), 175. https://doi.org/10.3390/math14010175

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data

Abstract

1. Introduction

2. Related Work

3. General Architecture

Formalization

4. Vision–Language Models

4.1. Categories of VLMs

4.1.1. Contrastive (Dual-Encoder) Models

4.1.2. Generative/Masked Modeling VLMs

4.1.3. Pre-Trained Backbones with Adapters

4.1.4. Unified Multimodal Transformers

4.2. Practical Aspects

4.2.1. Choosing a Model

4.2.2. Dataset Sizes and Curation

Fine-Tuning Strategies

5. Bias in VLMs and Implications

5.1. Mathematics of Bias

Inductive Bias

5.2. Dataset Bias

5.3. Theoretical Limitations in Bias and Fairness

5.4. Bias Quantification and Mitigation

5.5. Social Bias and Fairness

5.5.1. Bias in Training Data and Representation Spaces

5.5.2. Bias in Image Captioning Outputs

5.5.3. Mitigation-Oriented Approaches in Captioning

5.5.4. Normative and Philosophical Considerations

6. VLMs in Human Action Recognition

7. VLMs in Violence Detection

Results Comments

8. VLMs in Contextual Emotions Recognition

VLMs in Face Expression Recognition

9. Discussion

9.1. Action Recognition

9.2. Violence Detection

9.3. Emotion Recognition

9.4. Issues and Mitigation

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI