From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data
Abstract
1. Introduction
2. Related Work
3. General Architecture
- Image description is marked with blue. The image content can be described using either a convolutional neural network (CNN, suitable for low-resource settings) or a Visual Transformer (ViT).
- Problem-specific descriptors. The central branch is optional and depends on the problem. For example, in action recognition or body emotion, it may include a system that identifies keypoints on human articulations (a skeletal representation). For facial expression recognition, it may involve facial keypoints or an action unit descriptor. For recognition in video, an optical flow descriptor may be used.
- VLM captioning. The branch highlighted in red is the focal point of this paper.
Formalization
4. Vision–Language Models
4.1. Categories of VLMs
- Contrastive (dual-encoder) models;
- Generative/masked modeling VLMs;
- Pre-trained backbones with adapters;
- Unified multimodal Transformers.
4.1.1. Contrastive (Dual-Encoder) Models
4.1.2. Generative/Masked Modeling VLMs
4.1.3. Pre-Trained Backbones with Adapters
4.1.4. Unified Multimodal Transformers
4.2. Practical Aspects
4.2.1. Choosing a Model
4.2.2. Dataset Sizes and Curation
- Web-scale collections of image–text pairs;
- Filtering and bootstrapping for noise reduction;
- Curated domain-specific datasets.
Fine-Tuning Strategies
- Full fine-tuning of all model parameters;
- Adapter-based fine-tuning (e.g., LoRA);
- Prompt tuning;
- Instruction tuning;
- Probing classifiers.
5. Bias in VLMs and Implications
5.1. Mathematics of Bias
Inductive Bias
5.2. Dataset Bias
5.3. Theoretical Limitations in Bias and Fairness
5.4. Bias Quantification and Mitigation
- Pre-Processing Methods:
- Relabeling and Perturbation (RP): Modify training data either by changing ground-truth labels (relabeling) or altering feature values (perturbation).
- Sampling (Samp): Adjust the training data distribution by adding or removing samples, or by reweighting their influence during training.
- Latent Variables(LV): Augment training data with additional, preferably unbiased, features.
- Representation Learning (Repr): Learn transformations of training data that reduce bias while retaining as much useful information as possible.
- In-Processing Methods:
- Regularization and Constraints (RC): Modify the learning algorithm’s loss function. Regularization introduces penalty terms for discrimination (increasing loss when discrimination occurs), while constraints impose bias limits that cannot be violated during training.
- Adversarial Learning (AvL): Train models alongside adversaries. The model predicts ground-truth values, while the adversary attempts to exploit fairness violations.
- Compositional Approaches (CA): Train multiple classifiers, each specialized for a specific group (e.g., privileged vs. unprivileged), and combine their predictions.
- Adjusted Learning (AdL): Adapt existing algorithms or develop new ones with explicit bias mitigation mechanisms.
- Post-Processing Methods:
- Input Correction: Apply modifications to test data before prediction.
- Classifier Correction: Adjust trained models directly to satisfy fairness criteria.
- Output Correction: Modify predicted labels as a final step in enforcing fairness.
| Solution | Moment | Type | VLM Targeted | Benchmarks Reported |
|---|---|---|---|---|
| Revise [74] | Pre-Proc | RP, Samp | All | COCO; Places; Visual Genome, etc. |
| VLBiasBench [75] | Pre-Proc | Samp | All | Synthetic |
| Multifair [76] | Pre-Proc | Samp, Repr | All | CelebA |
| Zhu et al. [77] | Pre-Proc | RP | All | Debiased COCO; other |
| BiMa [78] | Pre-Proc | LV | All txt-vid retrieval | MSR-VTT; ActivityNet; etc. |
| OxonFair [79] | Pre-Proc | AdL | All | CelebA; others |
| DomInd [80] | In-Proc | RC | All | CIFAR-10S; CelebA |
| IDR [81] | In-Proc | RC | All | MiniImageNet; CalTech 101, etc. |
| BA-LoRA [82] | In-Proc | RC | All (LLM) | Waterbirds; CelebA |
| PRISM [67] | In-Proc | RC | CLIP | Waterbirds; CelebA |
| LogicCLIP [83] | In-Proc | RC | All | LogicBench |
| DIM [84] | In-Proc | CA, RC | All | CIFAR-100; Breeds |
| REAL [85] | Post-Proc | CC | All | ImageNet; Flowers; EuroSAT; etc |
| TriProTesting [24] | Post-Proc | CC | All | CelebA; UTKFace; FairFace, etc. |
5.5. Social Bias and Fairness
5.5.1. Bias in Training Data and Representation Spaces
5.5.2. Bias in Image Captioning Outputs
5.5.3. Mitigation-Oriented Approaches in Captioning
5.5.4. Normative and Philosophical Considerations
6. VLMs in Human Action Recognition
| Benchmark | Size/Duration | Data | Observations |
|---|---|---|---|
| MSR-VTT [99] | 10k YouTube videos; 200k descriptions (20 captions per clip); 40 h | Videos (10–30 s each) + manually labeled captions (avg 10 words/caption) | Standard benchmark for video → text retrieval and video question answering (Video QA). Scene context: H (diverse environments, activities) + interaction context: L (implicit, coarse) → tasks: video–text retrieval, Video QA, captioning. |
| Youcook2 [98] | 2k videos; 176 h (avg 5.26 min/video—each video is divided in 7–16 clips) | Cooking instructional videos + one caption per clip (10–20 words) | Used in instructional video understanding tasks (captioning, retrieval). Scene context: M (kitchen, cooking setup) + interaction context: M (human–object, procedural) → tasks: instructional captioning, retrieval, procedure understanding. |
| Kinetics-600 [95] | 474k YouTube videos (avg 10 s/video); 1317 h | Videos + 600 distinct actions classes (e.g., sports, domestic tasks) | Mainly used for video-level action classification. Scene context: H (sports, indoor/outdoor cues) + interaction context: L (coarse labels) → tasks: action classification, representation, pre-training. |
| Ego4D [96] | 3670 h collected from 923 participants across 74 locations and 9 countries (avg 8 min/video) | Egocentric videos (first-person view) + narrations—each narration is a free-form sentence associated with a single timestamp. | Scene context: H (egocentric, environment-aware) + interaction context: H (who–does–what–why, long-term)→ tasks: long-term action anticipation, temporal grounding, QA, NL retrieval. |
| Something-Something-V2 (SSv2) [97] | 170K videos (220k clips with 4–5 s/clip); 276 h | Clips + labels—each clip is labeled with a template caption such as “Putting [something] onto [something else]”. | Widely used for action recognition tasks—specifically, video-level classification of fine-grained human–object interactions (e.g., “Pushing [something] from left to right”). Scene context: L (background suppressed) + interaction context: H (fine-grained object relations) → tasks: fine-grained, action recognition, interaction-centric classification. |
| COIN [100] | 11k videos; 476 h; 46,354 annotated segments | Instructional videos, covering 180 distinct tasks across 12 different domains (e.g., vehicles, gadgets, household items) + annotations | It supports tasks such as step localization (finding where each step occurs in the video), action segmentation (parsing the video into segments corresponding to steps). Scene context: M (task-/domain-specific) + interaction context: M (step-wise human–object actions) → tasks: action segmentation, step localization, instruction understanding. |
| Solution | VLM (Visual Language Model/Architecture) | Benchmark and Performance (Metric and Value) | Improvement/Key Contribution |
|---|---|---|---|
| Luo et al., 2021 [101] | UniVL (Text Encoder + Video Encoder); Cross-Modal Encoder; Decoder | Video → Text Retrieval on YouCook2 (FT): R@1: 28.9%; R@10: 70.0%. Video Captioning on YouCook2 (FT): BLEU-4: 17.35, CIDEr: 1.81, METEOR: 22.35. | Early cross-modal VLM; objectives: retrieval, captioning; fine-grained semantics limited by caption quality (e.g., noisy ASR in HowTo100M); no LLM → limited contextual reasoning. |
| Xu et al., 2021 [106] | VideoCLIP (CLIP + Video Encoder) | Text → Video Retrieval on YouCook2 (ZS): R@1: 22.7, R@5: 50.4, R@10: 63.1. Action Segmentation on COIN (ZS): Accuracy 58.9%. | Contrastive video–text model; objectives: retrieval, zero-shot transfer; fine-grained reasoning constrained by pre-training captions; limited temporal modeling. |
| Alayrac et al., 2022 [43] | Flamingo (Frozen Vision Encoder + LLM) | VQA on MSRVTT (ZS): Top-1: 19.2%; (32 shots): Top-1: 31.0%. | LLM-based VLM enabling in-context few-shot learning; objectives: retrieval, captioning, QA; distinguishes similar actions via compositional/contextual reasoning; prompt-sensitive, high compute, weaker for strict classification. |
| Pramanick et al., 2023 [111] | EgoVLPv2 (TimeSformer + RoBERTa-B + Cross-Attention Gated Fusion) | Video → Text Retrieval on EgoMCQ (ZS): Inter Acc. 91.0%, Intra Acc. 60.9%. Video QA on EgoTaskQA head-tuned: Mean Acc.: 36.53%. | Egocentric VLM pre-trained on Ego4D; objectives: retrieval, QA, grounding; fine-grained semantics via hand/object and temporal context; relies on dense aligned narrations. |
| Xu et al., 2023 [115] | mPLUG-2 (Dual-Vision Encoder + Cross-Attention Fusion + Shared Decoder) | Text → Video Retrieval on MSRVTT (ZS): R@1: 47.1; (FT): R@1: 53.1. Video Captioning MSRVTT (FT): CIDEr: 80.3; BLEU-4: 57.8. Video QA on MSRVTT-QA (FT): Acc. 48.0. | Video–language model exploiting cross-modal correlations; objectives: retrieval, captioning, QA; fine-grained semantics via object–actor–scene grounding; needs careful alignment in multi-task supervision. |
| Chen et al., 2024 [114] | InternVL (InternViT-6B + Language Middleware + LLM Decoder) | Text → Video Retrieval on MSRVTT (ZS): R@1: 46.3. Video Classification on Kinetics-600 (ZS): 78.8% (Top-1/5 average). | Large-scale multimodal VLM with LLM alignment; objectives: cross-modal matching, generative captioning; fine-grained semantics limited by lack of temporal/action supervision; weaker for long-horizon reasoning. |
| Bolya et al., 2025 [109] | Perception Encoder (PE) (Scaled Vision Encoder + CLIP-Like Pre-Training + Video Data Engine) | Video Classification on Kinetics-400 (ZS, 8 frames): Top-1: 76.9%. Kinetics-600 (ZS, 8 frames): Top-1: 76.1%. Text→Video retrieval on MSR-VTT (ZS): 51.2 R@1. | Scaled vision encoder with CLIP-style pre-training; objectives: zero-shot classification, retrieval; limited fine-grained multimodal grounding; coarse linguistic and interaction supervision. |
| Yuan et al., 2025 [116] | Tarsier2-7B (VLM with 7B Parameters, Initialized from Qwen2-VL) | VQA EgoTaskQA (FT): 77.5% Exact Match. | Large-scale video–text pre-training with temporal alignment; objectives: retrieval, QA, captioning; detailed grounding via object–actor–scene; susceptible to hallucination and subtle visual ambiguities. |
7. VLMs in Violence Detection
| Dataset | Clips | Anomaly/Violence Types | Source | Key Characteristics |
|---|---|---|---|---|
| UCF-Crime [126] | 1900 | 13 anomalies (abuse, fighting, shooting, etc.) | Surveillance (indoor/outdoor) | Long, untrimmed videos; video-level training, frame-level testing. |
| XD-Violence [127] | 4754 | 6 classes (abuse, accidents, riots, etc.) | Movies, YouTube, CCTV | Large-scale; includes audio; weakly labeled untrimmed videos. |
| RWF-2000 [128] | 1600 train 400 test | Violent vs. non-violent | YouTube surveillance | Uniform 5 s clips at 30 fps. |
| Movies [129] | 200 | Fights vs. non-fight scenes | Action films, sports | Diverse scenes; manually labeled. |
| Surv. Fight [130] | 300 | Fights (kicks, fists, wrestling) vs. non-fight | YouTube (cafes, streets, buses) | Short clips (2 s) from real-world surveillance. |
| Hockey [129] | 1000 | Fights vs. non-fights | NHL games | Dynamic setting; violent and normal actions look similar. |
Results Comments
| Method | Datasets | Metrics (Values) | Effect of Captioning/Text Guidance |
|---|---|---|---|
| Holmes-VAD [122] | UCF-Crime, XD-Violence | AUC: 89.51% (UCF); AP: 90.67% (XD); JA: 86.0%; CP: 61.2%; AE: 51.9% | Instruction tuning (VAD-Instruct50k) greatly boosted interpretability: JA (86.0 vs. 65.1), CP (61.2 vs. 11.6), AE (51.9 vs. 15.9). |
| ASK-HINT [120] | UCF-Crime, XD-Violence | AUC (UCF): 89.83%; AUC (XD): 90.31% | Fine-grained prompting (6 prompts) outperformed full-prompt baseline (67.17%), improving AUC by 22.6%. |
| VERA [121] | UCF-Crime, XD-Violence | AUC (UCF): 86.55%; AUC (XD): 88.26% | Guiding question prompts raised AUC from 78.81% to 86.55%, highlighting textual reasoning benefits. |
| VIVID [123] | Movies, Surv. Fight, RWF-2000, Hockey, XD-Violence | Acc: Movies 0.985; Surv.fight: 0.736; RWF-2000: 0.797; Hockey: 0.954; 0.985; XD 0.826 | Violence definition database reduced bias and improved accuracy (up to +0.32). |
| ViCap-AD [125] | UCF-Crime, XD-Violence | AUC (UCF): 87.20%; AUC (XD): 85.02% | CLIP4Clip captions + Multiple-Instance Learning losses increased AUC by 1.72% (87.20 vs. 85.48) on UCF crime. |
| PiercingEye [124] | XD-Violence, UCF-Crime | AP: 88.82% (XD); AUC: 86.64% (UCF) | Hyperbolic Vision-Language Guided Loss with Ambiguous Event Text Generation text raised AP by 1.21% (87.61 → 88.82). |
| Multimodal VAD [17] | UCF-Crime, XD-Violence | AUC: 87.96% (UCF); AP: 86.32% (XD) | Adding text in V+A+T setup improved AP by at least 4% over video-only models for XD Violence. |
| LAVAD [119] | UCF-Crime, XD-Violence | AUC: 80.28% (UCF); AUC: 85.36% (XD) | Removing LLM-based anomaly scoring dropped AUC by −7.58% (80.28 → 72.70) on UCF-Crime. |
| Method | Primary Approach | UCF-Crime (AUC %) | XD-Violence (AP %) |
|---|---|---|---|
| LAVAD [119] | Training-Free (VLM Captions + LLM) | 80.28 | 62.01 |
| VERA [121] | Verbalized Learning (Frozen VLM) | 86.55 | 70.54 |
| PiercingEye [124] | Dual-Space Geometric Learning | 86.64 | 88.82 |
| Multimodal-AD [17] | Dual-Stream Network (Coarse: V+A, Fine: T) | 87.96 | 86.32 |
| Holmes-VAD [122] | Instruction-Tuned (Explainable VAD) | 89.51 | 90.67 |
8. VLMs in Contextual Emotions Recognition
VLMs in Face Expression Recognition
| Model | Size | Data | Observations |
|---|---|---|---|
| MELD [162] | 13,000 utterances across 1433 dialogues from 1000+ speakers. | 7 categ. emotions + sentiment polarity (pos.–neut.–neg.). | Multimodal dataset (audio, visual, text) derived from the TV series Friends, designed for multi-party conversational emotion recognition at the utterance level. |
| VCR [163] | 110,000 movie scenes forming ∼290,000 question–answer pairs. | 4 multiple-choice answers + 4 rationales per question. | Multimodal (image + text) dataset targeting visual common-sense reasoning, requiring understanding of context, intentions, and affect. |
| MER2023 [164] | ∼3373 labeled samples and >73,000 unlabeled samples for semi-supervised learning. | 6 discrete emotions (neutral, anger, happiness, sadness, worry, surprise). | Multimodal dataset (video, audio, text) designed for robust emotion recognition under noisy and semi-supervised conditions. Includes multiple tracks: MER-MULTI, MER-NOISE, MER-SEMI. |
| MAFW [165] | 10,045 video–audio clips with textual affective captions. | 11 basic and 32 compound emotions. | Multimodal dataset combining video, audio, and text for emotion recognition in complex natural contexts. Supports hierarchical affective labeling. |
| Solution | VLM | Benchmark and Performance | Improvement |
|---|---|---|---|
| Xenos et al. (2024) [16] | LLaVA (two-stage; caption + classify) | = 38.52% = 26.66% = 93.08% | Fuse visual features + the generated text via a Transformer architecture (vision encoder → learnable queries → Q-Former cross + self attention, then classifier) |
| Etesam et al. (2024) [145] | LLaVA, GPT-4V | = 54.27% = 78.40% = 36.83% | Introduces narrative captioning (NarraCap): generates descriptive captions (who, action, social/physical signals, environment) then feeds to LLM for emotion inference |
| Lei et al. (2024) [148] | VILA-8B (LVLM few-shot + CoT) | = 57.8% = 46.9% = 29.8% = 48.3% = 33.6% = 35.8% | Uses in-context learning (ICL) and Chain-of-Thought (CoT) prompts generated by GPT-4V for contextual reasoning |
| Bhattacharyya & Wang (2025) [149] | GPT-4o, CLIP variants | = 63.5% = 27.0% = 45.0% = 44.9% | Evaluates off-the-shelf VLMs using reasoning-based prompts for improved emotion prediction accuracy |
| Xie et al. (2024) [150] | EmoVIT (InstructBLIP + visual instruction tuning) | = 57.6% = 32.3% = 44.9% = 21.1% | Uses GPT-4-generated visual instruction data including captions, categorical, and reasoning forms to fine-tune emotion understanding via instruction following |
| Zhou et al. (2024) [146] | ViCor (LLM + VLM collaboration) | = 59.8% = 70.9% | Bridges vision understanding and common-sense reasoning: captions + visual clues are iteratively exchanged between the VLM and LLM to refine emotion and common-sense inference |
| Lian et al. (2024) [147] | AffectGPT (Multimodal LLM + MER-Caption dataset) | = 64.56% | Introduces descriptive emotion captions (MER-Caption), enabling free-form emotion reasoning across modalities (vision, audio, text) |
| Solution | VLM | Benchmark and Performance | Improvement |
|---|---|---|---|
| Zhao & Patras (2023) [159] | DFER-CLIP (CLIP + LLM text descriptions) | = 59.6/71.3%; = 41.3/51.6%; = 38.9/52.6%; | Uses ChatGPT-generated textual behavior descriptions instead of class labels → improved temporal FER generalization |
| Foteinopoulou & Patras (2024) [166] | EmoCLIP (CLIP + temporal transformer) | = 58.1/62.1%; = 31.3/43.5%; = 34.3/44.2%; = 44.3/46.2%; | Uses sample-level captions of expressions and LLM-generated class descriptions for zero-shot video FER |
| Li et al. (2024) [15] | CLIPER (CLIP + METD) | = 41.2/51.4%; = 91.8% | Learns multiple text prompts per emotion for richer semantics; better alignment without explicit captions |
| Chen et al. (2024) [160] | FineCLIPER (CLIP + AdaptERs + Video-LLaVA) | = 66.0/76.2%; = 45.2/54.0%; = 45.0/56.9%; | Fine-grained dynamic captions from Video-LLaVA improve temporal FER; surpasses CLIP baseline across datasets |
| Huang et al. (2025) [161] | Emotion-Qwen (Hybrid Expert Mixture) | = 77.1/62.2%; | Integrates VER dataset for instruction fine-tuning with contextual and causal captions to enhance emotion reasoning |
| Seoh et al. [151] | Qwen2.5-VL; Aya Vision; InternVL2.5 8B-MPO | = 52.9; = 52.2; = 48.5 | EmoGist analyzes clusters of example images to pre-generate multiple context-specific descriptions via the VLM for emotion labels |
| Zhao et al. [153] | CLIP-L | = 58.7/65.4; = 38.4/38.4; = 40.2/47.1; = 38.7/40.4 | Exp-CLIP introduces a simple, learnable projection head that aligns the general vision–language feature space of a pre-trained model with a task-aware semantic space derived from an LLM encoder |
| Lan et al. [155] | ViT-L/14 (DINOv2; LLAVA 1.5) | = 62.9; = 91.0; = 68.1 | EXPLLM leverages an AU model and GPT-4o to construct instruction–description data pairs that articulate the relationship between facial movements and emotion |
| Hu et al. [154] | LLaVA-LCS-558K (with LoRA) | = 42.9; = 69.9; = 68.1 | FEALLM aligned descriptions of facial expressions (FE) and action units (AU), along with causal reasoning instructions to explain how AUs lead to FEs |
| Sălăgean et al. [156] | GPT-3.5 | = 96.0; = 93.0; = 94 | It integrates expert-defined rules and mechanisms for amplifying weak signals with the reasoning capabilities of a large language model |
9. Discussion
9.1. Action Recognition
9.2. Violence Detection
9.3. Emotion Recognition
9.4. Issues and Mitigation
10. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| CLIP | Contrastive Language–Image Pre-Training |
| FER | Face Expression Recognition |
| FS | Few Shot |
| FT | Fine-Tuned |
| HAR | Human Action Recognition |
| LM | Large Models |
| LLM | Large Language Model |
| ML | Machine Learning |
| UAR | Unweighted Average Recall |
| VAD | Video Anomaly Detection |
| VVD | Video Violence Detection |
| VLM | Vision–Language Model |
| VQA | Visual Question Answering |
| WAR | Weighted Average Recall |
| ZS | Zero Shot |
References
- Pareek, P.; Thakkar, A. A survey on video-based human action recognition: Recent updates, datasets, challenges, and applications. Artif. Intell. Rev. 2021, 54, 2259–2322. [Google Scholar] [CrossRef]
- Ahmed, N.; Al Aghbari, Z.; Girija, S. A systematic survey on multimodal emotion recognition using learning algorithms. Intell. Syst. Appl. 2023, 17, 200171. [Google Scholar] [CrossRef]
- Liu, D.; Bao, Z.; Mi, J.; Gan, Y.; Ye, M.; Zhang, J. Cross-domain video action recognition via adaptive gradual learning. Neurocomputing 2023, 556, 126622. [Google Scholar] [CrossRef]
- Chen, T.; Pu, T.; Wu, H.; Xie, Y.; Liu, L.; Lin, L. Cross-domain facial expression recognition: A unified evaluation benchmark and adversarial graph learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9887–9903. [Google Scholar] [CrossRef] [PubMed]
- Xu, H.; Zhi, S.; Sun, S.; Patel, V.; Liu, L. Deep learning for cross-domain few-shot visual recognition: A survey. ACM Comput. Surv. 2025, 57, 1–37. [Google Scholar] [CrossRef]
- Gu, J.; Han, Z.; Chen, S.; Beirami, A.; He, B.; Zhang, G.; Liao, R.; Qin, Y.; Tresp, V.; Torr, P. A systematic survey of prompt engineering on vision-language foundation models. arXiv 2023, arXiv:2307.12980. [Google Scholar] [CrossRef]
- Sharma, H.; Padha, D. A comprehensive survey on image captioning: From handcrafted to deep learning-based techniques, a taxonomy and open research issues. Artif. Intell. Rev. 2023, 56, 13619–13661. [Google Scholar] [CrossRef]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PmLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef]
- Zhou, X.; Liu, M.; Yurtsever, E.; Zagar, B.L.; Zimmer, W.; Cao, H.; Knoll, A.C. Vision Language Models in Autonomous Driving: A Survey and Outlook. arXiv 2024, arXiv:2310.14414. [Google Scholar] [CrossRef]
- Huang, Z.; Yan, H.; Zhan, Q.; Yang, S.; Zhang, M.; Zhang, C.; Lei, Y.; Liu, Z.; Liu, Q.; Wang, Y. A Survey on Remote Sensing Foundation Models: From Vision to Multimodality. arXiv 2025, arXiv:2503.22081. [Google Scholar] [CrossRef]
- Han, X.; Chen, S.; Fu, Z.; Feng, Z.; Fan, L.; An, D.; Wang, C.; Guo, L.; Meng, W.; Zhang, X.; et al. Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision. arXiv 2025, arXiv:2504.02477. [Google Scholar] [CrossRef]
- Wang, M.; Xing, J.; Mei, J.; Liu, Y.; Jiang, Y. ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 625–637. [Google Scholar] [CrossRef] [PubMed]
- Li, H.; Niu, H.; Zhu, Z.; Zhao, F. CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
- Xenos, A.; Foteinopoulou, N.M.; Ntinou, I.; Patras, I.; Tzimiropoulos, G. VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning. arXiv 2024, arXiv:2404.07078. [Google Scholar] [CrossRef]
- Wang, D.; Wang, Q.; Hu, Q.; Wu, K. Multimodal VAD: Visual Anomaly Detection in Intelligent Monitoring System via Audio-Vision-Language. IEEE Trans. Instrum. Meas. 2025, 74, 4012212. [Google Scholar] [CrossRef]
- Gallegos, I.O.; Rossi, R.A.; Barrow, J.; Tanjim, M.M.; Kim, S.; Dernoncourt, F.; Yu, T.; Zhang, R.; Ahmed, N.K. Bias and fairness in large language models: A survey. Comput. Linguist. 2024, 50, 1097–1179. [Google Scholar] [CrossRef]
- Zeng, B.; Yin, Y.; Liu, Z. Understanding bias in large-scale visual datasets. Adv. Neural Inf. Process. Syst. 2024, 37, 61839–61871. [Google Scholar]
- Hamidieh, K.; Zhang, H.; Gerych, W.; Hartvigsen, T.; Ghassemi, M. Identifying implicit social biases in vision-language models. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, San Jose, CA, USA, 21–23 October 2024; Volume 7, pp. 547–561. [Google Scholar]
- Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
- Pourpanah, F.; Abdar, M.; Luo, Y.; Zhou, X.; Wang, R.; Lim, C.P.; Wang, X.Z.; Wu, Q.J. A review of generalized zero-shot learning methods. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4051–4070. [Google Scholar] [CrossRef]
- Navigli, R.; Conia, S.; Ross, B. Biases in large language models: Origins, inventory, and discussion. ACM J. Data Inf. Qual. 2023, 15, 1–21. [Google Scholar] [CrossRef]
- Sun, S.; Liu, L.; Liu, Y.; Liu, Z.; Zhang, S.; Heikkilä, J.; Li, X. Uncovering bias in foundation models: Impact, testing, harm, and mitigation. arXiv 2025, arXiv:2501.10453. [Google Scholar]
- Hort, M.; Chen, Z.; Zhang, J.M.; Harman, M.; Sarro, F. Bias mitigation for machine learning classifiers: A comprehensive survey. ACM J. Responsible Comput. 2024, 1, 1–52. [Google Scholar] [CrossRef]
- Pagano, T.P.; Loureiro, R.B.; Lisboa, F.V.; Peixoto, R.M.; Guimarães, G.A.; Cruz, G.O.; Araujo, M.M.; Santos, L.L.; Cruz, M.A.; Oliveira, E.L.; et al. Bias and unfairness in machine learning models: A systematic review on datasets, tools, fairness metrics, and identification and mitigation methods. Big Data Cogn. Comput. 2023, 7, 15. [Google Scholar] [CrossRef]
- Kong, Y.; Fu, Y. Human action recognition and prediction: A survey. Int. J. Comput. Vis. 2022, 130, 1366–1401. [Google Scholar] [CrossRef]
- Mumtaz, N.; Ejaz, N.; Habib, S.; Mohsin, S.M.; Tiwari, P.; Band, S.S.; Kumar, N. An overview of violence detection techniques: Current challenges and future directions. Artif. Intell. Rev. 2023, 56, 4641–4666. [Google Scholar] [CrossRef]
- Kalateh, S.; Estrada-Jimenez, L.A.; Nikghadam-Hojjati, S.; Barata, J. A systematic review on multimodal emotion recognition: Building blocks, current state, applications, and challenges. IEEE Access 2024, 12, 103976–104019. [Google Scholar] [CrossRef]
- Kopalidis, T.; Solachidis, V.; Vretos, N.; Daras, P. Advances in facial expression recognition: A survey of methods, benchmarks, models, and datasets. Information 2024, 15, 135. [Google Scholar] [CrossRef]
- Abbas, R.; Ni, B.; Ma, R.; Li, T.; Lu, Y.; Li, X. Context-based emotion recognition: A survey. Neurocomputing 2025, 618, 129073. [Google Scholar] [CrossRef]
- Hu, X.; Fan, Z.; Jiang, L.; Xu, J.; Li, G.; Chen, W.; Zeng, X.; Yang, G.; Zhang, D. TOP-ALCM: A novel video analysis method for violence detection in crowded scenes. Inf. Sci. 2022, 606, 313–327. [Google Scholar] [CrossRef]
- Wang, J.; Wang, C.; Guo, L.; Zhao, S.; Wang, D.; Zhang, S.; Zhao, X.; Yu, J.; Wang, Y.; Yang, Y.; et al. MDKAT: Multimodal Decoupling with Knowledge Aggregation and Transfer for Video Emotion Recognition. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 9809–9822. [Google Scholar] [CrossRef]
- Zheng, H.; Shen, L.; Tang, A.; Luo, Y.; Hu, H.; Du, B.; Wen, Y.; Tao, D. Learning from models beyond fine-tuning. Nat. Mach. Intell. 2025, 7, 6–17. [Google Scholar] [CrossRef]
- Zhou, J.; Chen, Y.; Hong, Z.; Chen, W.; Yu, Y.; Zhang, T.; Wang, H.; Zhang, C.; Zheng, Z. Training and serving system of foundation models: A comprehensive survey. IEEE Open J. Comput. Soc. 2024, 5, 107–119. [Google Scholar] [CrossRef]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. arXiv 2018, arXiv:2012.11747. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; p. 8440. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
- Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 104–120. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, Orleans, LA, USA, 28 November–9 December 2022; pp. 23716–23736. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Zhai, X.; Wang, X.; Mustafa, B.; Steiner, A.; Keysers, D.; Kolesnikov, A.; Beyer, L. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18123–18133. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, Orleans, LA, USA, 10–16 December 2023; pp. 34892–34916. [Google Scholar]
- Bordes, F.; Pang, R.Y.; Ajay, A.; Li, A.C.; Bardes, A.; Petryk, S.; Mañas, O.; Lin, Z.; Mahmoud, A.; Jayaraman, B.; et al. An introduction to vision-language modeling. arXiv 2024, arXiv:2405.17247. [Google Scholar] [CrossRef]
- LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F. A tutorial on energy-based learning. Predict. Struct. Data 2006, 1. [Google Scholar]
- Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Sardinia, Italy, 13–15 May 2010; pp. 297–304. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
- Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML, Helsinki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar]
- Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2536–2544. [Google Scholar]
- Dubois, Y.; Bloem-Reddy, B.; Ullrich, K.; Maddison, C.J. Lossy compression for lossless prediction. Adv. Neural Inf. Process. Syst. 2021, 34, 14014–14028. [Google Scholar]
- Shwartz Ziv, R.; LeCun, Y. To compress or not to compress—Self-supervised learning and information theory: A review. Entropy 2024, 26, 252. [Google Scholar] [CrossRef]
- Peng, Z.; Wang, W.; Dong, L.; Hao, Y.; Huang, S.; Ma, S.; Ye, Q.; Wei, F. Kosmos-2: Grounding multimodal large language models to the world. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Gadre, S.Y.; Ilharco, G.; Fang, A.; Hayase, J.; Smyrnis, G.; Nguyen, T.; Marten, R.; Wortsman, M.; Ghosh, D.; Zhang, J.; et al. Datacomp: In search of the next generation of multimodal datasets. Adv. Neural Inf. Process. Syst. 2023, 36, 27092–27112. [Google Scholar]
- Henighan, T.; Kaplan, J.; Katz, M.; Chen, M.; Hesse, C.; Jackson, J.; Jun, H.; Brown, T.B.; Dhariwal, P.; Gray, S.; et al. Scaling laws for autoregressive generative modeling. arXiv 2020, arXiv:2010.14701. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
- Thrush, T.; Jiang, R.; Bartolo, M.; Singh, A.; Williams, A.; Kiela, D.; Ross, C. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5238–5248. [Google Scholar]
- Hellström, T.; Dignum, V.; Bensch, S. Bias in machine learning-what is it good for? In Proceedings of the International Workshop on New Foundations for Human-Centered AI (NeHuAI) Co-Located with 24th European Conference on Artificial Intelligence (ECAI 2020), Santiago de Compostela, Spain, 29 August–8 September 2020; pp. 3–10. [Google Scholar]
- Torralba, A.; Efros, A.A. Unbiased look at dataset bias. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 21–23 June 2011; pp. 1521–1528. [Google Scholar]
- Liu, Z.; He, K. A Decade’s Battle on Dataset Bias: Are We There Yet? In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
- You, Z.; Zhang, X.; Guo, H.; Wang, J.; Li, C. Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers? In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 28790–28800. [Google Scholar]
- Mitchell, T.M. The Need for Biases in Learning Generalizations; Technical Report No. CBM-TR-117; Department of Computer Science, Rutgers University: New Brunswick, NJ, USA, 1980. [Google Scholar]
- Zhao, J.; Wang, T.; Yatskar, M.; Ordonez, V.; Chang, K.W. Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 2979–2989. [Google Scholar]
- Molahasani, M.; Motamedi, A.; Greenspan, M.; Kim, I.M.; Etemad, A. Prism: Reducing spurious implicit biases in vision-language models with llm-guided embedding projection. arXiv 2025, arXiv:2507.08979. [Google Scholar]
- Zafar, M.B.; Valera, I.; Gomez Rodriguez, M.; Gummadi, K.P. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 1171–1180. [Google Scholar]
- Kleinberg, J.; Mullainathan, S.; Raghavan, M. Inherent Trade-Offs in the Fair Determination of Risk Scores. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Berkeley, CA, USA, 9–11 January 2017. [Google Scholar]
- Zhao, H.; Gordon, G.J. Inherent tradeoffs in learning fair representations. J. Mach. Learn. Res. 2022, 23, 1–26. [Google Scholar]
- Hardt, M.; Price, E.; Srebro, N. Equality of opportunity in supervised learning. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
- Woodworth, B.; Gunasekar, S.; Ohannessian, M.I.; Srebro, N. Learning non-discriminatory predictors. In Proceedings of the Conference on Learning Theory, PMLR, Amsterdam, The Netherlands, 7–10 July 2017; pp. 1920–1953. [Google Scholar]
- Lee, H.; Chen, S. Systematic bias of machine learning regression models and correction. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4974–4983. [Google Scholar] [CrossRef] [PubMed]
- Wang, A.; Liu, A.; Zhang, R.; Kleiman, A.; Kim, L.; Zhao, D.; Shirai, I.; Narayanan, A.; Russakovsky, O. Revise: A tool for measuring and mitigating bias in visual datasets. Int. J. Comput. Vis. 2022, 130, 1790–1810. [Google Scholar] [CrossRef]
- Wang, S.; Cao, X.; Zhang, J.; Yuan, Z.; Shan, S.; Chen, X.; Gao, W. Vlbiasbench: A comprehensive benchmark for evaluating bias in large vision-language model. arXiv 2024, arXiv:2406.14194. [Google Scholar]
- Tian, H.; Liu, B.; Zhu, T.; Zhou, W.; Yu, P.S. Multifair: Model fairness with multiple sensitive attributes. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 5654–5667. [Google Scholar] [CrossRef]
- Zhu, H.; Liang, S.; Wang, W.; Li, B.; Yuan, T.; Li, F.; Wang, S.; Zhang, Z. Revisiting Data Auditing in Large Vision-Language Models. arXiv 2025, arXiv:2504.18349. [Google Scholar] [CrossRef]
- Le, H.; Chung, N.; Kieu, T.; Nguyen, A.; Le, N. BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance. arXiv 2025, arXiv:2506.03589. [Google Scholar]
- Delaney, E.; Fu, Z.; Wachter, S.; Mittelstadt, B.; Russell, C. Oxonfair: A flexible toolkit for algorithmic fairness. Adv. Neural Inf. Process. Syst. 2024, 37, 94209–94245. [Google Scholar]
- Wang, Z.; Qinami, K.; Karakozis, I.C.; Genova, K.; Nair, P.; Hata, K.; Russakovsky, O. Towards fairness in visual recognition: Effective strategies for bias mitigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8919–8928. [Google Scholar]
- Ma, Y.; Jiao, L.; Liu, F.; Li, L.; Ma, W.; Yang, S.; Liu, X.; Chen, P. Unveiling and mitigating generalized biases of dnns through the intrinsic dimensions of perceptual manifolds. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2237–2244. [Google Scholar] [CrossRef] [PubMed]
- Chang, Y.; Chang, Y.; Wu, Y. BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models. arXiv 2024, arXiv:2408.04556. [Google Scholar]
- Zhou, Y.; Tang, J.; Yang, S.; Xiao, X.; Dai, Y.; Yang, W.; Gou, C.; Xia, X.; Chua, T.S. Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models. arXiv 2025, arXiv:2508.11317. [Google Scholar] [CrossRef]
- Zhang, Z.; Feng, M.; Li, Z.; Xu, C. Discover and mitigate multiple biased subgroups in image classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 10906–10915. [Google Scholar]
- Parashar, S.; Lin, Z.; Liu, T.; Dong, X.; Li, Y.; Ramanan, D.; Caverlee, J.; Kong, S. The neglected tails in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 12988–12997. [Google Scholar]
- Birhane, A.; Prabhu, V.U.; Kahembwe, E. Multimodal datasets: Misogyny, pornography, and malignant stereotypes. arXiv 2021, arXiv:2110.01963. [Google Scholar] [CrossRef]
- Zhao, D.; Wang, A.; Russakovsky, O. Understanding and evaluating racial biases in image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14830–14840. [Google Scholar]
- Sabir, A.; Padró, L. Women Wearing Lipstick: Measuring the Bias Between an Object and Its Related Gender. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar]
- Abdelrahman, E.; Sun, P.; Li, L.E.; Elhoseiny, M. Imagecaptioner2: Image captioner for image captioning bias amplification assessment. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 20902–20911. [Google Scholar]
- Fraser, K.C.; Kiritchenko, S. Examining Gender and Racial Bias in Large Vision–Language Models Using a Novel Dataset of Parallel Images. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian’s, Malta, 17–22 March 2024; pp. 690–713. [Google Scholar]
- Konavoor, A.; Dandekar, R.A.; Dandekar, R.; Panat, S. Vision-Language Models display a strong gender bias. arXiv 2025, arXiv:2508.11262. [Google Scholar] [CrossRef]
- Yang, F.; Ghosh, S.; Barut, E.; Qin, K.; Wanigasekara, P.; Su, C.; Ruan, W.; Gupta, R. Masking latent gender knowledge for debiasing image captioning. In Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024), Mexico City, Mexico, 21 June 2024; pp. 227–238. [Google Scholar]
- Hirota, Y.; Nakashima, Y.; Garcia, N. Model-agnostic gender debiased image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15191–15200. [Google Scholar]
- Buijsman, S. Navigating fairness measures and trade-offs. AI Ethics 2024, 4, 1323–1334. [Google Scholar] [CrossRef]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The Kinetics Human Action Video Dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar] [CrossRef]
- Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. Ego4d: Around the world in 3000 h of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18995–19012. [Google Scholar]
- Goyal, R.; Ebrahimi Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5842–5850. [Google Scholar]
- Zhou, L.; Xu, C.; Corso, J. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Xu, J.; Mei, T.; Yao, T.; Rui, Y. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Tang, Y.; Ding, D.; Rao, Y.; Zheng, Y.; Zhang, D.; Zhao, L.; Lu, J.; Zhou, J. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1207–1216. [Google Scholar]
- Luo, H.; Ji, L.; Shi, B.; Huang, H.; Duan, N.; Li, T.; Li, J.; Bharti, T.; Zhou, M. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv 2020, arXiv:2002.06353. [Google Scholar]
- Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; Schmid, C. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7464–7473. [Google Scholar]
- Zhu, L.; Yang, Y. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8746–8755. [Google Scholar]
- Miech, A.; Zhukov, D.; Alayrac, J.B.; Tapaswi, M.; Laptev, I.; Sivic, J. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2630–2640. [Google Scholar]
- Zhukov, D.; Alayrac, J.B.; Cinbis, R.G.; Fouhey, D.; Laptev, I.; Sivic, J. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3537–3545. [Google Scholar]
- Xu, H.; Ghosh, G.; Huang, P.Y.; Okhonko, D.; Aghajanyan, A.; Metze, F.; Zettlemoyer, L.; Feichtenhofer, C. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6787–6800. [Google Scholar]
- Bain, M.; Nagrani, A.; Varol, G.; Zisserman, A. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1728–1738. [Google Scholar]
- Ma, Y.; Xu, G.; Sun, X.; Yan, M.; Zhang, J.; Ji, R. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 638–647. [Google Scholar]
- Bolya, D.; Huang, P.Y.; Sun, P.; Cho, J.H.; Madotto, A.; Wei, C.; Ma, T.; Zhi, J.; Rajasegaran, J.; Rasheed, H.; et al. Perception encoder: The best visual embeddings are not at the output of the network. arXiv 2025, arXiv:2504.13181. [Google Scholar] [CrossRef]
- Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 30016–30030. [Google Scholar]
- Pramanick, S.; Song, Y.; Nag, S.; Lin, K.Q.; Shah, H.; Shou, M.Z.; Chellappa, R.; Zhang, P. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 5285–5297. [Google Scholar]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
- Damen, D.; Doughty, H.; Farinella, G.M.; Fidler, S.; Furnari, A.; Kazakos, E.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4125–4141. [Google Scholar] [CrossRef]
- Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 24185–24198. [Google Scholar]
- Xu, H.; Ye, Q.; Yan, M.; Shi, Y.; Ye, J.; Xu, Y.; Li, C.; Bi, B.; Qian, Q.; Wang, W.; et al. mplug-2: A modularized multi-modal foundation model across text, image and video. In Proceedings of the International Conference on Machine Learning, ICML, Honolulu, HI, USA, 23–29 July 2023; pp. 38728–38748. [Google Scholar]
- Yuan, L.; Wang, J.; Sun, H.; Zhang, Y.; Lin, Y. Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding. arXiv 2025, arXiv:2501.07888. [Google Scholar] [CrossRef]
- Tong, Z.; Song, Y.; Wang, J.; Wang, L. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. arXiv 2022, arXiv:2203.12602. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar] [CrossRef]
- Zanella, L.; Menapace, W.; Mancini, M.; Wang, Y.; Ricci, E. Harnessing large language models for training-free video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18527–18536. [Google Scholar]
- Zou, S.; Tian, X.; Wesemann, L.; Waschkowski, F.; Yang, Z.; Zhang, J. Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting. arXiv 2025, arXiv:2510.02155. [Google Scholar]
- Ye, M.; Liu, W.; He, P. Vera: Explainable video anomaly detection via verbalized learning of vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, Vancouver, BC, Canada, 17–24 June 2025; pp. 8679–8688. [Google Scholar]
- Zhang, H.; Xu, X.; Wang, X.; Zuo, J.; Han, C.; Huang, X.; Gao, C.; Wang, Y.; Sang, N. Holmes-VAD: Towards unbiased and explainable video anomaly detection via multi-modal LLM. arXiv 2024, arXiv:2406.12235. [Google Scholar]
- Gonzalez, J.A.A.; Matsukawa, T.; Suzuki, E. Leveraging Vision Language Models for Understanding and Detecting Violence in Videos. In Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Porto, Portugal, 26–28 February 2025; Science and Technology Publications: Setúbal, Portugal, 2025; Volume 2, pp. 99–113. [Google Scholar]
- Leng, J.; Wu, Z.; Tan, M.; Mo, M.; Zheng, J.; Li, Q.; Gan, J.; Gao, X. PiercingEye: Dual-Space Video Violence Detection with Hyperbolic Vision-Language Guidance. arXiv 2025, arXiv:2504.18866. [Google Scholar] [CrossRef] [PubMed]
- Lim, J.; Lee, J.; Kim, H.; Park, E. ViCap-AD: Video caption-based weakly supervised video anomaly detection. Mach. Vis. Appl. 2025, 36, 61. [Google Scholar] [CrossRef]
- Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6479–6488. [Google Scholar]
- Wu, P.; Liu, J.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; Yang, Z. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 322–339. [Google Scholar]
- Cheng, M.; Cai, K.; Li, M. RWF-2000: An open large scale video database for violence detection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4183–4190. [Google Scholar]
- Bermejo Nievas, E.; Deniz Suarez, O.; Bueno García, G.; Sukthankar, R. Violence detection in video using computer vision techniques. In Proceedings of the International Conference on Computer Analysis of Images and Patterns, Seville, Spain, 29–31 August 2011; pp. 332–339. [Google Scholar]
- Aktı, Ş.; Tataroğlu, G.A.; Ekenel, H.K. Vision-based fight detection from surveillance cameras. In Proceedings of the 2019 Ninth International Conference on Image Processing Theory, Tools and Applications (IPTA), Istanbul, Turkey, 6–9 November 2019; pp. 1–6. [Google Scholar]
- Jiang, X.; Zong, Y.; Zheng, W.; Tang, C.; Xia, W.; Lu, C.; Liu, J. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2881–2889. [Google Scholar]
- Pepa, L.; Spalazzi, L.; Capecci, M.; Ceravolo, M.G. Automatic emotion recognition in clinical scenario: A systematic review of methods. IEEE Trans. Affect. Comput. 2021, 14, 1675–1695. [Google Scholar] [CrossRef]
- Yang, D.; Huang, S.; Xu, Z.; Li, Z.; Wang, S.; Li, M.; Wang, Y.; Liu, Y.; Yang, K.; Chen, Z.; et al. Aide: A vision-driven multi-view, multi-modal, multi-tasking dataset for assistive driving perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 20459–20470. [Google Scholar]
- Kosti, R.; Alvarez, J.M.; Recasens, A.; Lapedriza, A. Context based emotion recognition using emotic dataset. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2755–2766. [Google Scholar] [CrossRef]
- Luo, Y.; Ye, J.; Adams, R.B., Jr.; Li, J.; Newman, M.G.; Wang, J.Z. ARBEE: Towards automated recognition of bodily expression of emotion in the wild. Int. J. Comput. Vis. 2020, 128, 1–25. [Google Scholar] [CrossRef] [PubMed]
- Yang, D.; Huang, S.; Wang, S.; Liu, Y.; Zhai, P.; Su, L.; Li, M.; Zhang, L. Emotion recognition for multiple context awareness. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 144–162. [Google Scholar]
- Kosti, R.; Alvarez, J.M.; Recasens, A.; Lapedriza, A. Emotic: Emotions in context dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–16 July 2017; pp. 61–69. [Google Scholar]
- Lee, J.; Kim, S.; Kim, S.; Park, J.; Sohn, K. Context-aware emotion recognition networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10143–10152. [Google Scholar]
- Yang, J.; Huang, Q.; Ding, T.; Lischinski, D.; Cohen-Or, D.; Huang, H. Emoset: A large-scale visual emotion dataset with rich attributes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 20383–20394. [Google Scholar]
- Wang, Y.; Sun, Y.; Huang, Y.; Liu, Z.; Gao, S.; Zhang, W.; Ge, W.; Zhang, W. Ferv39k: A large-scale multi-scene dataset for facial expression recognition in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20922–20931. [Google Scholar]
- Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T. Acted Facial Expressions in the Wild Database; Technical Report TR-CS-11; Australian National University: Canberra, Australia, 2011; Volume 2. [Google Scholar]
- Li, S.; Deng, W.; Du, J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2852–2861. [Google Scholar]
- Mollahosseini, A.; Hasani, B.; Mahoor, M.H. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef]
- Zhang, Z.; Luo, P.; Loy, C.C.; Tang, X. Learning social relation traits from face images. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3631–3639. [Google Scholar]
- Etesam, Y.; Yalçın, Ö.N.; Zhang, C.; Lim, A. Contextual emotion recognition using large vision language models. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 4769–4776. [Google Scholar]
- Zhou, K.; Lee, K.; Misu, T.; Wang, X. ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 10783–10795. [Google Scholar]
- Lian, Z.; Sun, H.; Sun, L.; Yi, J.; Liu, B.; Tao, J. AffectGPT: Dataset and framework for explainable multimodal emotion recognition. arXiv 2024, arXiv:2407.07653. [Google Scholar] [CrossRef]
- Lei, Y.; Yang, D.; Chen, Z.; Chen, J.; Zhai, P.; Zhang, L. Large Vision-Language Models as Emotion Recognizers in Context Awareness. In Proceedings of the Asian Conference on Machine Learning, PMLR, Taipei, Taiwan, 9–12 December 2025; pp. 111–126. [Google Scholar]
- Bhattacharyya, S.; Wang, J.Z. Evaluating Vision-Language Models for Emotion Recognition. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 1798–1820. [Google Scholar]
- Xie, H.; Peng, C.J.; Tseng, Y.W.; Chen, H.J.; Hsu, C.F.; Shuai, H.H.; Cheng, W.H. Emovit: Revolutionizing emotion insights with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26596–26605. [Google Scholar]
- Seoh, R.; Goldwasser, D. EmoGist: Efficient In-Context Learning for Visual Emotion Understanding. arXiv 2025, arXiv:2505.14660. [Google Scholar]
- Wang, Z.; Zhang, Q.; Zhang, P.; Niu, W.; Zhang, K.; Sankaranarayana, R.; Caldwell, S.; Gedeon, T. Visual and textual prompts in vllms for enhancing emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 2025, 19, 14. [Google Scholar] [CrossRef]
- Zhao, Z.; Cao, Y.; Gong, S.; Patras, I. Enhancing zero-shot facial expression recognition by llm knowledge transfer. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 815–824. [Google Scholar]
- Hu, Z.; Yuan, K.; Liu, X.; Yu, Z.; Zong, Y.; Shi, J.; Yue, H.; Yang, J. Feallm: Advancing facial emotion analysis in multimodal large language models with emotional synergy and reasoning. arXiv 2025, arXiv:2505.13419. [Google Scholar] [CrossRef]
- Lan, X.; Xue, J.; Qi, J.; Jiang, D.; Lu, K.; Chua, T.S. Expllm: Towards chain of thought for facial expression recognition. IEEE Trans. Multimed. 2025, 27, 3069–3081. [Google Scholar] [CrossRef]
- Sălăgean, G.L.; Leba, M.; Ionica, A.C. Seeing the Unseen: Real-Time Micro-Expression Recognition with Action Units and GPT-Based Reasoning. Appl. Sci. 2025, 15, 6417. [Google Scholar] [CrossRef]
- Sun, L.; Jiang, X.; Chen, H.; Li, Y.; Lian, Z.; Liu, B.; Zong, Y.; Zheng, W.; Leppänen, J.M.; Zhao, G. Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions. arXiv 2025, arXiv:2507.21015. [Google Scholar] [CrossRef]
- Lin, Z.; Wang, Y.; Zhou, Y.; Du, F.; Yang, Y. MLM-EOE: Automatic Depression Detection via Sentimental Annotation and Multi-Expert Ensemble. IEEE Trans. Affect. Comput. 2025, 16, 2842–2858. [Google Scholar] [CrossRef]
- Zhao, Z.; Patras, I. Prompting Visual-Language Models for Dynamic Facial Expression Recognition. In Proceedings of the BMVC, Aberdeen, UK, 20–24 November 2023. [Google Scholar]
- Chen, H.; Huang, H.; Dong, J.; Zheng, M.; Shao, D. Finecliper: Multi-modal fine-grained clip for dynamic facial expression recognition with adapters. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 2301–2310. [Google Scholar]
- Huang, D.; Li, Q.; Yan, C.; Cheng, Z.; Huang, Y.; Li, X.; Li, B.; Wang, X.; Lian, Z.; Peng, X. Emotion-Qwen: Training Hybrid Experts for Unified Emotion and General Vision-Language Understanding. arXiv 2025, arXiv:2505.06685. [Google Scholar]
- Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019. [Google Scholar]
- Zellers, R.; Bisk, Y.; Farhadi, A.; Choi, Y. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6720–6731. [Google Scholar]
- Lian, Z.; Sun, H.; Sun, L.; Chen, K.; Xu, M.; Wang, K.; Xu, K.; He, Y.; Li, Y.; Zhao, J.; et al. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 9610–9614. [Google Scholar]
- Liu, Y.; Dai, W.; Feng, C.; Wang, W.; Yin, G.; Zeng, J.; Shan, S. Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 24–32. [Google Scholar]
- Foteinopoulou, N.M.; Patras, I. Emoclip: A vision-language method for zero-shot video facial expression recognition. In Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkey, 27–31 May 2024; pp. 1–10. [Google Scholar]
- Popescu, C.B.; Florea, L.; Florea, C. Mitigating Context Bias in Vision–Language Models via Multimodal Emotion Recognition. Electronics 2025, 14, 3311. [Google Scholar] [CrossRef]
- Lu, H.; Niu, X.; Wang, J.; Wang, Y.; Hu, Q.; Tang, J.; Zhang, Y.; Yuan, K.; Huang, B.; Yu, Z.; et al. Gpt as psychologist? Preliminary evaluations for gpt-4v on visual affective computing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 322–331. [Google Scholar]







| Model | Size | Data | Observations |
|---|---|---|---|
| BOLD [135] | 9876 video clips and 13,239 annotated human characters; 70/10/20 split. | 26 emotion classes. | In-the-wild body language dataset focusing on motion and keypoints. Designed for bodily expression recognition. |
| HECO [136] | 9385 images with 19,781 annotated agents (multiple people per image). | 8 discrete emotions + VAD dimensional scores. | Context-rich static image dataset focusing on multiple agents and their interactions; annotated by experts. Contains occlusions and ambiguous cases for realism. |
| EMOTIC [137] | 23,571 images and 34,320 annotated people; 70/10/20 split. | 26 emotion classes. | In-the-wild dataset emphasizing contextual cues beyond facial expressions. Highly imbalanced, reflecting real-world emotion distributions. |
| CAER/CAER-S [138] | 13,201 video clips (∼1,107,877 frames) and ∼70 k static images (CAER-S). | 7 emotion classes. | Collected from TV shows, providing rich contextual emotion cues from human interactions in diverse scenarios. |
| EMOSET [139] | 3.3 M weakly labeled images; 118 K human-annotated subset. | 8 emotion categories based on Mikel’s Model. | Large-scale visual emotion dataset from social media and artistic sources, emphasizing contextual and affective content. |
| DFEW [131] | 16,372 facial expression clips from 1500 movies. | 7 emotion classes. | Dynamic facial expression recognition dataset in the wild; focuses exclusively on facial cues from diverse cinematic sources. |
| FERV39K [140] | 38,935 video clips (>1 M frames) from 4K raw videos across 22 scene types. | 7 emotion classes. | Large-scale in-the-wild dataset emphasizing diverse multi-scene contexts and dynamic facial cues. |
| AFEW [141] | 1426 video clips from movies featuring 330 subjects aged 1–70. | 7 emotion classes. | One of the earliest in-the-wild benchmarks for emotion recognition. Acted, expressions include real-world variability in lighting, occlusion, and background. |
| RAF-DB [142] | 29,672 images. | 7 basic + 12 compound emotions. | In-the-wild Internet images; crowd-annotated with quality control. |
| AffectNet [143] | 450 k images (400 k manually annotated). | 8 categorical; valence/arousal (dimensional). | Large-scale, Internet-sourced; dual annotation scheme (categorical + dimensional). |
| ExpW [144] | 91,793 faces (manually annotated). | 7 categorical emotions. | Large-scale, Internet-sourced. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Florea, C.; Popescu, C.-B.; Racovițeanu, A.; Nițu, A.; Florea, L. From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data. Mathematics 2026, 14, 175. https://doi.org/10.3390/math14010175
Florea C, Popescu C-B, Racovițeanu A, Nițu A, Florea L. From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data. Mathematics. 2026; 14(1):175. https://doi.org/10.3390/math14010175
Chicago/Turabian StyleFlorea, Corneliu, Constantin-Bogdan Popescu, Andrei Racovițeanu, Andreea Nițu, and Laura Florea. 2026. "From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data" Mathematics 14, no. 1: 175. https://doi.org/10.3390/math14010175
APA StyleFlorea, C., Popescu, C.-B., Racovițeanu, A., Nițu, A., & Florea, L. (2026). From Context to Human: A Review of VLM Contextualization in the Recognition of Human States in Visual Data. Mathematics, 14(1), 175. https://doi.org/10.3390/math14010175

