FAIR-VID: A Multimodal Pre-Processing Pipeline for Student Application Analysis

Laukaitis, Algirdas; Kalibatienė, Diana; Jodenytė, Dovilė; Normantas, Kęstutis; Jancevičius, Julius; Jankauskas, Mindaugas; Serackis, Artūras

doi:10.3390/app152413127

Open AccessArticle

FAIR-VID: A Multimodal Pre-Processing Pipeline for Student Application Analysis

by

Algirdas Laukaitis

^*

,

Diana Kalibatienė

,

Dovilė Jodenytė

,

Kęstutis Normantas

,

Julius Jancevičius

,

Mindaugas Jankauskas

and

Artūras Serackis

The Faculty of Fundamental Sciences, Vilnius Gediminas Technical University, Saulėtekio al. 11, LT-10223 Vilnius, Lithuania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(24), 13127; https://doi.org/10.3390/app152413127 (registering DOI)

Submission received: 10 November 2025 / Revised: 7 December 2025 / Accepted: 10 December 2025 / Published: 13 December 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The shift toward remote and automated admission processes in higher education introduces new challenges, including evaluator subjectivity and risks of applicant fraud. The FAIR-VID project addresses these issues by developing an artificial intelligence system that integrates multimodal data fusion with semi-supervised deep learning to assess applicant video interviews, submitted documents, and form data. This paper presents the project’s data preprocessing pipeline, designed to fuse heterogeneous modalities and to support seamless interaction between AI agents and human decision-makers throughout the admission workflow. The proposed process is intentionally general, making it applicable not only to international university admissions but also to broader human resource management and hiring contexts. Emphasis is placed on the need for robust and transparent AI adoption in admission and recruitment, supported by open-source modules and models at every stage of interaction between applicants and institutions. As a proof of concept, we provide open-source solutions for the analysis of video interviews, images, and documents enriched with semantic descriptions generated by large multimodal and complementary AI models. The paper details the multi-phase implementation of this pipeline to create structured, semantically rich datasets suitable for training advanced deep learning systems for comprehensive applicant assessment and fraud detection.

Keywords:

automated admissions; video interview analysis; multimodal data fusion; speech and audio features; speaker verification; deepfake detection; document analysis; generative AI; human–AI collaboration

1. Introduction

The increasing integration of artificial intelligence into high-stakes decision-making processes has changed operational practices in many industries, with human resource management and higher-education admissions among the primary areas of application. The student admission process, particularly for international candidates, can be viewed as a specialized case of personnel recruitment, sharing fundamental challenges of scalability, objectivity, and the valid assessment of applicant potential. Consequently, technological solutions developed in one domain are often applicable to the other. Motivated by these challenges, we introduce FAIR-VID, a multimodal preprocessing pipeline designed for the transparent and fair evaluation of applicant documents, video interviews, and forms. The system addresses the growing demand for AI-enabled workflows that can holistically analyze diverse applicant information while maintaining the robustness, transparency, and ethical accountability required in high-stakes contexts.

In many current systems, the initial phase of the application process—where candidates submit curricula vitae (CVs), application forms, and educational certificates—is managed by rule-based document management systems. These platforms are typically limited to verifying the presence of required documents and extracting basic structured data, such as identification numbers, course names, and grades. A significant and persistent challenge arises when evaluating documents from international applicants, as these systems lack the contextual understanding to consistently and fairly compare qualifications across disparate educational frameworks. This paper introduces a novel solution to this problem, leveraging Large Language Models (LLMs) augmented with a Retrieval-Augmented Generation (RAG) architecture that draws upon a curated knowledge base, such as Wikipedia’s detailed entries on country-specific education systems. This approach moves beyond simple validation to provide a context-aware and equitable preprocessing of international credentials, forming a foundational layer of fairness from the very start of the pipeline.

Following the initial document screening, the interview stage represents another critical juncture for AI-driven innovation. Automated Video Interviews (AVIs) have emerged as a scalable and standardized method for initial applicant assessment. Building upon prior research that has established the psychometric viability of AVIs for evaluating constructs like cognitive ability [1], our work focuses on building a pipeline that fully capitalizes on this rich data source. Previous studies have demonstrated that while AI-scored interviews can be more resistant to score inflation than self-reports, the need for objective, multi-faceted evaluation remains paramount to mitigate impression management and potential fraud. The FAIR-VID pipeline addresses this by not only analyzing interview content but also integrating it with data from submitted documents, creating a unified and comprehensive applicant profile.

To contextualize our contribution, Figure 1 illustrates the end-to-end admission workflow envisioned by the FAIR-VID project. The diagram explicitly demarcates the boundary between automated processing (left panel) and human authority (right panel) to visualize where the AI’s role ends and human judgment begins. The left panel outlines the pipeline that is the primary focus of this paper. It begins with the applicant’s submission of multimodal data, which is then processed through an iterative cycle of automated interviews and data enrichment. An AI agent determines if follow-up interviews are necessary, creating subsequent questions and dialogue flows to probe specific areas of the applicant’s profile. The automated phase ends in the execution of evaluation models—including skills tests, risk and fraud detection algorithms, and holistic scoring—to generate a preliminary result accompanied by a detailed, explainable rationale. The right panel of Figure 1 depicts the subsequent stage, where this AI-generated output serves as a decision-support tool for human experts, who retain authority in the final selection, thus ensuring a model of human-AI collaboration.

This paper details the design and implementation of the multimodal pre-processing pipeline that underpins this entire system. Its central novelty lies in its ability to fuse heterogeneous data modalities into a structured, semantically rich dataset suitable for training sophisticated deep learning models. Specifically, we move beyond conventional feature extraction by employing generative AI to convert unstructured visual data from video interviews—such as key image frames depicting applicant behavior and environment—into structured textual descriptions. This transformation enriches the dataset with a new layer of semantic information, facilitating more nuanced analyses of non-verbal cues and contextual factors. In line with the growing demand for trustworthy AI, our approach emphasizes transparency and the use of open-source modules, ensuring that each step of the interaction between applicants, AI agents, and institutional decision-makers is auditable and explainable.

The remainder of this paper is organized as follows. Section 2 reviews the related work in automated video interviews, document intelligence, and audiovisual forgery detection. Section 3 details the methodology for our advanced document analysis, tracing the evolution from layout-aware transformers to the zero-shot capabilities of large multimodal models and presenting our novel RAG-based framework for credential adjudication. Section 4 describes the implementation of the multimodal data processing pipeline for video, audio, and visual data, detailing the two-phase process of data deconstruction and generative AI-based enrichment. Section 5 discusses the implications of this complete pipeline for training next-generation assessment systems and its broader applications in creating fairer and more effective admission and recruitment workflows. Following this, Section 6 will present the results of our proof-of-concept and discuss their implications for fairness and fraud detection. It will outline the experimental setup for evaluating the pipeline, including the datasets used and the fusion of the processed modalities. Finally, Section 7 will conclude the paper, summarize our contributions, and outline directions for future work.

2. Related Work

The FAIR-VID pipeline builds upon and integrates advances across several related research domains, including multimodal artificial intelligence, document intelligence, speech and video forensics, and knowledge-based reasoning. This section places our work within the broader scientific field, reviewing key developments and remaining challenges in document understanding, vision-language modeling, speech processing, video forensics, multimodal fusion, retrieval-augmented generation, AI fairness, and predictive analytics in high-stakes decision-making. While extensive research exists on document intelligence, automated video interviews, and multimodal fusion, the literature reveals a notable absence of end-to-end multimodal preprocessing pipelines specifically designed for admissions or HR decision workflows. Existing commercial systems typically handle only isolated components—such as résumé parsing, video interview scoring, or fraud detection—without unifying heterogeneous modalities into a single auditable representation. To the best of our knowledge, no prior work has proposed a transparent, open-source, and regulation-aligned preprocessing pipeline that integrates documents, audio, transcripts, and visual data into a standardized applicant profile. This paper therefore fills a methodological gap by presenting FAIR-VID as a holistic, multimodal preprocessing architecture.

Document Image Understanding. The field of document intelligence has moved from traditional optical character recognition (OCR) toward end-to-end deep learning models that process visual, text, and layout information together. The LayoutLM family of models [2] created a method of pretraining transformers on document images by adding location information with text data, greatly improving key-value extraction from structured forms. Later work, such as LayoutLMv3 [3] and LiLT [4], improved pretraining efficiency and the ability to work across languages, showing that combined text-image representations work better than separate OCR then Natural Language Processing (NLP) pipelines.

More recent approaches, such as OCR-free models like Donut [5] and Pix2Struct [6], have appeared that treat document understanding as a direct image-to-text generation task. This approach removes the need for unreliable, middle-step OCR processes, making it especially useful for the FAIR-VID system, which must handle non-standard credentials and poor-quality scans from many different global sources.

In the admissions field, AI services now read transcripts and credential images to pull out grades and check if they’re real, sometimes using LLMs or Contrastive Language-Image Pre-training (CLIP) style vision-language models to read scanned certificates. FAIR-VID builds on this work by using generative image-to-text models to turn applicant-uploaded images of credentials into computer-readable text, cutting down on OCR errors. Also, credential review can be improved through retrieval: by connecting to official data (university catalogs, accreditation lists), the system checks if courses match up or if degrees are valid. While fewer studies focus on credential checking in admissions, FAIR-VID’s pipeline uses advanced document transformers and knowledge retrieval to fill this gap.

Vision-Language Models and Generative Enrichment. The recent growth of Vision-Language Models (VLMs) and Large Multimodal Models (LMMs)—such as CLIP, Bootstrapping Language-Image Pre-training (BLIP) [7], Gemini, and GPT-4o—has created new possibilities for interview preprocessing by changing how zero-shot visual understanding works. These systems learn matched image-text representations, allowing open-vocabulary recognition and descriptive captioning without direct supervision. For example, models like BLIP work well at image captioning and can apply to video tasks even in zero-shot settings.

Building on these basics, “Vision LLMs” can look at facial images to figure out characteristics; a recent study shows that fine-tuned VLMs can predict age, gender, and emotion from regular photos with results equal to specialized classifiers [8]. Such models can create text summaries of a candidate’s expression or posture. In credential processing, generative V-L systems can also improve OCR by “making up” missing text or fixing errors using context.

FAIR-VID uses this approach to add textual descriptions to video frames and document images, a process we call “generative enrichment.” This method turns raw pixels into meaningful, language-matched representations that can be combined with transcript and document data. This goes beyond most current pipelines, which rarely use advanced V-L generation. However, issues of hallucination, spatial grounding, and factual consistency remain unsolved. To reduce these issues, FAIR-VID combines LMM outputs with layout-extracted metadata and time-based consistency checks, making sure that visual descriptions can be traced back to specific video frames and document references.

Speech Processing and Speaker Verification. Automatic speech recognition (ASR) and speaker verification have been changed by self-supervised learning. Models like wav2vec [9] and HuBERT [10] learn audio representations from unlabeled data, capturing phonetic and prosodic features. OpenAI’s Whisper extended this to a multilingual ASR model with near-human accuracy across a wide range of languages, accents, and acoustic conditions. For speaker verification, architectures such as x-vector and ECAPA-Time Delay Neural Network (TDNN) [11] have become standard, achieving low error rates on established benchmarks. These technologies, along with toolkits for paralinguistic analysis like openSMILE [12], form the foundation of FAIR-VID’s speech processing module for transcription, identity verification, and non-verbal feature extraction.

Video Forensics and Deepfake Detection. The proliferation of generative video manipulation has spurred significant research in deepfake detection. Foundational work in audiovisual synchronization, such as SyncNet [13], introduced deep learning methods to verify temporal consistency by analyzing lip movements. The “LipForensics” approach [14], which is also cited for its multimodal analysis, specifically leverages lip movement patterns as a modality-specific signal to achieve strong cross-dataset generalization in forgery detection. Datasets like FaceForensics++ [15] fueled the development of detectors trained to identify subtle artifacts in facial dynamics, lighting, and compression. Subsequent research has shown that multimodal approaches, which combine visual cues with audio-visual synchronization analysis [16], offer higher reliability. While recent methods use self-supervised learning to improve generalization to unseen manipulation techniques, robust cross-dataset performance remains a challenge. FAIR-VID incorporates these lessons by integrating speaker verification with lip-sync analysis and visual forgery detection. Crucially, instead of providing a binary fraud classification, our system generates evidential metadata (e.g., confidence scores, timestamps) to support transparent human review, aligning with calls for greater interpretability in digital forensics.

Multimodal Data Fusion. Integrating different data modalities is a central challenge in AI. Methods have moved from early fusion (concatenating embeddings) and late fusion (combining predictions) to more advanced techniques based on cross-modal attention [17]. Contrastive learning frameworks like CLIP [18] have been particularly influential, learning a shared latent space where different modalities can be directly compared and combined. Such approaches have been shown to improve performance on complex reasoning tasks. FAIR-VID uses a staged fusion architecture: document entities are extracted first, followed by the fusion of video and audio streams, with the resulting representation feeding into a final reasoning layer. This modular design improves traceability and aligns with best practices for building explainable multimodal systems.

Retrieval-Augmented Generation for Grounded Reasoning. Retrieval-augmented generation (RAG) [19] improves large language models by grounding their outputs in externally retrieved evidence. This approach reduces hallucination and enables responses that are factual, context-aware, and attributable to specific sources. Later research has extended RAG to handle long-context reasoning, citation tracking, and structured knowledge grounding [20]. FAIR-VID applies RAG in its Cross-Border Adjudication Matrix, which queries knowledge bases like International Standard Classification of Education (ISCED) and European Network of Information Centres (ENIC-NARIC) to assess credential equivalence. This approach mirrors the use of RAG in other high-stakes domains, such as legal and policy analysis, where provenance and auditability are critical.

Fairness, Governance, and Human-in-the-Loop Systems. The deployment of AI in education and employment requires careful consideration of fairness, transparency, and accountability. Foundational work in algorithmic fairness has identified numerous sources of bias in data, models, and deployment contexts [21,22,23]. In multimodal systems, these risks are increased by potential biases related to accents, facial attributes, and document formats [24,25]. In response, research has focused on developing explainability techniques for multimodal models [26] and designing human-in-the-loop (HITL) governance frameworks that preserve human oversight [27]. FAIR-VID implements these principles by designing its pipeline around interpretable, auditable outputs (e.g., structured data, timestamped transcripts, provenance logs), enabling human review and aligning with emerging regulatory standards like the EU AI Act.

Predictive Modeling in Admissions and HR. There is a growing body of literature on the use of automated systems for applicant screening and interview analysis. Studies on asynchronous video interviews and personality prediction have shown that multimodal cues can predict performance and engagement [28,29]. However, such systems have been criticized for their opacity and potential for bias [30]. Research specifically focused on automated video interviews confirms their increasing use for efficiency but also notes that their validity and fairness remain active areas of investigation [31,32]. Notably, the specific problem of applicant fraud within these interviews, such as the use of deepfakes or pre-scripted answers, has received limited academic attention [33]. FAIR-VID addresses these concerns by explicitly separating data preprocessing and enrichment from downstream predictive modeling. By producing standardized, provenance-aware representations, our system ensures that any subsequent predictive models are built on a transparent and auditable foundation, supporting reproducibility and bias auditing.

3. End-to-End Visual Document Analysis

This section describes the visual-document analysis component of FAIR-VID and explains why robust document understanding is a prerequisite for fair multimodal admission decisions. Below we outline the practical constraints that motivate our design choices and summarize the operational trade-offs that guided model selection and system architecture. While FAIR-VID centers on multimodal fusion, reliable applicant assessment depends equally on accurate, context-aware document understanding. Traditional document processing systems—typically reliant on rigid templates and basic optical character recognition—struggle to cope with the diversity of international diplomas, transcripts, and identity documents. For admissions workflows, document understanding is therefore not a simple automation problem but one of accuracy, explainability, and verification.

3.1. Key Challenges in International Credential Processing

We begin by listing the practical obstacles that make credential processing particularly challenging in cross-border admissions settings. In international student admissions, document evaluation remains one of the most labor-intensive stages of the process. Unlike video interview scoring, which can be standardized and automated relatively easily, document verification requires human-level reasoning across heterogeneous layouts, languages, seals, and grading conventions. These documents carry significant decision weight; even small extraction errors can distort an applicant’s profile and lead to biased or inconsistent outcomes.

Accuracy in information extraction is critical because even small OCR or parsing errors can mischaracterize an applicant’s record and propagate through downstream models. Therefore, extracted text should be treated as a hypothesis that is verified by complementary signals or human review rather than as ground truth. OCR results cannot be treated as unquestionable truth, since errors in text segmentation, recognition, or language encoding often propagate through downstream models.

In FAIR-VID, OCR output is regarded as a preliminary hypothesis that must be verified by complementary models or human oversight. To minimize such propagation of uncertainty, the pipeline adopts a dual-model strategy—combining OCR-based extraction with direct vision-language reasoning.

Another major constraint arises from data annotation requirements. Fine-tuning document models such as LayoutLM or Donut usually requires large, manually annotated datasets with bounding boxes, entity labels, and field relationships. For global admissions, where credentials vary across thousands of institutions and hundreds of document formats, such annotation is prohibitively expensive and impractical. Consequently, FAIR-VID prioritizes models and methods that generalize to unseen document structures without the need for exhaustive labeled data.

These real-world constraints—verification demands, OCR uncertainty, and the infeasibility of large-scale fine-tuning—directly shaped the FAIR-VID approach to building a reliable and auditable document understanding system.

3.2. Zero-Shot Document Understanding Strategy

Given these constraints (verification needs, OCR fragility, and annotation costs), we prioritized approaches that reduce dependence on large labeled corpora and that support rapid, explainable inference—which motivated a zero-shot prompt-based strategy described next. To address these challenges, development began with a local large language model (LLM)—Google’s Gemma 3 (27B)—as the central engine for the first version of the FAIR-VID document pipeline.

In our zero-shot setup each document image is supplied together with a compact, structured natural language prompt that specifies target fields and a machine-readable output format (for example, JSON), enabling direct image to structured text inference without task-specific fine-tuning. This prompt-based, zero-shot workflow enables rapid experimentation and bypasses the need for task-specific fine-tuning, allowing the system to handle highly diverse document types. Gemma 3 was selected over Donut- and LayoutLM-style architectures because it does not require any task-specific training dataset, making it better suited to the highly heterogeneous and globally diverse credentials encountered in international admissions. Additionally, in contrast to proprietary multimodal models such as GPT-4o, Gemma 3 can be deployed on institutional hardware and a single GPU, ensuring full data locality and compliance with privacy regulations while maintaining strong zero-shot interpretability.

A practical limitation is that generative LMM outputs can sometimes hallucinate; we therefore pair the zero-shot path with a verifiable OCR branch to provide spatial grounding and error checks. The verified results from this stage also serve as a foundation for benchmarking and constructing training data for secondary models.

Although Gemma 3 shows strong zero-shot performance and development efficiency relative to older end-to-end models (e.g., Donut), we maintain a complementary LayoutLM + OCR branch to provide explicit spatial grounding and visual evidence for extracted fields. This secondary path is essential for explainability: it enables bounding-box visualization of extracted fields, allowing the system to justify its decisions and support transparent human-AI interaction interfaces. In practice, Gemma 3 provides semantic interpretation, while the LayoutLM + OCR pipeline ensures spatial grounding and verification consistency.

3.3. Model Justification and Practical Trade-Offs

This subsection summarizes why we selected the chosen components and highlights the main trade-offs considered: accuracy vs. annotation cost, zero-shot flexibility vs. verification needs, and runtime/energy vs. local privacy controls. Specifically:

(1): a zero-shot LMM (Gemma 3) minimizes annotation overhead and scales across diverse layouts;
(2): a LayoutLM + OCR path provides inspectable bounding boxes required for audits;
(3): local deployment of heavier verification components balances privacy with compute cost.

Empirical evaluation confirms that the Gemma 3 approach provides superior flexibility and development efficiency, making it suitable for large-scale document ingestion and zero-shot inference. That said, once a sufficiently large, verified corpus exists, end-to-end models such as Donut may become preferable because they require less task orchestration and typically offer smaller model footprints, lower energy use, and faster per-document throughput.

In the current operational configuration of FAIR-VID, the combined LayoutLM + OCR branch remains essential because it provides spatial grounding and visual evidence for each extracted field. This verification capability is critical for explainability, auditability, and human–AI collaboration in the document analysis process, particularly in cases involving ambiguous or high-risk credentials. While the Gemma 3-based semantic interpretation module offers strong zero-shot performance and development efficiency, the layout-aware branch complements it by enabling explicit field localization and visual justification. Maintaining both pathways therefore ensures that extracted information can be inspected, validated, and traced to the underlying document structure, supporting the reliability requirements of international admissions workflows. We acknowledge that the current zero-shot strategy may underperform on rare credential formats, and the absence of fine-tuned domain-specific weights remains a limitation at this stage of deployment.

3.4. Integration into the Multimodal Pipeline

This subsection explains how document outputs are packaged and handed off to downstream actors so that they can be fused with video and behavioral signals described in Section 4. The overall interaction among the admission system, the visual document understanding subsystem (VDU), the local and cloud AI agents, and the human administrator is summarized in Figure 2. This sequence diagram is structured to clarify the flow of information, distinguishing the iterative verification loop from the final, linear hand-off to the admission system. The process begins in the admission system, where applicants register and upload scanned documents alongside other application data. These document images are transmitted to the VDU system, which performs multimodal visual analysis and entity extraction using both the Gemma 3-based semantic reasoning module and the LayoutLM + OCR verification branch. The extracted information is represented in structured JSON form and accompanied by labeled document images that highlight identified fields.

The VDU forwards structured JSON outputs and visualized, labeled images to the admission administrator for human inspection. Simultaneously, the local AI agent ingests the same outputs for contextual interpretation and downstream fusion. The visualized, labeled images are particularly valuable for human review, since formatting, texture, and visual authenticity cues provide additional evidence that purely textual transcripts cannot convey. Upon administrator approval, these labeled samples may be retained for future fine-tuning and validation of the VDU models.

A central operational feature is an interactive review loop: administrators request clarifications or corrections and the local AI agent responds with updated interpretations or grounding evidence. This loop functions as a conversational review cycle in which the administrator requests clarifications, issues new prompts, or provides feedback on extracted entities, while the local AI agent responds with updated interpretations or relevance assessments. Through this iterative exchange, the system ensures that all extracted data remain contextually valid and institutionally compliant.

Once verified, the local AI agent depersonalizes the data—removing identifiable information and transforming sensitive fields into synthetic or abstracted forms—before forwarding reasoning tasks to the cloud AI agent. The cloud AI agent, operating exclusively on anonymized profiles, executes advanced reasoning and knowledge-based validation steps. This strict separation ensures that computationally heavier cloud reasoning never accesses personal identifiers, in line with EU AI Act requirements. Its prompts are published in open repositories to ensure transparency, while its responses may optionally be archived for reproducibility. The resulting insights are returned to the local AI agent, which re-integrates contextual and personal details before presenting the complete, interpretable output to the admission administrator.

We note that this depersonalization and cloud/local split reduces privacy risk but does introduce an additional verification step that must be audited in practice. Depending on institutional policy, the local AI agent may annotate or insert AI-generated notes directly into the admission system. Finally, the admission administrator consolidates all AI-generated outputs, human observations, and validation results within the admission system, completing the end-to-end cycle from document submission to verified, explainable, and auditable decision support.

4. Video Interview Processing Pipeline

This section outlines how FAIR-VID processes video interviews and explains how raw audiovisual streams are transformed into interpretable, structured representations. We first decompose videos into separate modalities and then enrich the visual stream with generative textual descriptions, preparing all components for multimodal fusion in Section 5.

Having established the document-processing workflow, we now describe the complementary video-analysis pipeline, which provides temporal, behavioral, and authenticity cues that cannot be obtained from static credentials alone. The second major component of FAIR-VID focuses on the systematic processing of recorded video interviews, transforming unstructured audiovisual input into structured data suitable for downstream reasoning. Its objective is to transform raw audiovisual material into structured, analyzable representations that can be integrated with other multimodal evidence, such as document and audio data. The complete workflow, summarized in Figure 3, operates in two sequential phases: video deconstruction and visual data enrichment. Figure 3 visualizes this necessary separation, showing how raw media is first broken down (left branch) before being semantically reconstructed (right branch) for downstream analysis.

4.1. Phase 1—Video Deconstruction

This phase decomposes the raw video into independent modalities, enabling each to be analyzed by specialized tools rather than a single monolithic model. In the first phase (Figure 3, left), the system ingests video interviews and decomposes them into three key modalities: audio, text, and visual frames. The process begins with the acquisition or recording of interview files, which are then systematically decomposed so that audio, transcripts, and representative frames can be evaluated independently. Each video file (.mp4) is processed through an automated pipeline that:

Extracts the audio stream and saves it as an independent mp3 file.
Transcribes speech using the Whisper model to generate a textual txt file for each interview. We selected Whisper for its strong performance on non-native English, as our tests confirmed its robustness to the diverse accents typical of international student interviews. We acknowledge that these findings are qualitative rather than quantitatively benchmarked; however, audio transcription represents a relatively small component of the FAIR-VID method, and we aimed to avoid overly expanding the scope of the evaluation section.
Captures representative video frames—typically from the temporal midpoint—to serve as visual samples for authenticity and expression analysis. Such representative frames provide stable visual snapshots for authenticity checks, environmental context assessment, and alignment with transcript content.

Separating transcription, frame extraction, and audio processing at this stage reduces noise and ensures that each modality can be evaluated with domain-appropriate models. The resulting text and image data are packaged, archived, and stored in a shared environment for subsequent enrichment and AI-based reasoning. This modular decomposition ensures that each modality can later be analyzed independently by specialized components of the FAIR-VID system.

4.2. Phase 2—Visual Data Enrichment via Generative AI

After deconstruction, the pipeline enriches the visual stream to create language-based descriptions that support explainability, cross-modal alignment, and fusion with text-centric reasoning modules. The second phase (Figure 3, right) transforms the extracted visual data into semantically meaningful descriptions. Each extracted frame is processed by a multimodal generative model, which produces natural-language descriptions that capture salient visual details not easily represented through pixel data alone. The model outputs natural-language text describing the frame (e.g., “An applicant seated indoors, facing the camera, speaking while holding documents”). Additionally, the model describes the applicant’s visual details and environment details in JSON format. These descriptions are stored as structured files and linked to the corresponding video metadata. While generative models may introduce occasional hallucinations, FAIR-VID mitigates this risk by cross-checking temporal consistency across frames and aligning visual descriptions with the accompanying transcript.

This step enriches the visual modality with contextual information, enabling downstream AI components to perform authenticity verification, behavioral analysis, or cross-modal reasoning without requiring direct access to pixel data. The enriched dataset therefore bridges the gap between unstructured visual input and structured multimodal reasoning.

4.3. End-to-End Integration and AI Collaboration

Having described video deconstruction and enrichment, we next explain how these outputs flow through local and cloud AI agents and how they support the human-in-the-loop review process. The overall interaction between subsystems and AI agents is shown in Figure 4, which presents the complete sequence of communication in the video-interview analysis workflow. The admission system records and stores the raw video files, which the video-analysis (VA) subsystem then retrieves for automated processing. The video-analysis subsystem carries out Phase 1 deconstruction, generating separate audio files, Whisper transcripts, and representative frames for further analysis.

These outputs are forwarded to the local AI agent, which evaluates the authenticity of the frames and the quality of the transcribed responses while maintaining access to sensitive institutional data. The local AI agent conducts preliminary reasoning on sensitive data, identifying authenticity signals and communication features that cannot leave the institutional boundary. The local AI agent then depersonalizes the extracted information and transmits it to the cloud AI agent for high-level reasoning on anonymized data.

The cloud AI agent returns interpretive results, which are re-contextualized by the local AI agent and presented to the admission administrator for final human verification. Depending on institutional policy, validated AI annotations may be written directly into the admission system to support evaluators. This collaborative human-in-the-loop architecture ensures that the FAIR-VID pipeline maintains transparency, auditability, and fairness throughout the automated video-interview assessment process. The structured audiovisual outputs produced by this pipeline form one of the three pillars of FAIR-VID’s multimodal fusion strategy, described next in Section 5.

5. A Multimodal Data Fusion Pipeline

This section explains how FAIR-VID merges the processed document, audio, and video streams into a unified reasoning pipeline. We describe the rationale for each fusion stage and clarify how these components jointly support transparent, auditable decision-making. Having established how documents and video interviews are independently analyzed, we now detail how these heterogeneous representations are combined to form a coherent applicant profile.

The final stage of the FAIR-VID framework brings together all previously processed modalities—documents, audio, and video—into a unified decision-support architecture. The objective of the fusion architecture is to synthesize multimodal signals into a fair and interpretable representation that reflects both quantitative attributes and contextual behavioral evidence. The fusion criteria were defined to prioritize explainability and traceability: document-based evidence provides structural grounding, video-derived cues supply contextual and behavioral signals, and retrieval-augmented responses ensure alignment with prior institutional practice. This phase, shown in Figure 5, implements a multimodal data fusion pipeline that integrates document reasoning, video analysis, and historical decision data. The diagram illustrates the downward progression of data, systematically going from raw multimodal inputs in Stages 2 and 3 into a concise, human-readable evidence package in Stage 4. Each stage builds incrementally on the previous one, allowing evaluators to trace how raw inputs influence intermediate summaries and final recommendations. The resulting synthesis is always presented to a human evaluator for the final admission decision.

5.1. Transparency and Open Access of Prompts

A fundamental design principle of FAIR-VID is radical transparency in AI reasoning. Because prompt structure directly shapes model behavior, exposing these prompts enables external auditing and facilitates reproducibility. Since LLM behavior is highly sensitive to prompt formulation, documenting these prompts ensures that evaluators understand the assumptions embedded in each reasoning step.

All prompts used across the FAIR-VID system—ranging from document interpretation and video evaluation to multimodal fusion—are version-controlled in a public repository (fair-vid/admission_prompts). This repository allows external experts, auditors, and applicants to inspect exactly what the AI was asked to do. The prompt content, structure, and evaluation criteria can thus be publicly scrutinized and improved, aligning with the transparency and human-oversight principles emphasized in the EU AI Act.

By treating prompts as institutional policy artifacts rather than internal parameters, FAIR-VID makes the AI evaluation process open, contestable, and auditable—core expectations for trustworthy AI deployment in high-risk domains such as admissions and HR management.

5.2. Local vs. Cloud Agents

Fusion requires careful handling of sensitive versus depersonalized data, which motivates a split between local and cloud computation. The FAIR-VID design explicitly distinguishes between two categories of AI agents:

the Local AI Agent, operating within institutional infrastructure and authorized to access personal or sensitive data, and
the Cloud AI Agent, which performs advanced reasoning only on fully depersonalized or synthetic data.

This architectural division establishes a privacy boundary that ensures personal identifiers remain under institutional control while still enabling advanced cloud-based reasoning. The local AI agent performs multimodal fusion using actual applicant data (documents, audio, and video) and then depersonalizes the intermediate representations—removing names, identifiers, and contextual details—before sending general analytical requests to the cloud AI agent. The cloud AI agent thus receives only abstract or anonymized descriptions, ensuring that no personally identifiable data ever leave the institution.

This architecture satisfies several key obligations outlined in the EU AI Act. In particular, it supports proportionality, documentation, and explicit human oversight—requirements that are central for systems operating in high-risk admission contexts:

Data governance and protection: Personal data remain local, under institutional control.
Transparency and documentation: Prompts and system behavior are documented for external review.
Human oversight: A human administrator remains responsible for all final decisions.
Proportionality and risk mitigation: High-risk processing (e.g., profiling or ranking) is restricted to verified local environments.

5.3. Stage 1—Holistic Document Analysis

The fusion process begins with text-based materials, as these typically contain the most complete evidence about an applicant’s academic background. The first stage of the fusion pipeline conducts a comprehensive analysis of all textual materials submitted by the applicant—academic transcripts, CVs, certificates, and application forms. This ordering reflects an empirical trade-off: document-level reasoning offers the most complete and reliable evidence base, allowing later stages to refine rather than replace the initial assessment. All pre-processed document content is merged into a unified text block, enabling the model to reason holistically across credentials, summaries, and supporting statements.

In this step, the Local AI Agent has access to the complete, context-rich data, including personal identifiers, document metadata, and prior institutional records. This allows document-level reasoning to incorporate details that cannot be shared with the cloud agent, such as personal identifiers or institution-specific metadata. Local AI Agent uses this information to generate an internally consistent assessment of the applicant’s academic and professional potential. The model is guided by a transparent prompt template, publicly available in the fair-vid/admission_prompts repository, to ensure accountability and reproducibility. A representative example of such a prompt is (Prompt example):

You are an expert university admissions evaluator. Below is a student’s full application description and supporting documents.

{combined_text}

Based on this information, estimate the student’s probability (from 0 to 100%) of being accepted for a scholarship at universities of three different global ranking levels:

Top 10 global university; Around rank 100; Around rank 500.

For each category, provide:

-: A numerical probability (integer 0–100)
-: A short list of reasons (2–3 concise bullet points) explaining your evaluation.

Respond only with a valid JSON object.

The output from this local evaluation forms a structured, machine-readable JSON document that quantifies the applicant’s likelihood of acceptance at different institutional tiers.

Before sharing outputs externally, the Local AI Agent removes all identifying attributes and rewrites the profile into an abstract description that preserves structure but not identity. The depersonalization process removes names, identifiers, and any contextual markers, transforming the data into a hypothetical applicant profile—for example, “a student with high mathematics grades, average English proficiency, and two international certificates.” The Cloud AI Agent then reasons only about this generalized, anonymized profile. It produces broad interpretive insights, such as typical scholarship probabilities for similar academic patterns, without ever accessing personal or institutional data.

This division of responsibilities establishes a privacy-preserving hierarchy: the Local AI Agent performs fine-grained reasoning with access to all data, while the Cloud AI Agent contributes generalized analytical reasoning based solely on synthetic, context-neutral representations. This design both upholds the transparency and human-oversight principles of the EU AI Act and ensures that FAIR-VID’s decision-support process remains interpretable, reproducible, and ethically compliant.

5.4. Stage 2—Fused Re-Evaluation with Video Analysis

The second fusion stage incorporates audiovisual information to refine or contextualize the document-based assessment. In the second stage, the results of the video interview analysis (from Section 4) are integrated with the document-based evaluation. Video-derived cues—such as prosodic features, communication style, and authenticity indicators—are integrated with the document assessment to form an enriched intermediate representation. The combined dataset is then submitted to the local AI agent through a fused holistic prompt, prompting the model to re-evaluate the applicant in light of multimodal evidence.

This phase enables context-aware adjustment: strong written credentials may be reconsidered if the video reveals low engagement or inconsistencies, whereas a modest application may gain strength through confident and authentic communication. This step is particularly important for distinguishing strong applicants with weak documentation from those whose written credentials overstate performance.

5.5. Stage 3—Retrieval-Augmented Generation (RAG) Synthesis

To ensure that the fused profile is interpreted consistently with prior institutional practices, FAIR-VID supplements model reasoning with retrieval-based evidence. Running in parallel, the RAG subsystem grounds the AI’s reasoning in institutional precedent. The fused profile is embedded into a vector space and matched against historically similar cases, retrieving the most relevant prior decisions. The top-N most similar cases and their historical outcomes are retrieved to provide an empirical benchmark for comparison.

In the current implementation, all historical outcomes used for similarity-based retrieval are derived exclusively from human-made admission decisions recorded in previous admission cycles. These outcomes reflect the institution’s standard evaluation procedure and include only the binary admission result (offer granted or denied), without incorporating later academic performance. Relying on human-generated decisions prevents the retrieval mechanism from recycling FAIR-VID’s own predictions, thereby avoiding recursive feedback loops. Nevertheless, if institutions later adopt FAIR-VID-generated recommendations in operational practice, accumulated decisions may gradually incorporate model influence. We acknowledge this as a potential source of recursive bias and recommend periodic auditing and recalibration of the historical database. The RAG component is therefore intended solely as a reference tool for contextualizing cases, supporting both the reinterpretation of past decisions and the identification of decisions that should not be used in downstream model training.

This mechanism enforces consistency and fairness over time, aligning with the EU AI Act’s emphasis on traceability and risk monitoring. However, the quality of these comparisons depends on the representativeness of the historical dataset, which is an acknowledged limitation in early-stage deployments. The RAG output includes descriptive summaries of comparable applicants and how their cases were resolved, giving evaluators a transparent view of how the current assessment fits within established patterns.

5.6. Stage 4—Final Output Generation and Human Review

The final stage consolidates all intermediate results into a form that supports human review and regulatory compliance. This stage (Figure 5, Stage 4) fuses the outputs of the re-evaluation and RAG modules into a single evidence package. The resulting evidence package contains:

Updated scholarship probabilities and rationale;
A list of comparable past applicants and decisions;
Supporting extracts from documents, transcripts, and video metrics.

This package is presented to the admission administrator, who performs the ultimate evaluation and records the decision in the admission system. The human-in-the-loop review ensures compliance with the EU AI Act’s human oversight and accountability principles. The AI thus acts as an assistant that aggregates and contextualizes evidence rather than a decision-maker, ensuring that fairness, transparency, and privacy remain central to every stage of the multimodal assessment. The fused evidence produced by this stage forms the basis for the experiments and evaluation presented in Section 6.

6. Results and Discussion

6.1. Experimental Setup

To evaluate the reliability and generalization capacity of the FAIR-VID visual document understanding pipeline, we conducted a controlled experiment comparing two model configurations:

Gemma 3 (27B)—a large multimodal model operating in zero-shot mode;

LayoutLMv3 Base—a layout-aware transformer utilizing OCR input for spatial reasoning.

Both models were assessed on an identical test set of standardized International English Language Testing System (IELTS) certificates, chosen due to the global uniformity of their format and field structure (candidate name, overall score, date, and issuing center). This standardization allowed for precise benchmarking of extraction accuracy and error analysis under real document conditions.

6.2. Model Performance on Standardized IELTS Certificates

Each certificate image was processed independently, and outputs were evaluated against manually verified ground-truth annotations. Correct extraction required exact textual and field-level accuracy (e.g., numerical scores, names, and dates).

The Gemma 3 (27B) model achieved near-perfect accuracy, with only one document misclassified out of the entire test set. In contrast, LayoutLMv3 produced three erroneous extractions, primarily related to OCR boundary misalignment and missing diacritical marks in candidate names (see Table 1).

The results clearly demonstrate that document regularity strongly influences extraction accuracy. On the highly standardized IELTS certificates, Gemma 3 delivered nearly flawless zero-shot performance, confirming the model’s capability to generalize spatial and textual understanding without fine-tuning. The small number of errors observed in LayoutLMv3 highlights the limitations of OCR-dependent architectures when field boundaries or typography vary slightly, even in uniform documents.

When tested on a random set of international academic documents of the same size, Gemma 3’s error count increased to four, corresponding to 80% overall accuracy. This decline reflects the model’s sensitivity to unfamiliar layouts and unstructured text, emphasizing that robust generalization still benefits from limited domain adaptation. LayoutLMv3 was not evaluated on the random set due to the significant manual effort required to prepare OCR-aligned annotations—a known scalability limitation of layout-aware models.

Collectively, these findings validate the decision to employ Gemma 3 as the primary inference model within the FAIR-VID visual analysis pipeline. Its superior accuracy on standardized formats, minimal dependency on explicit OCR, and reduced data-annotation requirements make it the most practical solution for international admission workflows. At the same time, maintaining an OCR-based verification branch remains valuable for explainability, allowing bounding-box visualization of key fields in high-risk or ambiguous cases.

For the IELTS Certificate evaluation, accuracy was computed at the document level, defined as the proportion of certificates for which all required fields were extracted correctly. Each certificate was treated as a single evaluation unit because IELTS documents follow a standardized layout with a small, fixed number of fields, making field-level weighting unnecessary for this dataset. Accordingly, the metric does not distinguish between cases where two errors occur in a single certificate versus two errors distributed across two certificates; what matters for downstream use is whether a complete document can be reliably parsed. We acknowledge that the sample size of 20 certificates is limited and therefore interpret the results as exploratory rather than conclusive. The intent of this experiment is to illustrate the behavior of different model configurations under real-world constraints rather than to establish statistical superiority. This limitation reflects an important practical challenge in international admissions: institutions often receive very small numbers of examples for many certificate types across different countries, making it infeasible to develop large, high-quality training sets for layout-based models such as LayoutLMv3. In such settings, zero-shot multimodal LLMs provide a more practical and scalable alternative, while layout-aware verification remains valuable only where regulatory or transparency requirements necessitate explicit spatial grounding.

6.3. Predictive Modeling Experiments on Multimodal Admission Data

To quantify the contribution of multimodal analysis to predictive modeling in university admissions, we evaluated the proposed FAIR-VID framework on 5400 anonymized foreign student applications obtained from the admissions system. Each record included structured form data, uploaded supporting documents, and, for a subset of cases, video interview recordings processed through the FAIR-VID video pipeline.

For each application, a binary success variable was defined for three consecutive stages of the admission workflow:

Experiment 1—Fee Payment Prediction: Identifies applicants who complete the initial registration fee payment after document submission.
Experiment 2—Offer Prediction: Estimates the probability of receiving an admission offer, tested in two configurations:
- without video data, using only structured and document features;
- with integrated video–audio data from the FAIR-VID video pipeline.
Experiment 3—Enrollment Prediction: Predicts final enrollment after offer acceptance, incorporating behavioral interaction data (e.g., login frequency, payment timing).

Each task was modeled as a binary classification problem (success = 1, failure = 0). For Experiment 2 (Offer Prediction), the binary outcome corresponds to the actual admission offers issued by human admissions officers through the standard institutional evaluation process. The FAIR-VID pipeline was not involved in generating these decisions; it was used exclusively to produce predictive features for analytical purposes. Thus, the experiment assesses the extent to which the pipeline can approximate historical human judgments rather than predict outcomes of its own automated process. Data were partitioned using an 80/20 train–test split, and results were averaged across five randomized folds to ensure stability. Precision and recall were used as the primary evaluation metrics to capture the balance between predictive specificity and completeness, with the full performance summary presented in Table 2.

Experiment 1—Fee Payment Prediction. Early-stage application completeness and metadata consistency emerged as strong behavioral indicators. Applicants who provided all required credentials and adhered to the upload instructions were significantly more likely to complete the payment process. The resulting model achieved precision = 0.83 and recall = 0.79, effectively distinguishing compliant from uncertain or potentially fraudulent profiles.

Experiment 2—Offer Prediction. The introduction of video and audio modalities demonstrated clear performance gains. Without video features (Exp. 2a), the model achieved precision = 0.85 and recall = 0.82. When multimodal embeddings (Whisper-transcribed speech and Gemma 3-derived semantic representations) were added (Exp. 2b), both metrics improved by approximately 6 percentage points. This confirms that paralinguistic and visual cues—such as fluency, emotional tone, and authenticity—enhance evaluative robustness.

Experiment 3—Enrollment Prediction. The final stage integrated all available modalities, including behavioral activity logs. The model achieved precision = 0.87 and recall = 0.84, indicating stable predictive performance. However, these results also reflect the limits of algorithmic predictability, as external variables such as visa approval or financial circumstances introduce significant variance beyond the model’s scope.

Overall, the experiments confirm that multimodal fusion significantly improves the reliability of predictive admission modeling, particularly when incorporating video interview data. The results validate FAIR-VID’s design principles: (a) the complementary use of structured and unstructured data streams, (b) interpretable AI through modular pipelines, and (c) the preservation of human oversight. These findings provide empirical support for the project’s emphasis on transparent, explainable, and regulation-aligned AI in international admissions.

7. Conclusions and Future Work

This paper introduced FAIR-VID, a transparent and regulation-aligned framework that integrates document understanding, video analysis, and multimodal fusion into a coherent human-in-the-loop system. By combining two-phase video preprocessing with AI-driven document interpretation, FAIR-VID establishes a pipeline capable of converting heterogeneous raw data into interpretable, structured, and ethically auditable representations.

A defining feature of the project is its commitment to transparency and reproducibility. All major pipeline components, including document parsing, video deconstruction, visual enrichment, and multimodal fusion, are implemented as a set of Google Colab notebooks. These notebooks are openly accessible and fully executable, enabling researchers, educators, and policymakers to inspect, replicate, and extend every stage of the process. This open-source design ensures that FAIR-VID is not merely a technical framework but also an important research infrastructure that promotes accountability and community-driven innovation in AI-assisted profiling.

From an ethical and legal standpoint, the FAIR-VID architecture explicitly separates Local AI Agents, which process sensitive or identifiable data, from Cloud AI Agents, which operate only on synthetic or depersonalized inputs. This design operationalizes compliance with the EU AI Act and General Data Protection Regulation (GDPR), demonstrating how advanced AI reasoning can coexist with strict privacy protection and auditability requirements.

Future work will build on this foundation by extending multimodal learning to incorporate all available data streams—audio, transcript text, visual frames, and AI-generated descriptions—for semi-supervised fraud detection and holistic applicant assessment. In parallel, upcoming research will focus on benchmarking fairness and interpretability metrics across demographic and linguistic groups and developing shared evaluation datasets for reproducible research in ethical admissions AI. Through open collaboration and verifiable experimentation, FAIR-VID aims to foster a more transparent, fair, and community-governed future for AI-supported decision systems in higher education.

In addition to these broader research directions, further development of the document analysis component will investigate a transition from the current hybrid architecture toward more compact and energy-efficient end-to-end models, such as Donut-style OCR-free transformers. As the system accumulates a wider range of verified credentials, such models may become increasingly advantageous due to reduced annotation requirements and shorter inference times. At the same time, a layout-aware verification branch will be retained for cases that require explicit spatial grounding or compliance-oriented auditability. This evolution aims to balance scalability, transparency, and computational efficiency as the pipeline is adapted for large-scale and heterogeneous international admissions contexts.

Author Contributions

Conceptualization, A.L., D.K., A.S., K.N. and D.J.; methodology, A.L., D.K., A.S., K.N. and D.J.; software, A.L., D.K., K.N., J.J. and M.J. All authors have read and agreed to the published version of the manuscript.

Funding

Funding was provided by the Research Council of Lithuania (LMTLT), contract No. S-ITP-25-14.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The project code and demonstration notebooks are openly available at https://github.com/fair-vid/video-multimodal-pipeline (accessed on 30 November 2025). The complete FAIR-VID framework repository is accessible at https://github.com/fair-vid/ (accessed on 30 November 2025). The project is under active development with progressive releases planned over a two-year implementation timeline.

Acknowledgments

These results are part of the project “Artificial intelligence and multimodal data fusion system for assessing and detecting fraud in applicants’ videos” (FAIR-VID). This project has received funding from the Research Council of Lithuania (LMTLT), agreement No S-ITP-25-14.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hickman, L.; Tay, L.; Woo, S.E. Are automated video interviews smart enough? Behavioral modes, reliability, validity, and bias of machine learning cognitive ability assessments. J. Appl. Psychol. 2024, 110, 314–335. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Li, M.; Cui, L.; Huang, S.; Wei, F.; Zhou, M. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 1192–1200. [Google Scholar]
Huang, Y.; Lv, T.; Cui, L.; Lu, Y.; Wei, F. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In Proceedings of the 30th ACM International Conference On Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 4083–4091. [Google Scholar]
Wang, J.; Jin, L.; Ding, K. LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1, pp. 7747–7757. [Google Scholar]
Kim, G.; Hong, T.; Yim, M.; Nam, J.; Park, J.; Yim, J.; Hwang, W.; Yun, S.; Han, D.; Park, S. Ocr-free document understanding transformer. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 498–517. [Google Scholar]
Lee, K.; Joshi, M.; Turc, I.R.; Hu, H.; Liu, F.; Eisenschlos, J.M.; Toutanova, K. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HA, USA, 23–29 July 2023; pp. 18893–18912. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
AlDahoul, N.; Tan, M.J.T.; Kasireddy, H.R.; Zaki, Y. Exploring vision language models for facial attribute recognition: Emotion, race, gender, and age. arXiv 2024, arXiv:2410.24148. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Desplanques, B.; Thienpondt, J.; Demuynck, K. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv 2020, arXiv:2005.07143. [Google Scholar] [CrossRef]
Eyben, F.; Weninger, F.; Gross, F.; Schuller, B. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 835–838. [Google Scholar]
Chung, J.S.; Zisserman, A. Out of time: Automated lip sync in the wild. In Asian Conference On Computer Vision; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
Haliassos, A.; Ma, P.; Mira, R.; Petridis, S.; Pantic, M. Jointly learning visual and auditory speech representations from raw data. arXiv 2022, arXiv:2212.06246. [Google Scholar]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 1–11. [Google Scholar]
Haliassos, A.; Vougioukas, K.; Petridis, S.; Pantic, M. Lips don’t lie: A generalisable and robust approach to face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision And Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5039–5049. [Google Scholar]
Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July 2019; Volume 2019, pp. 6558–6569. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Izacard, G.; Lewis, P.; Lomeli, M.; Hosseini, L.; Petroni, F.; Schick, T.; Dwivedi-Yu, J.; Joulin, A.; Riedel, S.; Grave, E. Atlas: Few-shot learning with retrieval augmented language models. J. Mach. Learn. Res. 2023, 24, 1–43. [Google Scholar]
Barocas, S.; Selbst, A.D.; Raghavan, M. The hidden assumptions behind counterfactual explanations and principal reasons. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, 27–30 January 2020; pp. 80–89. [Google Scholar]
Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Brkan, M. Do algorithms rule the world? Algorithmic decision-making and data protection in the framework of the GDPR and beyond. Int. J. Law Inf. Technol. 2019, 27, 91–121. [Google Scholar] [CrossRef]
Buolamwini, J.; Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the Conference on Fairness, Accountability And Transparency, New York, NY, USA, 23–24 January 2018; pp. 77–91. [Google Scholar]
Koenecke, A.; Nam, A.; Lake, E.; Nudell, J.; Quartey, M.; Mengesha, Z.; Toups, C.; Rickford, J.R.; Jurafsky, D.; Goel, S. Racial disparities in automated speech recognition. Proc. Natl. Acad. Sci. USA 2020, 117, 7684–7689. [Google Scholar] [CrossRef] [PubMed]
Balagopalan, A.; Madras, D.; Yang, D.H.; Hadfield-Menell, D.; Hadfield, G.K.; Ghassemi, M. Judging facts, judging norms: Training machine learning models to judge humans requires a modified approach to labeling data. Sci. Adv. 2023, 9, eabq0701. [Google Scholar] [CrossRef]
Amershi, S.; Weld, D.; Vorvoreanu, M.; Fourney, A.; Nushi, B.; Collisson, P.; Suh, J.; Iqbal, S.; Bennett, P.N.; Inkpen, K.; et al. Guidelines for human-AI interaction. In Proceedings of the 2019 Conference on Human Factors in Computing Systems, Scotland, UK, 4–9 May 2019; pp. 1–13. [Google Scholar]
Langer, M.; Landers, R.N. The future of artificial intelligence at work: A review on effects of decision automation and augmentation on workers targeted by algorithms and third-party observers. Comput. Hum. Behav. 2021, 123, 106878. [Google Scholar] [CrossRef]
Langer, M.; König, C.J.; Busch, V. Changing the means of managerial work: Effects of automated decision support systems on personnel selection tasks. J. Bus. Psychol. 2021, 36, 751–769. [Google Scholar] [CrossRef]
Bogen, M.; Rieke, A. Help wanted: An examination of hiring algorithms, equity, and bias. Upturn 2018, 7. Available online: https://www.upturn.org/work/help-wanted (accessed on 10 November 2025).
Oswald, F.L.; Behrend, T.S.; Putka, D.J.; Sinar, E. Big data in industrial-organizational psychology and human resource management: Forward progress for organizational research and practice. Annu. Rev. Organ. Psychol. Organ. Behav. 2020, 7, 505–533. [Google Scholar] [CrossRef]
Kim, C.; Choi, J.; Yoon, J.; Yoo, D.; Lee, W. Fairness-aware multimodal learning in automatic video interview assessment. IEEE Access 2023, 11, 122677–122693. [Google Scholar] [CrossRef]
Vidros, S.; Kolias, C.; Kambourakis, G.; Akoglu, L. Automatic detection of online recruitment frauds: Characteristics, methods, and a public dataset. Future Internet 2017, 9, 6. [Google Scholar] [CrossRef]

Figure 1. The End-To-End admission and recruitment workflow proposed by the FAIR-VID project. The left panel illustrates the automated pre-processing pipeline, while the right panel depicts the human-in-the-loop decision-making process. Crucially, this visual separation highlights that the AI pipeline functions solely as a data preparation layer, ensuring that the final “accept/reject” determination remains an exclusively human responsibility.

Figure 2. End-to-End Visual Document Analysis workflow. The diagram maps the interactions between the admission system, VDU subsystem, and AI agents. The explicit separation of the “Local AI Agent” and “Cloud AI Agent” lanes illustrates how sensitive data is filtered locally before depersonalized reasoning occurs in the cloud, while the central loop demonstrates the requirement for human verification before data finalization.

Figure 3. Two-phase workflow of the FAIR-VID Video Interview Processing Pipeline. (Left): Phase 1—Video Deconstruction, where raw media is split into isolated modalities to reduce noise. (Right): Phase 2—Visual Data Enrichment, where generative AI transforms visual frames into text. This split workflow ensures that acoustic features and visual semantics are analyzed by specialized models before being merged for holistic evaluation.

Figure 4. End-to-end sequence of interactions in the FAIR-VID Video Interview Analysis workflow. The diagram traces communication from the initial recording to the final evaluation. The distinct “depersonalized reasoning” loop emphasizes the architectural privacy barrier, verifying that no personal data leaves the local environment during the semantic interpretation phase.

Figure 5. The Multimodal Data Fusion Workflow. This diagram interprets the hierarchical logic of the system: distinct document and video analyses (top) are merged for a fused re-evaluation (middle) and finally contextualized with historical data (bottom). The downward flow represents the progressive refinement of raw, unstructured data into a structured evidence package suitable for human review.

Table 1. Comparison of extraction accuracy between Gemma 3 (27B) and LayoutLMv3 Base on standardized IELTS certificates and a control set of random academic documents. The IELTS dataset was selected for its uniform structure, enabling consistent field-level evaluation. Results highlight Gemma’s superior zero-shot performance and LayoutLMv3’s OCR-related sensitivity to layout variations.

Model Configuration	Document Type (Test Set)	No. of Documents	Errors Detected	Accuracy (%)
Gemma 3 (27B)	IELTS certificates (standardized set)	20	1	95.0%
LayoutLMv3 Base	IELTS certificates (standardized set)	20	3	85.0%
Gemma 3 (27B)	Random academic documents (control set)	20	4	80.0%

Table 2. Performance of Multimodal Predictive Models Across Admission Stages.

Experiment/Stage	Modalities Used	Precision	Recall
Exp. 1—Fee Payment Prediction	Form + Document profiles	0.83	0.79
Exp. 2a—Offer Prediction (no video)	Form + Document data only	0.85	0.82
Exp. 2b—Offer Prediction (with video analysis)	Form + Document + Video/Audio embeddings	0.91	0.88
Exp. 3—Enrollment Prediction	All modalities (Form + Docs + Video + Behavioral traces)	0.87	0.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Laukaitis, A.; Kalibatienė, D.; Jodenytė, D.; Normantas, K.; Jancevičius, J.; Jankauskas, M.; Serackis, A. FAIR-VID: A Multimodal Pre-Processing Pipeline for Student Application Analysis. Appl. Sci. 2025, 15, 13127. https://doi.org/10.3390/app152413127

AMA Style

Laukaitis A, Kalibatienė D, Jodenytė D, Normantas K, Jancevičius J, Jankauskas M, Serackis A. FAIR-VID: A Multimodal Pre-Processing Pipeline for Student Application Analysis. Applied Sciences. 2025; 15(24):13127. https://doi.org/10.3390/app152413127

Chicago/Turabian Style

Laukaitis, Algirdas, Diana Kalibatienė, Dovilė Jodenytė, Kęstutis Normantas, Julius Jancevičius, Mindaugas Jankauskas, and Artūras Serackis. 2025. "FAIR-VID: A Multimodal Pre-Processing Pipeline for Student Application Analysis" Applied Sciences 15, no. 24: 13127. https://doi.org/10.3390/app152413127

APA Style

Laukaitis, A., Kalibatienė, D., Jodenytė, D., Normantas, K., Jancevičius, J., Jankauskas, M., & Serackis, A. (2025). FAIR-VID: A Multimodal Pre-Processing Pipeline for Student Application Analysis. Applied Sciences, 15(24), 13127. https://doi.org/10.3390/app152413127

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

FAIR-VID: A Multimodal Pre-Processing Pipeline for Student Application Analysis

Abstract

1. Introduction

2. Related Work

3. End-to-End Visual Document Analysis

3.1. Key Challenges in International Credential Processing

3.2. Zero-Shot Document Understanding Strategy

3.3. Model Justification and Practical Trade-Offs

3.4. Integration into the Multimodal Pipeline

4. Video Interview Processing Pipeline

4.1. Phase 1—Video Deconstruction

4.2. Phase 2—Visual Data Enrichment via Generative AI

4.3. End-to-End Integration and AI Collaboration

5. A Multimodal Data Fusion Pipeline

5.1. Transparency and Open Access of Prompts

5.2. Local vs. Cloud Agents

5.3. Stage 1—Holistic Document Analysis

5.4. Stage 2—Fused Re-Evaluation with Video Analysis

5.5. Stage 3—Retrieval-Augmented Generation (RAG) Synthesis

5.6. Stage 4—Final Output Generation and Human Review

6. Results and Discussion

6.1. Experimental Setup

6.2. Model Performance on Standardized IELTS Certificates

6.3. Predictive Modeling Experiments on Multimodal Admission Data

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI