Review Reports - FAIR-VID: A Multimodal Pre-Processing Pipeline for Student Application Analysis

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This article proposes a preprocessing pipeline for student application analysis that handles multimodal data—text, video, and audio—to build a comprehensive applicant profile for a simpler and fairer application offer decision-making process. The authors have conducted substantial work on the models supporting each step of the admission process, aiming to maximize the extraction of information about an applicant not only from submitted documents but also from interviews, while also verifying this information for manipulation techniques and document forgery.

However, the following remarks should be addressed:

1) To provide readers with better context for the research, the article should include a comprehensive overview of existing AI-based data preprocessing systems for similar tasks described in the literature or explicitly state the absence of such systems.

2) The caption for Figure 1 mentions the FAIR-VID project, which has not been introduced in the article at that point. When encountering Figure 1, the reader is not yet familiar with the FAIR-VID project. Moreover, upon its first mention in the text (Introduction, page 3), it is not explicitly stated that the FAIR-VID pipeline is the authors' own development, proposed in this article.

3) The conclusion states that "All major pipeline components, including document parsing, video deconstruction, visual enrichment, and multimodal fusion, are implemented as a set of Google Colab notebooks. These notebooks are openly accessible." However, a link to these Google Colab notebooks should be provided.

5) The information from section 3.1.3 regarding future directions is recommended to be moved to the"Conclusion and Future Work" section.

6) Section 5.5 states: "the applicant's synthesized profile is vectorized and used to query a database of prior decisions. The top-N most similar cases and their historical outcomes are retrieved to provide an empirical benchmark for comparison." This passage requires the following clarifications:

- It should be explained how the admission decisions (accept/reject) were originally made for these prior, "similar cases." Specifically, were those past decisions also made using profiles synthesized by the FAIR-VID pipeline, or through a different, perhaps traditional, process?

- If the historical decisions were indeed made using the FAIR-VID pipeline, this recursive application raises the potential for bias that should be discussed and perhaps noted as a limitation of the presented methodology.

- The term "historical outcomes" requires clarification. Does it refer solely to the admission decision (accepted/rejected), or does it also encompass the applicant's subsequent academic performance and ultimate success (e.g., graduation)?

7) Section 6.2 should clarify how the accuracy metric for "model performance on IELTS Certificates" is calculated. Based on Table 1, it appears that the number of extraction errors is divided by the total number of documents.

It is unclear whether this metric differentiates between scenarios where two errors occur in two separate documents versus two errors occurring in a single document. While such differentiation may not be necessary, the rationale for the chosen approach should be explicitly provided.

Furthermore, it must be noted that a sample size of 20 objects is insufficient for a robust comparison of model quality. It is recommended either expanding the dataset or refraining from using these results to draw conclusions about the superiority of one model configuration over another.

8) Section 6.2 describes an experiment predicting a binary success variable for three stages of the admission workflow. It is unclear from where the true response value for Experiment 2 is derived.

Specifically, does this value represent whether an admission offer was received (or not) through the standard procedure of human-made decisions? Or, is the admission offer granted or denied by the FAIR-VID pipeline itself?

If the latter is the case, it is necessary to clarify the purpose and meaning of such a prediction.

Author Response

Thank you for the constructive and detailed feedback. We carefully revised the manuscript, and below we address each comment point-by-point.

-------

To provide readers with better context for the research, the article should include a comprehensive overview of existing AI-based data preprocessing systems for similar tasks described in the literature or explicitly state the absence of such systems.

We expanded Section 2 with a detailed discussion of existing AI-based preprocessing solutions and clarified that no prior work provides an end-to-end multimodal preprocessing pipeline for admissions or HR contexts.

-----------

The caption for Figure 1 mentions the FAIR-VID project, which has not been introduced in the article at that point. When encountering Figure 1, the reader is not yet familiar with the FAIR-VID project. Moreover, upon its first mention in the text (Introduction, page 3), it is not explicitly stated that the FAIR-VID pipeline is the authors' own development, proposed in this article.

We thank the reviewer for noting that the FAIR-VID project was mentioned before being formally introduced. To address this, we revised the opening paragraph of the Introduction to introduce FAIR-VID at the very beginning of the paper and explicitly state that it is the authors’ own multimodal preprocessing pipeline developed and proposed in this article. This provides readers with clear context from the outset and ensures that subsequent references, including the figure caption, are immediately understandable. The revised paragraph now appears on page 1.

-----------

The conclusion states that "All major pipeline components, including document parsing, video deconstruction, visual enrichment, and multimodal fusion, are implemented as a set of Google Colab notebooks. These notebooks are openly accessible." However, a link to these Google Colab notebooks should be provided.

We thank the reviewer for this valuable suggestion. We have addressed this concern by implementing the following changes:

GitHub Repository Establishment

We have created a dedicated GitHub repository at https://github.com/fair-vid/video-multimodal-pipeline containing initial demonstration notebooks for the video information processing pipeline. The repository includes:

Demo_video_multimodal_extraction.ipynb - Extracting audio, text transcriptions, and frames from video interviews

Demo_image_description_generation.ipynb - Generating AI-powered semantic descriptions of visual content

These notebooks are directly accessible through Google Colab via clickable badges in the repository's /notebooks/ directory, allowing immediate experimentation without local installation requirements.

Data Availability Statement

We have added an explicit Data Availability Statement to the manuscript with a direct link to the repository.

Ongoing Development Timeline

It is important to note that the FAIR-VID project is under active development with a planned two-year implementation timeline. The current repository represents the initial proof-of-concept implementation of the video processing pipeline. Additional components will be progressively released, including:

Document parsing and analysis modules

Advanced multimodal fusion frameworks

Semi-supervised learning implementations

Fraud detection algorithms

Integration APIs for institutional systems

We will maintain version control and comprehensive documentation throughout the development cycle, ensuring reproducibility and transparency at each stage. All major releases will be accompanied by detailed technical documentation and usage examples.

We believe these additions substantially improve the transparency and accessibility of our work, aligning with best practices for open science and reproducible research.

-----------

The information from section 3.1.3 regarding future directions is recommended to be moved to the"Conclusion and Future Work" section.

We thank the reviewer for the helpful suggestion. In the revised manuscript, the text in Section 3.1.3 that discussed prospective development of the document analysis component has been relocated to the “Conclusion and Future Work” section. This material now appears at the end of the conclusion and is integrated into the broader discussion of future research directions. The original text in Section 3.1.3 has been rewritten to retain only methodological content, ensuring a clear separation between current methods and future work.

-----------

Section 5.5 states: "the applicant's synthesized profile is vectorized and used to query a database of prior decisions. The top-N most similar cases and their historical outcomes are retrieved to provide an empirical benchmark for comparison." This passage requires the following clarifications:

We thank the reviewer for this valuable observation. The manuscript has been updated to clarify that all historical decisions used for similarity-based retrieval were produced exclusively by human admissions officers through the institution’s standard evaluation process.

Historical outcomes refer solely to the binary result of these prior evaluations (offer granted or denied). We further explain that the system avoids recursive self-reinforcement because no FAIR-VID-generated predictions are used as input data for retrieval.

At the same time, we acknowledge that, if institutions later adopt FAIR-VID operationally, accumulated decisions may gradually incorporate model influence. This is now discussed as a potential future limitation, and we recommend periodic auditing and recalibration of the historical database.

-----------

Section 6.2 should clarify how the accuracy metric for "model performance on IELTS Certificates" is calculated. Based on Table 1, it appears that the number of extraction errors is divided by the total number of documents.

We appreciate the reviewer’s careful analysis and have substantially revised Section 6.2 to clarify the evaluation procedure. We now explicitly state that accuracy was computed at the document level, defined as the proportion of certificates for which all required fields were extracted correctly.

Because IELTS certificates share a uniform layout and contain a small, fixed number of fields, each certificate is treated as a single evaluation unit, and differentiating between multiple errors within one document versus across documents is not necessary for this use case. The revised section also acknowledges that the sample size of 20 certificates is limited; the results are therefore presented as exploratory rather than conclusive and are not used to claim statistical superiority of any model.

We also explain that expanding the dataset was not feasible under the current research constraints.

For many countries, institutions receive only a small number of certificate examples per year, often fewer than 20, which mirrors the small-sample condition used in our evaluation. In addition, layout-based models such as LayoutLMv3 require annotated training sets, which are labor-intensive to produce for each document type. These practical constraints motivate our emphasis on zero-shot multimodal LLMs and document-level evaluation. A revised and extended explanation has been added to Section 6.2.

-----------

Section 6.3 describes an experiment predicting a binary success variable for three stages of the admission workflow. It is unclear from where the true response value for Experiment 2 is derived.

If the latter is the case, it is necessary to clarify the purpose and meaning of such a prediction.

We thank the reviewer for raising this important clarification. In Experiment 2 (Offer Prediction), the true response variable reflects the actual admission decisions issued by human admissions officers through the institution’s standard evaluation process.

The FAIR-VID pipeline did not generate these decisions; it served only to extract and represent features for research analysis.

Accordingly, the experiment evaluates the pipeline’s ability to approximate historical human decision-making rather than to predict outcomes of its own automated recommendations.

This clarification has been added to Section 6.3 and it now explicitly states that the ground-truth labels originate from human-made decisions

Reviewer 2 Report

Comments and Suggestions for Authors

The paper is well written and well structured. The English language does not require major improvements; however, on page 12, paragraph 4, please unify the terminology (“AI agent” vs. “agent”) for consistency.

The topic is relevant and timely.

However, please focus more on the Methods section. It currently lacks key technical details-such as model parameters, prompt formulations, preprocessing settings, and configuration information-which are necessary for reproducibility and scientific completeness.

Author Response

We appreciate the reviewer’s encouraging words regarding the structure and writing quality of our manuscript. We have addressed the inconsistency noted on page 12 by standardizing the terminology throughout the paper. We now consistently use the full terms "Local AI Agent" and "Cloud AI Agent" (or "AI agent" generally) rather than the abbreviated "agent" to prevent any ambiguity.

-----------

We thank the reviewer for highlighting the need to strengthen the Methods section with technical details essential for reproducibility.

In the revised manuscript, we have substantially expanded Sections 3–5 to provide explicit descriptions of model configurations, prompt formulations, preprocessing settings, and system architecture. Below we summarize the key additions.

Model Parameters and Architectural Choices. We clarified that the primary inference engine is Gemma 3 (27B) operating in zero-shot mode, selected for its ability to generalize across heterogeneous credential formats without fine-tuning. We now explicitly state: 1) Deployment on a single institutional GPU for privacy compliance ; 2) Complementary LayoutLMv3 + OCR branch retained for spatial grounding and auditability.

For video processing section we added explanation that the pipeline uses Whisper for multilingual speech transcription due to its robustness to non-native accents, and a multimodal generative model for visual enrichment (image-to-text conversion). For fusion strategy section we detailed the staged fusion architecture—document reasoning first, followed by video/audio integration, and finally retrieval-augmented synthesis using historical human decisions. These choices are now justified in terms of accuracy vs. annotation cost, privacy vs. compute trade-offs, and forward compatibility (Sections 3.3 and 5.4).

Prompt Formulations. Because prompt design critically influences LLM behavior, we added:

A link to the open prompt repository (https://github.com/fair-vid/admission_prompts), enabling external auditing and reproducibility.

Preprocessing Settings. We now detailed video deconstruction process i.e. extraction of audio , Whisper-based transcripts and representative video frames (temporal midpoint) for authenticity and behavioral analysis.

Configuration and Reproducibility.

We detailed and extended our open-source implementations on GitHub portal.. All major pipeline components (document parsing, video deconstruction, visual enrichment, multimodal fusion) are implemented as Google Colab notebooks and published in our GitHub repository.

FAIR-VID project now have 4 open source repositories on GitHub that we plan to extend in future.

Now transparency measures like prompts, configuration files, and version-controlled scripts are accessible for external verification. And for data governance we clarified the local/cloud split for privacy compliance and depersonalization before cloud reasoning (Section 5.2).

We expanded Section 6.2 to explain accuracy computed at document level , rationale for treating each IELTS certificate as a single evaluation unit and limitations of small sample size and interpretation as exploratory rather than conclusive.

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors,

The manuscript presents an ambitious and technically sophisticated multimodal AI pipeline for applicant evaluation. The topic is timely and relevant, and the integration of document understanding, video analysis, and fairness-oriented architectural design is valuable. However, several sections would benefit from refinement to improve clarity, methodological transparency, and alignment with established evaluation standards. I therefore recommend minor-to-moderate revisions before the manuscript can be accepted.

Several sections are too heavy with technical descriptions, making it difficult for the reader to extract the main conceptual contributions. The authors should consider restructuring long paragraphs, adding guiding subheadings, and clarifying transitions.
While the pipeline is detailed, important methodological choices—such as threshold settings, model parameterization, and criteria for fusion—are not sufficiently justified. The paper would benefit from clearer explanations of:
- why these models were selected over alternatives,
- how performance trade-offs were evaluated,
- what limitations the authors acknowledge in their approach.
Figures are informative but visually dense. The reader would benefit from:
- more explicit explanations of workflow diagrams,
- clearer separation between stages,
- more descriptive figure captions that interpret, and not just describe the diagrams.
The manuscript is generally readable, but several sentences are overly long and contain unnecessary technical redundancy. Careful language polishing could significantly improve flow and precision.
The experimental results are promising but somewhat narrow:
- The dataset of 20 standardized certificates is small for meaningful benchmarking.
- Predictive modeling is reported at a high level, without deeper analysis (e.g., ablation, calibration, error inspection).
  Further justification, or an explicit acknowledgment of these limitations, is needed.
While the architecture aligns with regulatory expectations, the discussion could benefit from addressing:
- risks of applicant profiling,
- fairness risks in multimodal embeddings,
- reproducibility concerns in generative AI–based enrichment.
  A brief reflections subsection would strengthen the contribution.
Some definitions (e.g., “generative enrichment”) could be introduced earlier or with clearer examples.
Consider reporting confidence intervals or variance for predictive metrics.
Ensure all acronyms are defined on first use.

Author Response

--------

Several sections are too heavy with technical descriptions, making it difficult for the reader to extract the main conceptual contributions. The authors should consider restructuring long paragraphs, adding guiding subheadings, and clarifying transitions.

We thank the reviewer for noting that several sections were overly dense and could obscure the main conceptual contributions. In response, we have substantially revised Sections 3–5 to improve readability and structural clarity.

Specifically, we:

restructured long paragraphs into shorter, concept-focused units,

added new guiding subheadings to make the narrative progression explicit

inserted lead-in and transition sentences at the beginning of major sections and subsections to explain how each component connects to the overall architecture.

These revisions clarify the conceptual flow of the FAIR-VID pipeline and make the technical descriptions easier to follow.

--------

While the pipeline is detailed, important methodological choices—such as threshold settings, model parameterization, and criteria for fusion—are not sufficiently justified. The paper would benefit from clearer explanations of:

why these models were selected over alternatives,

how performance trade-offs were evaluated,

what limitations the authors acknowledge in their approach.

We thank the reviewer for highlighting the need for clearer justification of methodological choices.

We have implemented several targeted revisions to address these points.

First, we expanded the discussion of model selection in Section 3 by explaining why Gemma 3 was chosen over Donut-, LayoutLM-, and GPT-4o–style models. We now explicitly state that Gemma 3 offers strong zero-shot performance without requiring large annotated datasets, which is essential given the global diversity of credentials in admissions workflows. We also clarify that its open-source licensing and ability to run on local hardware were decisive factors for regulatory compliance and institutional deployment, whereas proprietary models would not satisfy locality constraints. These additions make the architectural motivations more transparent.

Second, we added description of performance trade-offs and the design logic behind the dual-path document-processing strategy.

We further clarify that Gemma-generated structured outputs may, in future, serve as training data for layout-aware models—emphasizing the forward-compatible nature of the design.

Third, we expanded the justification of Whisper for audio transcription, noting its observed robustness to non-native English accents typical of international student interviews.

We also explicitly acknowledge a limitation: these observations are qualitative and not quantitatively benchmarked, but the paper intentionally avoids expanding the scope of the evaluation because audio transcription is a minor component of the overall pipeline.

Fourth, we added the criteria for multimodal fusion and the rationale behind the staged ordering. The revised text now emphasizes that document-based reasoning provides the most reliable evidence foundation, with video-derived cues functioning as contextual refinements, and retrieval-augmented reasoning providing institutional consistency.

We also acknowledge that fusion stability depends on the size and diversity of the historical dataset and explicitly identify this as a limitation.

--------------

Figures are informative but visually dense. The reader would benefit from:

more explicit explanations of workflow diagrams,

clearer separation between stages,

more descriptive figure captions that interpret, and not just describe the diagrams.

We appreciate the feedback regarding the visual density of the figures. To address this, we have revised the corresponding sections in the body text to explicitly explain the workflow logic and the specific separation between processing stages (e.g., local vs. cloud, automated vs. human-in-the-loop). Additionally, we have rewritten the figure captions to be interpretive rather than purely descriptive, ensuring the reader understands the functional purpose of each diagram component.

------------------

The manuscript is generally readable, but several sentences are overly long and contain unnecessary technical redundancy. Careful language polishing could significantly improve flow and precision.

We appreciate the feedback on the manuscript's flow.

We have conducted a careful language polishing throughout the paper, targeting and shortening several long sentences that contained technical redundancies.

This effort aims to improve the readability of the text.

------

The experimental results are promising but somewhat narrow: The dataset of 20 standardized certificates is small for meaningful benchmarking. Predictive modeling is reported at a high level, without deeper analysis (e.g., ablation, calibration, error inspection). Further justification, or an explicit acknowledgment of these limitations, is needed.

We thank the reviewer for this feedback.

We agree that the sample size for the document analysis component is limited and that the predictive modeling section benefits from clearer contextualization regarding its scope.

We have revised the manuscript to explicitly acknowledge these limitations and to clarify how the current experiments serve as a proof-of-concept study.

Regarding the dataset size for certificate analysis (Section 6.2):

We have updated Section 6.2 to explicitly state that the sample size of 20 certificates is limited and that the results should be interpreted as "exploratory rather than conclusive".

We further clarified the practical constraints that necessitated this approach:

Data Scarcity: In the context of international admissions, institutions often receive very small numbers of specific certificate types per year, mirroring the small-sample condition used in our evaluation.

Annotation Costs: We explain that large-scale quantitative benchmarking for layout-aware models (like LayoutLMv3) is often infeasible due to the prohibitive cost of annotating bounding boxes for diverse international document formats.

We have added a dedicated explanation in Section 6.2 acknowledging these constraints and clarifying that the experiment aims to illustrate model behavior under real-world restrictions rather than to claim statistical superiority.

Regarding predictive modeling and ablation (Section 6.3): We appreciate the suggestion to include ablation analysis. We wish to highlight that Experiment 2 (Offer Prediction) was designed specifically as an ablation study to isolate the contribution of the proposed video processing pipeline.

Table 2 compares Experiment 2a (using only form and document data) against Experiment 2b (adding video/audio embeddings). This comparison functions as an ablation of the video modality, demonstrating that the inclusion of the FAIR-VID video analysis improved precision and recall by approximately 6 percentage points.

We acknowledge that deeper error inspection and calibration curves are necessary for a production-ready system. We have updated the Conclusion and Future Work section to specify that upcoming research will focus on "benchmarking fairness and interpretability metrics" and developing shared evaluation datasets.

-------

While the architecture aligns with regulatory expectations, the discussion could benefit from addressing: risks of applicant profiling, fairness risks in multimodal embeddings, reproducibility concerns in generative AI–based enrichment.

A brief reflections subsection would strengthen the contribution.

We thank the reviewer for this valuable suggestion. We fully agree that the ethical and reproducibility challenges of multimodal AI are central to the validity of our contribution. In the revised manuscript, rather than isolating these points in a separate subsection, we have integrated a deeper "reflections" discussion directly into the Methodology and Discussion sections. This ensures that the ethical analysis is tightly coupled with the technical architecture.

We have strengthened the discussion in the following three areas:

Risks of Applicant Profiling and Recursive Bias:

Data Depersonalization: In Section 5.2 (Local vs. Cloud Agents), we explicitly discuss how the system mitigates profiling risks by ensuring that high-level reasoning occurs only on "generalized, anonymized profiles". We emphasize that the Cloud AI Agent never accesses personal identifiers, satisfying the "proportionality and risk mitigation" requirements of the EU AI Act.

Recursive Bias: In Section 5.5, we added a specific acknowledgment of the risks of "recursive self-reinforcement" if the system’s outputs are eventually used to train future iterations. We explicitly list this as a limitation, noting that "accumulated decisions may gradually incorporate model influence," and recommend periodic auditing.

In Section 7 (Conclusion and Future Work), we have committed to a specific research roadmap for "benchmarking fairness and interpretability metrics across demographic and linguistic groups", acknowledging that this remains an active challenge for the field.

Reproducibility in Generative AI:

Open Prompts Policy: To address the variability of Generative AI, we added Section 5.1 (Transparency and Open Access of Prompts). We argue that because prompt structure dictates model behavior, "exposing these prompts enables external auditing". We now link to a repository containing the exact prompts used for document and video interpretation.

-------

Some definitions (e.g., “generative enrichment”) could be introduced earlier or with clearer examples.

We have reviewed the manuscript to ensure that definitions are defined explicitly upon its first technical introduction.

Demonstrable Examples via Open Source: We recognized that static text is often insufficient to fully illustrate complex multimodal processing. Therefore, to provide the "clearer examples" requested, we have established a public GitHub repository at https://github.com/fair-vid/video-multimodal-pipeline.

We specifically included a notebook titled Demo_image_description_generation.ipynb. This notebook serves as a practical, executable definition of "generative enrichment," allowing readers to see exactly how the model takes a video frame and generates a structured semantic description.

Ongoing Development: As this paper presents the initial phase of the FAIR-VID project, which is under active development with a two-year implementation timeline, we plan to continuously expand this repository. We will add further examples and documentation to attribute and clarify emerging technical concepts as the project evolves, ensuring that the definitions remain grounded in accessible, reproducible code.

------

Ensure all acronyms are defined on first use.

We thank the reviewer for this careful observation. We have thoroughly reviewed the manuscript and expanded all acronyms at their first occurrence to ensuring clarity for a broad readership.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have fully addressed my comments. I believe the article can be published in its current version.