Multi-Modal Method for Candidate Interview Assessment Based on Computer Vision and Large Language Models

Kassab, Kenan; Kashevnik, Alexey; Shoshina, Irina

doi:10.3390/bdcc10040106

Open AccessArticle

Multi-Modal Method for Candidate Interview Assessment Based on Computer Vision and Large Language Models

by

Kenan Kassab

¹

,

Alexey Kashevnik

^1,*

and

Irina Shoshina

²

¹

St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg 199178, Russia

²

Institute of Computer Science, MISIS University of Science and Technology, Moscow 196066, Russia

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(4), 106; https://doi.org/10.3390/bdcc10040106

Submission received: 11 February 2026 / Revised: 17 March 2026 / Accepted: 26 March 2026 / Published: 1 April 2026

Download

Browse Figures

Versions Notes

Abstract

Candidate interview assessment is primarily reliant on subjective human judgment, while existing AI-based methods rely on end-to-end predictions with no psychometric basis. In this paper, we propose an interpretable multi-modal framework that combines nonverbal behavior, LLM-based verbal analysis, and Big Five personality traits into three theory-based constructs: professional-cognitive competence, observed leadership behavior, and leadership disposition. The proposed method utilizes computer vision and larger language models to extract features from video interviews. Rather than targeting predictive accuracy, the proposed method prioritizes construct validity and transparent aggregation under severe label scarcity. The proposed method aggregates the constructs into a Top Potential Score that reflects the executive abilities of the candidate. Experiments on the method show its ability to significantly differentiate top candidates from others (Cliff’s delta = 0.91 for the composite Top Potential Score, permutation p = 0.0002). Leave-one-out analysis verifies robustness, while rank-based evaluation yields 100% recall of executive candidates in the top 20% of rated applications. The findings justify the use of the proposed multi-modal method as an interpretable decision-support tool for candidate interview assessment.

Keywords:

large language models; OCEAN model; personality traits; non-verbal cues

1. Introduction

Candidate hiring decisions have long-term organizational consequences, yet interview-based evaluations remain heavily dependent on subjective judgment. Human interviewers have inherent limitations, including cognitive bias, inconsistency across raters, and difficulties processing multidimensional behavioral data simultaneously [1,2]. These difficulties are most obvious in candidate assessments, where leadership potential emerges through verbal content, nonverbal conduct, and dispositional features.

Video interviews represent a valuable data source for computational analysis and facial dynamics, capturing speech, gaze, and body language in a single video. By quantifying these signals on a large scale, AI technologies could improve human judgment [3,4,5]. However, existing systems usually take one of two approaches: single-modality analysis (for example, speech or text) or end-to-end black-box prediction. Both limit practical application in personnel selection. Single-modality systems neglect complementing behavioral channels; on the other hand, black-box models lack the transparency needed for high-stakes judgments.

Prior research investigated the importance of multiple features and factors throug the candidate interview assessment. Nonverbal behaviors—eye contact, facial expressivity, and head movements—signal confidence and interpersonal effectiveness [6]. Large language models now enable scalable analysis of verbal content, capturing strategic reasoning and communication quality [7]. Personality traits, particularly the Big Five, predict leadership emergence and professional readiness. Despite evidence supporting each modality, prior work rarely integrates all three into a single framework. In rare systems when integration occurs, it typically prioritizes predictive performance over interpretability.

In this work, we present an interpretable multimodal method for candidate interview assessment that incorporates computer vision and large language models to extract features from the video interview. The method examines three parallel streams: (i) nonverbal signals collected from video (gaze engagement, head motion dynamics, and facial expressivity), (ii) spoken content assessed using LLM-based behavioral coding across leadership-relevant dimensions, and (iii) Big Five personality traits estimated using deep learning models. Rather than developing an end-to-end predictor, we aggregate low-level features into three theory-based constructs (professional-cognitive competence, observed leadership behavior, and leadership disposition) that combine to produce an interpretable Top Potential Score.

The contribution of this paper is summarized as follows:

Multi-modal, AI-based method, integrating video-based nonverbal analysis, personality trait estimation, and LLM-driven verbal assessment;
Candidate-assessment-interpretable feature construct for bridging LLM outputs with established leadership and assessment theories;
Empirical validation demonstrating the method’s efficiency to identify high-potential candidates.

The rest of the paper is divided as follows: Section 2 explores the related studies. Section 3 presents the proposed framework. Section 4 delves into the experiments and results. Section 5 analyzes the limitations and future steps. Section 6 concludes the research.

2. Related Work

Automated interview analysis is becoming increasingly popular as businesses seek scalable and objective alternatives to traditional human-centered hiring processes. Computer vision methods have been applied to quantify eye contact, facial expressivity, head motion, and posture from interview videos. These nonverbal cues correlate with perceived confidence, social presence, and hiring outcomes in empirical studies [8,9,10]. However, many of these systems rely on black-box classifiers that have been trained end-to-end, restricting interpretability and raising issues about fairness and deployment in high-stakes environments [11].

Natural language processing approaches use interview transcripts to predict employability or communication quality. Recent research uses large language models (LLMs) to assess behavioral aspects like clarity, strategic thinking, and confidence [12,13,14]. While LLMs allow for scalable semantic analysis, they are often used as direct predictors rather than structured coding instruments. This raises concerns about demographic and cultural biases in model outputs [15]. Few studies have anchored LLM outcomes in known evaluation frameworks or combined them with nonverbal or dispositional indicators.

Personality assessment, notably using the Big Five model, is a well-known aspect of people selection research. Openness, extraversion, emotional stability, and conscientiousness have all been connected to leadership development and managerial performance. Prior research has investigated personality prediction from text or video [16,17,18,19,20]; however, personality measurements are frequently removed from multimodal pipelines or combined in an ad hoc manner without precise construct-level interpretation.

In contrast to previous approaches, this study provides an interpretable multi-modal AI framework that incorporates nonverbal indicators, LLM-based verbal assessments, and personality factors into theory-driven leadership concepts. Rather than focusing on predicting accuracy through supervised learning, the method focuses on statistical robustness, effect magnitude, and rank-based screening. This establishes the suggested method as a complimentary alternative to black-box hiring methods, especially for candidate interview assessment where transparency and data efficiency are crucial.

The suggested framework fits into the intrinsically interpretable (ante-hoc) paradigm of Explainable AI (XAI) research [21]. In contrast to post-hoc explanation techniques that seek to provide explanations for black box predictions, our technique incorporates interpretability into the architecture by design. In line with the taxonomy of Explainable AI, our technique offers: (1) global interpretability that provides insights into the relationships between high-level constructs (verbal, non-verbal, personality) and the final Top Potential Score, and (2) decomposability that enables the individual construct contributions to be audited. This places the technique in the emerging research that privileges transparency over marginal accuracy gains in high-stakes decision contexts.

3. Proposed Method

This section introduces an interpretable multi-modal AI-based method for evaluating candidate interviews (see Figure 1). The method uses three complementary modalities to convert unstructured interview recordings into meaningful construct-level indicators of leadership potential and seniority readiness. The methods integrated computer vision to analyze nonverbal behavioral, verbal content evaluation via large language models (LLMs), and dispositional assessment (personality trait assessment) via deep learning models.

The method uses three parallel analysis streams to process candidate data from a recorded video interview:

Nonverbal Behavioral Analysis: Extraction of eye-gaze, head motion, and facial expressivity cues from video frames.
Verbal Content Analysis: Structured evaluation of transcribed responses using LLM-based behavioral coding.
Personality Trait Estimation: Utilizing the deep learning models to evaluate personality traits [19].

Each modality produces standardized feature vectors that aggregate into three interpretable constructs: Professional–Cognitive Competence (verbal reasoning and strategic thinking), Observed Leadership Behavior (nonverbal expressivity and engagement), and Leadership Disposition (personality-based readiness). These constructs combine into a single Top Potential Score for candidate screening. The method is construct-driven: every computational signal maps directly to established concepts in industrial–organizational criteria, ensuring both interpretability and theoretical grounding.

3.1. Nonverbal Behavioral Feature Extraction

We used modern computer vision algorithms to extract nonverbal features from interview videos. The largest face in each frame was recognized as the candidate’s face and kept for further examination.

3.1.1. Gaze Engagement

The L2CS-Net [22] with a ResNet-50 backbone model was used to assess gaze direction. The percentage of frames in which the candidate’s visual axis crossed the camera region was used to measure eye contact engagement. In high-stakes interpersonal situations, maintaining eye contact with the interviewer (represented by the camera) is a confirmed behavioral indicator of focus, self-assurance, and social presence.

3.1.2. Head Movement

MediaPipe [23] with Face Mesh solution (which provides 468 fical landmarks and enables esitimation of head pose dynamics) was used to estimate three-dimensional head posture trajectories and extract rotation angles (roll, pitch, and yaw). To measure behavioral expressivity and engagement, three motion-derived features were calculated from these trajectories:

Pitch variation (nodding intensity) is linked to active listening and agreement;
Yaw variance (shaking intensity), possibly indicating doubt or disagreement;
Head motion energy captures total expressiveness and engagement.

These characteristics jointly define behavioral dynamism and nonverbal expressivity, which are especially important in leadership perception.

3.1.3. Facial Expressivity

DeepFace [24] was used for facial expression analysis. The system extracts emotion probabilities for multiple affective categories; in this study, we focus on the happiness probability as an indicator of positive affect during interview responses. Face detection was performed using the RetinaFace backend. We extracted the following features:

Mean happiness score: average valence score across frames;
Happy dominance ratio: the proportion of frames in which happiness was the primary expressed emotion.

These metrics capture warmth, approachability, and positive affect, all of which have been linked to improved leadership assessments and interpersonal effectiveness. All nonverbal features were aggregated at the interview level and standardized to enable cross-modal integration.

3.2. Verbal Content Analysis via Large Language Models

Audio streams were transcribed using the Whisper large-v3 automatic speech recognition model. Then transcripts were processed utilizing a locally deployed LLM to conduct theory-informed behavioral coding at executive and behavioral features. Specifically, we used the Llama 3.2 (Meta), llama3.2:3b model. At the end, an HR expert reviewed these results manually and proved the LLM scores’ reliability.

3.2.1. Dimensional Behavioral Coding

For each of five interview questions, the LLM was prompted to rate responses on five leadership-relevant behavioral dimensions using a fixed 1–5 scale: confidence, clarity, strategic thinking, emotional stability and leadership potential. Prompt templates were iteratively refined through pilot experiments to ensure consistent and interpretable scoring. The final prompt templates used in the study are provided in Figure 2.

These dimensions were chosen based on considerable empirical research relating them to interview performance and leadership effectiveness [25,26,27,28]. Importantly, the LLM was employed as a behavioral coding tool rather than a free-form evaluator, assuring consistency and construct alignment among candidates.

3.2.2. Holistic Candidate Assessment

In a second stage, the LLM synthesized all responses from a candidate, providing ratings on three higher-level features (the prompt used is shown in Figure 3).

Answer quality: reflects depth, coherence, and relevancy;
Top-manager Potential: captures executive-level thinking and vision;
Professionalism: demonstrates corporate maturity, tone, and precision.

This two-stage technique is based on information on how the professional HR estimates the candidates, first evaluating specific behaviors and then making holistic assessments.

3.3. Personality Trait Extimation

The personality traits described by the OCEAN model were estimated using the deep learning model descried by our paper [19]. The model utilizes a Convolution Neural Network to capture spacial features followed by Long-Short Term Memory to capture temporal features carried through the video to estimate personality traits.

To reconcile personality traits with leadership theory, neuroticism was reversed to represent emotional stability. Selected attributes were then combined into a Leadership Disposition construct, which represents consistent individual differences related to leadership emergence and seniority preparation. This construct supports behavioral observations by capturing personality tendencies.

3.4. Construct-Level Aggregation and Top Potential Scoring

The suggested method is built on the ideas of multi-modal complementarity, psychometric grounding, and interpretability in order to synthesize non-redundant data with established leadership conceptions. Also, due to the small dataset size (not optimal for ML), the method functions at the construct level rather than raw features. Modality-specific outputs are aggregated into three interpretable feature-constructs:

Professional Cognitive Competence: derived from verbal dimensions (confidence, clarity, strategic thinking, answer quality);
Observed Leadership Behavior: synthesized from nonverbal dynamics (gaze engagement, head motion energy, facial expressivity);
Leadership Disposition: computed from personality traits aligned with leadership emergence theory.

All constructs were standardized and combined through equal-weight averaging to produce a Top Potential Score. This design choice favors interpretability and resilience over model complexity, allowing for clear inspection of individual contributions. By ensuring that scores are decomposable and auditable, the architecture serves as a transparent decision-support system that reinforces human judgment in hiring rather than operating autonomously.

4. Experiments and Results

In this section, we conduct a set of experiments (parametric and non-parametric testing, permutation test, leave-one-out validation, ranking and screening utility) to assess the proposed multi-modal method’s ability to distinguish top executive candidates from non-top candidates using interpretable, construct-level analysis under extreme label scarcity. Rather than targeting prediction accuracy, the studies prioritize concept validity, robustness, and screening utility, which are more appropriate for candidate assessment.

The used dataset consists of 59 video interviews obtained using a custom website. The dataset includes 31.8% male and 68.2% female candidates, with ages ranging from 20 to 55 years (M = 35.0, SD = 7.99). All interviews were conducted in Russian under consistent recording conditions (self-recorded via custom website with standardized instructions). Participants provided asynchronous replies to five standardized questions about executive abilities (Table 1 shows the questions translated into English). Following the interview, participants filled out the IPIP-50 personality test. All candidates were employees from two partner companies undertaking internal talent assessments. Ground truth labels were derived from organizational seniority levels: (level 0—non-managerial positions, level 1—top managers (executive-level), level 2—middle management, level 3—personnel reserve).

We defined level 1 applicants as top (N = 5) and all others as non-top (N = 54), reflecting a realistic but highly imbalanced executive screening scenario. For each candidate, we retained synchronized video, audio, LLM-based verbal evaluations, and personality scores for multi-modal analysis.

To determine if each construct distinguishes top candidates from non-top candidates, we performed nonparametric statistical comparisons with the Mann-Whitney U-test (appropriate for small and imbalanced groups) and several effect size metrics (results shown in Table 2). Across all constructs, top candidates had significantly higher mean scores than non-top candidates. The Top Potential Score, which combines verbal, nonverbal, and personality cues, showed the most significant separation. In addition to significance tests, effect sizes were calculated using Cohen’s d and Cliff’s delta, both of which showed strong effects. Spearman rank correlations between construct scores and the top level confirmed monotonic relationships, especially for the total score. These findings show that each modality offers a valuable signal, and that their combination produces the most significant differential between executive and non-executive candidates.

To investigate the effect of construct weighting, we ran a grid search on the aggregation weights with a step size of 0.1, yielding 66 different weight combinations. Each configuration was assessed using Cliff’s delta and ranking-based recall criteria. The optimal design provides weights of 0.4, 0.1, and 0.5 to the Professional Cognitive, Leadership Disposition, and Observed Leadership constructs, respectively, resulting in a Cliff’s delta of 0.904 and a Cohen’s d of 2.19 (as shown in Table 3). We can notice that the top six configurations differ by less than 0.01 in Cliff’s delta, demonstrating that the model is relatively insensitive to moderate variations in the construct weights (Figure 4 illustrates the performance of each configuration as a heatmap). This suggests that the equal-weight aggregation provides a reasonable and stable baseline, while task-specific tuning may further optimize performance when sufficient training data are available.

A core advantage of the proposed method is its intrinsic interpretability, enabling full traceability of assessment decisions. Unlike black-box models where predictions are opaque, our construct-level architecture allows stakeholders to audit exactly which factors differentiate high-potential candidates from others. In the previous example for the construct weights, values of 0.4 Professional Cognitive, 0.1 Leadership Disposition, and 0.5 Observed Leadership yield strong separation. In such a configuration, if a candidate scores highly, the system reveals that 50% of that score derives from Observed Leadership behaviors (e.g., gaze stability, head motion), providing actionable feedback rather than a binary reject/hire decision. This transparency supports responsible AI deployment by allowing bias detection, construct validation, and human-in-the-loop verification, aligning with best practices for high-stakes personnel selection.

We estimated the method’s robustness under label scarcity with only five top candidates. So, for five candidates, traditional cross-validation is insufficient. We instead used Leave-One-Top-Out (LOTO) analysis; in each iteration, one top candidate was removed from the top group and evaluated separately against the distribution of non-top candidates using the top potential score. The results presented in Table 4 reveal that all five top candidates regularly outperform the non-top mean. The non-top candidate dominates the observed effect, demonstrating that the separation is not based on a single sample. This investigation provides strong evidence of robustness despite extreme label scarcity.

To confirm statistical significance without relying on distributional assumptions, we ran a permutation test with 10,000 iterations. Candidate scores were randomly assigned to top and non-top groups while maintaining group sizes. The observed mean difference in Top Potential Score was compared with the null distribution. The observed difference was at the extreme tail of the permutation distribution, with a permutation p-value of around 0.0002. This demonstrates that the observed separation is extremely unlikely to arise by chance and is not a result of the limited sample size.

Given the scarcity of labels (N = 5 top candidates), we favor effect-size statistics over typical classification measurements (such as accuracy and F1-score). To eliminate distributional assumptions, we quantify group separation with Cohen’s d and nonparametric Cliff’s delta. Cohen’s d values are interpreted using conventional standards of 0.2 (small), 0.5 (middle), and 0.8 (big). The permutation test (p = 0.0002) performs distribution-free significance testing, indicating that separation is unlikely due to chance. Together, these metrics provide more relevant evidence of discriminative capacity than accuracy-based metrics would in extreme label scarcity, where classifier training is impossible.

To evaluate the method’s efficacy as a screening tool, applicants were prioritized based on their Top Potential Score, with recall tested at various levels. As shown in Table 5, the top 20% of the samples based on the potential score includes the whole executive-level group. This ability to significantly reduce the candidate pool while maintaining high-potential identification supports the utility of the framework as a decision-support screening mechanism rather than a final decision system.

We further analyzed the results based on the ranking. As we can see from Figure 5, among the top 20% of candidates ranked by the method (11 individuals), five belonged to Level 1 (top managers), four to Level 2 (middle management), two to Level 3 (personnel reserve), and one to Level 0 (non-managerial positions). The percentage of Level 2 candidates in this subset aligns with organizational hierarchy: middle managers represent the immediate pipeline to executive roles and often exhibit overlapping leadership competencies, providing a logical validation of the method’s discriminative ability.

To determine whether different modalities provide redundant or complimentary information, the correlation between the three construct scores was examined. Figure 6 revealed low to moderate correlations, showing that verbal cognitive assessments, nonverbal behavioral signals, and personality all provide unique information about leadership potential. This validates the framework’s multimodal nature and justifies combining diverse AI-derived inputs rather than depending solely on one modality.

Finally, we ran an unsupervised clustering analysis in the three-dimensional construct space just for the sake of visualization (as shown in Figure 7). Clustering was not utilized for classification, but rather to investigate structural trends in the data. One cluster had significantly higher construct means and featured a majority of top candidates, whereas the other clusters matched to lower leadership characteristics. However, due to the small sample size, the clustering results are interpreted as exploratory and illustrative rather than predictive.

5. Discussion

The results achieved within the scope of the research allow us to conclude that computer vision is a promising instrument for candidate interview assessment. Our experiments show that the proposed method can be used as a screening tool to assess candidates’ abilities and rank them. The results achieved by the method ranked all five executive-level candidates within the top 20% of applicants (11 individuals). This subset also included four middle managers, two talent reserve candidates, and one non-managerial candidate. The concentration of executives and middle managers in the highest ranks aligns with organizational hierarchy: middle managers typically correlate to executive roles and often exhibit overlapping leadership competencies. The single non-managerial candidate in this group does not undermine the overall discriminative pattern. These findings support the framework’s utility as a screening aid that prioritizes high-potential candidates while preserving human oversight for final decisions.

This study acknowledges several limitations. First, the dataset exhibits label scarcity, comprising 59 interviews with only five executive-level candidates (N = 5). However, this reflects the inherent rarity of senior leadership data in real-world hiring scenarios; as a result, standard supervised learning and prediction accuracy measurements are not applicable. To mitigate this, we prioritized effect sizes, nonparametric testing, and permutation-based inference to ensure statistical robustness. Nevertheless, the proposed method is largely language-independent. The computer vision module relies on nonverbal behavioral cues, while the LLM-based verbal analysis supports multilingual processing. Future work will focus on validating the method using larger and more diverse datasets covering multiple languages and cultural contexts.

Second, organizational seniority serves as an imperfect proxy for leadership effectiveness. While Level 1 positions indicate formal executive roles verified by HR, they do not capture individual performance metrics or validated leadership outcomes. Future research should incorporate multi-source ratings, objective performance indicators, and longitudinal career progression data to strengthen criterion validity.

Third, although LLM-based behavioral coding was validated by HR experts who manually reviewed and confirmed score reliability, the specific model configuration (Llama 3.2, 3B parameters) may still exhibit prompt sensitivity or domain-specific biases. Future work will include cross-model validation to assess consistency across different LLM architectures and further establish measurement generalizability.

6. Conclusions

This study provides an interpretable multi-modal AI method for video-based candidate interview screening that incorporates nonverbal behavioral features, LLM-derived verbal assessments, and personality traits into grounded leadership constructs. Instead of depending on predictive models, the proposed method prioritizes construct validity, transparency, and statistical resilience in the face of limited data. Our proposed methods achieved strong discrimination between executive-potential and other candidates: the composite Top Potential Score separated groups with a large effect size (Cliff’s delta = 0.89, permutation p = 0.0002), and 100% of executive-level candidates were retained within the top 20% of ranked applicants. Importantly, these results were obtained without the use of a classifier, suggesting that meaningful screening signals can be retrieved even in situations when labeled executive data is restricted.

The findings show that multi-modal AI can improve early-stage executive screening by giving objective, scalable, and interpretable markers of leadership readiness. This approach provides a solid foundation for future AI-assisted employment systems and encourages additional testing on larger datasets and in real-world deployment scenarios.

From a cognitive psychology perspective, this study demonstrates an integration of static personality traits with dynamic behavioral markers to assess leadership potential. By quantifying verbal reasoning and nonverbal cues alongside dispositional traits, the framework provides a more valid method of leadership competence that captures the interplay between cognitive processes and social context. This supports the development of dynamic personality models and offers objective tools for assessing cognitive-behavioral links in high-stakes scenarios.

Future research will center on verifying the proposed multi-modal AI framework on bigger and more diversified interview datasets, particularly those with a higher number of senior and executive candidates. With larger sample sets, supervised and semi-supervised learning algorithms can be tested to determine predictive accuracy, generalization, and deployment practicality. Additionally, validation against downstream outcomes would allow for a more comprehensive evaluation of executive readiness prediction.

Author Contributions

K.K. was in charge of formalizing the approach, building and collecting the dataset, testing the proposed method, and writing the manuscript. A.K. was responsible for conceptualization, verifying the proposed methods, validating the results, and peer review. I.S. was responsible for paper review and editing, as well as method conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Russian State Research FFZF-2025-0003.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Scientific Council of St. Petersburg Federal Research Center of the Russian Academy of Sciences (protocol code #1 on 29 January 2026).

Informed Consent Statement

Informed consent for participation was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hollenbeck, G.P. Executive Selection—What’s Right … and What’s Wrong. Ind. Organ. Psychol. 2009, 2, 130–143. [Google Scholar] [CrossRef]
Theodorakopoulos, L.; Theodoropoulou, A.; Halkiopoulos, C. Cognitive Bias Mitigation in Executive Decision-Making: A Data-Driven Approach Integrating Big Data Analytics, AI, and Explainable Systems. Electronics 2025, 14, 3930. [Google Scholar] [CrossRef]
Chamorro-Premuzic, T.; Akhtar, R.; Winsborough, D.; Sherman, R.A. The datafication of talent: How technology is advancing the science of human potential at work. Curr. Opin. Behav. Sci. 2017, 18, 13–16. [Google Scholar] [CrossRef]
Agrawal, A.; Anil George, R.; Ravi, S.S.; Kamath S, S.; Kumar, A. Leveraging Multimodal Behavioral Analytics for Automated Job Interview Performance Assessment and Feedback. In Proceedings of the Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML); Zadeh, A., Morency, L.P., Liang, P.P., Poria, S., Eds.; Association for Computational Linguistics: Seattle, DC, USA, 2020; pp. 46–54. [Google Scholar] [CrossRef]
Chen, L.; Zhao, R.; Leong, C.W.; Lehman, B.; Feng, G.; Hoque, M.E. Automated video interview judgment on a large-sized corpus collected online. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, USA, 23–26 October 2017; IEEE: New York, NY, USA, 2017; pp. 504–509. [Google Scholar] [CrossRef]
Hall, J.A.; Coats, E.J.; LeBeau, L.S. Nonverbal behavior and the vertical dimension of social relations: A meta-analysis. Psychol. Bull. 2005, 131, 898. [Google Scholar] [CrossRef] [PubMed]
Nashwan, A.J.; Jaradat, J.H. Streamlining Systematic Reviews: Harnessing Large Language Models for Quality Assessment and Risk-of-Bias Evaluation. Cureus 2023, 15. [Google Scholar] [CrossRef] [PubMed]
Rahman, W.; Mahbub, S.; Salekin, A.; Hasan, M.K.; Hoque, E. HirePreter: A Framework for Providing Fine-grained Interpretation for Automated Job Interview Analysis. In 2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), Nara, Japan, 28 September–1 October 2021; IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar] [CrossRef]
Martín-Raugh, M.P.; Kell, H.J.; Randall, J.G.; Anguiano-Carrasco, C.; Banfi, J.T. Speaking without words: A meta-analysis of over 70 years of research on the power of nonverbal cues in job interviews. J. Organ. Behav. 2023, 44, 132–156. [Google Scholar] [CrossRef]
Kassab, K.; Kashevnik, A. Personality Traits Estimation Based on Job Interview Video Analysis: Importance of Human Nonverbal Cues Detection. Big Data Cogn. Comput. 2024, 8, 173. [Google Scholar] [CrossRef]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
Kim, E.; Suk, J.; Kim, S.; Muennighoff, N.; Kim, D.; Oh, A. Llm-as-an-interviewer: Beyond static testing through dynamic llm evaluation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025; Association for Computational Linguistics: Seattle, DC, USA, 2025; pp. 26456–26493. [Google Scholar] [CrossRef]
Sun, H.; Lin, H.; Yan, H.; Song, Y.; Gao, X.; Yan, R. MockLLM: A Multi-Agent Behavior Collaboration Framework for Online Job Seeking and Recruiting. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2; ACM: New York, NY, USA, 2025; pp. 2714–2724. [Google Scholar] [CrossRef]
Bhaduri, S.; Kapoor, S.; Gil, A.; Mittal, A.; Mulkar, R. Reconciling methodological paradigms: Employing large language models as novice qualitative research assistants in talent management research. arXiv 2024, arXiv:2408.11043. [Google Scholar] [CrossRef]
Ashwin, J.; Chhabra, A.; Rao, V. Using large language models for qualitative analysis can introduce serious bias. arXiv 2023, arXiv:2309.17147. [Google Scholar] [CrossRef]
Zhang, T.; Koutsoumpis, A.; Oostrom, J.K.; Holtrop, D.; Ghassemi, S.; de Vries, R.E. Can Large Language Models Assess Personality From Asynchronous Video Interviews? A Comprehensive Evaluation of Validity, Reliability, Fairness, and Rating Patterns. IEEE Trans. Affect. Comput. 2024, 15, 1769–1785. [Google Scholar] [CrossRef]
Javalagi, A.A.; Newman, D.A.; Li, M. Personality and leadership: Meta-analytic review of cross-cultural moderation, behavioral mediation, and honesty-humility. J. Appl. Psychol. 2024, 109, 1489. [Google Scholar] [CrossRef] [PubMed]
Judge, T.A.; Bono, J.E.; Ilies, R.; Gerhardt, M.W. Personality and leadership: A qualitative and quantitative review. J. Appl. Psychol. 2002, 87, 765. [Google Scholar] [CrossRef] [PubMed]
Kassab, K.; Kashevnik, A.; Glekler, E.; Mayatin, A. Human Sales Ability Estimation Based on Interview Video Analysis. In Proceedings of the 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 24–26 May 2023; IEEE: New York, NY, USA, 2023; pp. 132–138. [Google Scholar] [CrossRef]
Grunenberg, E.; Peters, H.; Francis, M.J.; Back, M.D.; Matz, S.C. Machine learning in recruiting: Predicting personality from CVs and short text responses. Front. Soc. Psychol. 2024, 1, 1290295. [Google Scholar] [CrossRef]
Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Abdelrahman, A.A.; Hempel, T.; Khalifa, A.; Al-Hamadi, A.; Dinges, L. L2CS-Net: Fine-Grained Gaze Estimation in Unconstrained Environments. In Proceedings of the 2023 8th International Conference on Frontiers of Signal Processing (ICFSP), Corfu, Greece, 23–25 October 2023; IEEE: New York, NY, USA, 2023; pp. 98–102. [Google Scholar] [CrossRef]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; et al. MediaPipe: A Framework for Building Perception Pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar] [CrossRef]
Serengil, S.; Ozpinar, A. A Benchmark of Facial Recognition Pipelines and Co-Usability Performances of Modules. J. Inf. Technol. 2024, 17, 95–107. [Google Scholar] [CrossRef]
Levashina, J.; Hartwell, C.J.; Morgeson, F.P.; Campion, M.A. The Structured Employment Interview: Narrative and Quantitative Review of the Research Literature. Pers. Psychol. 2014, 67, 241–293. [Google Scholar] [CrossRef]
Zaccaro, S.J. Trait-based perspectives of leadership. Am. Psychol. 2007, 62, 6. [Google Scholar] [CrossRef] [PubMed]
Hogan, R.; Kaiser, R.B. What we know about leadership. Rev. Gen. Psychol. 2005, 9, 169–180. [Google Scholar] [CrossRef]
Matveev, A.V.; Merz, M.Y. Intercultural Competence Assessment: What Are Its Key Dimensions Across Assessment Tools? In Toward Sustainable Development through Nurturing Diversity; International Association for Cross-Cultural Psychology: Melbourne, FL, USA, 2014. [Google Scholar]

Figure 1. The multi-modal AI screening method for candidate interview assessment.

Figure 2. Snapshot of the prompt used for dimensional behavior coding.

Figure 3. Snapshot of the prompt used for holistic candidate assessment.

Figure 4. Weight sensitivity analysis of construct aggregation with heatmaps illustrating the performance of different weight configurations used to aggregate the three constructs.

Figure 5. The distribution of the best 20% ranked candidates among the levels.

Figure 6. Pairwise Spearman correlations among the three constructs. Correlations range between 0.026 and 0.260, demonstrating low redundancy and supporting multimodal complementarity. The results between brackets are the p-values.

Figure 7. Exploratory visualization of candidates in three-dimensional construct space.

Table 1. The questions used in the self-recorded interviews.

Question	Description
1	In your opinion, what are the key differences between the work of a top manager?
2	What are your strengths that will help you successfully perform the functions of a senior manager?
3	What qualities do you need to develop in order to be a strong top manager?
4	In your opinion, what is the key to the successful work of a top manager?
5	What is a strategy and why does a company need it?

Table 2. Statistical comparison of construct scores between top (executive-level) and non-top candidates.

Construct	Mean_Top	Mean_Others	Mean_Diff	MannWhitney	Cohen_d	Cliffs_delta	Spearman_rho	Spearman_p
Professional Cognitive	0.705	−0.065	0.770289	0.019356	1.102009	0.562963	0.273240	0.036268
Leadership Behavior	0.991	−0.091	1.083833	0.010647	1.802033	0.614815	0.296618	0.022535
Leadership Disposition	0.510	−0.047	0.557987	0.030429	0.771707	0.511111	0.246586	0.059738
Top_Potential Score	0.735	−0.068	0.804037	0.000039	1.868610	0.911111	0.439566	0.000495

Table 3. Results of the top 10 weight configurations by Cliff’s delta, where w_cog, w_led, and w_obs are the weights of the Professional Cognitive, Leadership Disposition, and Observed Leadership constructs, respectively.

#	w_cog	w_led	w_obs	Cliffs_delta	Cohen_d
1	0.4	0.1	0.5	0.903704	2.187438
2	0.5	0.1	0.4	0.903704	2.045972
3	0.3	0.3	0.4	0.896296	1.969315
4	0.3	0.4	0.3	0.896296	1.749918
5	0.4	0.3	0.3	0.896296	1.852874
6	0.4	0.2	0.4	0.896296	2.055336
7	0.4	0.4	0.2	0.881481	1.614671
8	0.5	0.0	0.5	0.881481	2.141918
9	0.5	0.2	0.3	0.881481	1.877828
10	0.6	0.0	0.4	0.874074	1.956789

Table 4. LOTO robustness analysis. Each executive-level candidate (N = 5) was excluded in turn and evaluated individually against the non-top distribution using the top potential score.

Left_Out	Score_Left_Out	Mean_Remaining_Top	Mean_Others	Above_Others
Id_04	0.509381	0.792527	−0.068139	✓
Id_16	0.477353	0.800534	−0.068139	✓
Id_26	1.213491	0.616500	−0.068139	✓
Id_35	1.065575	0.653479	−0.068139	✓
Id_65	0.413690	0.816450	−0.068139	✓

Table 5. The results of rank-based screening utility of the Top Potential Score.

Top Percentile	Percentage of Top Candidates Retained
5% (3/59)	40% (2/5)
10% (5/59)	60% (3/5)
15% (9/59)	80% (4/5)
20% (11/59)	100% (5/5)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kassab, K.; Kashevnik, A.; Shoshina, I. Multi-Modal Method for Candidate Interview Assessment Based on Computer Vision and Large Language Models. Big Data Cogn. Comput. 2026, 10, 106. https://doi.org/10.3390/bdcc10040106

AMA Style

Kassab K, Kashevnik A, Shoshina I. Multi-Modal Method for Candidate Interview Assessment Based on Computer Vision and Large Language Models. Big Data and Cognitive Computing. 2026; 10(4):106. https://doi.org/10.3390/bdcc10040106

Chicago/Turabian Style

Kassab, Kenan, Alexey Kashevnik, and Irina Shoshina. 2026. "Multi-Modal Method for Candidate Interview Assessment Based on Computer Vision and Large Language Models" Big Data and Cognitive Computing 10, no. 4: 106. https://doi.org/10.3390/bdcc10040106

APA Style

Kassab, K., Kashevnik, A., & Shoshina, I. (2026). Multi-Modal Method for Candidate Interview Assessment Based on Computer Vision and Large Language Models. Big Data and Cognitive Computing, 10(4), 106. https://doi.org/10.3390/bdcc10040106

Article Menu

Multi-Modal Method for Candidate Interview Assessment Based on Computer Vision and Large Language Models

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Nonverbal Behavioral Feature Extraction

3.1.1. Gaze Engagement

3.1.2. Head Movement

3.1.3. Facial Expressivity

3.2. Verbal Content Analysis via Large Language Models

3.2.1. Dimensional Behavioral Coding

3.2.2. Holistic Candidate Assessment

3.3. Personality Trait Extimation

3.4. Construct-Level Aggregation and Top Potential Scoring

4. Experiments and Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI