1. Introduction
Candidate hiring decisions have long-term organizational consequences, yet interview-based evaluations remain heavily dependent on subjective judgment. Human interviewers have inherent limitations, including cognitive bias, inconsistency across raters, and difficulties processing multidimensional behavioral data simultaneously [
1,
2]. These difficulties are most obvious in candidate assessments, where leadership potential emerges through verbal content, nonverbal conduct, and dispositional features.
Video interviews represent a valuable data source for computational analysis and facial dynamics, capturing speech, gaze, and body language in a single video. By quantifying these signals on a large scale, AI technologies could improve human judgment [
3,
4,
5]. However, existing systems usually take one of two approaches: single-modality analysis (for example, speech or text) or end-to-end black-box prediction. Both limit practical application in personnel selection. Single-modality systems neglect complementing behavioral channels; on the other hand, black-box models lack the transparency needed for high-stakes judgments.
Prior research investigated the importance of multiple features and factors throug the candidate interview assessment. Nonverbal behaviors—eye contact, facial expressivity, and head movements—signal confidence and interpersonal effectiveness [
6]. Large language models now enable scalable analysis of verbal content, capturing strategic reasoning and communication quality [
7]. Personality traits, particularly the Big Five, predict leadership emergence and professional readiness. Despite evidence supporting each modality, prior work rarely integrates all three into a single framework. In rare systems when integration occurs, it typically prioritizes predictive performance over interpretability.
In this work, we present an interpretable multimodal method for candidate interview assessment that incorporates computer vision and large language models to extract features from the video interview. The method examines three parallel streams: (i) nonverbal signals collected from video (gaze engagement, head motion dynamics, and facial expressivity), (ii) spoken content assessed using LLM-based behavioral coding across leadership-relevant dimensions, and (iii) Big Five personality traits estimated using deep learning models. Rather than developing an end-to-end predictor, we aggregate low-level features into three theory-based constructs (professional-cognitive competence, observed leadership behavior, and leadership disposition) that combine to produce an interpretable Top Potential Score.
The contribution of this paper is summarized as follows:
Multi-modal, AI-based method, integrating video-based nonverbal analysis, personality trait estimation, and LLM-driven verbal assessment;
Candidate-assessment-interpretable feature construct for bridging LLM outputs with established leadership and assessment theories;
Empirical validation demonstrating the method’s efficiency to identify high-potential candidates.
The rest of the paper is divided as follows:
Section 2 explores the related studies.
Section 3 presents the proposed framework.
Section 4 delves into the experiments and results.
Section 5 analyzes the limitations and future steps.
Section 6 concludes the research.
2. Related Work
Automated interview analysis is becoming increasingly popular as businesses seek scalable and objective alternatives to traditional human-centered hiring processes. Computer vision methods have been applied to quantify eye contact, facial expressivity, head motion, and posture from interview videos. These nonverbal cues correlate with perceived confidence, social presence, and hiring outcomes in empirical studies [
8,
9,
10]. However, many of these systems rely on black-box classifiers that have been trained end-to-end, restricting interpretability and raising issues about fairness and deployment in high-stakes environments [
11].
Natural language processing approaches use interview transcripts to predict employability or communication quality. Recent research uses large language models (LLMs) to assess behavioral aspects like clarity, strategic thinking, and confidence [
12,
13,
14]. While LLMs allow for scalable semantic analysis, they are often used as direct predictors rather than structured coding instruments. This raises concerns about demographic and cultural biases in model outputs [
15]. Few studies have anchored LLM outcomes in known evaluation frameworks or combined them with nonverbal or dispositional indicators.
Personality assessment, notably using the Big Five model, is a well-known aspect of people selection research. Openness, extraversion, emotional stability, and conscientiousness have all been connected to leadership development and managerial performance. Prior research has investigated personality prediction from text or video [
16,
17,
18,
19,
20]; however, personality measurements are frequently removed from multimodal pipelines or combined in an ad hoc manner without precise construct-level interpretation.
In contrast to previous approaches, this study provides an interpretable multi-modal AI framework that incorporates nonverbal indicators, LLM-based verbal assessments, and personality factors into theory-driven leadership concepts. Rather than focusing on predicting accuracy through supervised learning, the method focuses on statistical robustness, effect magnitude, and rank-based screening. This establishes the suggested method as a complimentary alternative to black-box hiring methods, especially for candidate interview assessment where transparency and data efficiency are crucial.
The suggested framework fits into the intrinsically interpretable (ante-hoc) paradigm of Explainable AI (XAI) research [
21]. In contrast to post-hoc explanation techniques that seek to provide explanations for black box predictions, our technique incorporates interpretability into the architecture by design. In line with the taxonomy of Explainable AI, our technique offers: (1) global interpretability that provides insights into the relationships between high-level constructs (verbal, non-verbal, personality) and the final Top Potential Score, and (2) decomposability that enables the individual construct contributions to be audited. This places the technique in the emerging research that privileges transparency over marginal accuracy gains in high-stakes decision contexts.
3. Proposed Method
This section introduces an interpretable multi-modal AI-based method for evaluating candidate interviews (see
Figure 1). The method uses three complementary modalities to convert unstructured interview recordings into meaningful construct-level indicators of leadership potential and seniority readiness. The methods integrated computer vision to analyze nonverbal behavioral, verbal content evaluation via large language models (LLMs), and dispositional assessment (personality trait assessment) via deep learning models.
The method uses three parallel analysis streams to process candidate data from a recorded video interview:
Nonverbal Behavioral Analysis: Extraction of eye-gaze, head motion, and facial expressivity cues from video frames.
Verbal Content Analysis: Structured evaluation of transcribed responses using LLM-based behavioral coding.
Personality Trait Estimation: Utilizing the deep learning models to evaluate personality traits [
19].
Each modality produces standardized feature vectors that aggregate into three interpretable constructs: Professional–Cognitive Competence (verbal reasoning and strategic thinking), Observed Leadership Behavior (nonverbal expressivity and engagement), and Leadership Disposition (personality-based readiness). These constructs combine into a single Top Potential Score for candidate screening. The method is construct-driven: every computational signal maps directly to established concepts in industrial–organizational criteria, ensuring both interpretability and theoretical grounding.
3.1. Nonverbal Behavioral Feature Extraction
We used modern computer vision algorithms to extract nonverbal features from interview videos. The largest face in each frame was recognized as the candidate’s face and kept for further examination.
3.1.1. Gaze Engagement
The L2CS-Net [
22] with a ResNet-50 backbone model was used to assess gaze direction. The percentage of frames in which the candidate’s visual axis crossed the camera region was used to measure eye contact engagement. In high-stakes interpersonal situations, maintaining eye contact with the interviewer (represented by the camera) is a confirmed behavioral indicator of focus, self-assurance, and social presence.
3.1.2. Head Movement
MediaPipe [
23] with Face Mesh solution (which provides 468 fical landmarks and enables esitimation of head pose dynamics) was used to estimate three-dimensional head posture trajectories and extract rotation angles (roll, pitch, and yaw). To measure behavioral expressivity and engagement, three motion-derived features were calculated from these trajectories:
Pitch variation (nodding intensity) is linked to active listening and agreement;
Yaw variance (shaking intensity), possibly indicating doubt or disagreement;
Head motion energy captures total expressiveness and engagement.
These characteristics jointly define behavioral dynamism and nonverbal expressivity, which are especially important in leadership perception.
3.1.3. Facial Expressivity
DeepFace [
24] was used for facial expression analysis. The system extracts emotion probabilities for multiple affective categories; in this study, we focus on the happiness probability as an indicator of positive affect during interview responses. Face detection was performed using the RetinaFace backend. We extracted the following features:
These metrics capture warmth, approachability, and positive affect, all of which have been linked to improved leadership assessments and interpersonal effectiveness. All nonverbal features were aggregated at the interview level and standardized to enable cross-modal integration.
3.2. Verbal Content Analysis via Large Language Models
Audio streams were transcribed using the Whisper large-v3 automatic speech recognition model. Then transcripts were processed utilizing a locally deployed LLM to conduct theory-informed behavioral coding at executive and behavioral features. Specifically, we used the Llama 3.2 (Meta), llama3.2:3b model. At the end, an HR expert reviewed these results manually and proved the LLM scores’ reliability.
3.2.1. Dimensional Behavioral Coding
For each of five interview questions, the LLM was prompted to rate responses on five leadership-relevant behavioral dimensions using a fixed 1–5 scale: confidence, clarity, strategic thinking, emotional stability and leadership potential. Prompt templates were iteratively refined through pilot experiments to ensure consistent and interpretable scoring. The final prompt templates used in the study are provided in
Figure 2.
These dimensions were chosen based on considerable empirical research relating them to interview performance and leadership effectiveness [
25,
26,
27,
28]. Importantly, the LLM was employed as a behavioral coding tool rather than a free-form evaluator, assuring consistency and construct alignment among candidates.
3.2.2. Holistic Candidate Assessment
In a second stage, the LLM synthesized all responses from a candidate, providing ratings on three higher-level features (the prompt used is shown in
Figure 3).
Answer quality: reflects depth, coherence, and relevancy;
Top-manager Potential: captures executive-level thinking and vision;
Professionalism: demonstrates corporate maturity, tone, and precision.
This two-stage technique is based on information on how the professional HR estimates the candidates, first evaluating specific behaviors and then making holistic assessments.
3.3. Personality Trait Extimation
The personality traits described by the OCEAN model were estimated using the deep learning model descried by our paper [
19]. The model utilizes a Convolution Neural Network to capture spacial features followed by Long-Short Term Memory to capture temporal features carried through the video to estimate personality traits.
To reconcile personality traits with leadership theory, neuroticism was reversed to represent emotional stability. Selected attributes were then combined into a Leadership Disposition construct, which represents consistent individual differences related to leadership emergence and seniority preparation. This construct supports behavioral observations by capturing personality tendencies.
3.4. Construct-Level Aggregation and Top Potential Scoring
The suggested method is built on the ideas of multi-modal complementarity, psychometric grounding, and interpretability in order to synthesize non-redundant data with established leadership conceptions. Also, due to the small dataset size (not optimal for ML), the method functions at the construct level rather than raw features. Modality-specific outputs are aggregated into three interpretable feature-constructs:
Professional Cognitive Competence: derived from verbal dimensions (confidence, clarity, strategic thinking, answer quality);
Observed Leadership Behavior: synthesized from nonverbal dynamics (gaze engagement, head motion energy, facial expressivity);
Leadership Disposition: computed from personality traits aligned with leadership emergence theory.
All constructs were standardized and combined through equal-weight averaging to produce a Top Potential Score. This design choice favors interpretability and resilience over model complexity, allowing for clear inspection of individual contributions. By ensuring that scores are decomposable and auditable, the architecture serves as a transparent decision-support system that reinforces human judgment in hiring rather than operating autonomously.
4. Experiments and Results
In this section, we conduct a set of experiments (parametric and non-parametric testing, permutation test, leave-one-out validation, ranking and screening utility) to assess the proposed multi-modal method’s ability to distinguish top executive candidates from non-top candidates using interpretable, construct-level analysis under extreme label scarcity. Rather than targeting prediction accuracy, the studies prioritize concept validity, robustness, and screening utility, which are more appropriate for candidate assessment.
The used dataset consists of 59 video interviews obtained using a custom website. The dataset includes 31.8% male and 68.2% female candidates, with ages ranging from 20 to 55 years (M = 35.0, SD = 7.99). All interviews were conducted in Russian under consistent recording conditions (self-recorded via custom website with standardized instructions). Participants provided asynchronous replies to five standardized questions about executive abilities (
Table 1 shows the questions translated into English). Following the interview, participants filled out the IPIP-50 personality test. All candidates were employees from two partner companies undertaking internal talent assessments. Ground truth labels were derived from organizational seniority levels: (level 0—non-managerial positions, level 1—top managers (executive-level), level 2—middle management, level 3—personnel reserve).
We defined level 1 applicants as top (N = 5) and all others as non-top (N = 54), reflecting a realistic but highly imbalanced executive screening scenario. For each candidate, we retained synchronized video, audio, LLM-based verbal evaluations, and personality scores for multi-modal analysis.
To determine if each construct distinguishes top candidates from non-top candidates, we performed nonparametric statistical comparisons with the Mann-Whitney U-test (appropriate for small and imbalanced groups) and several effect size metrics (results shown in
Table 2). Across all constructs, top candidates had significantly higher mean scores than non-top candidates. The Top Potential Score, which combines verbal, nonverbal, and personality cues, showed the most significant separation. In addition to significance tests, effect sizes were calculated using Cohen’s d and Cliff’s delta, both of which showed strong effects. Spearman rank correlations between construct scores and the top level confirmed monotonic relationships, especially for the total score. These findings show that each modality offers a valuable signal, and that their combination produces the most significant differential between executive and non-executive candidates.
To investigate the effect of construct weighting, we ran a grid search on the aggregation weights with a step size of 0.1, yielding 66 different weight combinations. Each configuration was assessed using Cliff’s delta and ranking-based recall criteria. The optimal design provides weights of 0.4, 0.1, and 0.5 to the Professional Cognitive, Leadership Disposition, and Observed Leadership constructs, respectively, resulting in a Cliff’s delta of 0.904 and a Cohen’s d of 2.19 (as shown in
Table 3). We can notice that the top six configurations differ by less than 0.01 in Cliff’s delta, demonstrating that the model is relatively insensitive to moderate variations in the construct weights (
Figure 4 illustrates the performance of each configuration as a heatmap). This suggests that the equal-weight aggregation provides a reasonable and stable baseline, while task-specific tuning may further optimize performance when sufficient training data are available.
A core advantage of the proposed method is its intrinsic interpretability, enabling full traceability of assessment decisions. Unlike black-box models where predictions are opaque, our construct-level architecture allows stakeholders to audit exactly which factors differentiate high-potential candidates from others. In the previous example for the construct weights, values of 0.4 Professional Cognitive, 0.1 Leadership Disposition, and 0.5 Observed Leadership yield strong separation. In such a configuration, if a candidate scores highly, the system reveals that 50% of that score derives from Observed Leadership behaviors (e.g., gaze stability, head motion), providing actionable feedback rather than a binary reject/hire decision. This transparency supports responsible AI deployment by allowing bias detection, construct validation, and human-in-the-loop verification, aligning with best practices for high-stakes personnel selection.
We estimated the method’s robustness under label scarcity with only five top candidates. So, for five candidates, traditional cross-validation is insufficient. We instead used Leave-One-Top-Out (LOTO) analysis; in each iteration, one top candidate was removed from the top group and evaluated separately against the distribution of non-top candidates using the top potential score. The results presented in
Table 4 reveal that all five top candidates regularly outperform the non-top mean. The non-top candidate dominates the observed effect, demonstrating that the separation is not based on a single sample. This investigation provides strong evidence of robustness despite extreme label scarcity.
To confirm statistical significance without relying on distributional assumptions, we ran a permutation test with 10,000 iterations. Candidate scores were randomly assigned to top and non-top groups while maintaining group sizes. The observed mean difference in Top Potential Score was compared with the null distribution. The observed difference was at the extreme tail of the permutation distribution, with a permutation p-value of around 0.0002. This demonstrates that the observed separation is extremely unlikely to arise by chance and is not a result of the limited sample size.
Given the scarcity of labels (N = 5 top candidates), we favor effect-size statistics over typical classification measurements (such as accuracy and F1-score). To eliminate distributional assumptions, we quantify group separation with Cohen’s d and nonparametric Cliff’s delta. Cohen’s d values are interpreted using conventional standards of 0.2 (small), 0.5 (middle), and 0.8 (big). The permutation test (p = 0.0002) performs distribution-free significance testing, indicating that separation is unlikely due to chance. Together, these metrics provide more relevant evidence of discriminative capacity than accuracy-based metrics would in extreme label scarcity, where classifier training is impossible.
To evaluate the method’s efficacy as a screening tool, applicants were prioritized based on their Top Potential Score, with recall tested at various levels. As shown in
Table 5, the top 20% of the samples based on the potential score includes the whole executive-level group. This ability to significantly reduce the candidate pool while maintaining high-potential identification supports the utility of the framework as a decision-support screening mechanism rather than a final decision system.
We further analyzed the results based on the ranking. As we can see from
Figure 5, among the top 20% of candidates ranked by the method (11 individuals), five belonged to Level 1 (top managers), four to Level 2 (middle management), two to Level 3 (personnel reserve), and one to Level 0 (non-managerial positions). The percentage of Level 2 candidates in this subset aligns with organizational hierarchy: middle managers represent the immediate pipeline to executive roles and often exhibit overlapping leadership competencies, providing a logical validation of the method’s discriminative ability.
To determine whether different modalities provide redundant or complimentary information, the correlation between the three construct scores was examined.
Figure 6 revealed low to moderate correlations, showing that verbal cognitive assessments, nonverbal behavioral signals, and personality all provide unique information about leadership potential. This validates the framework’s multimodal nature and justifies combining diverse AI-derived inputs rather than depending solely on one modality.
Finally, we ran an unsupervised clustering analysis in the three-dimensional construct space just for the sake of visualization (as shown in
Figure 7). Clustering was not utilized for classification, but rather to investigate structural trends in the data. One cluster had significantly higher construct means and featured a majority of top candidates, whereas the other clusters matched to lower leadership characteristics. However, due to the small sample size, the clustering results are interpreted as exploratory and illustrative rather than predictive.
5. Discussion
The results achieved within the scope of the research allow us to conclude that computer vision is a promising instrument for candidate interview assessment. Our experiments show that the proposed method can be used as a screening tool to assess candidates’ abilities and rank them. The results achieved by the method ranked all five executive-level candidates within the top 20% of applicants (11 individuals). This subset also included four middle managers, two talent reserve candidates, and one non-managerial candidate. The concentration of executives and middle managers in the highest ranks aligns with organizational hierarchy: middle managers typically correlate to executive roles and often exhibit overlapping leadership competencies. The single non-managerial candidate in this group does not undermine the overall discriminative pattern. These findings support the framework’s utility as a screening aid that prioritizes high-potential candidates while preserving human oversight for final decisions.
This study acknowledges several limitations. First, the dataset exhibits label scarcity, comprising 59 interviews with only five executive-level candidates (N = 5). However, this reflects the inherent rarity of senior leadership data in real-world hiring scenarios; as a result, standard supervised learning and prediction accuracy measurements are not applicable. To mitigate this, we prioritized effect sizes, nonparametric testing, and permutation-based inference to ensure statistical robustness. Nevertheless, the proposed method is largely language-independent. The computer vision module relies on nonverbal behavioral cues, while the LLM-based verbal analysis supports multilingual processing. Future work will focus on validating the method using larger and more diverse datasets covering multiple languages and cultural contexts.
Second, organizational seniority serves as an imperfect proxy for leadership effectiveness. While Level 1 positions indicate formal executive roles verified by HR, they do not capture individual performance metrics or validated leadership outcomes. Future research should incorporate multi-source ratings, objective performance indicators, and longitudinal career progression data to strengthen criterion validity.
Third, although LLM-based behavioral coding was validated by HR experts who manually reviewed and confirmed score reliability, the specific model configuration (Llama 3.2, 3B parameters) may still exhibit prompt sensitivity or domain-specific biases. Future work will include cross-model validation to assess consistency across different LLM architectures and further establish measurement generalizability.
6. Conclusions
This study provides an interpretable multi-modal AI method for video-based candidate interview screening that incorporates nonverbal behavioral features, LLM-derived verbal assessments, and personality traits into grounded leadership constructs. Instead of depending on predictive models, the proposed method prioritizes construct validity, transparency, and statistical resilience in the face of limited data. Our proposed methods achieved strong discrimination between executive-potential and other candidates: the composite Top Potential Score separated groups with a large effect size (Cliff’s delta = 0.89, permutation p = 0.0002), and 100% of executive-level candidates were retained within the top 20% of ranked applicants. Importantly, these results were obtained without the use of a classifier, suggesting that meaningful screening signals can be retrieved even in situations when labeled executive data is restricted.
The findings show that multi-modal AI can improve early-stage executive screening by giving objective, scalable, and interpretable markers of leadership readiness. This approach provides a solid foundation for future AI-assisted employment systems and encourages additional testing on larger datasets and in real-world deployment scenarios.
From a cognitive psychology perspective, this study demonstrates an integration of static personality traits with dynamic behavioral markers to assess leadership potential. By quantifying verbal reasoning and nonverbal cues alongside dispositional traits, the framework provides a more valid method of leadership competence that captures the interplay between cognitive processes and social context. This supports the development of dynamic personality models and offers objective tools for assessing cognitive-behavioral links in high-stakes scenarios.
Future research will center on verifying the proposed multi-modal AI framework on bigger and more diversified interview datasets, particularly those with a higher number of senior and executive candidates. With larger sample sets, supervised and semi-supervised learning algorithms can be tested to determine predictive accuracy, generalization, and deployment practicality. Additionally, validation against downstream outcomes would allow for a more comprehensive evaluation of executive readiness prediction.