Explainable Use of Foundation Models for Job Hiring

Pendyala, Vishnu S.; Thakur, Neha Bais; Agarwal, Radhika

doi:10.3390/electronics14142787

Open AccessArticle

Explainable Use of Foundation Models for Job Hiring

by

Vishnu S. Pendyala

^1,*

,

Neha Bais Thakur

¹ and

Radhika Agarwal

²

¹

Department of Applied Data Science, San Jose State University, San Jose, CA 95192, USA

²

Independent Researcher, San Jose, CA 95134, USA

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2787; https://doi.org/10.3390/electronics14142787

Submission received: 22 May 2025 / Revised: 4 July 2025 / Accepted: 8 July 2025 / Published: 11 July 2025

(This article belongs to the Special Issue Innovative Applications of Large Language Models in Natural Language Processing (NLP))

Download

Browse Figures

Versions Notes

Abstract

Automating candidate shortlisting is a non-trivial task that stands to benefit substantially from advances in artificial intelligence. We evaluate a suite of foundation models such as Llama 2, Llama 3, Mixtral, Gemma-2b, Gemma-7b, Phi-3 Small, Phi-3 Mini, Zephyr, and Mistral-7b for their ability to predict hiring outcomes in both zero-shot and few-shot settings. Using only features extracted from applicants’ submissions, these models, on average, achieved an AUC above 0.5 in zero-shot settings. Providing a few examples similar to the job applicants based on a nearest neighbor search improved the prediction rate marginally, indicating that the models perform competently even without task-specific fine-tuning. For Phi-3 Small and Mixtral, all reported performance metrics fell within the 95% confidence interval across evaluation strategies. Model outputs were interpreted quantitatively via post hoc explainability techniques and qualitatively through prompt engineering, revealing that decisions are largely attributable to knowledge acquired during pre-training. A task-specific MLP classifier trained solely on the provided dataset only outperformed the strongest foundation model (Zephyr in 5-shot setting) by approximately 3 percentage points on accuracy, but all the foundational models outperformed the baseline model by more than 15 percentage points on f1 and recall, underscoring the competitive strength of general-purpose language models in the hiring domain.

Keywords:

large language models (LLMs); recruitment; explainable AI (XAI); zero-shot learning; few-shot learning

1. Introduction

Human capital is a decisive determinant of organizational performance, and sub-optimal hiring decisions can incur considerable financial and strategic costs. The growth of online recruitment platforms has transformed talent acquisition into a large-scale information retrieval task: a single vacancy in high-demand industries may attract thousands of résumés, rendering manual triage both time-consuming and costly. Consequently, automated résumé screening has become indispensable for reducing labour expenses and shortening the time-to-hire. The automatic pipeline typically comprises information extraction [1] followed by candidate evaluation [2], with natural-language processing (NLP) methods constituting the core of algorithmic hiring systems [3].

Within NLP, text classification is a primary mechanism for converting unstructured résumé content into structured representations suitable for downstream decision making [4]. Recent advances in foundation models [5]—particularly large language models (LLMs), a powerful subclass of such models [6]—offer the capacity to perform complex linguistic reasoning with minimal task-specific supervision. Their ability to synthesize heterogeneous applicant information, generalize across domains, and generate human-readable rationales makes them attractive candidates for automating candidate shortlisting.

This study, therefore, investigates the efficacy of LLMs in résumé triage, evaluating their performance under zero-shot and few-shot paradigms and analyzing the transparency of their decisions. Throughout the remainder of the paper, we use the terms “LLM” and “foundation model” interchangeably to denote these large-scale pretrained language models.

2. Related Work

The literature on data-driven hiring has expanded rapidly. A Google Scholar query with various permutations of terms such as “AI recruitment,” “automated résumé screening,” and “large language models” now returns several thousand publications, underscoring both scientific and commercial interest in the field. Selected representative studies are summarized below.

2.1. Artificial Intelligence for Job Hiring

“AI-based recruitment” denotes the use of artificial-intelligence technologies to identify, screen, and appraise candidates for open positions [7]. Early research framed résumé triage as a supervised-learning problem, applying classifiers such as linear regression, decision trees, AdaBoost, and XGBoost to predict candidate suitability [8]. To augment limited résumé information with behavioural signals, authors incorporated social-media profiles, which are presumed to capture more spontaneous self-presentation than curated curricula vitae [9].

Beyond screening, AI has been deployed to expand the top of the recruiting funnel. Web-crawling APIs automatically harvest relevant vacancies across industry portals [10] while hybrid recommender systems that blend collaborative and content-based filtering suggest jobs to applicants. Large-scale case studies at multinational firms such as Amazon, L’Oréal, and Samsung highlight both the efficiencies and governance challenges of AI recruitment pipelines [11]. Other work proposes joint profiling of social-media activity, résumé content, and psychometric proxies such as “employability” and “emotional-intelligence” scores to support bidirectional matching between employers and candidates [12].

Automated interviews extend algorithmic hiring to the assessment stage: natural-language processing (NLP) and deep-learning (DL) models can conduct preliminary conversations, transcribe responses, and score them against competency rubrics [3]. Comprehensive surveys of the state of practice document the opportunities and risks of these systems, including fairness, transparency, and regulatory compliance [13]. End-to-end platforms have emerged that track applicants from résumé submission through interview evaluation, generate analytic reports, and even store candidate data on blockchains to facilitate secure talent sharing among recruiters [14].

2.2. Foundation Models in Human-Resources Departments

Research focus is now shifting from bespoke machine-learning pipelines toward foundation models—large, pretrained neural architectures that can be adapted to myriad downstream tasks with minimal additional supervision. For example, Du et al. [15] employs an LLM-conditioned generative adversarial network (GAN) to complete incomplete résumés and feed the enriched documents into a job-recommendation engine, thereby mitigating hallucination in résumé enhancement. The open-sourcing of datasets such as HR-MultiWOZ, which contains 550 multi-turn conversations spanning ten HR scenarios [16], provides a standardized benchmark for dialogue-based HR assistants.

LLMs have also been scrutinized for ethical and legal implications. A recent audit of state-of-the-art models uncovered differential recommendations based on gender, race, parental status, and political affiliation [17]. Nevertheless, longitudinal analyses argue that, when properly governed, AI can attenuate rather than amplify human bias in recruitment decisions [18].

Taken together, these studies indicate a clear trend: as foundation models mature, they are poised to become central components of HR technology stacks, supporting tasks from résumé parsing and job matching to conversational candidate engagement and bias-aware decision support.

2.3. Contribution of the Paper

Existing studies offer only preliminary evidence for applying large language models (LLMs) and other foundation models to résumé enhancement and hiring, leaving the domain largely unexplored. This work is, to our knowledge, the first to evaluate nine LLMs for automated candidate shortlisting under both zero-shot and few-shot regimes, providing a broad comparative perspective. We further present a detailed examination of model explainability, integrating post hoc quantitative analyses with qualitative insights derived from prompt engineering. Supplying the k most similar prior applications as exemplars yields only modest accuracy gains, indicating that foundation models remain effective even in the absence of task-specific examples. The performance of the models is, therefore, largely owed to pre-training. Another unique finding is that these general-purpose models are comparable to models trained for the specific task.

3. Methodology

This section outlines the empirical strategy used to address the three research questions (RQs) posed below and introduced in Section 1. We first describe the foundation models under evaluation and the dataset employed, followed by the preprocessing workflow, the in-context learning protocol, and the quantization procedure adopted to enable local experimentation.

RQ1: Can recent general-purpose foundation models accurately shortlist job applicants using zero- and few-shot learning?
RQ2: Is there sufficient quantitative evidence that explains the decisions produced by those models?
RQ3: Does prompt engineering contribute qualitative insight into model decisions?

3.1. Models Under Study

Nine open-weight or commercially accessible large language models (LLMs) were benchmarked, as described in Table 1. Model selection balanced parameter scale, licensing constraints, and availability on consumer hardware. All models were evaluated in their base (instruction-tuned) form without additional fine-tuning.

3.2. Dataset

We employed the Piramal Finance Candidate Screening dataset introduced by Gautam [26]. The dataset was sourced from a reputed Indian company, `Piramal Capital & Housing Finance.’ The corpus contains 932 anonymized candidate profiles together with binary hire/no-hire outcomes recorded after on-site performance reviews. Features from the profiles are extracted into .csv files with 21 columns. The original authors partitioned the data into mutually exclusive training (80%) and test (20%) splits; however, ground-truth labels for the test set are withheld. Consequently, all analyses reported here use the publicly available training split (745 instances). It must be noted that the dataset is only used for testing/few-shot learning and not for training any foundational model from scratch. Cross-validation strategies like Leave-One-Out and K-Fold will be employed to get generalized estimates on metrics despite the limited size of the dataset. Therefore, it is substantial enough for the experiments.

3.3. Pre-Processing

To preserve privacy, it was ensured that raw identifiers, free-text cover letters, and firm-specific codes do not appear in the dataset. Inconsistent feature names were standardized (e.g., 1-Feb. → 1–2); categorical features were one-hot encoded (e.g., 0/Fresher → 0). All transformations were implemented in Python 3.11 using pandas v2.2 and are fully reproducible. Some columns represented after-the-fact information, like the date of joining, and therefore were dropped from the dataset. Post-cleaning, no rows were removed from the dataset, but 4 columns were removed, resulting in 745 rows and 17 columns.

To develop the task-specific baseline model, all the features in the dataset needed to be in numerical format. All the features with ordinal data were encoded using an ordinal encoding scheme to preserve the inherent order within the categories. As the foundation models are pre-trained, not having a huge dataset is not an issue. LLMs are known to be zero-shot learning reasoners [27]. Training and fine-tuning, if at all, for this work, happens only in the form of few-shot learning. The shots are the examples provided to the LLM to fine-tune its learning.

3.4. In-Context Learning Protocol

We adopted an n-shot in-context learning (ICL) paradigm. It relies on the generalization capabilities of the LLM and utilizes the large context windows of LLMs to provide sufficient examples and instructions within the prompt itself. An “n-shot” is defined as the number of examples provided to the LLM before the actual question is asked. An example typically consists of a question and the corresponding answer that is expected from the LLM. This question-answer pair helps the LLM understand the way it should respond and which part of the question it should pay attention to for the answer. The number of examples to be provided should be balanced with the context window for the LLM. Providing too many examples might result in the prompt exceeding the entire context window, causing the LLM to truncate the input prompt, leading to loss of information and context, and resulting in the LLM giving an incoherent response. For each inference call, the prompt comprises:

1.: n illustrative exemplars drawn at random from the training set using a nearest neighbor search;
2.: A task instruction requiring the model to predict the hiring decision; and
3.: The candidate profile under evaluation.

Unless explicitly specified, default values were chosen for the hyperparameters such as temperature.

3.5. Quantization

LLM weights and parameters are often 32-bit floating-point numbers. LLM models usually contain more than a billion parameters. Therefore, owing to their size, such models cannot fit inside the memory of most consumer-grade desktop and laptop devices. To run these models on consumer-grade devices for on-device inference, with or without hardware accelerators like GPUs, quantization is used to represent the model weights using lower-precision floating-point numbers or integers. Depending on the number of bits used to represent the individual weights, a significant amount of the memory footprint of the LLM can be reduced. This allows the LLM models to be loaded into the memory of consumer devices and speeds up their execution time. Quantization has been used for the experiments detailed in the subsequent sections. Specifically, memory-efficient 4-bit weight quantization was applied via the bitsandbytes library to enable local execution. Empirical validation in the literature confirmed that quantized checkpoints do not always exhibit statistically significant degradation [28] compared with full-precision baselines.

3.6. Evaluation Metrics

Model performance was assessed with various metrics such as accuracy, precision, recall, and F1-score calculated under macro averaging. Statistical uncertainty was estimated with confidence intervals.

3.7. Conversation- and Instruction-Tuned LLMs

Many open-source LLMs have variants that are more tuned towards following instructions provided in the prompt to generate consistent answers, while some variants are tuned to utilize the entire conversation history with the LLM to provide an answer for the input prompt. These models are more consistent with their responses and can fully utilize the instructions and context provided in the prompt and conversation history.

LLMs tuned for chat and instruction-based prompts expect the prompt to be formatted into different sections using LLM-specific special tokens. These special tokens help the LLM get context about the text that follows or precedes it. The LLM generally expects the prompt to be structured as a conversation between two actors: User and Assistant/Model. These prompt sections are often referred to as User Prompt and Assistant/Model Prompt. Some LLMs also support a System Prompt, which can be used to set the context and expected behavior from the LLM. A complete prompt to the LLM is thus structured to start with a System Prompt, followed by alternating User Prompts and Assistant Prompts. Specifically for this research, the few-shot learning examples provided to the LLM model were modeled as a conversation with the LLM. An example of a chat template for an LLM is given in the next section. Samples demonstrating this structured format of LLM prompts are shared in Section 4.6.

4. Experiments

This section reports the empirical results that answer the three research questions introduced above. We first benchmark the predictive performance of each foundation model. We present an ablation study that isolates the contribution of prompt-engineering choices and exemplar selection to overall performance (Section 5.2). Next, we examine the faithfulness of model explanations by correlating the qualitative explanations from the models with ground-truth reviewer rationales (Section 5.3). We also evaluate the performance of the task-specific traditional machine-learning baseline for comparison. The purpose of the comparison with the baseline model is to demonstrate that currently, even the most basic model specialized for the task exhibits moderately superior performance compared to the generalized foundation models. All reported figures include 95% bootstrap confidence intervals, and significance is assessed at the p < 0.05 level. To ensure reproducibility, all the LLMs were evaluated using a seed of 42 on the Huggingface Endpoints service, and the baseline model was cross-validated using a seed of 42.

4.1. Baseline Model

For comparison purposes, a single-layer MLP (Multi-Layer Perceptron) model for the classification task was set up to act as the baseline model for all the subsequent experiments. It consists of a single-layer MLP, with 128 units and ReLU activation in the hidden layer and Sigmoid activation on the output layer. The baseline model has been trained for 118 epochs using binary cross-entropy loss, Adam optimization with default hyperparameters, and an early stopping criterion that monitors the loss to prevent overfitting. The model, therefore, is exclusively trained for the job hiring task using the given dataset, unlike LLMs, which are trained and fine-tuned on much more general data. Therefore, the baseline model can be expected to perform better than LLMs on traditional classification metrics like accuracy. To get unbiased estimates of the model performance, all the evaluation metrics were calculated using 5-fold cross-validation.

4.2. Tokenizer and Chat Template

Since each LLM has a different syntax for communication, the tokenizer and chat template specified by the LLM creators were used for querying. For example, the chat template in the Jinja template format [29] for Phi-3 Mini looks as follows. Here, bos_token represents the “beginning of a sentence”.

“{{ bos_token }}{% for message in messages %} {% if (message[’role’] == ’user’) %}
{{’<|user|>’ + ’\n’ + message[’content’] + ’<|end|>’ + ’\n’ + ’<|assistant|>’ + ’\n’}}
{% elif (message[’role’] == ’assistant’) %} {{message[’content’] + ’<|end|>’ + ’\n’}}
{% endif %}{% endfor %}”

Using the Jinja template format provides flexibility, as prompts can be customized for different LLMs with ease. The templates ensure uniform prompt structures, providing consistency. Adding more templates as requirements evolve is easy and therefore the approach is scalable. The syntax is intuitive to understand. For instance, in the above prompt, {{ bos_token }} is a placeholder for dynamic content and {% for message in messages %} is a for-loop.

4.3. Zero-Shot Learning and Few-Shot Learning

Earlier research has shown that LLMs may function as few-shot learners [30]. In this paper, each LLM was tested using zero-shot, one-shot, three-shot, and five-shot in-context learning. The nearest neighbor algorithm was used to find the most similar examples closest to the test data being classified. In this case, the test data comprises details of the specific job candidate whose performance will be predicted by the LLM. Cosine similarity is used to identify the nearest 10 neighbors for each test data item. The examples for few-shot learning are these nearest neighbors.

4.4. System Prompt

Some LLMs support system prompts as part of their chat template. When the LLM does not support system prompts, it is given as part of the user prompt. The job description was unavailable from the resume dataset, so it was extrapolated from a job description posted online by the same company ‘Piramal Capital & Housing Finance’ and generalized for multiple positions in the same organization. This is how Piramal describes itself:

“Piramal Capital & Housing Finance Ltd. (Piramal Finance), at its core, believes we are a company that is of the people of India and for the people of India. Our story has been one of steady change. We entered the retail finance area with housing finances and now offer business and personal loans. We use customer feedback and new market opportunities to create long-term, value-driven financial services. At Piramal Finance, we emphasize digitization and online lending, while still giving our valued customers a human touch and expanding branches throughout Bharat.”

The system prompt used in the work is as follows:

“You are a Hiring Manager at the company Piramal Capital & Housing Finance Ltd. You are looking to hire multiple people with multiple roles and varying experiences for the Direct Sales Team (DST). You are currently hiring for housing, business, personal loans, sales, and other related departments. Any relevant experience in any of these and adjacent fields is valuable. Job responsibilities can change depending on the department, so a little flexibility is allowed when hiring. You do not have access to all the open positions and their job descriptions, but there are some common requirements for all the open positions that you should know which are: degree of travel, key stakeholders, qualifications, skills and experience, and key roles/responsibilities.”

“You will be asked questions about whether a candidate will perform well in a role at Piramal or not if they are hired. You only answer with a yes or no, where no means the candidate cannot perform, and yes means the candidate will perform well.”

4.5. User Prompt Template

Each question to the LLM for a candidate was formulated using the following prompt template, where each placeholder ‘{ }’ was replaced by the individual candidate’s answer. Here is a candidate who applied for a DST role in the ‘{}’ department of Piramal Finance.

Table 2 shows the responses the candidate gave to the questions mentioned on the application. Do you think that the candidate’s performance will be up to the mark if they are hired?

4.6. Few-Shot Prompt Chat Template

For few-shot prompts, the input to the LLM will be provided in the following prompt formats. Special tokens that provide extra context and structure for the prompts to Phi-3 Mini and Zephyr are described below.

<|system|> - Special token for the LLM that denotes the start of system prompt section.
<|user|> - Special token for the LLM that denotes the start of the user prompt section.
<|assistant|> - Special token for the LLM that denotes the start of the LLM response section.
<|end|> - Special token for the LLM that denotes the end of a section.
</s> - Special token for the LLM that denotes the end of a user prompt/system prompt /assistant response.
<|endoftext|> - Special token for the LLM that denotes the end of the complete prompt to LLM.

Without System Prompt (Phi-3 Mini)

For K = 0 <|user|> - You are a Hiring Manager at the company Piramal Ca … m,

and yes means the candidate will perform well. Here is a candidate who applied for

a DST role in …performance will be up to the mark if they are hired? - <|end|>

<|endoftext|>

For K = 1 <|user|> - Here is a candidate who applied for a DST role in … ormance

will be up to the mark if they are hired? - <|end|><|assistant|>no<|end|>

<|user|> - You are a Hiring Manager at the company Piramal Ca… … m, and yes

means the candidate will perform well. Here is a candidate who applied for a DST role

in … ormance will be up to the mark if they are hired? - <|end|> <|endoftext|>

For K = 3 & 5:…<Prompts for each 3 & 5 shots are similar to the one for K = 1>…

With System Prompt (Zephyr)

For K = 0: <|system|> You are a Hiring Manager at the company Piramal Ca … m,

and yes means the candidate will perform well.- </s> <|user|> - Here is a candidate

who applied for a DST role in … ormance will be up to the mark if they are hired? -

</s>

For K = 1: <|system|> You are a Hiring Manager at the company Piramal Ca…m,

and yes means the candidate will perform well. - </s> <|user|> - Here is a candidate

who applied for a DST role in…ormance will be up to the mark if they are hired? -

</s>

<|assistant|>no</s> <|user|> - Here is a candidate who applied for a DST role in…

…ormance will be up to the mark if they are hired? - </s>

For K = 3 & 5:…<Prompts for each 3 & 5 shots are similar to the one for K = 1>…

4.7. Dataset Evaluation and Interpretability

For evaluation, the LLMs were asked in succession about their hiring decision for all the 745 candidates in the dataset. The evaluation strategy adopted is like Leave-One-Out Cross-Validation, where the sample being evaluated is removed from the nearest neighbors set to be used in the few-shot examples for that sample, removing any data leakage concerns. On top of this, only the first N neighbors are used in the prompt for the few-shot examples. The system prompt, the generated user prompts, and examples for a few shots were provided to the model in its specified chat template. The answer from the LLM was then parsed to extract its prediction. The actual value and the model prediction were recorded in a .csv file for offline analysis. The models used during this phase of the experiment have all been quantized using 4-bit representation, as mentioned previously in Section 3.4.

The output of a large language model is a probability distribution over its vocabulary, and the token with the highest probability is chosen as the next output token. A key aspect of understanding the interpretability of large language models is to quantify the impact of key tokens in the input prompt on the output tokens produced by the LLM. This helps in understanding which key tokens in the input prompt improve the confidence in the output tokens and which tokens reduce the confidence in the output tokens.

A perturbation-based technique described by Captum [31] calculates the impact of key tokens in the input prompt by randomly replacing the key tokens in the input prompt with baseline values, and then observing the change in the probability of the token originally produced by the LLM as output. A baseline value can be simply an empty string, indicating complete omission of the input token, or another set of tokens that contextually make sense in place of the input tokens. The change is calculated using the difference in the log probability of the output token with the original input token and the log probability of the output token with the baseline value.

Analyzing this change yields insights into the importance/impact of key tokens in the input prompt on the LLM output. If the change is positive, it means that the probability of an output token because of the original tokens in the input prompt is higher than the probability of an output token because of baseline values in the input prompt. A positive value indicates that those tokens in the input prompt improve the confidence in the token predicted by the LLM. On the other hand, if the change is negative, it means that those tokens in the input prompt worsen the confidence in the token predicted by the LLM compared to the baseline values in the input prompt. The magnitude of change quantifies how much impact those tokens have on the confidence with which the LLM produces the output token.

5. Results and Discussion

The performance results of all the models for zero-shot, one-shot, three-shot, and five-shot are given in Table 3, Table 4, Table 5 and Table 6, respectively. The models are sorted in decreasing value of the Area Under the Curve (AUC) score. The curve referred to here is the Receiver Operator Characteristic (ROC) curve that is often used in evaluating classification algorithms. Considering the original class distribution was unbalanced, and for a comprehensive evaluation, multiple evaluation metrics were computed. A model that only learns the class distribution of the training data will have high accuracy but poor results for metrics like F1 and AUC scores.

To further check the validity of the performance metrics, confidence intervals (CI) were computed. CI shows the probability of the obtained value falling between a desired range around the mean. The CI is calculated using a 95% confidence level, excluding the baseline model, for the various scenarios.

For the values corresponding to Table 3, the confidence interval for various performance metrics of LLMs is as follows: Accuracy: [0.3994, 0.5131], Precision: [0.3883, 0.4332], Recall: [0.4923, 0.9993], F1: [0.3616, 0.6015], AUC: [0.5000, 0.5122], AUPRC: [0.5623, 0.6949]. Considering the calculated values from Table 3, one can see that the performance metrics corresponding to models Llama 2, Llama 3, Mixtral, Mistral-7b, Gemma-7b, and Phi-3 Small lie in the confidence interval for 95% of the confidence level.

For the values corresponding to Table 4, the confidence interval for various performance metrics for LLMs is as follows: Accuracy: [0.4296, 0.4852], Precision: [0.3998, 0.4159], Recall: [0.7212, 0.8939], F1: [0.5238, 0.5549], AUC: [0.5070, 0.5284], AUPRC: [0.6222, 0.6694]. For 95% of confidence level, one can see that the performance metrics corresponding to models Llama 2, Mixtral, Phi-3 Mini, and Phi-3 Small lie in the CI when closely considering the calculated values from Table 4.

For the values corresponding to Table 5, the confidence interval for various performance metrics for LLMs is as follows: Accuracy: [0.4476, 0.5224], Precision: [0.4032, 0.4319], Recall: [0.6231, 0.8428], F1: [0.5039, 0.5495], AUC: [0.5109, 0.5446], AUPRC: [0.5987, 0.6576]. The performance metrics corresponding to models Mixtral, Mistral-7b, and Phi-3 Small lie in the CI for 95% of the confidence level, considering the calculated values from Table 5.

For the values corresponding to Table 6, the confidence interval for various performance metrics of the LLMs is as follows: Accuracy: [0.4473, 0.5242], Precision: [0.4018, 0.4273], Recall: [0.5787, 0.8216], F1: [0.4864, 0.5423], AUC: [0.5066, 0.5388], AUPRC: [0.5831, 0.6503]. Considering the calculated values from Table 6, it can be seen that the performance metrics corresponding to models Llama 2, Llama 3, Mixtral, and Phi-3 Small lie in the confidence interval.

From the above tables and calculations, it can be concluded that the values for models Mixtral and Phi-3 Small lie in the CI for all the performance metrics and all the shots considered in this paper. Along with the CI, the changing pattern of AUC for all the models can be seen in Figure 1. The differences in the model performances may be attributable to the diversity in the training process and the data used for the various models.

This work only presents a proof-of-concept. However, no assumptions have been made that limit the generalizability of the findings. The experiments demonstrate that the foundation models do not entirely match the human intuition evident from the dataset for use in the job hiring space.

For Gemma-2b, the AUC metric value remains constant at 0.5 for all the changing values of shots.
The AUC metric for Llama 3 decreases until three-shots and then increases.
In the case of Phi-3 Mini, the AUC metric keeps decreasing with the increase in the number of shots.
For the models, Gemma-7b, Llama 2, and Zephyr, initially, the AUC metric value increases steeply and then decreases.
In the case of Mixtral, Mistral-7b, and Phi-3 Small, there is an increase in the AUC metric until three-shots, and then the metric value tends to become constant.
Even though the Baseline model has higher accuracy compared to all the LLMs for all the shots, the gap between the LLMs and the baseline model’s accuracy kept decreasing with increasing number of shots.
LLM consistently performs better than the baseline model when considering recall and F1 metrics.

To overcome the limited dataset size and the lack of labels in the test dataset and ensure reproducibility of the results, all the experiments were conducted in well-defined steps that have been thoroughly described in the previous section. The evaluation metrics for the Baseline model on the training dataset were calculated using five-fold cross-validation, while the LLMs were evaluated using leave-one-out cross-validation across all the shots, which reduces bias in the results presented in the section.

5.1. Analysis of Classification Metrics

In hiring contexts, false positives (incorrectly recommending a poor candidate) generally carry substantial costs, including training expenses, potential severance costs, team disruption, and opportunity costs from the role remaining unfilled by a qualified candidate. Thus, a precision-focused model helps ensure that hiring recommendations maintain high-quality standards. Recall becomes more critical in specific scenarios where missing qualified candidates proves exceptionally costly. This occurs most commonly in highly specialized positions where qualified candidates are scarce, during periods of intense competition for talent, or when filling roles that are absolutely critical to business operations.

The baseline model demonstrates consistent performance across all metrics (AUC: 0.5758, Accuracy: 0.5852, F1: 0.4390). LLMs show trade-offs between precision and recall, with several models achieving higher recall at the cost of accuracy and precision, with the performance varying significantly across different few-shot learning scenarios.

Llama 3 showed steady improvement from zero-shot to five-shot learning, maintaining high recall across all scenarios while improving precision. Its five-shot AUC (0.5427) also approached baseline performance. Phi-3 Mini showed inconsistent performance with degradation in higher shot scenarios, but still achieved a balanced precision–recall trade-off compared to other models. Mistral-7b showed strong performance across all scenarios, with a peak in five-shot learning and demonstrated exceptional recall (0.9593) but low precision (0.4008), proving its effectiveness when recall is prioritized over precision. Zephyr showed strong zero-shot accuracy (0.6027) but inconsistent few-shot performance. Mixtral showed stable performance, with consistently high recall across all few-shot settings (>0.78), but poor precision.

Phi-3 Small showed declining performance on F1 score with increased shot learning, maintaining high recall but struggling with precision. Gemma-7b performance on F1 score dropped with increasing number of shots in few-shot settings, while Gemma-2b demonstrated completely static performance across all scenarios with an AUC of 0.5, which is the same as random model performance. Gemma-2b has a small context window, and hence the long prompt might not have fit inside its context window, leading to the completely random performance. Llama 2 had the best 3-shot performance among all models in AUC with strong AUPRC scores (>0.6) till 3-shot. Overall, for balanced performance approaching baseline levels, Llama 3 in five-shot scenarios and Mistral-7b represent the strongest alternatives with superior recall abilities. Hence, as stated earlier, LLMs show an affinity for hiring decisions in scenarios where missing qualified candidates proves exceptionally costly compared to traditional ML approaches with far fewer training data samples (1–5 examples compared to 600 samples in the baseline model).

5.2. Visualizing Interpretability

In this subsection, the visualization of the interpretability using the first resume sample from the dataset is carried out. All LLMs were evaluated for interpretability using the same candidate and their responses. Their responses to each question in the user prompt section were treated as the key tokens in the input prompt. In the following heatmaps, these responses are shown on the y-axis, and the number of shots is represented on the x-axis. The X-axis markers also show the model output for that K-shot learning. The Y-axis represents features like income and employment. The cell values are contribution scores of each token to the predicted output token. Positive values in green support the model predicting “Yes”. Negative values in red oppose the model predicting “Yes”.

The expected response for all the models for the candidate chosen here is “no”. The name of the company for which the candidate worked before applying for a position at this company has been left anonymous or unanswered, which is used as the baseline for a specific “Company X” for generating the heatmaps in Figure 2. Each cell in the heatmap represents the feature importance score for that set of tokens in the input prompt concerning the output token for each type of K-shot learning. For example, if a cell has a value of 0.5, it means the probability of an output token is higher when the input prompt contains the response associated with that cell compared to the baseline value of “not answered”. A value of -0.2 means that the probability of an output token is lower when the input prompt contains the response associated with that cell as compared to the baseline value of “not answered”. A larger magnitude implies that those tokens in the input prompt have a more significant impact on the probability of the output token.

The heatmap for Gemma-2b in Figure 2a can be interpreted as follows. The remaining heatmaps can be similarly interpreted. The education feature, “Full Time,” indicating full-time study, has a strongly negative score, implying full-time study worked against this candidate in the job hiring. Often, for some types of jobs, work experience is preferred over full-time study during the same period. Similarly, graduate studies can also work against candidates. The “Mass Affluent Housing” department has a positive score, implying that applying to this department of the company increases the chances of securing a job. This aligns with the human intuition because affluent housing is likely to have more jobs than any other department. The heatmap also shows that selling “MSME/SME Loans” hurts the selection for the job. MSME loans are financial products in India designed to support the growth of small businesses. It is possible that selling such loans is relatively easy, and doing easy jobs in the past may not align well with the current job requirements. However, it must be noted that the scores are only indicative. The explainability technique used for this study is a local interpretation method. That means it explains the model’s prediction for a specific input instance, rather than giving a broad overview of model behavior across the entire dataset.

5.2.1. Gemma-2b

As seen in Figure 2a, in the Gemma-2b model, there is no consistent trend that can be observed for any of the tokens in the input prompt across all the K-shot scenarios. For all the shots, this model gave the response “Yes”, even though the expected response for this candidate was “No”. The response “Full Time” has a strong negative impact across one-shot (−2.4), three-shot (−0.66), and five-shot (−1.9) scenarios, suggesting the model predicts “Yes” more confidently when the response is “not answered” compared to “Full Time”. “Company X” shows a notable positive impact in the three-shot scenario (1.4), indicating the model predicts “Yes” less confidently when the response is “not answered” compared to this specific phrase. “MSME/SME Loans, Others” has a consistently negative impact, particularly in zero-shot (−0.77), one-shot (−1.5), and five-shot learning (−0.8). Some responses, such as “1–2 members” and “NBFC”, show relatively neutral impacts across different shot scenarios.

5.2.2. Gemma-7b

In Gemma-7b, one-shot learning showed stronger negative impacts across various candidate responses, as seen in Figure 2b. Three-shot and five-shot learning generally showed a more positive impact on many candidate responses compared to zero-shot learning and one-shot learning. For all the shots, this model gave the response “Yes”, even though the expected response for this candidate was “No”. “Mass Affluent Housing” had a positive impact across all shot types, with the highest positive impact observed in three-shot learning with a value of 1.3. Both “Full Time” and “Graduate” responses had a strong negative impact on one-shot learning but had a big positive impact on the output token for three-shot and five-shot learning. Barring one-shot learning, the “MSME/SME Loans, Others” response had a consistently negative impact on the output token.

5.2.3. Llama 2

Llama 2 correctly predicted “No” for the candidate only in the case of the five-shot learning, which can be seen from Figure 2c. For zero-shot learning, the majority of the responses have a positive impact on the predicted token “Yes”. For one-shot learning, this model was much more confident in predicting “Yes,” as all the tokens in the input prompt had positive values. For three-shot learning, the impact of tokens in the input prompt was mixed, with “Mass Affluent Housing” and “Company X” having the most significant positive impact on the probability of the output token “Yes”. For five-shot learning, even though the model predicted “No”, most of the tokens in the input prompt had a large negative impact on the probability of “No” compared to the baseline value of “not answered”.

5.2.4. Llama 3

Llama 3 predicted “Yes” for this candidate for all the K-shot scenarios. From Figure 2d, there is a clear trend showing the increasingly negative impact of all the tokens in the input prompt as the number of shots kept increasing, which indicates the model is becoming less confident in predicting “Yes” with an increasing number of shots. “Mass Affluent Housing” had a consistently positive impact across different shots on the output token. “Company X” and “INR 5L - INR 15L” showed strong negative impacts across various shot configurations. “Full Time” and “Above 10K” also showed significant negative impacts, particularly in the three and five-shot configurations on the output token.

5.2.5. Phi-3 Mini

From Figure 2e, the Phi-3 Mini model correctly predicted “No” for this candidate in the zero-shot learning scenario, but incorrectly predicted “Yes” for other shots. Except for zero-shot learning, there is a trend of decreasing the impact of many of the tokens in the input prompt as the number of shots increases from 1 to 5. “MSME/SME Loans, Others” with one-shot learning had the highest positive impact (+8.79). “INR 5L-INR 15L” with one-shot learning also showing a strong positive impact (+8.37). “Company X” with zero-shot learning had the highest negative impact (−8.46). “MSME/SME Loans, Others” with zero-shot learning also showing a strong negative impact (−8.66). Some entries, such as “Mass Affluent Housing” across different shots, showed values close to zero, indicating minimal impact on the log probability of the predicted token.

5.2.6. Zephyr

In the case of Zephyr, seen in Figure 2f, it correctly predicted “No” for this candidate in zero-shot learning and one-shot learning scenarios, but incorrectly predicted “Yes” for other shots. From the heatmap, there does not seem to be any general trend in the interpretability scores with an increasing number of shots. “Company X” consistently showed positive impact values across all shots and predicted tokens, indicating a strong influence in increasing the log probability for both “No” and “Yes” predictions. “NBFC” had a significant negative impact, especially in the one-shot learning and three-shot learning conditions for the “No” predicted token. “MSME/SME Loans, Others” showed a strong negative impact, particularly for the three-shot and five-shot conditions for the “Yes” predicted token. Several responses like “Graduate”, “Full Time”, and “Others” tend to have minor positive or negative impacts, indicating relatively less influence on the predicted token’s log probability.

5.2.7. Mistral-7b

Mistral-7b incorrectly predicted “Yes” for all the K-shot learning scenarios. There is a clear trend that shows the increasing positive impact of tokens in the input prompt on the output token predicted by the model, as seen in Figure 2g. The words “Above 10K”, “Company X”, “INR 5L - INR 15L” and “MSME/SME Loans, Others” had a strong negative impact (deep red) under the zero-shot learning scenario. As the number of shots increases (1, 3, 5), some words such as “Full Time”, “Graduate” and “Company X” show a transition from negative to positive impact (from red to green) on the output token. This indicates that with more context, these words contribute more positively to the prediction, even though the token predicted by the model was incorrect. Some words like “Mass Affluent Housing” show relatively stable and moderate impacts (around yellow) across all shot conditions, indicating a consistent influence on the predicted token.

5.3. Qualitative Reasoning

To understand how LLMs reason, additional experiments were performed using prompting. Each LLM, after giving its response to the prompts as part of previous prompts, was asked this follow-up question—“Explain the reasoning behind the recommendation you gave for the above candidate very briefly”. The LLMs used diverse reasoning to explain their decisions. The reasons would range from evaluating specific industry experience, such as NBFC, to general qualifications and educational background. Some reasons would explicitly mention years of experience and specific roles or projects the candidates had worked on to justify the decision made by the LLM. Despite different models and values of K for K-shot learning, the reasoning appeared consistent in terms of structure and focused on candidate experience and qualifications.

To analyze the reasoning trends, the reasons were categorized based on common themes, and the frequency of each category was evaluated. The common themes were experience, educational background, specific skills or qualifications, balance of qualifications, and fit for the role. Each reason given by the LLMs was assigned one or more of the identified themes, as shown in Figure 3. Experience with 32 occurrences was the most frequently mentioned theme, indicating that the LLMs heavily rely on the candidates’ experience when making decisions, which aligns with the human intuition. This includes specific industry experience, years of experience, and previous work experience. Specific skills or qualifications, with 18 occurrences, were the second most common theme, focusing on the specific skills and qualifications of the candidates. This showed that the LLMs did consider detailed aspects of the candidates’ expertise. Educational background was only considered four times, indicating that while educational background was considered, it was not as prominent as experience and specific skills. Fit for the role was rarely mentioned directly, suggesting that explicit statements about the candidate’s fit for the role were less common in the LLMs’ reasoning. There were no mentions of balancing experience and education, indicating that the LLMs may have been focusing more on individual aspects rather than a combined evaluation. As can be seen, most of the qualitative explanations from the foundation models are consistent with human reasoning.

5.4. Ethical Considerations

Large language models (LLMs) inherit statistical associations, both desirable and undesirable, from the web-scale corpora on which they are pre-trained. When such models are applied to hiring, spurious correlations can translate into disparate treatment of individuals based on legally protected characteristics such as age, gender, or ethnicity. Algorithmic fairness, therefore, constitutes a first-order performance criterion rather than a post hoc addendum. Recent surveys emphasize the dual importance of technical safeguards, such as bias-controlled data curation, model explainability, and organizational governance (e.g., independent audit and external oversight) in mitigating discrimination risks. To address issues with algorithmic bias in hiring, it is recommended to follow technical measures such as unbiased dataset frameworks and improved algorithmic transparency, alongside management strategies like corporate ethical governance and external oversight [32].

5.4.1. Bias Audit

The dataset employed in this study was manually inspected to verify the absence of explicit sensitive attributes such as gender or race indicators. Although such screening reduces direct leakage, proxy variables such as years since graduation, for instance, may still encode demographic information. We therefore report class conditional performance disaggregated by available non-protected features.

5.4.2. Human Oversight

A study based on interviews with experienced recruiters emphasizes the irreplaceable role of human judgment in final hiring decisions. While AI can assist in initial screening, concerns remain about its ability to fairly evaluate candidates without reinforcing biases [33]. Consistent with guidance from the U.S. Equal Employment Opportunity Commission and the EU AI Act, we envisage LLM-based screening only as an assistive filter that narrows the applicant pool; final hiring decisions must remain with qualified human reviewers. Empirical evidence indicates that hybrid human-AI workflows can improve both efficiency and equity when appropriate guardrails are in place.

5.4.3. Explainability and Accountability

Because transformer models operate as high-dimensional black boxes, transparent rationales are essential for contestability. We therefore couple each prediction with token-level saliency maps derived from Captum’s Feature Ablation [34] implementation, selected for its

O (n)

complexity relative to the number of input tokens, among other reasons. Prior work has demonstrated the complementary value of perturbation-based methods such as LIME [35], SHAP [36], and a path-integrated method called Integrated Gradients [37] in adjacent domains such as misinformation detection [38]. It was also demonstrated that it can be misleading to rely on evaluation metrics without considering the explainability aspects [39]. Future work will incorporate an ensemble of explanation techniques to triangulate causal attributions.

5.4.4. Privacy

Candidate documents may contain personally identifiable information (PII). All experiments were executed on artifacts that were anonymized in accordance with GDPR Recital 26. No raw application data was used or will be released.

In conclusion, while our results showcase the promise of LLMs for scalable résumé screening, responsible deployment demands rigorous bias audits, comprehensive explanation tooling, and sustained human oversight.

6. Conclusions and Future Directions

This study presents a systematic assessment of state-of-the-art instruction-tuned large language models (LLMs) as zero- and few-shot resume screeners. Across nine contemporary models, median macro-F1 oscillated around 0.52 in the few-shot scenario—a few percentage points above a task-specific machine learning model that served as the classical baseline. Other evaluation metrics are quite aligned as well. Performance only increased slightly with the number of in-context exemplars, suggesting that most of the accuracy is attributable to pre-training.

Beyond aggregate accuracy, we evaluated the faithfulness of model rationales. Feature-ablation sensitivity aligned with self-reported token-level explanations for most models, indicating that verbal justifications are not merely decorative but are aligned with human intuition. Only Phi³-Small and Mixtral-8x7B_Instruct achieved non-overlapping 95 bootstrap confidence intervals across all shot counts, underscoring the need for larger evaluation sets before deployment. The three research questions the paper started with have accordingly been answered.

LLMs confer two practical advantages over static screening pipelines: (i) rapid transfer to novel job verticals without retraining, and (ii) generation of human-interpretable explanations that facilitate compliance with emerging AI-transparency regulations. This work promises to be an established systematic analysis in the direction of using LLMs for hiring. However, as a limitation, it must be noted that the dataset for this study contains limited samples from a single company, limiting industry and job function diversity. This affects the generalizability of the findings.

Nevertheless, limitations remain. The use of foundational models is only meant to complement human efforts in recruiting. Even if the models can help pick 50% of the candidates correctly, it can save substantial human efforts. The publicly available dataset is modest in size and scope, omitting multimodal evidence such as occupational portfolios or professional-network profiles. Future work should examine hybrid retrieval-augmented architectures that ground predictions in richer candidate graphs and enforce fairness constraints through counterfactual evaluation.

Overall, the present results establish a transparent benchmark and demonstrate that commodity LLMs already deliver competitive shortlist quality, laying the groundwork for responsible, auditable AI-assisted hiring workflows.

Author Contributions

Conceptualization, V.S.P.; methodology, V.S.P.; software, N.B.T.; validation, N.B.T., R.A., and V.S.P.; formal analysis, N.B.T., R.A., and V.S.P.; investigation, N.B.T., R.A., and V.S.P.; resources, N.B.T. and R.A.; data curation, R.A.; writing—original draft preparation, N.B.T., R.A., and V.S.P.; writing—review and editing, N.B.T., R.A., and V.S.P.; visualization, N.B.T., R.A., and V.S.P.; supervision, V.S.P.; project administration, V.S.P.; funding acquisition, V.S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in Kaggle at https://www.kaggle.com/datasets/abhishekgautam12/piramal-ml-hackathon-resume-dataset (accessed on 18 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Singhal, A. Modern Information Retrieval: A Brief Overview. IEEE Data Eng. Bull. 2001, 24, 35–43. [Google Scholar]
Singh, A.; Catherine, R.; Ramanan, K.V.; Chenthamarakshan, V.; Kambhatla, N. PROSPECT: A system for screening candidates for recruitment. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, ON, Canada, 26–30 October 2010. [Google Scholar]
Ghadekar, P.; Kabra, A.; Gangwal, K.; Kinage, A.; Agarwal, K.; Chaudhari, K. A Semantic Approach for Automated Hiring using Artificial Intelligence & Computer Vision. In Proceedings of the 2023 IEEE 8th International Conference for Convergence in Technology (I2CT), Lonavla, India, 7–9 April 2023; pp. 1–7. [Google Scholar] [CrossRef]
Bayer, M.; Kaufhold, M.A.; Reuter, C. A Survey on Data Augmentation for Text Classification. ACM Comput. Surv. 2022, 55, 1–39. [Google Scholar] [CrossRef]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2021, arXiv:2108.07258. [Google Scholar]
Fan, L.; Li, L.; Ma, Z.; Lee, S.; Yu, H.; Hemphill, L. A Bibliometric Review of Large Language Models Research from 2017 to 2023. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–25. [Google Scholar] [CrossRef]
Savariapitchai, M. Decision Strategies for Intelligent Recruitment. In Decision Strategies and Artificial Intelligence: Navigating the Business Landscape; San International Scientific Publications: Kanyakumari District, Tamil Nadu, India, 2023; ISBN 978-81-963849-1-3. [Google Scholar] [CrossRef]
Sridevi, G.; Suganthi, S.K. AI based suitability measurement and prediction between job description and job seeker profiles. Int. J. Inf. Manag. Data Insights 2022, 2, 100109. [Google Scholar] [CrossRef]
Pendyala, V.; Atrey, N.; Aggarwal, T.; Goyal, S. Artificial Intelligence Enabled, Social Media Leveraging Job Matching System for Employers and Applicants. In Proceedings of the 2022 International Conference on Recent Trends in Microelectronics, Automation, Computing and Communications Systems (ICMACC), Hyderabad, India, 28–30 December 2022; pp. 422–429. [Google Scholar]
Kumar, N.; Gupta, M.; Sharma, D.; Ofori, I. Technical Job Recommendation System Using APIs and Web Crawling. Comput. Intell. Neurosci. 2022, 2022, 7797548. [Google Scholar] [CrossRef]
Son, S.C.; Oh, J.Y. A Study on the Performances of AI Recruitment System: A Case Study on Domestic and Abroad Companies. Korean Career Entrep. Bus. Assoc. 2023, 7, 137–155. [Google Scholar] [CrossRef]
Pendyala, V.S.; Atrey, N.; Aggarwal, T.; Goyal, S. Enhanced Algorithmic Job Matching based on a Comprehensive Candidate Profile using NLP and Machine Learning. In Proceedings of the 2022 IEEE Eighth International Conference on Big Data Computing Service and Applications (BigDataService), Newark, CA, USA, 15–18 August 2022; pp. 183–184. [Google Scholar] [CrossRef]
Albassam, W. The Power of Artificial Intelligence in Recruitment: An Analytical Review of Current AI-Based Recruitment Strategies. Int. J. Prof. Bus. Rev. 2023, 8, e02089. [Google Scholar] [CrossRef]
Aishwarya, G.A.G.; Su, H.K.; Kuo, W.K. Efficient Hiring Analysis and Management Using Artificial Intelligence and Blockchain. In Proceedings of the 2022 IEEE 4th Eurasia Conference on IOT, Communication and Engineering (ECICE), Yunlin, Taiwan, 28–30 October 2022; pp. 283–285. [Google Scholar] [CrossRef]
Du, Y.; Luo, D.; Yan, R.; Wang, X.; Liu, H.; Zhu, H.; Song, Y.; Zhang, J. Enhancing job recommendation through llm-based generative adversarial networks. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 8363–8371. [Google Scholar]
Xu, W.; Huang, Z.; Hu, W.; Fang, X.; Cherukuri, R.; Nayyar, N.; Malandri, L.; Sengamedu, S. HR-MultiWOZ: A Task Oriented Dialogue (TOD) Dataset for HR LLM Agent. In Proceedings of the Proceedings of the First Workshop on Natural Language Processing for Human Resources (NLP4HR 2024), St. Julian’s, Malta, 22 March 2024; pp. 59–72. [Google Scholar]
Veldanda, A.K.; Grob, F.; Thakur, S.; Pearce, H.; Tan, B.; Karri, R.; Garg, S. Investigating Hiring Bias in Large Language Models. In Proceedings of the R0-FoMo:Robustness of Few-Shot and Zero-Shot Learning in Large Foundation Models, New Orleans, LA, USA, 12 December 2023. [Google Scholar]
Raveendra, P.; Satish, Y.; Singh, P. Changing Landscape of Recruitment Industry: A Study on the Impact of Artificial Intelligence on Eliminating Hiring Bias from Recruitment and Selection Process. J. Comput. Theor. Nanosci. 2020, 17, 4404–4407. [Google Scholar] [CrossRef]
Team, G.; Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupatiraju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M.S.; Love, J.; et al. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024, arXiv:2403.08295. [Google Scholar]
Abdin, M.; Jacobs, S.A.; Awan, A.A.; Aneja, J.; Awadallah, A.; Awadalla, H.; Bach, N.; Bahree, A.; Bakhtiari, A.; Bao, J.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
Tunstall, L.; Beeching, E.; Lambert, N.; Rajani, N.; Rasul, K.; Belkada, Y.; Huang, S.; von Werra, L.; Fourrier, C.; Habib, N.; et al. Zephyr: Direct Distillation of LM Alignment. arXiv 2023, arXiv:2310.16944. [Google Scholar]
Meta. Meta-Llama-3-8B. 2024. Available online: https://ai.meta.com/blog/meta-llama-3/ (accessed on 18 July 2024).
Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Hanna, E.B.; Bressand, F.; et al. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [Google Scholar]
Gautam, A. Piramal ML Hackathon Resume Dataset. Kaggle. 2023. Available online: https://www.kaggle.com/datasets/abhishekgautam12/piramal-ml-hackathon-resume-dataset (accessed on 16 September 2024).
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
Gong, Z.; Liu, J.; Wang, J.; Cai, X.; Zhao, D.; Yan, R. What makes quantization for large language model hard? An empirical study from the lens of perturbation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 18082–18089. [Google Scholar]
Pallets. Jinja2: A Modern and Designer-Friendly Templating Engine for Python. 2007. Available online: https://jinja.palletsprojects.com/ (accessed on 27 July 2024).
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Kokhlikyan, N.; Miglani, V.; Martin, M.; Wang, E.; Alsallakh, B.; Reynolds, J.; Melnikov, A.; Kliushkina, N.; Araya, C.; Yan, S.; et al. Captum: A unified and generic model interpretability library for PyTorch. arXiv 2020, arXiv:2009.07896. [Google Scholar]
Chen, Z. Ethics and discrimination in artificial intelligence-enabled recruitment practices. Humanit. Soc. Sci. Commun. 2023, 10, 1–12. [Google Scholar] [CrossRef]
Sýkorová, Z.; Hague, D.S.; Dvouletý, O.; Procházka, D.A. Incorporating artificial intelligence (AI) into recruitment processes: Ethical considerations. Vilakshan 2024, 21, 293–307. [Google Scholar] [CrossRef]
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part I 13. Springer: Berlin, Germany, 2014; pp. 818–833. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. arXiv 2016, arXiv:1602.04938. [Google Scholar]
Lundberg, S.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic Attribution for Deep Networks. arXiv 2017, arXiv:1703.01365. [Google Scholar]
Pendyala, V.S.; Hall, C.E. Explaining Misinformation Detection Using Large Language Models. Electronics 2024, 13, 1673. [Google Scholar] [CrossRef]
Pendyala, V.; Kim, H. Assessing the Reliability of Machine Learning Models Applied to the Mental Health Domain Using Explainable AI. Electronics 2024, 13, 1025. [Google Scholar] [CrossRef]

Figure 1. The variations in the AUC for different models and shot scenarios.

Figure 2. Interpretability results for the models.

Figure 3. Qualitative reasoning trends.

Table 1. Overview of different AI models.

Model Name	Publisher	Total Parameters	Active Parameters	Platform
Gemma-2B [19]	Google Deepmind	2.5 B	2.5 B	Huggingface Serverless
Phi-3 Mini [20]	Microsoft	3.8 B	3.8 B	Ollama
LLama 2 [21]	Meta	6.7 B	6.7 B	Ollama
Mistral-7b [22]	Mistral7b.ai	7.2 B	7.2 B	Huggingface Serverless
Zephyr [23]	Huggingface	7.2 B	7.2 B	Ollama
Phi-3 Small [20]	Microsoft	7.4 B	7.4 B	Google Colaboratory
LLama 3 [24]	Meta	8.0 B	8.0 B	Huggingface Serverless
Gemma (7B) [19]	Google Deepmind	8.5 B	8.5 B	Huggingface Serverless
Mixtral [25]	Mistral7b.ai	47.0 B	13.0 B	Huggingface Serverless

Table 2. Responses to questions mentioned on the application.

List of questions for user prompt template

• Average incentive [per month] earned in your previous company? {}

• Have you completed your graduation? {}

• Highest educational qualification? {}

• How did you come to know about the role at Piramal Finance? {}

• How many organizations did you work in before joining Piramal Finance? {}

• How many are earning family members? [Other than yourself]? {}

• How many members are dependent on you? {}

• How many members are there in your family? {}

• Name of your previous organization/company? {}

• Previous industry worked with [before joining Piramal]? {}

• Total no of years of experience [before joining Piramal]? {}

• What was the average ticket size you handled in your previous role? {}

• Which products were you selling in your previous role? {}

Table 3. Zero-shot learning results sorted on AUC.

Model	Accuracy	Precision	Recall	F1	AUC	AUPRC
Baseline	0.5852	0.4760	0.4101	0.4390	0.5758	0.4915
Llama 3	0.4698	0.4077	0.7492	0.5281	0.5179	0.6281
Phi-3 Mini	0.5557	0.4231	0.3356	0.3743	0.5178	0.5109
Mistral-7b	0.4161	0.4008	0.9593	0.5654	0.5097	0.6881
Zephyr	0.6027	0.4848	0.0542	0.0976	0.5082	0.4568
Mixtral	0.4336	0.3977	0.8373	0.5393	0.5031	0.6497
Phi-3 Small	0.4188	0.3973	0.9051	0.5522	0.5025	0.670
Gemma-7b	0.3973	0.3962	0.9966	0.5667	0.5005	0.6971
Gemma-2b	0.396	0.396	1.0000	0.5673	0.500	0.698
Llama 2	0.4161	0.3933	0.8746	0.5426	0.4951	0.6588

Table 4. One-shot learning results sorted on AUC.

Model	Accuracy	Precision	Recall	F1	AUC	AUPRC
Baseline	0.5852	0.4760	0.4101	0.4390	0.5758	0.4915
Gemma-7b	0.5128	0.4295	0.7017	0.5328	0.5453	0.6246
Llama 2	0.4738	0.4139	0.7898	0.5431	0.5282	0.6435
Zephyr	0.5007	0.415	0.6373	0.5027	0.5242	0.598
Mixtral	0.4617	0.408	0.7966	0.5396	0.5194	0.6426
Mistral-7b	0.4295	0.4038	0.9254	0.5623	0.5149	0.6794
Phi-3 Mini	0.4617	0.4047	0.7627	0.5288	0.5136	0.6307
Phi-3 Small	0.4349	0.4028	0.8847	0.5536	0.5124	0.6666
Llama 3	0.4456	0.3969	0.7695	0.5236	0.5014	0.6288
Gemma-2b	0.396	0.396	1.0000	0.5673	0.5	0.698

Table 5. Three-shot results sorted on AUC.

Model	Accuracy	Precision	Recall	F1	AUC	AUPRC
Baseline	0.5852	0.4760	0.4101	0.4390	0.5758	0.4915
Llama 2	0.5315	0.4392	0.661	0.5277	0.5538	0.6172
Zephyr	0.549	0.4441	0.5525	0.4924	0.5496	0.5869
Mistral-7b	0.506	0.4294	0.7525	0.5468	0.5485	0.64
Gemma-7b	0.5235	0.4261	0.5864	0.4936	0.5343	0.5882
Mixtral	0.4765	0.415	0.7864	0.5433	0.5299	0.643
Phi-3 Small	0.4832	0.4144	0.739	0.5311	0.5273	0.6284
Phi-3 Mini	0.4376	0.4034	0.878	0.5528	0.5134	0.6649
Gemma-2b	0.396	0.396	1.0000	0.5673	0.5	0.698
Llama 3	0.4617	0.3905	0.6407	0.4852	0.4926	0.5867

Table 6. Five-shot results sorted on AUC.

Model	Accuracy	Precision	Recall	F1	AUC	AUPRC
Baseline	0.5852	0.4760	0.4101	0.4390	0.5758	0.4915
Mistral-7b	0.5262	0.4378	0.6915	0.5361	0.5547	0.6257
Llama 3	0.5181	0.4295	0.6610	0.5207	0.5427	0.6124
Zephyr	0.5409	0.4334	0.5186	0.4722	0.5371	0.5713
Mixtral	0.4725	0.4131	0.7898	0.5425	0.5271	0.6431
Phi-3 Small	0.4953	0.4147	0.6678	0.5117	0.5250	0.6070
Llama 2	0.5047	0.4131	0.5966	0.4882	0.5205	0.5847
Gemma-7b	0.5047	0.4031	0.5220	0.4549	0.5077	0.5572
Gemma-2b	0.3960	0.3960	1.0000	0.5673	0.5000	0.6980
Phi-3 Mini	0.4134	0.3901	0.8542	0.5356	0.4893	0.6510

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pendyala, V.S.; Thakur, N.B.; Agarwal, R. Explainable Use of Foundation Models for Job Hiring. Electronics 2025, 14, 2787. https://doi.org/10.3390/electronics14142787

AMA Style

Pendyala VS, Thakur NB, Agarwal R. Explainable Use of Foundation Models for Job Hiring. Electronics. 2025; 14(14):2787. https://doi.org/10.3390/electronics14142787

Chicago/Turabian Style

Pendyala, Vishnu S., Neha Bais Thakur, and Radhika Agarwal. 2025. "Explainable Use of Foundation Models for Job Hiring" Electronics 14, no. 14: 2787. https://doi.org/10.3390/electronics14142787

APA Style

Pendyala, V. S., Thakur, N. B., & Agarwal, R. (2025). Explainable Use of Foundation Models for Job Hiring. Electronics, 14(14), 2787. https://doi.org/10.3390/electronics14142787

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable Use of Foundation Models for Job Hiring

Abstract

1. Introduction

2. Related Work

2.1. Artificial Intelligence for Job Hiring

2.2. Foundation Models in Human-Resources Departments

2.3. Contribution of the Paper

3. Methodology

3.1. Models Under Study

3.2. Dataset

3.3. Pre-Processing

3.4. In-Context Learning Protocol

3.5. Quantization

3.6. Evaluation Metrics

3.7. Conversation- and Instruction-Tuned LLMs

4. Experiments

4.1. Baseline Model

4.2. Tokenizer and Chat Template

4.3. Zero-Shot Learning and Few-Shot Learning

4.4. System Prompt

4.5. User Prompt Template

4.6. Few-Shot Prompt Chat Template

4.7. Dataset Evaluation and Interpretability

5. Results and Discussion

5.1. Analysis of Classification Metrics

5.2. Visualizing Interpretability

5.2.1. Gemma-2b

5.2.2. Gemma-7b

5.2.3. Llama 2

5.2.4. Llama 3

5.2.5. Phi-3 Mini

5.2.6. Zephyr

5.2.7. Mistral-7b

5.3. Qualitative Reasoning

5.4. Ethical Considerations

5.4.1. Bias Audit

5.4.2. Human Oversight

5.4.3. Explainability and Accountability

5.4.4. Privacy

6. Conclusions and Future Directions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI