Next Article in Journal
Development and Validation of a Coupled Hygro-Chemical and Thermal Transport Model in Concrete Using Parallel FEM
Previous Article in Journal
End-to-End Online Video Stitching and Stabilization Method Based on Unsupervised Deep Learning
Previous Article in Special Issue
Digitization of Medical Device Displays Using Deep Learning Models: A Comparative Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Comparative Evaluation of Sequential Neural Network (GRU, LSTM, Transformer) Within Siamese Networks for Enhanced Job–Candidate Matching in Applied Recruitment Systems

1
Department of Artificial Intelligence, Institute of Information Technology, Warsaw University of Life Sciences, ul. Nowoursynowska 159, 02-776 Warsaw, Poland
2
Avenga IT Professionals sp. z o.o., ul. Gwiaździsta 66, 53-413 Wroclaw, Poland
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(11), 5988; https://doi.org/10.3390/app15115988
Submission received: 10 April 2025 / Revised: 20 May 2025 / Accepted: 23 May 2025 / Published: 26 May 2025
(This article belongs to the Special Issue Innovations in Artificial Neural Network Applications)

Abstract

:
Job–candidate matching is pivotal in recruitment, yet traditional manual or keyword-based methods can be laborious and prone to missing qualified candidates. In this study, we introduce the first Siamese framework that systematically contrasts GRU, LSTM, and Transformer sequential heads on top of a multilingual Sentence Transformer backbone, which is trained end-to-end with triplet loss on real-world recruitment data. This combination captures both long-range dependencies across document segments and global semantics, representing a substantial advance over approaches that rely solely on static embeddings. We compare the three heads using ranking metrics such as Top-K accuracy and Mean Reciprocal Rank (MRR). The Transformer-based model yields the best overall performance, with an MRR of 0.979 and a Top-100 accuracy of 87.20% on the test set. Visualization of learned embeddings (t-SNE) shows that self-attention more effectively clusters matching texts and separates them from irrelevant ones. These findings underscore the potential of combining multilingual base embeddings with specialized sequential layers to reduce manual screening efforts and improve recruitment efficiency.

1. Introduction

The rapid transformation of the job market, fueled by technological evolution and globalization, has posed new challenges to recruiters and job seekers alike [1]. Traditional recruitment processes, often reliant on manual keyword searches or rule-based matching systems, struggle to cope with the increasing volume and complexity of job applications [2,3]. In response, artificial intelligence (AI)-driven systems are being integrated into recruitment pipelines to improve the precision and scalability of candidate matching [4,5,6].
A promising approach in this domain involves the use of deep-learning models for semantic matching between job descriptions and candidate résumés. Among such architectures, Siamese Neural Networks (SNNs) have demonstrated significant potential by learning a shared embedding space in which semantically similar job–candidate pairs are placed closer together [7,8,9,10,11,12]. Although previous studies—including our earlier work on Zero-Shot Recommendation models [13]—have shown that pretrained sentence embeddings (e.g., Sentence-BERT or MiniLM) can yield impressive results in job–candidate alignment, such models often treat each document as a static unit and may not fully capture sequential or hierarchical structures commonly found in résumés and job postings.
To address this limitation, this study explores the integration of sequential neural architectures—namely, Gated Recurrent Units (GRU), Long Short-Term Memory (LSTM) networks, and Transformers—within a Siamese network framework. These sequential “head” models are applied on top of pretrained sentence embeddings to capture the temporal or structural dependencies present in résumés and job descriptions. By doing so, we aim to leverage both contextual richness from pretrained models and the sequence modeling capabilities of GRU, LSTM, and Transformer architectures.
The central goal of this paper is to evaluate and compare the effectiveness of these three sequential architectures as head modules within a Siamese network trained using triplet loss function. We focus on real-world IT recruitment data, consisting of multilingual résumés and job descriptions, and assess model performance using ranking metrics such as Top-K Accuracy and Mean Reciprocal Rank (MRR). Additionally, we provide qualitative analysis through t-SNE visualizations of the learned embedding spaces.
Our contributions can be summarized as follows:
  • We introduce a modular Siamese framework that combines multilingual pretrained sentence embeddings with GRU, LSTM, and Transformer-based sequential heads for enhanced job–candidate matching.
  • We conduct a systematic empirical comparison of these architectures using a real-world dataset derived from a commercial IT recruitment pipeline.
  • We evaluate the models using both quantitative (ranking metrics) and qualitative (embedding space visualization) methods, demonstrating that the Transformer head yields the best overall performance.
  • We analyze the practical implications of architectural choices in terms of matching precision, training dynamics, and computational tradeoffs.
This study builds on our previous zero-shot recommendation model [13], expanding it through the addition of supervised training and advanced neural modeling strategies. The findings contribute to the development of more intelligent and scalable AI systems for modern recruitment workflows.

1.1. Background and Motivation

Recruiters today faces an unprecedented flood of applications. The Future of Jobs 2023 report estimates that the largest job platforms process ~100 million submissions every month, showing a ten-fold growth since 2016 [1]. Traditional recruitment processes, often reliant on manual keyword searches or rule-based matching systems, struggle to cope with the increasing volume and complexity of job applications [2,14]. Empirical audits further reveal that purely keyword-based systems discard up to 15% of qualified résumés and may amplify demographic bias [15]. In response, AI-driven systems are being integrated into recruitment pipelines to enhance the precision and scalability of candidate matching.
Siamese Neural Networks (SNNs) have gained attention for their ability to learn embeddings that capture similarity between pairs of inputs by minimizing the distance between semantically related pairs while maximizing the distance between unrelated ones. This makes them particularly suitable for tasks like job–candidate matching, where the goal is to project both job descriptions and résumés into a shared embedding space where proximity reflects semantic relevance [16,17].
While pretrained models or multilingual Transformers result in sufficient sentence-level embeddings [18], they may overlook contextual dependencies across document sections, since they treat input as independent segments. For example, the chronology of work experience, the interplay between skills and roles, and the presence of redundant qualifications are often distributed across multiple segments of a résumé. Similarly, job descriptions may introduce role responsibilities and requirements across paragraphs or bullet points, implying temporal or logical structures that static embeddings may not fully capture.
To address this limitation, sequential neural architectures were used as “head” modules added on top of static sentence embeddings. The goal of this approach was to model various dependencies across text segments that are not present in state-of-the-art solutions, resulting in more accurate candidate matching.
The core motivation of this work is two-fold. First, we aim to augment existing sentence embedding techniques with sequential modeling to better reflect the document-level structure inherent in résumés and job descriptions. Second, we intend to conduct a systematic empirical comparison between GRU, LSTM, and Transformer head architectures to understand which model offers the best tradeoff between performance and complexity in a real-world recruitment setting. This analysis is critical for practitioners seeking to deploy intelligent candidate-matching systems that are both accurate and efficient.

1.2. Related Work

Siamese Neural Networks (SNNs) have gained attention for their ability to learn embeddings that capture similarity between pairs of inputs by minimizing the distance between semantically related pairs while maximizing the distance between unrelated ones [19]. While pretrained models such as Sentence-BERT or multilingual Transformers like DistilUSE provide powerful sentence-level embeddings [18], they treat input as independent segments and may overlook contextual dependencies across sections of a document. Recent dense retrieval frameworks illustrate the trend. Maheshwary applied a bi-GRU Siamese network [20]; Rezaeipourfarsangi [11,21,22] adopted a Transformer encoder; and the industrial ConFit v2 model reported a 14% recall lift over BM25 by combining self-attention with hard negative mining [23]. To the best of authors’ knowledge, no study contrasts multiple sequential heads under identical conditions on the same multilingual dataset.
To address this limitation, we explore the addition of sequential neural architectures—namely, Gated Recurrent Units (GRUs) [24], Long Short-Term Memory (LSTM) networks [25] and Transformers [26]—as head modules applied on top of static sentence embeddings. These sequential heads are expected to model dependencies across text segments, enriching the initial embedding with contextual or temporal patterns essential for accurate matching.
Recent advancements in deep learning for job–candidate matching have demonstrated the value of integrating sequential architectures within SNN frameworks. Such configurations enable the effective comparison of job descriptions and résumés by learning to project semantically similar pairs closer in a shared embedding space. While prior works have utilized CNNs and RNNs for this purpose [11,20,27,28,29], the inclusion of memory-based architectures like GRU and LSTM enhances the ability to capture temporal and hierarchical patterns across document segments.
GRU and LSTM, as recurrent neural networks, offer mechanisms to retain contextual information over sequences [24,25]. GRUs, with fewer parameters, provide computational efficiency without significantly compromising accuracy [30,31], while LSTMs, with more elaborate gating structures, allow for more refined long-term dependency modeling. These properties are particularly relevant in recruitment scenarios, where the sequence of experiences or educational history plays a pivotal role. Attention-based extensions, as shown in [32,33], can further enhance sequential models by highlighting the most relevant portions of the input text.
Goel et al. [34] explored LSTM architectures in the context of brain signal processing and neuromorphic chips, demonstrating their capability to model complex temporal dynamics. In the recruitment domain, such dynamics include career trajectories and project involvements. Zhao et al. [35] combined RNNs with a Siamese Region Proposal Network to improve robustness in visual tracking tasks, paralleling the need in recruitment systems to handle varying candidate profiles and job formats.
Transformers, introduced by Vaswani et al. [26], have revolutionized NLP with their self-attention mechanisms, enabling models to consider global dependencies in parallel. Their suitability for recruitment was indirectly affirmed by works like Yang et al. [36], where Transformer-based low-dimensional embeddings facilitated encrypted video stream analysis, which is analogous to extracting structured patterns from résumés. Moreover, the use of self-attention in mechanical prognostics, such as in the self-adaptive graph convolutional network by Wei et al. [37], highlights how attention layers can dynamically prioritize informative segments—a desirable trait in comparing qualifications and job requirements.
The time warping capabilities of recurrent architectures were analyzed by Tallec and Ollivier [38], supporting their adaptability to various time scales—a valuable property for modeling diverse career timelines. Additionally, Tang et al. [39] introduced visualization techniques for understanding memory retention in GRUs and LSTMs, which can aid in interpretability—an emerging concern in explainable AI for HR.
Finally, recent works have explored hybrid models, including GRU-based Siamese networks for behavioral authentication [40], reinforcing the relevance of recurrent Siamese configurations in pairwise comparison tasks. These architectures have also been tested in low-resource NLP applications, such as Amazigh part-of-speech tagging using GRUs [30].
In summary, integrating GRU, LSTM, and Transformer heads within Siamese architectures offers distinct advantages for job–candidate matching. GRUs offer simplicity and speed, LSTMs provide nuanced long-term modeling, and Transformers enable deep, parallel contextual understanding. Our work systematically compares these approaches and is informed by prior research and its applications across diverse domains with the intention to assess their efficiency in recruitment-related tasks.

1.3. Problem Statement and Objectives

The primary research problem addressed in this paper is the following: How do different sequential neural network architectures (GRU, LSTM, and Transformer), when inserted as head modules within the same Siamese retrieval framework, compare in their ability to match multilingual job descriptions with candidate résumés drawn from a production-scale IT pipeline?
The specific objectives are the following:
  • To fill the gap left by past studies that fixed a single backbone.
  • To train these models using a triplet loss function on a real-world dataset of job descriptions and candidate résumés from the IT sector.
  • To evaluate and compare the performance of the GRU, LSTM, and Transformer head models using ranking metrics (Top-K Accuracy and MRR) and qualitative analysis (t-SNE visualization).
  • To analyze the results highlighting the relative strengths and weaknesses of each sequential model for this specific task.

1.4. Contributions

The main contributions of this paper are the following:
  • The first head-to-head evaluation of GRU, LSTM, and Transformer sequential heads inside an end-to-end Siamese network trained with triplet loss on 14.8 k real hiring events in English and Polish.
  • A modular framework that combines multilingual pretrained sentence embeddings with sequential heads for job–candidate matching.
  • An evaluation performed on a dataset reflecting real-world IT recruitment scenarios.
  • Quantitative results obtained using ranking metrics (Top-K Accuracy and MRR) and analysis of training dynamics (loss and accuracy curves).
  • A qualitative analysis of the learned embedding spaces using t-SNE visualizations for each model.

1.5. Paper Structure

The remainder of this paper is organized as follows. In Section 2, we describe the dataset used in our experiments, outline the data preprocessing pipeline, and present the proposed model architecture, including the base embedding layer and the sequential head configurations (GRU, LSTM, and Transformer). We also detail the training setup, evaluation metrics, and hyperparameter choices.
Section 3 presents the results of our empirical evaluation, including both quantitative metrics and qualitative assessments. This section also includes a detailed discussion of training dynamics and model behavior across the GRU, LSTM, and Transformer variants.
In Section 3.6, we analyze the observed performance trends, compare our findings to the existing literature, and discuss the strengths and limitations of each sequential architecture within the proposed Siamese framework. We also reflect on the practical implications for deployment in real-world recruitment systems.
Section 4 concludes the paper by summarizing the main findings and outlining potential avenues for future research, such as incorporating structured metadata, improving model explainability, and extending the approach to other domains or languages.

2. Materials and Methods

2.1. Dataset Description

2.1.1. Data Source and Collection

Our study uses a real-world dataset curated from an internal IT recruitment pipeline. This pipeline manages the screening and evaluation of multiple candidates applying for various job positions, primarily in the software development and related IT domains. The data collection process involved extracting anonymized job–candidate pair records from the recruitment system. Each record corresponds to a unique pairing of the following:
(a)
A specific job posting (identified by job_id), along with its associated metadata such as job title and a detailed job description.
(b)
A candidate application (identified by candidate_id), which includes textual résumé data and recruitment progression details.
The purpose of gathering these data was to facilitate the development of an automated matching system based on text similarity and advanced embedding models. The dataset was divided into two main parts: a combined training and validation set (denoted as train + valid) and a separate test set reserved for the final evaluation of the model. Table 1 summarizes the common data structure (column layout) shared by both subsets.
The columns provide both descriptive textual information (job descriptions, résumés) and structured fields that capture the stages and outcomes of the recruitment. Notably, the label field indicates how far or successfully a candidate progressed in the hiring process, serving as our main ground truth signal for downstream modeling.

2.1.2. Data Statistics

Table 2 and Table 3 present an overview of the size, distribution of labels, and language breakdown of the dataset. These statistics apply to both the combined train + valid set and the test set.
As shown in Table 2, the train + valid partition comprises 14,827 job–candidate pairs, whereas the test set has 1000 pairs. In both subsets, the soft_positive category is the largest (about 44–45%), with soft_negative being the second largest and positive representing roughly 18% of the entries. This distribution reflects the real-world recruitment funnel: many candidates show partial alignment but do not get hired, a notable subset is quickly rejected, and a smaller group eventually receives an offer.
Table 3 highlights that most job descriptions are in English (over 85%), with Polish being the second most common language. This bilingual nature of the used dataset influenced our choice of a multilingual embedding model.

2.1.3. Data Annotation (If Applicable)

Each row in our dataset was assigned a label reflecting how well the candidate matched the position and how far they progressed:
  • soft_positive: The candidate advanced through some recruitment stages but was not ultimately hired.
  • positive: The candidate received and accepted a job offer.
  • soft_negative: The candidate was rejected early in the process or chose to withdraw without advancing significantly.
The positive vs. negative (soft_negative) outcome is straightforward; however, soft_positive captures intermediate cases where some alignment was evident but did not lead to a hire. These annotations were determined by recruiters during actual hiring processes and served as ground truth signals for training and evaluating our matching models.

2.2. Data Preprocessing

The raw data obtained from the recruitment pipeline underwent several preprocessing steps to prepare them for model training and evaluation. These steps ensured data quality, consistency, and appropriate formatting for the embedding and sequential models.

2.2.1. Text Cleaning

Initial text cleaning focused on removing irrelevant artifacts and standardizing the text format. Both job descriptions and candidate résumés, originally containing HTML markup from the source system, were processed to extract plain text content. This was achieved using the BeautifulSoup library to parse the HTML and retrieve the text nodes [13].

2.2.2. Anonymization

To protect candidate privacy, an anonymization step was applied. This step aimed to remove personally identifiable information, specifically the candidate’s name and surname, from the résumé text. The process involved the following:
  • Retrieving the candidate’s full name (e.g., from the candidate_name column).
  • Splitting the full name into components. It is typically assumed that the first part is the first name and the last part is the surname.
  • Using regular expressions to find and remove occurrences of the identified first name and surname within the corresponding candidate’s résumé text (stored in candidate_resume_no_html) [16]. This removal is case-insensitive.
  • If the name cannot be reliably split into at least two parts, or if the anonymization function encounters an error, the original résumé text is retained.

2.2.3. Data Splitting

The preprocessed dataset was divided into training and validation sets. The split between training and validation data was performed using the dedicated function from the scikit-learn library. The process is described as follows:
  • The proportion allocated to the validation set is controlled by the VALID_SIZE parameter (0.1 for 10%).
  • Stratification is optionally applied using specified columns to maintain the distribution of label column in both the training and validation sets.
  • An additional filtering step can be applied before splitting. If set, job titles with fewer records than this threshold are temporarily removed. The split is performed on the filtered data, and the removed records are then added back to the training set to avoid data loss while ensuring robust stratification.
The test set underwent similar initial processing (like HTML cleaning and, potentially, outlier removal) but was kept separate for final model evaluation.

2.3. Proposed Model Architecture

To address the job–candidate matching task, we proposed a Siamese Neural Network architecture. This architecture is designed to learn discriminative embeddings for both job descriptions and candidate résumés such that semantically similar job–résumé pairs are mapped to nearby points in the embedding space, while dissimilar pairs are mapped far apart.

2.3.1. Overall Siamese Framework

The core of the proposed model is a two-tower Siamese network. Both towers share the same weights and consist of two main components: a base sentence embedding model and a sequential head model.
  • Input Processing: Job descriptions and candidate resumes are independently fed into their respective towers.
  • Base Embedding: A pretrained Sentence Transformer model generates initial contextual embeddings for segments of the input text.
  • Sequential Head: A sequential model (GRU, LSTM, or Transformer) processes the sequence of segment embeddings generated by the base model to capture higher-level sequential or contextual information within the document.
  • Final Embedding: Each tower outputs a final fixed-size vector representation (embedding) for the corresponding job description or résumé.
  • Similarity Learning: The network is trained using a triplet loss function, which compares the embeddings of an anchor (job description), a positive sample (matching candidate résumé), and a negative sample (non-matching candidate résumé).
This framework allows the model to learn representations sensitive to the nuanced similarities and differences relevant for job–candidate matching [13,16,17].
Figure 1 presents the end-to-end workflow of the proposed recruitment–matching pipeline. Raw recruitment data are (i) cleaned of HTML artefacts, (ii) anonymized, and (iii) split into training, validation, and test subsets. Each document is encoded with a multilingual Sentence Transformer, after which segment embeddings are processed by one of three sequential heads (GRU, LSTM, or Transformer, all with 256 hidden units). Job description and résumé vectors pass through weight-sharing Siamese towers optimized with a triplet loss objective (margin 0.4). Finally, ranking metrics (Top-K Accuracy, Mean Reciprocal Rank) and t-SNE visualizations are produced; the dotted feedback arrow denotes optional iterative retraining.

2.3.2. Base Sentence Embedding

The foundation of each tower is a pretrained multilingual Sentence Transformer model. Specifically, we utilized the distiluse-base-multilingual-cased-v2 model [18]. It was chosen for its ability to handle multilingual text (predominantly English and Polish in our dataset) and generate meaningful sentence-level embeddings. The process involves the following:
  • Segmenting the input text (job description or résumé) into chunks of a specified length (e.g., 1024 characters).
  • Encoding each segment using the Sentence Transformer to obtain a sequence of initial embeddings, each with a dimension of 512.
  • Optionally concatenating the embeddings generated from different chunk lengths if multiple lengths are specified.
  • Padding or truncating the sequence of segment embeddings to a fixed maximum length (MAX_SEGMENTS_LENGTH, e.g., 15) to create a consistent input shape for the subsequent sequential head model. Padding uses a dedicated embedding for the [PAD] token.

2.3.3. Sequential Head Models

To process the sequence of segment embeddings generated by the base Sentence Transformer, a sequential head model is applied. It aims to capture dependencies and aggregate information across the segments of a document. We experimented with three different architectures for this head model: GRU, LSTM, and Transformer. All head models took the sequence of 512-dimensional embeddings as input and projected them into a final output embedding space of dimension 256. They utilized an attention mechanism (specifically Bahdanau attention) over the sequence of hidden states to produce the final aggregated representation before the last linear projection.

GRU Head

The GRU head applies a Gated Recurrent Unit (GRU) layer (torch.nn.GRU) with a single layer (NUM_LAYERS = 1) and a hidden layer size of 256 (HEAD_HIDDEN_SIZE). The GRU processes the input sequence of embeddings. The sequence of output hidden states from the GRU layer is then passed through a Bahdanau attention mechanism to compute a context vector. This context vector is finally passed through a linear layer to produce the 256-dimensional output embedding.

LSTM Head

This head uses a Long Short-Term Memory (LSTM) layer (torch.nn.LSTM) configured similarly to the GRU head: a single layer with size of 256. It processes the sequence of embeddings, and its output hidden states are aggregated using the Bahdanau attention mechanism. A final linear layer maps the resulting context vector to the 256-dimensional output embedding.

Transformer Head

The Transformer head utilizes a Transformer encoder architecture. It first projects the 512-dimensional input embeddings to the model’s hidden layer (256) using a linear layer. Then, a single-layer Transformer encoder (torch.nn.TransformerEncoder with a torch.nn.TransformerEncoderLayer) processes this sequence. The encoder layer uses multihead self-attention (e.g., 4 heads) and a feedforward network. The output sequence from the Transformer encoder is aggregated using the Bahdanau attention mechanism, followed by a final linear layer projecting to the 256-dimensional output embedding.

2.3.4. Similarity Calculation and Loss Function

The model learns to differentiate between matching and non-matching job–candidate pairs through a triplet loss function. Specifically, the GradualTripletLoss is applied. The core idea of triplet loss is to minimize the distance between an anchor embedding (A, e.g., job description) and a positive embedding (P, matching résumé) while maximizing the distance between the anchor and a negative embedding (N, non-matching résumé).
The loss for a triplet ( A , P , N ) is defined as
L ( A , P , N ) = max d ( A , P ) d ( A , N ) + α , 0
where d ( x , y ) represents the distance between embeddings x and y, while α is the margin hyperparameter. In our implementation, the embeddings ( A , P , N ) were L2-normalized before passing them to the loss function. The distance d is implicitly the Euclidean distance, which, for normalized vectors, relates directly to cosine similarity. The objective pushes the distance between the anchor and negative d ( A , N ) to be larger than the distance between anchor and positive d ( A , P ) by at least the margin α .
The GradualTripletLoss variant modifies the margin based on the label strength of the positive sample (encoded numerically, e.g., 1 for ‘positive’, 2 for ‘soft_positive’), potentially applying a smaller effective margin for weaker positive matches. This is controlled by the margin ( α = 0.4 ) and a scaler parameter (scaler = 4.0 ):
L gradual ( A , P , N , label P ) = max d ( A , P ) d ( A , N ) + α × 5.0 label P scaler , 0
The model was trained by minimizing the mean of this loss across the batches. The final similarity score between a job and a candidate for ranking purposes is typically calculated using the cosine similarity between their respective output embeddings after training.

2.4. Training Details

2.4.1. Hyperparameters

The training process used specific hyperparameters crucial for model performance and convergence. The key settings, consistent across the GRU, LSTM, and Transformer head model experiments, are detailed in Table 4 (adapted from the configuration files and training scripts). The optimizer used was Adam. The Gradual Triplet Loss function applied a margin ( α ) of 0.4 and a scaler parameter of 4.0, designed to differentiate between positive, soft positive, and negative examples effectively.

2.4.2. Hardware and Software Setup

The experiments were conducted on a high-performance computing setup to handle the demands of training deep neural networks. All models (GRU, LSTM, Transformer) were trained using identical hardware and software configurations, as summarized in Table 5. Each run utilized two NVIDIA (Santa Clara, CA, USA) GH200 GPUs with 144 GB HBM3e memory each, leveraging CUDA version 12.6 for GPU acceleration. The data loading process was supported by 144 CPU workers. Key libraries included PyTorch for model implementation and training, Sentence Transformers for base embeddings, and scikit-learn for utilities like t-SNE.

2.5. Evaluation Metrics

Model performance was evaluated using a combination of standard ranking metrics and an analysis of the learned embedding space.

2.5.1. Ranking Metrics

To quantify the model’s ability to rank relevant candidates higher than irrelevant ones for a given job description, we applied Top-K Accuracy and Mean Reciprocal Rank (MRR), which are described as follows:
  • Top-K Accuracy: Measures the percentage of times a correct (positive or soft positive) candidate résumé appears within the top-K-ranked résumés for its corresponding job description. We calculated this for K values of 1, 3, 5, 10, 20, 30, 50, 75, and 100. The comparative results are presented in Table 6.
  • Mean Reciprocal Rank (MRR): Calculates the average of the reciprocal ranks of the first correct candidate resume for each job description. A higher MRR indicates that the correct candidate is typically found closer to the top of the ranked list. The MRR values are included in Table 7.

2.5.2. Embedding Space Analysis

To qualitatively assess the learned representations, we visualized the embedding space using t-Distributed Stochastic Neighbor Embedding (t-SNE). This technique reduces the high-dimensional embedding vectors (256 dimensions in our case) into a 2D space, allowing for visual inspection of the separation between different types of embeddings.
We generated t-SNE plots for 1000 sampled triplets (Job Description, Positive Resume, Negative Resume) from the test set for each trained model head (GRU, LSTM, Transformer). These visualizations (Figure 2, Figure 3 and Figure 4) help understand how well the models cluster similar items (jobs and positive résumés) while separating dissimilar items (jobs and negative résumés). Blue points represent job descriptions, green points represent corresponding positive résumés, and red points represent corresponding negative résumés. Ideally, green points should cluster near their blue counterparts, while red points should be further away.

3. Results and Discussion

3.1. Computational Setup

Table 5 presents detailed information about the computational resources used in this study. The table includes the total and average training time for each variant of the model (GRU, LSTM, and Transformer), the device configuration (including GPU model, CUDA version, and the number of GPUs), and the number of CPU workers for data loading. Such details are crucial for reproducibility and for understanding the computational environment in which the experiments were conducted.
Table 5 contains the main configuration and timing statistics for each model type training. The Total Run Duration (seconds) and Avg. Epoch Duration (seconds) columns provide insight into how long it took to the models under identical experimental conditions. The Device Used and GPU Name rows specify which hardware was applied, while the Num GPU Used row indicates how many GPUs were running in parallel. The GPU CUDA Version row shows the CUDA environment version, and the Num Dataloader Workers (CPU) row reports how many CPU processes or threads were used to load the data efficiently during training.
Figure 2. GRU final t-SNE visualization (1000 triplets). Blue points correspond to Job Description, green points to Positive Resume, and red points to Negative Resume.
Figure 2. GRU final t-SNE visualization (1000 triplets). Blue points correspond to Job Description, green points to Positive Resume, and red points to Negative Resume.
Applsci 15 05988 g002
Figure 3. LSTM final t-SNE visualization (1000 triplets). Blue points correspond to Job Description, green points to Positive Resume, and red points to Negative Resume.
Figure 3. LSTM final t-SNE visualization (1000 triplets). Blue points correspond to Job Description, green points to Positive Resume, and red points to Negative Resume.
Applsci 15 05988 g003
Figure 4. Transformer final t-SNE visualization (1000 triplets). Blue points correspond to Job Description, green points to Positive Resume, and red points to Negative Resume.
Figure 4. Transformer final t-SNE visualization (1000 triplets). Blue points correspond to Job Description, green points to Positive Resume, and red points to Negative Resume.
Applsci 15 05988 g004

3.2. Model Architecture and Embedding Configuration

In Table 8, we summarize the key parameters of the Sentence Transformer base model and the sequential head (GRU, LSTM, or Transformer). These parameters define how text inputs (such as job descriptions and résumés) are encoded into embeddings, how many hidden units are in the sequential layer, how many layers are stacked, and how the sequence of embeddings is aggregated (e.g., using an attention mechanism).
Table 8 outlines the primary settings for both the base embedding model and the sequential head. Base Model shows the pretrained model name (distiluse-base-multilin gual-cased-v2), which is responsible for initial text embedding. The Input Embedding Size indicates the dimensionality of those embeddings. The rows under Head Model specify the hidden size (Hidden Size), the depth of the sequential network (Number of Layers), the dimensionality of the final output vector (Output Embedding Size), and the aggregation strategy (Output Aggregation) used to combine the sequential outputs into a single embedding vector.

3.3. Training Hyperparameters, Evaluation Settings, and Data Preprocessing

Table 4 presents the main training hyperparameters, such as batch size, learning rate, and number of epochs, as well as the configuration for the triplet loss function. Additionally, it specifies how the model was evaluated using metrics like Top-K accuracy and how data were processed or split (e.g., validation set ratio, chunk length, and maximum segments per document).
The Training Hyperparameters section defines key elements such as the batch size (Batch Size), number of epochs (Number of Epochs), and the learning rate (Learning Rate). It also includes the triplet loss margin (Triplet Loss Margin) and a scaling factor for a custom variant of this loss (Gradual Triplet Scaler). Under Evaluation and Logging, we list the K values for computing Top-K accuracy, the sample size for t-SNE visualization, and file naming conventions for saving both the main and the best model states. Finally, the Data Handling and Preprocessing rows indicate how the validation split was configured (Validation Set Ratio) and specify text chunk parameters (Embedding Chunk Lengths and Max Segments per Document) for embedding generation. All these parameters collectively ensure that the training, evaluation, and preprocessing steps are clearly documented and reproducible.

3.4. Quantitative Results

3.4.1. Performance Metrics Comparison

In Table 6, we present the Top-K Accuracy results for all three models (GRU, LSTM, and Transformer) evaluated on the training, validation, and test sets. These accuracies are reported for various values of K, allowing us to examine the ranking quality at different cutoff thresholds. Specifically, Top-1 and Top-3 provide insight into how often the very best matches were correctly identified, whereas larger K values (such as Top-50 or Top-100) offer a broader perspective on how well the models performed when returning an extended list of potential candidates.
Table 7 complements the Top-K Accuracy results by showing additional performance metrics, such as the Mean Reciprocal Rank (MRR), the average similarities between job and positive (matching) résumés, the average similarities for job and negative (non-matching) résumés, and the final loss values. MRR places particular emphasis on the position of the first correct candidate, thereby reflecting how promptly a model can retrieve a valid match. The average positive vs. negative similarities indicate how effectively each model managed to separate relevant resumes from irrelevant ones in the embedding space. Finally, the reported loss values show how consistently each model converged during training and how well it generalized on unseen data.

GRU Performance

Based on Table 6, the GRU model achieved robust performance across small K ranges (Top-1 and Top-3). This suggests that it excels at identifying the single most relevant candidate. This effect is also reflected in its high MRR (Table 7), indicating that correct candidates often appear near the top of its ranked lists. The GRU’s recurrent gating mechanism helps capture job requirements and candidate experience in a sequential manner, which is beneficial when dealing with moderately sized text segments. Moreover, the noticeable gap between average positive and negative similarities (shown in Table 7) suggests that the GRU effectively learns to cluster relevant and irrelevant résumés. However, its loss curve tended to stabilize at a slightly higher value than the Transformer, implying that while GRU is strong in identifying the best match early, it may have a slightly harder time achieving consistently low overall error across a broad range of candidates.

LSTM Performance

The LSTM model often displayed comparable or improved Top-K Accuracy scores (Table 6) relative to GRU for K values in the mid-range (e.g., Top-10 and Top-20). LSTM’s long-term memory cells captures more extended patterns in the text, making it particularly beneficial for lengthy résumés with nuanced skills or for job descriptions containing extensive requirements. Table 7 shows that the LSTM also achieved a high MRR and maintained a discernible gap between its average positive and negative similarities, demonstrating good discriminative power. However, the average negative similarity was sometimes slightly higher than with GRU, suggesting that while LSTM excels in correctly ranking positive matches, it may occasionally place some negative résumés closer to the positive ones in the embedding space. Despite this, the overall loss for LSTM remained at a level that indicates strong convergence and effective learning of job–candidate relationships.

Transformer Performance

The Transformer model exhibited the most balanced performance across all tested values of K, as demonstrated in Table 6. Its Top-K Accuracy was consistently strong, and its MRR tended to be the highest among the three architectures (Table 7). The self-attention mechanism within the Transformer proves especially valuable for capturing complex semantic relationships in job descriptions and candidate résumés, which can span multiple sentences or paragraphs. As a result, the embedding space separation between positive and negative examples was more pronounced, with average positive similarities often being substantially higher than average negative similarities. The relatively low final loss value further confirms that the Transformer model generalized effectively, ranking correct candidates near the top of the list in most cases. This advantage was particularly evident in scenarios where job requirements were very detailed or when résumés described multiple, interconnected skillsets.
Overall, the Transformer emerged as the most robust architecture in terms of consistently ranking performance across a variety of metrics and K values, while LSTM and GRU still maintained a competitive edge in certain ranking thresholds and exhibited strong general performance. Each model’s strengths might be more or less relevant depending on the specific priorities of a recruitment scenario (e.g., precision for top suggestions, coverage in broader candidate lists, or handling particularly lengthy text documents).

3.4.2. Training Dynamics

As seen in Figure 5, the GRU architecture showed a rapid decrease in average training loss within the first few epochs, suggesting that the gated recurrent mechanism efficiently captured relevant features from the textual data early in training. As the epochs progressed, the GRU curve progressively flattened and approached a near-zero region, reflecting the model’s increasing confidence in distinguishing job–candidate pairs that are highly similar from those that are not. Although there is some minor fluctuation visible in the tail end of the curve, the overall trend indicates strong convergence characteristics.
As seen in Figure 6, the LSTM model exhibited a broadly similar trajectory of loss reduction but with a slightly more pronounced early-stage decrease. This is attributable to the LSTM’s capacity to retain longer-term dependencies in textual descriptions, which can be especially beneficial for résumés and job posts containing extensive lists of qualifications. Over time, the loss stabilized to a near-constant level, illustrating that the model successfully learned to align relevant job–candidate embeddings.
Finally, Figure 7 demonstrates the Transformer architecture’s steep initial drop in loss, which was followed by a smoother convergence phase relative to the recurrent approaches. The self-attention layers in the Transformer captured global semantic relationships throughout the text, allowing faster learning in earlier epochs and relatively stable optimization behavior. Despite slight oscillations, the Transformer-based head generally achieved the lowest plateau, highlighting its robustness for complex multisentence inputs.
Collectively, these training curves underscore that all three architectures can effectively reduce the margin in the learned embedding space, but each model exploits different internal mechanisms for capturing long-range dependencies, with the Transformer showing especially rapid early gains and strong final convergence.
The GRU head started at a loss of ~0.19 and pushed it below 0.05 in the first 20 epochs, confirming the very fast convergence typical for compact gated units [20]. By epoch 100, the curve dropped under 0.02; after epoch 350, it flattened close to 0.008 and settled at ~0.003 near epoch 900, indicating that further training yielded only marginal refinement of the embedding space. Because the model reached 95% of its eventual gain in under a minute (Table 5), it is attractive for frequent retraining whenever new job families appear.
The LSTM began slightly higher (~0.21) and needed ~25 epochs to descend below 0.05, reflecting its larger parameter count. Nevertheless, the curve kept improving and broke the 0.01 barrier around epoch 500, reaching a final plateau at ~0.004 by the end of training. The longer tail confirms observations by Yu et al. [32] that LSTMs continue to extract small gains from extended training; whether this extra accuracy justifies the increased training time depends on hardware availability.
Self-attention accelerates optimization: Starting at 0.11, the loss collapsed under 0.03 within only 20 epochs—twice as fast as either recurrent alternative. The curve passed 0.01 before epoch 250 and leveled off around 0.004 at epoch 900—the lowest steady state value among all heads. These numbers confirm that the Transformer offers the best speed-to-performance ratio when documents exceed one thousand characters.
In Figure 8, Figure 9 and Figure 10, we present the validation loss curves over consecutive training epochs for three distinct model architectures. Each plot reveals a rapid decrease in loss during the initial epochs, indicating that the networks quickly learned to differentiate between similar and dissimilar job–candidate pairs. Subsequently, the curves flattened out and hovered around low values, signifying that the models converged to stable solutions as training progressed.
For the GRU network (Figure 8), the loss plunged from a relatively high initial level to roughly 0.03–0.04 over the first few dozen epochs, followed by only minor oscillations. A similar pattern emerged in the LSTM model (Figure 9), where there was also a marked early decline and subsequent leveling off, reflecting how the LSTM’s gating mechanism adeptly captured temporal or sequential relationships in the text. In the case of the Transformer (Figure 10), we can observe a higher initial loss peak that quickly dropped, followed by a stabilization phase at a comparably low level. The self-attention mechanism underlying the Transformer model enabled it to efficiently represent long-range dependencies between elements of the job descriptions and candidate résumés.
Across all three plots, the quick descent in validation loss indicates effective triplet loss minimization, whereas the final plateaus suggest that each architecture learned a robust representation for job–candidate matching. Subtle differences among the curves reflect variations in how recurrent networks (GRU and LSTM) and the self-attention approach (Transformer) capture the relevant semantic and structural features within the recruitment data.
The GRU head trimmed the validation loss from ~0.14 to ~0.04 inside the first ten epochs, mirroring its fast training set convergence. Between epochs 50 and 250, the curve bottomed out near 0.030, after which a slow up-turn appeared, and the loss drifted toward 0.045 at epoch 1000. That rebound suggests that the model started to overfit once the margin had been maximized, a behavior also observed by Tang et al. [39]. Early stopping around epoch 250 would therefore preserve the best generalization for GRU.
The LSTM followed a comparable trajectory: An initial plunge from ~0.15 to ~0.04, then a gradual decline that plateaued between 0.030 and 0.035 until roughly epoch 500. Afterwards, the curve climbed modestly, finishing close to 0.042. The shallower U-shape compared to GRU indicates that the LSTM’s larger memory cell delayed overfitting, yet the upward tail confirms the need for a patience-based stopping criterion, as recommended by Yu et al. [32].
Self-attention was the fastest one in reaching its minimum: The loss collapsed from 0.065 to below 0.030 within 25 epochs and oscillated narrowly (0.028–0.033) for the remainder of training. Although a slight positive drift is still visible after epoch 600, the slope is milder than for the recurrent heads, signaling stronger generalization margins. Consequently, the Transformer could train longer without severe overfitting, but epochs beyond 300 yielded only marginal improvement.
Figure 11 presents the validation accuracy of the GRU-based Siamese model as training progressed, showing a rapid initial increase followed by a smoother improvement phase, which indicates that the GRU mechanism quickly captured core similarities between job descriptions and candidate résumés. Figure 12 depicts the corresponding validation accuracy curve for the LSTM model, where the network also demonstrated strong early gains but with a more gradual convergence, reflecting its capacity to handle longer sequential dependencies in the text. Finally, Figure 13 illustrates the validation accuracy of the Transformer-based approach, revealing a relatively stable trajectory and consistently high accuracy later in training, suggesting that its attention mechanisms are particularly effective at extracting nuanced features from both job and résumé texts to distinguish suitable candidates from less relevant ones.
The GRU surpassed 80% Top-100 accuracy within the first ~50 epochs, demonstrating that the network learned to surface the most relevant candidates very quickly. The Top-1 accuracy, however, increased much more slowly and stabilized only around ~10%, indicating that broad recall was mastered long before the model consistently ranked the single best candidate first. The small oscillations that appeared after epoch 600 echo the rise in validation loss seen in Figure 11 and are a sign of incipient overfitting; early stopping or a shorter patience window would therefore be beneficial.
The LSTM lifted Top-1 to ~12% and pushed Top-20 to ~48%. The smoother shape of all five trajectories—especially the stable plateau between epochs 300 and 600—reflects the cell state’s ability to dampen gradient noise and retain long-range context. This yielded a better balance between recall and precision, although it took longer to reach peak performance.
The self-attention model leapt to ~82% Top-100 accuracy by about epoch 20 and maintained the highest values for every K throughout training. The gain was most pronounced at Top-20 (~50%), outperforming both recurrent variants by 2–3 pp and confirming that attention mechanisms separate the most relevant candidates more effectively. Because all curves flattened after epoch 300, additional training yielded diminishing returns, so the run could safely be terminated earlier.

3.5. Qualitative Analysis

Embedding Space Visualization

As illustrated in Figure 2, the GRU-based model’s t-SNE projection displays distinct clusters for Job Description (blue), Positive Resume (green), and Negative Resume embeddings (red). Although some overlap is observable—especially where résumés partially match certain job descriptions—green points generally lie closer to their corresponding blue anchors than red points do, indicating that the GRU managed to separate the most well-aligned résumés from clearly incompatible ones.
In Figure 3, the LSTM-based model yielded a broadly similar distribution pattern, although its clusters appear slightly more spread out. The LSTM’s ability to capture longer-range dependencies in textual data (e.g., detailed résumés with multiparagraph experience and intricate job requirements) often yields tighter local groupings of green (positive) points around the blue (job) points, while red (negative) points drift farther away. However, some mixed regions suggest that additional context, such as domain-specific skills, might still be challenging to isolate purely via textual embeddings.
Finally, Figure 4 showcases the Transformer-based model’s t-SNE embedding. Here, clusters of blue and green points are more clearly discernible, and negative resumes (red) are distributed more distinctly. The self-attention mechanism appears to help encode nuanced relationships, causing them to cluster more cohesively. In particular, candidates with highly specialized or precisely matched skill profiles show very close proximity to the relevant job descriptions.
Overall, this comparative view highlights that while all three sequential heads (GRU, LSTM, and Transformer) achieved effective separations in embedding space, the Transformer-based model demonstrated a consistently sharper distinction between positive and negative résumés.

3.6. Discussion

The experimental results indicate that all three sequential heads achieved competitive performance within the Siamese framework for job–candidate matching. However, several notable observations can be drawn from the quantitative and qualitative analyses performed.
  • Interpretation of Results.
Across multiple evaluation metrics, including Top-K accuracy, Mean Reciprocal Rank (MRR), and embedding space separability, the Transformer-based head consistently demonstrated the strongest performance. In particular, it achieved the highest Top-1 accuracies and placed correct candidates near the top of the ranking more reliably (higher MRR) compared to GRU and LSTM. This finding suggests that the self-attention mechanism in Transformers captures long-range semantic relationships between job descriptions and candidate resumes more effectively. Both GRU and LSTM showed fast initial convergence and performed well on smaller K values (e.g., Top-1 and Top-3), highlighting that recurrent gating mechanisms can quickly learn relevant features in textual data. Still, their overall final performance lagged slightly behind the Transformer head when considering broader ranking scenarios (e.g., Top-20 and Top-50).
  • Comparison to Related Work.
Our results align with the existing literature, where Transformers often outperform recurrent models on various NLP tasks due to their superior capacity for learning global dependencies. Many prior studies on text similarity, information retrieval, and recommendation have reported similar advantages of attention-based architectures, especially in scenarios involving complex or lengthy documents. Although GRU- and LSTM-based Siamese networks have also shown efficacy in semantic matching tasks (e.g., question–answer retrieval), the Transformer remains a strong contender due to its parallelization and ability to focus on different parts of the input text simultaneously.
  • Strengths and Weaknesses of the Proposed Approach.
A principal strength of our framework lies in its modular design: a pretrained multilingual Sentence Transformer base is combined with a specialized sequential head, allowing for flexible adaptation to different languages and domain-specific text structures. Additionally, the triplet loss function provides a robust way to enforce margin separation between positive and negative pairs. This helps the model to learn a more discriminative embedding space, which is crucial for ranking candidates accurately.
One potential weakness is the reliance on large-scale pretrained embeddings, which can introduce biases or inaccuracies if the domain-specific vocabulary is underrepresented in the base model. Furthermore, while our attention-based modules capture nuanced textual relationships, they can sometimes be sensitive to hyperparameter tuning (e.g., number of attention heads, hidden size, etc.). This complexity may increase the overhead for practitioners looking for quick deployment without extensive model optimization.
  • Impact of Different Sequential Heads.
Our comparative study of GRU, LSTM, and Transformer heads underscores the importance of architectural choices in the final stage of representation learning. The GRU’s simpler gating mechanism proves effective and computationally efficient for moderate text lengths. LSTM, with its explicit memory cell structure, manages longer text dependencies well but can be slower to train. The Transformer’s self-attention not only captures intricate global patterns but also scales effectively for varying document lengths [41]. Depending on real-world constraints such as hardware resources, maximum input size, and the criticality of high recall at large K, recruiters or system integrators might select the most suitable head architecture for their specific use case.

3.7. Limitations

While our study demonstrates promising results for automated job–candidate matching in the IT recruitment domain, several limitations must be acknowledged:
  • Dataset Size and Representation Bias: Although our dataset is drawn from real-world IT recruitment processes, it may not cover all role types or industry segments comprehensively. Additionally, the proportion of résumés in English vs. Polish could bias the model’s multilingual performance. Models trained under these conditions might not generalize well to other languages or drastically different domains without further fine-tuning.
  • Domain Specificity: We focused on IT-related positions, where job descriptions often include specific technical terms and résumés list programming languages or frameworks. Hence, models trained on this domain may need additional data and adaptation for non-technical roles (e.g., marketing or finance) or for specialized subfields in IT (e.g., data science vs. cybersecurity).
  • Model Complexity: Transformer-based heads, in particular, can have high memory and computational demands during training and inference. Organizations with limited GPU or CPU resources might find GRU or LSTM architectures more practical, albeit with slightly reduced performance. Hyperparameter tuning can further inflate the total cost of model deployment.
  • No Explicit Handling of Structured Fields: Our current approach encodes job descriptions and résumés as free-text strings, largely ignoring tabular or structured information such as discrete skill lists, candidate location, or salary ranges. While the neural representations implicitly capture some structure, explicit incorporation of tabular or metadata features could further improve matching accuracy.
  • Evaluation Based on Historical Outcomes: Our ground truth labels reflect the final result of each recruitment process, which may include factors beyond direct skill or qualification matches (e.g., salary negotiation, candidate relocation constraints, etc.). Relying on these labels assumes that historical decisions perfectly correlate with “best match” outcomes, which might not always hold. Future studies could refine labeling schemes to isolate skill-based matching from external hiring factors.
Despite these constraints, our findings highlight the effectiveness of combining multilingual embeddings with specialized sequential heads for the matching of job descriptions and candidate résumés. Future directions include expanding the dataset to encompass additional industries, incorporating structured features and exploring more advanced attention-based architectures to refine the semantic matching process.

4. Conclusions

In this paper, we presented a Siamese network framework enhanced with sequential heads (GRU, LSTM, and Transformer) for the purpose of matching job descriptions and candidate résumés in an IT recruitment context. Leveraging a multilingual pretrained Sentence Transformer base, our architecture effectively encoded rich semantic representations from job and resume texts, while the sequential heads captured longer-range dependencies and contextual nuances. Through a detailed evaluation using real-world recruitment data, we observed that all three proposed head models achieved competitive performance in ranking and retrieval metrics, with the Transformer-based approach providing slightly better results in terms of Top-K accuracy and Mean Reciprocal Rank (MRR) results.
Our findings align with the broader NLP literature, confirming that self-attention mechanisms are particularly well suited for handling complex textual relationships and longer documents. Nonetheless, the GRU and LSTM heads also provided strong outcomes, suggesting that the choice of architecture may depend on practical constraints such as available computational resources or the specific ranking goals of a given recruitment system. In particular, we found that the GRU was highly efficient and converged rapidly, while the LSTM proved useful for tasks requiring more explicit long-term memory structures.
The proposed models demonstrate considerable promise in reducing time-to-hire and enhancing overall recruitment accuracy by automatically surfacing candidates most aligned with a role’s requirements. The clear separation in the learned embedding space between positive and negative examples indicates robust discriminative capabilities, thus paving the way for improved job–candidate matching in production-scale settings.

Future Work

Several avenues for future research and practical enhancements of the presented approach can be identified:
  • Integration of Structured Data: While our approach focused on unstructured text (job and résumé content), incorporating tabular metadata (e.g., candidate location, years of experience, and skill taxonomies) might further refine matching outcomes.
  • Domain Expansion: Although our dataset was derived from IT-specific recruitment pipelines, extending experiments to other industry sectors or roles with distinct terminologies (e.g., finance, marketing, healthcare, etc.) would test model generalization capabilities and highlight potential domain adaptation strategies.
  • Explainability and Interpretability: Providing interpretability (e.g., attention heatmaps or saliency scores) would offer transparency to recruiters, helping them understand why certain candidates rank higher than others and thus increasing the trust in the system’s recommendations.
  • Real-Time Inference Optimization: Transformer heads can be computationally demanding in real-time ranking scenarios. Investigating model distillation, quantization, or efficient Transformers could make the approach more practical for large-scale deployments.
  • Active Learning and Online Updates: In real recruitment environments, data distribution shifts occur frequently (e.g., new technologies, changing skill demands, etc.). Implementing active learning, continuous fine-tuning, or online learning methods can help the models to remain current and effective.
By pursuing these directions, we aim to deliver more comprehensive, transparent, and scalable solutions for job–candidate matching, ensuring that advanced NLP technologies continue to benefit both hiring organizations and job seekers alike.

Author Contributions

Conceptualization, M.Ł., J.K. and T.L.; methodology, J.K, M.Ł. and T.L.; software, M.Ł., J.K., B.A. and T.L.; validation, M.B., B.Ś., J.K. and M.Ł.; formal analysis, M.Ł.; investigation, B.Ś., T.L. and B.A.; resources, Ł.D. and R.Z.; data curation, G.B. and B.N.; writing—original draft preparation, J.K. and I.A.; writing—review and editing, J.K. and I.A.; visualization, M.B. and B.A.; supervision, Ł.D. and J.K.; project administration, Ł.D. and R.Z.; funding acquisition, Ł.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to privacy and confidentiality.

Conflicts of Interest

Authors: Mateusz Łępicki, Michał Bukowski, Grzegorz Baranik, Bogusz Nowak, Łukasz Dobrakowski were employed by the company Avenga IT Professionals sp. z o.o.

References

  1. World Economic Forum. The Future of Jobs Report 2023. White Paper. 2023. Available online: https://www.weforum.org/publications/the-future-of-jobs-report-2023/ (accessed on 17 May 2025).
  2. Upadhyay, A.K.; Khandelwal, K. Applying artificial intelligence: Implications for recruitment. Strateg. HR Rev. 2018, 17, 255–258. [Google Scholar] [CrossRef]
  3. LinkedIn Talent Insights Team. Global Talent Trends 2023. Industry Report. 2023. Available online: https://www.accaglobal.com/gb/en/professional-insights/pro-accountants-the-future/global-talent-trends-2023.html (accessed on 17 May 2025).
  4. Saxena, P.; Agrawal, V.; Pradhan, I.P. AI-Powered Talent Acquisition and Recruitment. In Human Resource Management and Artificial Intelligence; Routledge: London, UK, 2025; pp. 25–43. [Google Scholar] [CrossRef]
  5. Gethe, R.K. Extrapolation of talent acquisition in AI aided professional environment. Int. J. Bus. Innov. Res. 2022, 27, 462–479. [Google Scholar] [CrossRef]
  6. Gupta, S.; Shukla, B.; Sharma, U.; Hinneh, P.J.; Eswaran, B.; Sinha, B.C.; Kumari, A.; Sharma, N.; Laddhu, S. Impact of artificial intelligence and machine learning on recruitment process in MNCs of India. In Sustainable Management Practices for Employee Retention and Recruitment; IGI Global: Hershey, PA, USA, 2025; pp. 177–198. [Google Scholar] [CrossRef]
  7. Bouhoun, Z.; Guerrois, T.; Li, X.; Baker, M.; Elhadji Ille Gado, N.; Roumili, E.; Vitillo, F.; Benmiloud Bechet, L.; Plana, R. Information Retrieval Using Domain Adapted Language Models: Application to Resume Documents for HR Recruitment Assistance. In Computational Science and Its Applications—ICCSA 2023 Workshops; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2023; pp. 440–457. [Google Scholar] [CrossRef]
  8. Jiang, J.; Ye, S.; Wang, W.; Xu, J.; Luo, X. Learning Effective Representations for Person-Job Fit by Feature Fusion. In Proceedings of the CIKM’20: 29th ACM International Conference on Information & Knowledge Management, Online, 19–23 October 2020; pp. 2549–2556. [Google Scholar] [CrossRef]
  9. Menacer, M.A.; Hamda, F.B.; Mighri, G.; Hamidene, S.B.; Cariou, M. An Interpretable Person-Job Fitting Approach Based on Classification and Ranking. 2021, pp. 130–138. Available online: https://aclanthology.org/2021.icnlsp-1.15.pdf (accessed on 17 May 2025).
  10. Yazici, M.B.; Sabaz, D.; Elmasry, W. AI-based Multimodal Resume Ranking Web Application for Large Scale Job Recruitment. In Proceedings of the 2024 8th International Artificial Intelligence and Data Processing Symposium (IDAP), Malatya, Türkiye, 21–22 September 2024. [Google Scholar] [CrossRef]
  11. Rezaeipourfarsangi, S.; Milios, E.E. AI-powered Resume-Job matching: A document ranking approach using deep neural networks. In Proceedings of the DocEng’23: ACM Symposium on Document Engineering, Limerick, Ireland, 22–25 August 2023. [Google Scholar] [CrossRef]
  12. Gu, F.; Lu, J.; Cai, C. RPformer: A Robust Parallel Transformer for Visual Tracking in Complex Scenes. IEEE Trans. Instrum. Meas. 2022, 71, 5011214. [Google Scholar] [CrossRef]
  13. Kurek, J.; Latkowski, T.; Bukowski, M.; Świderski, B.; Łępicki, M.; Baranik, G.; Nowak, B.; Zakowicz, R.; Dabrowski, L. Zero-Shot Recommendation AI Models for Efficient Job–Candidate Matching in Recruitment Process. Appl. Sci. 2024, 14, 2601. [Google Scholar] [CrossRef]
  14. Nikolaou, I. What Is Artificial Intelligence in Recruitment and Selection? A Research Agenda for Future Adoption. Eur. J. Work Organ. Psychol. 2021, 30, 505–513. [Google Scholar] [CrossRef]
  15. Bevara, R.V.K.; Mannuru, N.R.; Karedla, S.P.; Lund, B.; Xiao, T.; Pasem, H.; Dronavalli, S.C.; Rupeshkumar, S. Resume2Vec: Transforming Applicant Tracking Systems with Intelligent Resume Embeddings for Precise Candidate Matching. Electronics 2025, 14, 794. [Google Scholar] [CrossRef]
  16. Bromley, J.; Bentz, J.W.; Bottou, L.; Guyon, I.; Lecun, Y.; Moore, C.; SÄckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. Int. J. Pattern Recognit. Artif. Intell. 1993, 7, 669–688. [Google Scholar] [CrossRef]
  17. Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the 32nd International Conference on Machine Learning (ICML) Deep Learning Workshop, Lille, France, 6–11 July 2015; Available online: https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf (accessed on 17 May 2025).
  18. Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing: Association for Computational Linguistics, Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
  19. Lu, X.; Sheng, W.; Li, X. TSN-GReID: Transformer-based Siamese Network for Group Re-Identification. In Proceedings of the 2022 17th International Conference on Control, Automation, Robotics and Vision (ICARCV), Singapore, 11–13 December 2022; pp. 422–427. [Google Scholar] [CrossRef]
  20. Maheshwary, S.; Misra, H. Matching Resumes to Jobs via Deep Siamese Network. In Proceedings of the WWW ’18: Companion the Web Conference 2018, Lyon France, 23–27 April 2018; pp. 87–88. [Google Scholar] [CrossRef]
  21. Wu, R.; Wen, X.; Yuan, L.; Xu, H.; Liu, Y. Visual Tracking based on deformable Transformer and spatiotemporal information. Eng. Appl. Artif. Intell. 2024, 127, 107269. [Google Scholar] [CrossRef]
  22. Che, C.; Fu, Y.; Shi, W.; Zhu, Z.; Wang, D. Dual Feature Fusion Tracking with Combined Cross-Correlation and Transformer. IEEE Access 2023, 11, 144966–144977. [Google Scholar] [CrossRef]
  23. Yu, X.; Xu, R.; Xue, C.; Zhang, J.; Yu, Z. ConFit v2: Improving Resume–Job Matching Using Hypothetical Resume Embedding and Runner-Up Hard-Negative Mining. arXiv 2025, arXiv:2502.12361. [Google Scholar]
  24. Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
  25. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  26. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar] [CrossRef]
  27. Mhatre, S.; Dakhare, B.; Ankolekar, V.; Chogale, N.; Navghane, R.; Gotarne, P. Resume Screening and Ranking using Convolutional Neural Network. In Proceedings of the 2023 International Conference on Sustainable Computing and Smart Systems (ICSCSS), Coimbatore, India, 14–16 June 2023; pp. 412–419. [Google Scholar] [CrossRef]
  28. Omara, E.; Mosa, M.; Ismail, N. Applying Recurrent Networks For Arabic Sentiment Analysis. Menoufia J. Electron. Eng. Res. 2022, 31, 21–28. [Google Scholar] [CrossRef]
  29. Yang, X.; Song, Y.; Zhao, Y.; Zhang, Z.; Zhao, C. Unveil the potential of siamese framework for visual tracking. Neurocomputing 2022, 513, 204–214. [Google Scholar] [CrossRef]
  30. Otman, M.; Rachid, E.A.; Mohamed, B. Amazigh Part of Speech Tagging using Gated recurrent units (GRU). In Proceedings of the 2021 7th International Conference on Optimization and Applications (ICOA), Wolfenbüttel, Germany, 19–20 May 2021. [Google Scholar] [CrossRef]
  31. Chen, H.; Meng, L.; Xi, Y.; Xin, M.; Yu, S.; Chen, G.; Chen, Y. GRU Based Time Series Forecast of Oil Temperature in Power Transformer. Distrib. Gener. Altern. Energy J. 2023, 38, 393–412. [Google Scholar] [CrossRef]
  32. Yu, S.; Liu, D.; Zhu, W.; Zhang, Y.; Zhao, S. Attention-based LSTM, GRU and CNN for short text classification. J. Intell. Fuzzy Syst. 2020, 39, 333–340. [Google Scholar] [CrossRef]
  33. Wang, S.; Ge, H.; Li, W.; Liu, L.; Zhou, T.; Yang, S. Bidirectional Joint Attention Mechanism for Target Tracking Algorithm. In Proceedings of the 2022 4th International Conference on Natural Language Processing (ICNLP), Xi’an, China, 25–27 March 2022; pp. 256–265. [Google Scholar] [CrossRef]
  34. Goel, A.; Katiyar, A.; Goel, A.K.; Kumar, A. LSTM Neural Networks for Brain Signals and Neuromorphic Chip. In Proceedings of the 2024 2nd International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT), Faridabad, India, 28–29 November 2024; pp. 1033–1039. [Google Scholar] [CrossRef]
  35. Zhao, X.; Liu, Y.; Han, G. Cooperative Use of Recurrent Neural Network and Siamese Region Proposal Network for Robust Visual Tracking. IEEE Access 2021, 9, 57704–57715. [Google Scholar] [CrossRef]
  36. Yang, L.; Wang, Y.; Fu, S.; Liu, L.; Luo, Y. EVS2vec: A Low-dimensional Embedding Method for Encrypted Video Stream Analysis. In Proceedings of the 2023 20th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Madrid, Spain, 11–14 September 2023; Volume 2023, pp. 537–545. [Google Scholar] [CrossRef]
  37. Wei, Y.; Wu, D.; Terpenny, J. Bearing remaining useful life prediction using self-adaptive graph convolutional networks with self-attention mechanism. Mech. Syst. Signal Process. 2023, 188, 110010. [Google Scholar] [CrossRef]
  38. Tallec, C.; Ollivier, Y. Can recurrent neural networks warp time? arXiv 2018, arXiv:1804.11188. [Google Scholar]
  39. Tang, Z.; Shi, Y.; Wang, D.; Feng, Y.; Zhang, S. Memory visualization for gated recurrent neural networks in speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2736–2740. [Google Scholar] [CrossRef]
  40. Fide, M.; Anarım, E. User Authentication with GRU based Siamese Networks using Keyboard Usage Behaviour. In Proceedings of the 2024 32nd Signal Processing and Communications Applications Conference (SIU), Mersin, Türkiye, 15–18 May 2024. [Google Scholar] [CrossRef]
  41. Wang, H.; Guo, F. Online object tracking based interactive attention. Comput. Vis. Image Underst. 2023, 236, 103809. [Google Scholar] [CrossRef]
Figure 1. Workflow of the proposed job–candidate matching system.
Figure 1. Workflow of the proposed job–candidate matching system.
Applsci 15 05988 g001
Figure 5. GRU—average training loss vs. epoch.
Figure 5. GRU—average training loss vs. epoch.
Applsci 15 05988 g005
Figure 6. LSTM—average training loss vs. epoch.
Figure 6. LSTM—average training loss vs. epoch.
Applsci 15 05988 g006
Figure 7. Transformer—average training loss vs. epoch.
Figure 7. Transformer—average training loss vs. epoch.
Applsci 15 05988 g007
Figure 8. GRU—validation loss vs. epoch.
Figure 8. GRU—validation loss vs. epoch.
Applsci 15 05988 g008
Figure 9. LSTM—validation loss vs. epoch.
Figure 9. LSTM—validation loss vs. epoch.
Applsci 15 05988 g009
Figure 10. Transformer—validation loss vs. epoch.
Figure 10. Transformer—validation loss vs. epoch.
Applsci 15 05988 g010
Figure 11. GRU—validation accuracy vs. epoch.
Figure 11. GRU—validation accuracy vs. epoch.
Applsci 15 05988 g011
Figure 12. LSTM—validation accuracy vs. epoch.
Figure 12. LSTM—validation accuracy vs. epoch.
Applsci 15 05988 g012
Figure 13. Transformer—validation accuracy vs. epoch.
Figure 13. Transformer—validation accuracy vs. epoch.
Applsci 15 05988 g013
Table 1. Common data structure for both train + valid and test datasets. Each row represents a single job–candidate pair, with labels indicating recruitment outcomes.
Table 1. Common data structure for both train + valid and test datasets. Each row represents a single job–candidate pair, with labels indicating recruitment outcomes.
Column NameData TypeColumn Description
indexint64Sequential number identifying the record in the dataset.
job_submission_idint64Unique identifier of a candidate’s submission for a given job posting.
job_idint64Unique identifier of the job posting.
job_titlestringThe descriptive title or name of the job position (e.g., “Senior Java Developer”).
job_descriptionstringFull description of the job, covering responsibilities, requirements, and other details.
candidate_idint64Unique identifier of the candidate (applicant).
candidate_resumestringTextual content of the candidate’s résumé/CV (potentially multilingual).
recruitment_pathstringA text-based representation of the candidate’s progress through different recruiting stages.
recruitment_path_lenint64Number of completed stages in the candidate’s recruitment path.
recruitment_path_statusstringThe candidate’s final or current stage status (e.g., passed technical interview, awaiting
HR interview, etc.).
labelstringThe final recruitment outcome category for the candidate (e.g., soft_positive, soft_negative, or positive).
Table 2. Consolidated statistics for the train + valid and test datasets. Alongside the total number of rows and columns, the distribution of the three label categories is shown.
Table 2. Consolidated statistics for the train + valid and test datasets. Alongside the total number of rows and columns, the distribution of the three label categories is shown.
PropertyTrain + Valid (Count)Train + Valid (%)Test (Count)Test (%)
Number of Rows14,8271000
Number of Columns2323
Label: soft_positive663044.72%43243.20%
Label: soft_negative555637.47%38138.10%
Label: positive264117.81%18718.70%
Total14,827100.00%1000100.00%
Table 3. Language statistics for the train + valid and test sets, indicating the proportion of English and Polish texts in candidate_resume or job_description.
Table 3. Language statistics for the train + valid and test sets, indicating the proportion of English and Polish texts in candidate_resume or job_description.
Train + ValidTest
LanguageRecord CountPercentRecord CountPercent
en13,01987.81%87087.00%
pl180812.19%13013.00%
Total14,827100.00%1000100.00%
Table 4. Training hyperparameters, evaluation settings, data handling, and preprocessing details.
Table 4. Training hyperparameters, evaluation settings, data handling, and preprocessing details.
Parameter CategoryValue
Training Hyperparameters
   Batch Size1024
   Number of Epochs1000
   Learning Rate0.001
   Triplet Loss Margin ( α )0.4
   Gradual Triplet Scaler4.0
Evaluation and Logging
   Top-K Accuracy Metrics (K values)1, 3, 5, 10, 20, 30, 50, 75, 100
   t-SNE Visualization Sample Size1000
   Log Frequency (steps)Dynamic (e.g., every 1/10 epoch)
   K Values for Plotting Accuracy1, 20, 50, 75, 100
   Main Model Save File Namehead_model.pt
   Best Model Save File Namebest_model.pt
Data Handling and Preprocessing
   Validation Set Ratio0.1
   Embedding Chunk Lengths1024
   Max Segments per Document15
Table 5. Computational setup and main hyperparameters for model training.
Table 5. Computational setup and main hyperparameters for model training.
ParameterGRULSTMTransformer
Total Run Duration (seconds)3048.032903.503017.77
Avg. Epoch Duration (seconds)1.901.731.83
Device Usedcudacudacuda
Num GPU Used222
GPU NameNVIDIA GH200 144G HBM3eNVIDIA GH200 144G HBM3eNVIDIA GH200 144G HBM3e
GPU CUDA Version12.612.612.6
Num Dataloader Workers (CPU)144144144
Table 6. Top-K Accuracy comparison for GRU, LSTM, and Transformer models across train, validation, and test sets.
Table 6. Top-K Accuracy comparison for GRU, LSTM, and Transformer models across train, validation, and test sets.
MetricGRULSTMTransformer
Train Top-11.450.941.15
Train Top-33.982.863.31
Train Top-56.094.355.19
Train Top-109.987.498.52
Train Top-2015.9312.1913.67
Train Top-3020.7215.8317.85
Train Top-5028.0422.5124.47
Train Top-7535.0328.7831.07
Train Top-10040.5234.4336.11
Valid Top-15.464.724.79
Valid Top-312.3411.1311.53
Valid Top-517.9415.8517.19
Valid Top-1029.5324.6826.10
Valid Top-2042.3537.9038.17
Valid Top-3049.3648.0147.47
Valid Top-5061.8359.8160.28
Valid Top-7572.6969.9370.67
Valid Top-10079.3078.0877.68
Test Top-15.005.404.50
Test Top-312.8013.2011.60
Test Top-518.1019.3018.00
Test Top-1030.9029.5029.90
Test Top-2044.7043.5043.40
Test Top-3055.2052.5054.60
Test Top-5069.2066.4069.20
Test Top-7580.0078.0080.50
Test Top-10086.7084.8087.20
Table 7. Performance metrics (loss, MRR, average similarities) for GRU, LSTM, and Transformer models across train, validation, and test sets.
Table 7. Performance metrics (loss, MRR, average similarities) for GRU, LSTM, and Transformer models across train, validation, and test sets.
MetricGRULSTMTransformer
Train Loss0.014330.016100.01423
Train MRR0.99180.99050.99115
Train Avg. Pos. Sim.0.683240.698870.75792
Train Avg. Neg. Sim.0.020400.034550.01873
Validation Loss0.031570.036790.03253
Validation MRR0.983140.977410.97876
Valid Avg. Pos. Sim.0.617530.650690.69815
Valid Avg. Neg. Sim.0.008360.039750.02107
Test Loss0.035860.037030.03140
Test MRR0.978000.977500.97900
Test Avg. Pos. Sim.0.605790.639800.69249
Test Avg. Neg. Sim.0.005400.047780.02460
Table 8. Model architecture and embedding configuration parameters.
Table 8. Model architecture and embedding configuration parameters.
Parameter CategoryValue
Sentence Transformer
   Base Modeldistiluse-base-multilingual-cased-v2
   Input Embedding Size512
Head Model (GRU/LSTM/Transformer)
   Hidden Size256
   Number of Layers1
   Output Embedding Size256
   Output Aggregationattention
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Łępicki, M.; Latkowski, T.; Antoniuk, I.; Bukowski, M.; Świderski, B.; Baranik, G.; Nowak, B.; Zakowicz, R.; Dobrakowski, Ł.; Act, B.; et al. Comparative Evaluation of Sequential Neural Network (GRU, LSTM, Transformer) Within Siamese Networks for Enhanced Job–Candidate Matching in Applied Recruitment Systems. Appl. Sci. 2025, 15, 5988. https://doi.org/10.3390/app15115988

AMA Style

Łępicki M, Latkowski T, Antoniuk I, Bukowski M, Świderski B, Baranik G, Nowak B, Zakowicz R, Dobrakowski Ł, Act B, et al. Comparative Evaluation of Sequential Neural Network (GRU, LSTM, Transformer) Within Siamese Networks for Enhanced Job–Candidate Matching in Applied Recruitment Systems. Applied Sciences. 2025; 15(11):5988. https://doi.org/10.3390/app15115988

Chicago/Turabian Style

Łępicki, Mateusz, Tomasz Latkowski, Izabella Antoniuk, Michał Bukowski, Bartosz Świderski, Grzegorz Baranik, Bogusz Nowak, Robert Zakowicz, Łukasz Dobrakowski, Bogdan Act, and et al. 2025. "Comparative Evaluation of Sequential Neural Network (GRU, LSTM, Transformer) Within Siamese Networks for Enhanced Job–Candidate Matching in Applied Recruitment Systems" Applied Sciences 15, no. 11: 5988. https://doi.org/10.3390/app15115988

APA Style

Łępicki, M., Latkowski, T., Antoniuk, I., Bukowski, M., Świderski, B., Baranik, G., Nowak, B., Zakowicz, R., Dobrakowski, Ł., Act, B., & Kurek, J. (2025). Comparative Evaluation of Sequential Neural Network (GRU, LSTM, Transformer) Within Siamese Networks for Enhanced Job–Candidate Matching in Applied Recruitment Systems. Applied Sciences, 15(11), 5988. https://doi.org/10.3390/app15115988

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop