On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation

Thivaios, Giannis; Zervas, Panagiotis; Giotopoulos, Konstantinos; Tzimas, Giannis

doi:10.3390/info17030233

Open AccessArticle

On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation

¹

Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 26334 Patras, Greece

²

Department of Management Science and Technology, University of Patras, 26334 Patras, Greece

^*

Author to whom correspondence should be addressed.

Information 2026, 17(3), 233; https://doi.org/10.3390/info17030233

Submission received: 12 December 2025 / Revised: 22 February 2026 / Accepted: 23 February 2026 / Published: 1 March 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

This paper addresses the challenge of deduplicating job postings in large, heterogeneous datasets by introducing an efficient, multi-stage methodology that combines embedding-based filtering with Large Language Model (LLM) validation. The proposed system begins with data preprocessing and semantic vectorization of key textual fields using a text embedding model. To reduce the computational cost of exhaustive pairwise comparisons, a clustering-based grouping mechanism is employed to restrict comparisons to semantically coherent clusters. Candidate duplicates are then validated using LLMs, which assess semantic equivalence across highlighted differences in job titles, descriptions, companies, and locations. The proposed system is evaluated on an augmented dataset of 50,000 job postings, producing 6669 candidate pairs for validation. Among the evaluated models, GPT-4o achieved the highest F1-score (95.10%), while the lightweight Phi-4 model demonstrated competitive performance (92.58%) with significantly lower computational cost. These findings demonstrate that the proposed hybrid framework achieves high semantic precision while remaining scalable for continuous large-scale deployment.

Keywords:

job posting deduplication; near-duplicate detection; large language models; text embeddings; semantic similarity; hybrid nlp pipeline; synthetic data generation; labor market intelligence

1. Introduction

The digitization of recruitment has created new opportunities for automated analysis of job-market data. Artificial intelligence (AI) is increasingly used in recruitment to improve efficiency, personalization, and data-driven decision-making [1]. Despite the availability of advanced tools and methods for data collection, processing this information remains a significant challenge. Especially when merging data from multiple sources, cleaning and archiving the vast amount of captured records is crucial [2]. A key issue is the prevalence of duplicate job postings, which arises as recruiters often publish vacancies across multiple platforms, and platform providers scrape job postings from one another to expand their market coverage [3]. Although many aspects of the recruitment process can already be automated effectively, identifying duplicates within unstructured text remains a challenging problem [4]. This difficulty arises in part from platform and company-specific constraints, which often lead to posts that are similar but not identical, whether they represent different advertisements for the same project or different projects from the same company [5].

Duplicates significantly affect data integrity and labor-market analysis. These duplicates introduce biases into the data analysis, resulting in misleading conclusions about employment trends and the demand for specific skills. In addition, they impose unnecessary computational and storage burdens, increasing the effort and resources required to process and manage recruitment data effectively [6].

In order to tackle these challenges, it is essential to implement efficient deduplication methods. Deduplication reduces noise in datasets, enabling more accurate labor-market insights and better decision-making. Furthermore, deduplication is crucial to optimize computational and storage resources [7]. By removing duplicates, job portals can significantly lower the costs associated with data processing and storage, improving operational efficiency and scalability.

This study introduces a novel methodology for duplicate job posting detection, combining a two-stage approach to enhance accuracy. First, we employ a word embedding-based similarity measurement technique to identify near-duplicates using predefined criteria. Second, we implement an LLM-powered validation step to verify detected duplicates and reduce false positives. Additionally, we perform a comparative analysis of open-source and commercial LLMs to evaluate potential performance disparities in duplication tasks. The main contributions of this study can be summarized as follows.

1.: A scalable hybrid deduplication framework that integrates lightweight embedding-based filtering with LLM-based semantic validation to balance computational efficiency and semantic precision.
2.: A clustering-based grouping mechanism that significantly reduces computational complexity by restricting comparisons to temporally and semantically coherent candidate sets, thereby improving scalability in continuous ingestion environments.
3.: The construction of a large augmented evaluation dataset consisting of 50,000 job postings, combining real-world and synthetically generated records to simulate realistic duplication scenarios and linguistic variability.
4.: A comprehensive comparative evaluation of open-source and commercial LLMs, demonstrating that lightweight locally deployable models can achieve performance close to state-of-the-art commercial systems while substantially reducing operational cost.
5.: An operationally feasible continuous deduplication pipeline, validated under realistic daily ingestion settings, confirming its suitability for large-scale deployment.

The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 describes the proposed methodology. Section 4 presents the experimental results. Section 5 discusses the findings and practical implications and Section 6 outlines the limitations of the study.

2. Related Work

Over the past decade, several studies have analyzed job postings using classical NLP techniques. Burk et al. [8] address duplicate detection in online recruitment using n-grams in order to identify common words and Jaccard similarity in order to quantify the overlap between words.

This study was a baseline for many other approaches. In one of these, Zhao et al. [3] proposed a comprehensive framework for duplicate detection in online job postings, testing 24 different approaches using various tokenization, vectorization and similarity measurement techniques, demonstrating that overlap-based methods combined with TF-IDF outperformed baseline approaches.

These approaches are limited because many job ads contain similar text blocks that may differ in meaning. Other research has dealt with the semantic analysis of texts using text embeddings. Gao et al. [9] use Word2Vec embeddings [10] to detect duplicates of short text. This approach has potential, as short texts in job advertisements, such as company name, job title, or location, are well-suited for such methods. However, since the job description serves as the primary source of information, a different and more comprehensive approach is required.

Similarly, Engelbach et al. [6] combined text embeddings, domain knowledge, and keyword matching to improve duplicate detection accuracy, emphasizing that no single method alone suffices, but rather a hybrid approach enhances deduplication effectiveness.

Recent work has also explored deep learning architectures to capture semantic similarities beyond traditional string-matching methods. Notably, Shi et al. [11] proposed PDDM-AL, a pre-trained Transformer-based deduplication model enhanced with active learning. Their approach treats deduplication as a classification task, utilizing Transformer embeddings for semantic understanding and iteratively selecting the most uncertain samples for expert labeling. In this way, they reduce the manual labeling effort and achieve strong performance across multiple structured datasets. Their model emphasizes structured data and domain-aware feature tagging.

Our approach extends deduplication to unstructured textual content in job postings, combining semantic similarity analysis between texts using an advanced version of word embeddings powered by LLM. This enables us to evaluate whether the semantic similarity between two job ads is sufficient to classify them as duplicates.

3. Methodology

3.1. Overview

This study proposes a hybrid framework for detecting and validating duplicate job postings on heterogeneous online sources. The process begins with the construction of two complementary datasets. First, a synthetic data set of job postings is generated using prompt-engineered queries with the OpenAI API [12], producing controlled variations in titles, descriptions, company names, and locations. In parallel, a real-world dataset is collected through web scraping from multiple job portals, capturing authentic labor-market variability across platforms and time periods.

Synthetic and real datasets undergo a unified data-cleaning and normalization pipeline. During this stage, textual fields are standardized, irrelevant content (HTML tags, URLs) is removed, and posts are filtered based on scraping metadata [13]. The cleaned datasets are then merged into an augmented dataset that represents a wide spectrum of possible duplication scenarios, ranging from exact matches to nuanced paraphrased variants.

Following data preparation, key textual fields—such as job titles, descriptions, locations, and company identifiers—are encoded into semantic vector representations using embedding models. Similarity scoring is applied across fields using predefined, field-specific thresholds to identify candidate near-duplicate pairs. Subsequently, these candidate pairs are analyzed using an automated difference-visualization step, where contrasting text segments are highlighted in HTML format to support interpretability and facilitate rapid inspection.

To ensure robust validation beyond surface-level similarity, the framework incorporates a final assessment stage powered by LLM. Open-source and commercial LLMs evaluate each candidate pair holistically, judging whether the postings refer to the same underlying job opportunity despite stylistic or structural differences. This combined approach enables a rigorous and scalable methodology for detecting semantically equivalent job postings across diverse platforms.

The complete workflow of this hybrid framework is summarized in Figure 1, which outlines the stages from dataset construction to LLM-based validation.

3.2. Synthetic Job Postings Generation Using GPT-5

Our aim was to expand the experimental dataset and introduce controlled variation by generating synthetic job postings that emulate real-world heterogeneity. The corpus was created using advanced augmentation methods and template-driven prompts executed via the GPT-5 [14], due to its ability to generate coherent and occupation-specific content. These templates were tailored to cover a wide range of occupational categories relevant to contemporary labor-market analysis. To ensure broad content diversity, we leveraged structured occupational resources such as ESCO [15] and O*NET [16], using them to construct dictionaries that contain alternative job titles, representative skills, typical responsibilities, and related terminology for each occupation. This enabled the creation of prompts that produce diverse outputs from GPT-5, capturing the essential characteristics of each job category while allowing nuanced variation.

To enrich lexical variation even further, we integrated resources such as WordNet [17], enabling controlled substitution of terms with appropriate synonyms while preserving the semantic core of each posting. During generation, GPT-5 occasionally declined to produce postings that contradicted occupational norms—for example, when prompted to create a senior-level role with unrealistically low qualifications—thus reinforcing the internal consistency of the dataset.

However, generating data that are semantically correlated with the real ones and exhibit sufficient intra-dataset diversity remains challenging [18]. Accordingly, each synthetic record consists of two to three short paragraphs forming a realistic job description, accompanied by structured metadata fields (portal, scraped_date, company, location and title). The prompt explicitly targets a controlled mixture of (a) unique postings, (b) exact duplicates (same content with different identifiers/dates), and (c) near-duplicates that reflect common real-world variation patterns such as seniority differences, employment type changes, minor title rewordings, reposting with minor updates (e.g., benefits or requirements), and multi-location postings by the same company (see Appendix A.2). An illustrative example of a synthetically generated near-duplicate pair is presented in Table 1.

To ensure structural consistency and data integrity, generation was constrained to produce a strictly formatted JSON array with predefined metadata fields. Each LLM response was parsed and validated to confirm schema compliance, including the presence of all required fields and correct formatting of identifiers and dates. Each response was programmatically parsed to ensure strict schema compliance. Validation checks included: (i) verification of the presence of all mandatory fields, (ii) enforcement of identifier format constraints, (iii) validation and normalization of date formats, and (iv) restriction of portal values to a predefined closed set. Records failing these checks were either automatically corrected where feasible or replaced with fallback instances to maintain dataset integrity.

Following generation, we computed descriptive summary statistics, including the number of unique companies, distinct job titles, and the observed date range, in order to verify dataset diversity. We additionally performed explicit detection of exact duplicate groups using full-field matching across company, title, location, and description, confirming that duplicates were present in the proportions intended by the prompt design. Finally, we manually inspected a stratified random sample of postings from each category (unique, exact duplicate, and near-duplicate) to ensure occupational plausibility and verify that the induced variations aligned with the intended real-world duplication scenarios [19].

3.3. Detect Near Duplicate Job Postings

A core component of the proposed framework is the detection of near-duplicate job postings prior to LLM-based validation. This stage aims to efficiently identify pairs or groups of postings that are likely to describe the same underlying job vacancy, despite differences in wording, structure, or formatting. Duplicate detection is essential for improving dataset quality, reducing redundancy, and ensuring that downstream analyses—such as labor-market statistics, job-matching models, or trend monitoring—are not distorted by repeated entries. To accomplish this, we implement a multi-step process consisting of:

(1): Defining a date-based sliding window;
(2): Computing semantic similarity using embedding models;
(3): Grouping related postings through an efficient clustering mechanism that minimizes unnecessary comparisons.

Together, these steps produce a refined set of candidate duplicates that are subsequently validated using LLM reasoning.

The overall workflow of this near-duplicate detection process is illustrated in Figure 2, which visualizes the sequential stages from candidate selection to LLM-based validation.

3.3.1. Time Window

The first criterion used to constrain candidate comparisons is the temporal proximity between job postings. Prior work frequently adopts broad windows—such as 60 days—to account for legitimate reposting scenarios, particularly when vacancies remain unfilled over long periods [3]. However, such wide windows introduce excessive noise, as many postings are re-issued at high frequency by automated systems, often daily or weekly, to maintain visibility on job portals.

Through exploratory analysis of job posting datasets and discussions with domain experts, we observed that a large portion of these near-identical ads, often generated by bots or automated systems, appear on a daily or weekly basis. These are likely meant to keep the listing visible at the top of the job boards. To reduce the impact of such spam-like activity, we adopt a more conservative time window of two weeks (14 days) when determining whether two job advertisements should be considered duplicates.

3.3.2. Detecting Duplicates Using Embeddings

Once date filtering is applied, candidate postings are evaluated using semantic similarity across multiple textual fields [20,21]. The fields considered include:

Job title;
Company name;
Location;
Job description.

To accommodate the differing semantic characteristics of each field, we employ distinct similarity thresholds based on their matching requirements. For high-precision fields such as location and company, we adopt a strict threshold of 0.9 to ensure accurate named-entity matching with minimal tolerance for variation, consistent with prior work emphasizing precision in entity resolution tasks [22].

Job titles receive a moderately lower threshold of 0.8 to account for minor phrasing differences while maintaining semantic equivalence (e.g., “Software Developer” vs. “Backend Engineer”), in line with findings that acknowledge variability in occupational title expressions [23]. For job descriptions, we apply a more lenient threshold (0.7), as descriptions frequently differ in structure, verbosity, and phrasing despite referring to the same role. Similar tolerance levels have been supported in text similarity and duplicate detection research involving unstructured data [6].

More broadly, similarity thresholds in embedding-based semantic alignment tasks are known to be sensitive to data distribution and task context. Prior research in paraphrase detection and semantic similarity modeling has shown that performance can vary substantially depending on threshold calibration, underscoring the need for task-specific tuning rather than universal cutoffs [24].

Guided by this understanding, we conducted a systematic parameter sweep across similarity values ranging from 0.60 to 0.95 (step 0.05) using an expert-annotated validation subset. Thresholds were selected based on maximizing recall while controlling candidate growth and maintaining acceptable precision within each field. The final tiered configuration (0.9 for company/location, 0.8 for titles, 0.7 for descriptions) therefore reflects empirical calibration aligned with the semantic distinctiveness and variability of each field, rather than arbitrary selection.

In our experiment, we use WordLlama (0.3.9), a fast and lightweight NLP toolkit designed for efficient handling of tasks like fuzzy-deduplication, similarity calculation, classification, ranking and more tasks. More particularly, we utilize it for semantic similarity measurement. Designed for efficiency, WordLlama delivers strong performance on CPU hardware with minimal inference-time dependencies. Moreover, it outperforms word models like GloVe 300d on MTEB benchmarks while maintaining a significantly smaller size of 16 MB for its default 256-dimensional model. The unique approach of this tool is based on the fact that it recycles components from large language models to create compact and efficient word representations, gain insights into its process of extracting token embedding codebooks from several models, including LLama3 70B and phi 3 medium, and train a small context-less model in a general purpose embedding framework [25].

The core deduplication component generates dense vector embeddings for key textual fields such as job titles, locations, company names and descriptions in order to capture their semantic meaning and contextual nuances. These embeddings enable precise comparison between job postings through cosine similarity scoring, which quantifies alignment in vector space.

As shown in Figure 2, if the criteria are met, the output is a list of candidate duplicate pairs, which are then passed to an LLM for further validation, ensuring a robust and accurate deduplication process.

3.3.3. Clustering for Efficient Grouping

To further reduce the computational burden of duplicate detection, we employ a greedy, date-ordered clustering strategy that groups postings into compact sets of potential duplicates. All job postings are first sorted chronologically, and the algorithm iterates through them sequentially. For each posting that has not yet been assigned to a cluster, a new cluster is initialized and the posting becomes its representative. The system then compares this representative only with later postings, adding them to the cluster if they fall within the 14-day sliding window and satisfy the field-specific similarity thresholds. Once a posting is assigned to a cluster, it is excluded from all subsequent comparisons, ensuring that clusters remain non-overlapping and processing is not repeated unnecessarily.

This design avoids the exhaustive pairwise comparisons typical of naive approaches. By restricting comparisons to a small set of temporally adjacent postings, the algorithm reduces the complexity of duplicate detection from

O (n^{2})

to approximately

O (n k)

, where k is the average number of postings within the date window. Since

k ≪ n

in real-world datasets, the clustering step substantially improves computational efficiency while maintaining high recall for near-duplicate detection.

3.4. Highlight Differences Using HTML

For each duplicate pair, we retrieve the corresponding job postings and highlight differences in key fields such as job title, location, company, and description. To accomplish this, we use a method that compares two text inputs and highlights the differences between them using HTML formatting. It begins by splitting the input texts into individual words and then uses Python 3.11.9 difflib.SequenceMatcher to identify differences at the word level. The function iterates through the differences, wrapping non-matching words in HTML <span> tags with bold black text to make them visually distinct. This allows for easy identification of discrepancies between the two texts. This practice is widely used in the literature on document similarity. Visualization tools and libraries that support the highlighting of text differences [26] are important for managing redundant content, improving information retrieval, and supporting version control. Document similarity in HTML can be effectively measured using sentence-based [27], feature-based [28], and semantic approaches [29]. Moreover, there are instances where the identification of duplicates is straightforward, but others present greater ambiguity, making it challenging to determine whether they are truly duplicates. So, highlighting text differences through an HTML display and semantic analysis conducted by LLM using carefully crafted prompts, we can effectively resolve such cases. Below are some examples to demonstrate these scenarios.

Figure 3a–d illustrate key scenarios in duplicate detection. In Figure 3a, two job postings share the same company and location. Despite minor textual differences in the title and description of the job (e.g., ‘crew’ versus ‘staff’), the semantic equivalence of these terms suggests that the postings are duplicates. Figure 3b, however, highlights a different case: while the job title, description, and company are nearly identical, a difference in location—explicitly mentioned in the job description—indicates separate opportunities. Figure 3c, similar to the previous figure, indicates different employment types, although the rest of the text is almost the same. Conversely, Figure 3d showcases a nuanced distinction: postings share the same company and location but differ in role specificity (e.g., ‘Computer Technician’ versus ‘Mobile Technician’). This divergence in both the title and the description signals distinct positions, underscoring the importance of contextual analysis in deduplication.

LLM-based validation helps determine whether small differences represent distinct job opportunities or minor variations.

3.5. Evaluate Duplicate Pairs Using LLM

Evaluating duplicate pairs using LLMs requires balancing accuracy, efficiency, and practical deployment constraints. Both open-source and commercial LLMs offer powerful capabilities for text understanding and comparison, yet they differ fundamentally in terms of accessibility, transparency, cost, and operational control. In duplicate detection, the goal is semantic similarity rather than complex reasoning. Smaller models can therefore provide efficient and cost-effective solutions, especially at scale. These lightweight models are faster, easier to host locally, and require far fewer computational resources, making them well-suited for real-world validation pipelines. In this section, we examine the characteristics of open-source and commercial LLMs and discuss how their architectural and licensing differences influence their suitability for duplicate-pair evaluation.

3.5.1. Open-Source vs. Commercial LLMs

In this study, we evaluate both open-source and commercial LLMs to support the task of duplicate-pair validation. Open-source models, such as Llama 3.1–8B [30], Mistral–7b [31], and the lightweight Phi-4 [32] model, provide full access to their architecture, training data, and fine-tuning procedures. This transparency enables researchers to inspect model behavior, adapt systems using domain-specific data, and deploy models on local infrastructure, which enhances data privacy and eliminates the recurring costs associated with commercial APIs. Small open-source models, in particular, are advantageous for operational workloads such as duplicate detection because they offer competitive performance at significantly lower computational and financial cost. However, these benefits often come with limitations: open-source models may lag behind commercial LLMs in reasoning capabilities, generalization, and up-to-date knowledge, and fine-tuning them can require substantial specialized resources.

In contrast, commercial LLMs such as GPT-4 [33] provide state-of-the-art performance in language understanding, multilingual processing, and complex reasoning tasks. Their API-based nature allows seamless integration without the need for local deployment or hardware management, while continuous updates ensure consistently high-quality outputs. These advantages make commercial models suitable for scenarios requiring advanced capabilities, robustness, or broad generalization. However, the proprietary nature of these systems restricts access to model internals, limits opportunities for customization, and introduces scalability concerns due to increasing API costs. Such constraints can pose challenges for applications that require reproducibility, transparency, or budget-aware long-term deployment.

Finally, closed-source LLMs are typically justified by providers as a means of protecting proprietary knowledge, ensuring system security, and maintaining compliance with regulatory and contractual constraints [34]. In contrast, open-source LLMs promote algorithmic transparency and reproducibility by exposing model architectures and training methodologies, thereby fostering a collaborative research ecosystem that supports independent verification, systematic benchmarking, and rapid methodological innovation [35].

3.5.2. Evaluation Metrics

To evaluate the performance of our duplicate detection models, we employ standard classification metrics to measure the alignment between model predictions and the human-annotated ground truth [36,37]. These metrics provide a comprehensive assessment of classification quality in terms of both correctness and error characteristics.

Precision: The proportion of correctly identified duplicates among all pairs flagged by the model. High precision indicates minimal false positives, which is critical for ensuring that distinct job postings are not erroneously merged. Precision is defined as:

$\begin{matrix} \begin{matrix} Precision = \frac{True Positives (TP)}{True Positives (TP) + False Positives (FP)} \end{matrix} \end{matrix}$

(1)
Recall: The proportion of actual duplicates correctly detected by the model. High recall ensures comprehensive deduplication and reduces residual noise in the dataset. Recall is calculated as:

$\begin{matrix} \begin{matrix} Recall = \frac{True Positives (TP)}{True Positives (TP) + False Negatives (FN)} \end{matrix} \end{matrix}$

(2)
Accuracy: The overall correctness of the model, reflecting the ratio of all true predictions (both duplicates and non-duplicates) to the total number of evaluated pairs:

$\begin{matrix} \begin{matrix} Accuracy = \frac{TP + True Negatives (TN)}{TP + TN + FP + FN} \end{matrix} \end{matrix}$

(3)
F1-score: The harmonic mean of precision and recall, balancing both metrics to provide a single measure of overall classification effectiveness, especially in settings where both false positives and false negatives are equally important:

$\begin{matrix} \begin{matrix} F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{matrix} \end{matrix}$

(4)

3.5.3. Experiment Setup and Data Collection

To evaluate the effectiveness of the proposed hybrid deduplication framework, we constructed an augmented dataset consisting of 50,000 job postings. Of these, 40,000 were real job postings collected from major Greek job portals during the period 1 August 2025 to 31 October 2025. Our web-scraping pipeline collects approximately 500 new postings per day. Each daily batch is systematically compared against a sliding historical window of approximately 7500 postings corresponding to the previous 15 days, in order to identify potential near-duplicate pairs.

In addition, 10,000 synthetic job postings (20% of the corpus) were generated using GPT-5 (as described in Section 3.2). The purpose of incorporating synthetic data was not to replace real data but to introduce controlled semantic variation and rare edge cases that may be underrepresented in naturally occurring datasets. The synthetic postings were generated using a predefined duplication distribution consisting of:

60% unique postings;
10% exact duplicates;
30% near-duplicates.

This controlled generation process produced approximately 2000 additional duplicate pairs, complementing the ~4500 duplicate pairs identified in the real-world data. Consequently, nearly 70% of the evaluated duplicate pairs originate from authentic job postings, ensuring that the evaluation remains strongly grounded in real-world data.

Following data preprocessing, the WordLlama-based near-duplicate retrieval module was applied to the full dataset. This step produced a total of 6669 candidate pairs, which were subsequently evaluated using the lightweight LLM-based validation stage described earlier. To assess ground-truth correctness and benchmark LLM performance, all retrieved pairs underwent structured human annotation.

A panel of three domain experts (recruitment specialists with over five years of experience in Greek job-market analytics) manually annotated the candidate pairs, labeling each as True Duplicate or Near Duplicate. Prior to annotation, the experts were trained on 50 prelabeled example pairs and achieved strong consensus (Fleiss’

κ = 0.82

). Fleiss’ kappa is a statistical measure of inter-rater agreement for categorical judgments, correcting for chance agreement [38]. Each pair was independently reviewed by two annotators, with conflicts (10.3%) resolved by a third expert. Borderline cases were logged to refine annotation guidelines. A random re-evaluation of 10% of the pairs demonstrated 98% label consistency, confirming the stability of expert judgments.

All experiments were executed on a workstation equipped with an AMD Ryzen 5 5600G processor (Advanced Micro Devices, Santa Clara, CA, USA), 32 GB RAM, and an NVIDIA GeForce RTX 3060 GPU (NVIDIA Corporation, Santa Clara, CA, USA) using NVIDIA driver version 580.126.09 and CUDA 12.2 (NVIDIA Corporation, Santa Clara, CA, USA), running Ubuntu 24.04 LTS (Canonical Ltd., London, UK). Lightweight LLMs (Phi-4, Mistral 7B, Llama 3.1–8B) were executed locally using deterministic inference settings, while WordLlama embeddings were computed on the same GPU. These hardware specifications ensure reproducibility and provide a realistic benchmark for deployment in resource-constrained environments. The GPT-4o model was accessed via the Microsoft Azure OpenAI Service (Microsoft Corporation, Redmond, WA, USA), using cloud-based inference under the same deterministic decoding configuration for fair comparison.

3.5.4. Model Configuration for Deterministic Outputs

To ensure consistent and reliable outputs from the LLM during the validation phase, we utilized Pydantic (2.10.5) models in conjunction with the Instructor (1.7.2) library. Pydantic is used to define a strict structured schema for the LLM responses, ensuring that the output consistently contained two predefined fields: a boolean isDuplicate flag and a string-based justification [39]. The Instructor library facilitates the integration of Pydantic with the LLM, enabling automatic parsing and validation of the model responses. This approach ensured that the LLM outputs were not only semantically accurate but also structurally consistent, reducing the need for manual error handling [40]. The overall validation process for a pair of job postings using the lightweight LLM is illustrated in Figure 4.

To guide the LLM in analyzing job postings for deduplication, we designed a structured prompt that explicitly instructs the model to focus on key fields title, location, company, and description and the highlighted parts of the HTML content. The prompt emphasizes semantic equivalence over minor linguistic, grammatical, or formatting differences, ensuring that the model prioritizes meaningful distinctions. Specifically, the prompt directs the model to consider whether differences in the highlighted text affect the overall meaning or intent of the job postings. For example, synonyms, minor phrasing variations, or slight differences in company name formatting (e.g., abbreviations or additional words) are not considered meaningful if they refer to the same entity or concept. The justification field in the output schema is designed to capture the LLM’s reasoning process based on this prompt, providing a transparent, natural language explanation for each final duplicate or non-duplicate classification (see Appendix A.1).

We set the LLM’s temperature to 0.2 to minimize randomness, favoring high-probability tokens and near-deterministic responses. General-purpose models typically benefit from a lower temperature to remain focused on relevant content [41].

4. Results

After human expert annotation of the 6669 candidate pairs retrieved by the WordLlama-based near-duplicate module, we observed that the 40.4% of the pairs were true duplicates. This finding highlights the intrinsic difficulty of near-duplicate detection and confirms that retrieval-based similarity alone is insufficient for reliable deduplication. Consequently, a semantic validation stage using LLMs is essential to accurately distinguish true duplicates from just similar postings.

4.1. Model Duplication Rate Analysis

Table 2 reports the aggregate duplication rates produced by each model. The results reveal substantial variation in model behavior. The Mistral–7B model exhibits a very high duplication rate 75.8%), significantly overestimating the number of true duplicates compared to the human-annotated ground truth. This indicates a strong tendency toward over-deduplication, which may lead to excessive removal of legitimately distinct job postings.

The Llama 3.1–8B model produces a more moderate duplication rate (52.2%), yet still exceeds the true duplicate proportion, suggesting a persistent bias toward positive duplicate classification. In contrast, the Phi-4 model yields a duplication rate of 42.0%, which closely aligns with the human-validated ground-truth distribution of approximately 40%. The GPT-4o model demonstrates the closest alignment with human annotations, at 39.7%, effectively matching the true duplicate prevalence.

These results indicate that Phi-4 provides the best trade-off among lightweight models in terms of duplication rate calibration, while GPT-4o serves as the upper-bound performance reference.

4.2. Classification Performance Against Human Annotations

Table 3 summarizes the aggregate precision, recall, accuracy, and F1-score of each model against expert labels. The GPT-4o model achieves the strongest overall performance, with an F1-score of 95.10%, reflecting both very high precision (92.26%) and recall (98.14%). This indicates strong semantic discrimination performance in near-duplicate validation.

Among the lightweight models, Phi-4 delivers the best overall balance, achieving an F1-score of 92.58%, precision of 89.97%, and recall of 95.35%. This performance is remarkably close to that of GPT-4o, despite Phi-4 being executed entirely on local hardware with significantly lower computational cost.

The Llama 3.1–8B model attains a moderate F1-score of 78.77%, with relatively high recall (91.95%) but substantially lower precision (68.89%), indicating a tendency to over-predict duplicates. Finally, Mistral–7B shows perfect recall (100%) but very low precision (51.80%), resulting in excessive false positives and an overall F1-score of only 68.25%. This behavior is consistent with its inflated duplication rate observed earlier.

4.3. Computational Efficiency

Beyond accuracy, we also evaluated the computational cost of the deduplication pipeline under realistic operational conditions. The WordLlama embedding similarity stage, responsible for retrieving near-duplicate candidates from the rolling 15-day comparison window, required on average approximately 10 min per daily batch. Table 4 summarizes the average daily validation time required by each LLM model to process the retrieved near-duplicate candidate pairs.

4.4. Key Findings

The experimental results lead to the following central conclusions. First, the true duplicate proportion of approximately 40% confirms that embedding-based near-duplicate retrieval alone is insufficient and must be complemented by a semantic LLM-based validation stage in order to reliably distinguish true duplicates from merely similar job postings. Second, although GPT-4o achieves the highest overall performance and serves as the upper-bound reference model for semantic validation, Phi-4 demonstrates nearly equivalent duplication rates and classification accuracy and can therefore be used as a cost-free, locally deployable alternative for the deduplication process without meaningful loss of performance, while also avoiding the privacy and infrastructure constraints associated with cloud-based inference.

In contrast, Llama 3.1–8B and Mistral–7B exhibit strong recall but substantially lower precision, leading to systematic over-deduplication and a high volume of false positives, which limits their suitability for production-grade deduplication. Finally, the complete daily deduplication pipeline—combining embedding-based retrieval on a 15-day rolling window with LLM-based semantic validation—operates within a practical runtime budget, demonstrating the feasibility of the proposed framework for continuous large-scale deployment.

5. Limitations

Despite the promising results, several limitations should be acknowledged. First, the evaluation focuses primarily on Greek job postings, which may limit generalizability to other languages or labor markets with different linguistic structures and posting conventions. Second, the similarity thresholds used in the embedding-based filtering stage were empirically derived and may require recalibration when applied to different domains or datasets. Third, although the framework is designed to scale efficiently through a sliding-window and clustering strategy, our experimental setup reflects a realistic incremental ingestion scenario, where approximately 8000 postings are processed within each rolling comparison window. While this mirrors real-world operational settings, it does not represent extremely large static corpora containing millions of records, and further validation under such conditions would strengthen scalability claims.

Additionally, while lightweight LLMs demonstrate competitive performance, their behavior may vary under different prompting strategies, model updates, or deployment environments. Finally, the cost analysis was limited to execution time and did not include detailed measurements of energy consumption or long-term infrastructure considerations.

6. Discussion

The findings of this study highlight the effectiveness of combining embedding-based filtering with LLM validation for job posting deduplication, particularly through the strategic use of synthetic data to enrich real-world datasets with more nuanced and challenging duplicate scenarios. This augmentation ensures robustness against varied linguistic expressions and structural differences common across platforms.

Looking ahead, further improvements could focus on enhancing scalability and adaptability by employing more efficient grouping methods, such as hierarchical or density-based clustering, to reduce pairwise comparisons. To further reduce manual oversight, active learning frameworks could be employed to iteratively select and label uncertain cases, continuously improving LLM validation with minimal human annotation.

To further reduce manual oversight, active learning frameworks could be employed to iteratively select and label uncertain cases, continuously improving LLM validation with minimal human annotation.

Agent-based systems could further improve the framework. They could dynamically adapt prompts, similarity thresholds, model selection, and validation depth based on context.

The choice between open-source and commercial LLMs involves important trade-offs: while models like GPT-4o offer superior accuracy, open-source alternatives such as Phi-4 provide a cost-effective, locally-deployable solution that enhances data privacy and avoids recurring API costs—a critical consideration in regulated or resource-sensitive environments. Finally, it is essential to address the ethical and privacy implications of deploying such systems, particularly when processing sensitive employment data, by ensuring compliance with data protection regulations and exploring privacy-preserving techniques such as federated learning or on-premise deployment.

Author Contributions

Conceptualization, G.T. (Giannis Tzimas) and P.Z.; Methodology, G.T. (Giannis Thivaios), P.Z., K.G. and G.T. (Giannis Tzimas); Software, G.T. (Giannis Thivaios) and P.Z.; Validation, K.G.; Formal analysis, G.T. (Giannis Tzimas) and K.G.; Resources, G.T. (Giannis Tzimas) and P.Z.; Data curation, G.T. (Giannis Thivaios), G.T. (Giannis Tzimas) and P.Z.; Writing—original draft, G.T. (Giannis Thivaios) and P.Z.; Supervision, G.T. (Giannis Tzimas) and K.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Full Prompt Used for LLM Deduplication Evaluation

Please analyze the following job postings content strictly based on the highlighted parts in the HTML.
Your task is to analyze the content with a specific focus on the formatted (highlighted) parts within the HTML. The fields that we are interested in are title, location, company, and en\_description.
The highlighted sections contain key textual (or contextual) differences that are critical for determining the nature of the job postings. When analyzing the content, focus on semantic equivalence rather than minor linguistic, grammatical, or formatting differences.
However, role differences (e.g., job titles, required qualifications, role levels) and location differences must always be treated as meaningful and should result in the job postings being classified as distinct opportunities, even if all other fields are identical.
Based on your analysis of these formatted parts, decide if the job postings represent duplicate posts of the same job or distinct opportunities.
Respond only with "yes" if the highlighted textual content indicates the job postings are essentially the same, or "no" if the highlighted content suggests they are different.
Respond only in JSON format according to the predefined schema.

Appendix A.2. Prompt Used for Synthetic Job Postings Generation

Generate exactly batch\_size job postings as a JSON array. Each posting must have these exact fields:
- id: string (format like "P1-J001", "P2-J045")
- scraped\_date: string (YYYY-MM-DD between 2023-10-01 and 2024-01-15)
- portal: string ("portal1", "portal2", or "portal3")
- company: string
- location: string
- title: string
- description: string (2--4 sentences, realistic job description)
DISTRIBUTION REQUIREMENTS:
- 60\% unique jobs (completely different roles, companies, locations)
- 10\% exact duplicates (same content, different IDs and dates)
- 30\% near-duplicates with realistic variations.
NEAR-DUPLICATE TYPES (distribute the 30\% among these common real-world scenarios):
1. HIERARCHICAL LEVEL VARIATIONS (same role, different seniority):
- "Software Engineer" vs. "Senior Software Engineer" vs. "Lead Software Engineer"
- "Waiter" vs. "Head Waiter" vs. "Restaurant Supervisor"
- "Sales Associate" vs. "Senior Sales Associate" vs. "Sales Team Lead"
- "Nurse" vs. "Senior Nurse" vs. "Nurse Supervisor"
2. EMPLOYMENT TYPE DIFFERENCES:
- Same role but different contract types:
* "Full-time Software Engineer" vs. "Contract Software Engineer" vs. "Part-time Software Engineer"
* "Permanent Marketing Manager" vs. "Temporary Marketing Manager" vs. "Freelance Marketing Manager"
3. MINOR TITLE WORDING VARIATIONS:
- "Data Analyst" vs. "Business Data Analyst" vs. "Marketing Data Analyst"
- "Customer Service Representative" vs. "Customer Support Agent" vs. "Client Service Specialist"
- "Hotel Receptionist" vs. "Front Desk Agent" vs. "Guest Services Representative"
4. SAME COMPANY, MULTIPLE LOCATIONS:
- Large companies posting the same role in different cities.
5. SIMILAR ROLES IN SAME INDUSTRY:
- "Junior Accountant" vs. "Accounting Assistant" vs. "Bookkeeper"
- "Web Developer" vs. "Frontend Developer" vs. "UI Developer"
6. SAME AD WITH MINOR UPDATES:
- Slightly updated requirements, salary range, benefits, or deadline.
Return only a valid JSON array. No other text or explanations.

References

Zhang, P. Application of Artificial Intelligence (AI) in Recruitment and Selection: The Case of Company A and Company B. J. Bus. Manag. Stud. 2024, 6, 224–225. [Google Scholar] [CrossRef]
Draisbach, U. Efficient Duplicate Detection and the Impact of Transitivity. Ph.D. Thesis, Universitat Potsdam, Potsdam, Germany, 2022. [Google Scholar]
Zhao, Y.; Chen, H.; Mason, C.M. A framework for duplicate detection from online job postings. In Proceedings of the WI-IAT’21: 20th IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Melbourne, Australia, 14–17 December 2021; Association for Computing Machinery: New York, NY, USA, 2022; pp. 249–256. [Google Scholar]
Ramya, R.S.; Venugopal, K.R. Feature extraction and duplicate detection for text mining: A survey. Glob. J. Comput. Sci. Technol. 2017, 16, 1–20. [Google Scholar]
Tzimas, G.; Zotos, N.; Mourelatos, E.; Giotopoulos, K.C.; Zervas, P. From Data to Insight: Transforming Online Job Postings into Labor-Market Intelligence. Information 2024, 15, 496. [Google Scholar] [CrossRef]
Engelbach, M.; Klau, D.; Kintz, M.; Ulrich, A. Combining Embeddings and Domain Knowledge for Job Posting Duplicate Detection. arXiv 2024, arXiv:2406.06257. [Google Scholar] [CrossRef]
Adhab, A.H.; Husieen, A.N. Techniques of Data Deduplication for Cloud Storage: A Review. Int. J. Eng. Res. Adv. Technol. 2024, 8, 7–18. [Google Scholar] [CrossRef]
Burk, H.; Javed, F.; Balaji, J. Apollo: Near-duplicate detection for job ads in the online recruitment domain. In Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 18–21 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 177–182. [Google Scholar]
Gao, J.; He, Y.; Zhang, X.; Xia, Y. Duplicate short text detection based on Word2vec. In Proceedings of the 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 24–26 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 33–37. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
Shi, H.; Liu, X.; Lv, F.; Xue, H.; Hu, J.; Du, S.; Li, T. A Pre-trained Data Deduplication Model based on Active Learning. arXiv 2025, arXiv:2308.00721. [Google Scholar] [CrossRef]
OpenAI. API Reference—OpenAI Platform. Available online: https://platform.openai.com/docs/api-reference (accessed on 1 May 2024).
Ram, S.; Nachappa, M.N. Fake Job Posting Detection. Int. J. Adv. Res. Sci. Commun. Technol. 2024, 4, 283–287. [Google Scholar] [CrossRef]
OpenAI. Introducing GPT-5. Available online: https://openai.com/index/introducing-gpt-5/ (accessed on 7 August 2025).
ESCO (European Skills, Competences, Qualifications and Occupations). Available online: https://esco.ec.europa.eu/en/classification/occupation_main (accessed on 15 May 2025).
O*NET Web Services. Welcome to the O*Net Web Services Site! Available online: https://services.onetcenter.org/ (accessed on 29 September 2023).
Miller, G.A. WordNet: A Lexical Database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
Colombo, S.; D’Amico, S.; Malandri, L.; Mercorio, F.; Seveso, A. JobSet: Synthetic Job Advertisements Dataset for Labour Market Intelligence. In Proceedings of the SAC’25: 40th ACM/SIGAPP Symposium on Applied Computing, Catania, Italy, 31 March–4 April 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 928–935. [Google Scholar]
Skondras, P.; Zervas, P.; Tzimas, G. Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification. Information 2023, 15, 363. [Google Scholar] [CrossRef]
Skondras, P.; Zotos, N.; Lagios, D.; Zervas, P.; Tzimas, G. Deep Learning Approaches for Big Data-Driven Metadata Extraction in Online Job Postings. Future Internet 2023, 14, 585. [Google Scholar] [CrossRef]
Itnal, V. Fake/Real Job Posting Detection Using Machine Learning. Int. J. Res. Appl. Sci. Eng. Technol. 2025, 13, 1508–1515. [Google Scholar] [CrossRef]
Christen, P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection; Springer Science and Business Media: Berlin, Germany, 2012. [Google Scholar]
Lavi, D.; Medentsiy, V.; Graus, D. conSultantBERT: Fine-Tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers. arXiv 2021, arXiv:2109.06501. [Google Scholar]
Ortiz Martes, D.; Gunderson, E.; Neuman, C.; Kachouie, N.N. Transformer Models for Paraphrase Detection: A Comprehensive Semantic Similarity Study. Computers 2025, 14, 385. [Google Scholar] [CrossRef]
Miller, D.L. WordLlama: Recycled Token Embeddings from Large Language Models. 2024. Available online: https://github.com/dleemiller/wordllama (accessed on 24 October 2024).
Bos, A. Visualizing Differences Between HTML Documents. Bachelor’s Thesis, Radboud University, Nijmegen, The Netherlands, 2018. [Google Scholar]
Rajiv, Y. Detecting Similar HTML Documents Using a Sentence-Based Copy Detection Approach. Master’s Thesis, Department of Computer Science, Brigham Young University, Provo, UT, USA, 2005. [Google Scholar]
Lin, Y.S.; Jiang, J.Y.; Lee, S.J. A Similarity Measure for Text Classification and Clustering. IEEE Trans. Knowl. Data Eng. 2014, 26, 1575–1590. [Google Scholar] [CrossRef]
Gunawan, D.; Sembiring, C.A.; Budiman, M.A. The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents. J. Phys. Conf. Ser. 2018, 978, 012120. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Jiang, A.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.; de Las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
Abdin, M.; Aneja, J.; Behl, H.; Bubeck, S.; Eldan, R.; Gunasekar, S.; Harrison, M.; Hewett, R.J.; Javaheripo, M.; Kauffmann, P.; et al. Phi-4 Technical Report. arXiv 2024, arXiv:2412.08905. [Google Scholar]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
Dong, Y.; Mu, R.; Zhang, Y.; Sun, S.; Zhang, T.; Wu, C.; Jin, G.; Qi, Y.; Hu, J.; Meng, J.; et al. Safeguarding Large Language Models: A Survey. arXiv 2024, arXiv:2406.02622. [Google Scholar] [CrossRef] [PubMed]
Kibriya, H.; Khan, W.Z.; Siddiqa, A.; Khan, M.K. Privacy issues in Large Language Models. Comput. Electr. Eng. 2024, 120, 109698. [Google Scholar] [CrossRef]
Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar]
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
Gwet, K. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement; Among Raters, 4th ed.; Advanced Analytics LLC: Gaithersburg, MD, USA, 2014. [Google Scholar]
Ntinopoulos, V.; Rodriguez Cetina Biefer, H.; Tudorache, I.; Papadopoulos, N.; Odavic, D.; Risteski, P.; Haeussler, A.; Dzemali, O. Large language models for data extraction from unstructured and semi-structured electronic health records: A multiple model performance evaluation. BMJ Health Care Inform. 2025, 32, e101139. [Google Scholar] [CrossRef] [PubMed]
Bhayana, K.; Wang, D.; Jiang, X.; Fraser, S. Abstract 134: Use of Large Language Model to Allow Reliable Data Acquisition for International Pediatric Stroke Study. Stroke 2025, 56, A134. [Google Scholar] [CrossRef]
Du, W.; Yang, Y.; Welleck, S. Optimizing Temperature for Language Models with Multi-Sample Inference. arXiv 2024, arXiv:2502.05234. [Google Scholar]

Figure 1. Architecture of the proposed hybrid deduplication framework.

Figure 2. Near-duplicate detection and LLM validation workflow.

Figure 3. Illustrative scenarios in job posting deduplication: (a) semantically equivalent duplicates with minor wording differences, (b) distinct postings differing by location, (c) near duplicates differing by employment type, and (d) distinct roles with different job titles despite similar company and location.

Figure 4. LLM-based duplicate classification pipeline.

Table 1. Example of a synthetically generated near-duplicate job posting pair illustrating controlled variation in role specificity.

Field	Posting A	Posting B
Job Title	Waiter/Waitress	Assistant Waiter
Posting Date	02/09/25	05/09/25
Company	OceanView Restaurant	OceanView Restaurant
Location	Thessaloniki	Thessaloniki
Description	Hiring experienced waiters/waitresses for busy beachfront restaurant. Full-time or part-time roles available.	Assistant waiter needed for support roles. Experience a plus. Flexible working hours.

Table 2. Duplication Rate across models.

Model	Duplication Rate
GPT-4o	39.7%
Phi-4	42.0%
Llama 3.1–8B	52.2%
Mistral–7B	75.8%

Table 3. Performance Comparison of Language Models on Job Posting Deduplication.

Model	Precision (%)	Recall (%)	Accuracy (%)	F1-Score (%)
GPT-4	92.26%	98.14%	95.99%	95.10%
Phi-4	89.97%	95.35%	93.91%	92.58%
Llama 3.1–8B	68.89%	91.95%	80.30%	78.77%
Mistral–7b	51.80%	100.00%	62.85%	68.25%

Table 4. Average daily execution time of the LLM-based semantic validation.

Model	Avg. Validation Time (min/day)
GPT-4o	1.9
Phi-4	2.2
Llama 3.1–8B	2.5
Mistral–7B	3.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Thivaios, G.; Zervas, P.; Giotopoulos, K.; Tzimas, G. On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation. Information 2026, 17, 233. https://doi.org/10.3390/info17030233

AMA Style

Thivaios G, Zervas P, Giotopoulos K, Tzimas G. On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation. Information. 2026; 17(3):233. https://doi.org/10.3390/info17030233

Chicago/Turabian Style

Thivaios, Giannis, Panagiotis Zervas, Konstantinos Giotopoulos, and Giannis Tzimas. 2026. "On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation" Information 17, no. 3: 233. https://doi.org/10.3390/info17030233

APA Style

Thivaios, G., Zervas, P., Giotopoulos, K., & Tzimas, G. (2026). On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation. Information, 17(3), 233. https://doi.org/10.3390/info17030233

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Overview

3.2. Synthetic Job Postings Generation Using GPT-5

3.3. Detect Near Duplicate Job Postings

3.3.1. Time Window

3.3.2. Detecting Duplicates Using Embeddings

3.3.3. Clustering for Efficient Grouping

3.4. Highlight Differences Using HTML

3.5. Evaluate Duplicate Pairs Using LLM

3.5.1. Open-Source vs. Commercial LLMs

3.5.2. Evaluation Metrics

3.5.3. Experiment Setup and Data Collection

3.5.4. Model Configuration for Deterministic Outputs

4. Results

4.1. Model Duplication Rate Analysis

4.2. Classification Performance Against Human Annotations

4.3. Computational Efficiency

4.4. Key Findings

5. Limitations

6. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Full Prompt Used for LLM Deduplication Evaluation

Appendix A.2. Prompt Used for Synthetic Job Postings Generation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI