On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation
Abstract
1. Introduction
- 1.
- A scalable hybrid deduplication framework that integrates lightweight embedding-based filtering with LLM-based semantic validation to balance computational efficiency and semantic precision.
- 2.
- A clustering-based grouping mechanism that significantly reduces computational complexity by restricting comparisons to temporally and semantically coherent candidate sets, thereby improving scalability in continuous ingestion environments.
- 3.
- The construction of a large augmented evaluation dataset consisting of 50,000 job postings, combining real-world and synthetically generated records to simulate realistic duplication scenarios and linguistic variability.
- 4.
- A comprehensive comparative evaluation of open-source and commercial LLMs, demonstrating that lightweight locally deployable models can achieve performance close to state-of-the-art commercial systems while substantially reducing operational cost.
- 5.
- An operationally feasible continuous deduplication pipeline, validated under realistic daily ingestion settings, confirming its suitability for large-scale deployment.
2. Related Work
3. Methodology
3.1. Overview
3.2. Synthetic Job Postings Generation Using GPT-5
3.3. Detect Near Duplicate Job Postings
- (1)
- Defining a date-based sliding window;
- (2)
- Computing semantic similarity using embedding models;
- (3)
- Grouping related postings through an efficient clustering mechanism that minimizes unnecessary comparisons.
3.3.1. Time Window
3.3.2. Detecting Duplicates Using Embeddings
- Job title;
- Company name;
- Location;
- Job description.
3.3.3. Clustering for Efficient Grouping
3.4. Highlight Differences Using HTML
3.5. Evaluate Duplicate Pairs Using LLM
3.5.1. Open-Source vs. Commercial LLMs
3.5.2. Evaluation Metrics
- Precision: The proportion of correctly identified duplicates among all pairs flagged by the model. High precision indicates minimal false positives, which is critical for ensuring that distinct job postings are not erroneously merged. Precision is defined as:
- Recall: The proportion of actual duplicates correctly detected by the model. High recall ensures comprehensive deduplication and reduces residual noise in the dataset. Recall is calculated as:
- Accuracy: The overall correctness of the model, reflecting the ratio of all true predictions (both duplicates and non-duplicates) to the total number of evaluated pairs:
- F1-score: The harmonic mean of precision and recall, balancing both metrics to provide a single measure of overall classification effectiveness, especially in settings where both false positives and false negatives are equally important:
3.5.3. Experiment Setup and Data Collection
- 60% unique postings;
- 10% exact duplicates;
- 30% near-duplicates.
3.5.4. Model Configuration for Deterministic Outputs
4. Results
4.1. Model Duplication Rate Analysis
4.2. Classification Performance Against Human Annotations
4.3. Computational Efficiency
4.4. Key Findings
5. Limitations
6. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Appendix A.1. Full Prompt Used for LLM Deduplication Evaluation
- Please analyze the following job postings content strictly based on the highlighted parts in the HTML.
- Your task is to analyze the content with a specific focus on the formatted (highlighted) parts within the HTML. The fields that we are interested in are title, location, company, and en\_description.
- The highlighted sections contain key textual (or contextual) differences that are critical for determining the nature of the job postings. When analyzing the content, focus on semantic equivalence rather than minor linguistic, grammatical, or formatting differences.
- However, role differences (e.g., job titles, required qualifications, role levels) and location differences must always be treated as meaningful and should result in the job postings being classified as distinct opportunities, even if all other fields are identical.
- Based on your analysis of these formatted parts, decide if the job postings represent duplicate posts of the same job or distinct opportunities.
- Respond only with "yes" if the highlighted textual content indicates the job postings are essentially the same, or "no" if the highlighted content suggests they are different.
- Respond only in JSON format according to the predefined schema.
Appendix A.2. Prompt Used for Synthetic Job Postings Generation
- Generate exactly batch\_size job postings as a JSON array. Each posting must have these exact fields:- id: string (format like "P1-J001", "P2-J045")- scraped\_date: string (YYYY-MM-DD between 2023-10-01 and 2024-01-15)
- - portal: string ("portal1", "portal2", or "portal3")
- - company: string
- - location: string
- - title: string
- - description: string (2--4 sentences, realistic job description)
- DISTRIBUTION REQUIREMENTS:
- - 60\% unique jobs (completely different roles, companies, locations)
- - 10\% exact duplicates (same content, different IDs and dates)
- - 30\% near-duplicates with realistic variations.
- NEAR-DUPLICATE TYPES (distribute the 30\% among these common real-world scenarios):
- 1. HIERARCHICAL LEVEL VARIATIONS (same role, different seniority):
- - "Software Engineer" vs. "Senior Software Engineer" vs. "Lead Software Engineer"
- - "Waiter" vs. "Head Waiter" vs. "Restaurant Supervisor"
- - "Sales Associate" vs. "Senior Sales Associate" vs. "Sales Team Lead"
- - "Nurse" vs. "Senior Nurse" vs. "Nurse Supervisor"
- 2. EMPLOYMENT TYPE DIFFERENCES:
- - Same role but different contract types:
- * "Full-time Software Engineer" vs. "Contract Software Engineer" vs. "Part-time Software Engineer"
- * "Permanent Marketing Manager" vs. "Temporary Marketing Manager" vs. "Freelance Marketing Manager"
- 3. MINOR TITLE WORDING VARIATIONS:
- - "Data Analyst" vs. "Business Data Analyst" vs. "Marketing Data Analyst"
- - "Customer Service Representative" vs. "Customer Support Agent" vs. "Client Service Specialist"
- - "Hotel Receptionist" vs. "Front Desk Agent" vs. "Guest Services Representative"
- 4. SAME COMPANY, MULTIPLE LOCATIONS:
- - Large companies posting the same role in different cities.
- 5. SIMILAR ROLES IN SAME INDUSTRY:
- - "Junior Accountant" vs. "Accounting Assistant" vs. "Bookkeeper"
- - "Web Developer" vs. "Frontend Developer" vs. "UI Developer"
- 6. SAME AD WITH MINOR UPDATES:
- - Slightly updated requirements, salary range, benefits, or deadline.
- Return only a valid JSON array. No other text or explanations.
References
- Zhang, P. Application of Artificial Intelligence (AI) in Recruitment and Selection: The Case of Company A and Company B. J. Bus. Manag. Stud. 2024, 6, 224–225. [Google Scholar] [CrossRef]
- Draisbach, U. Efficient Duplicate Detection and the Impact of Transitivity. Ph.D. Thesis, Universitat Potsdam, Potsdam, Germany, 2022. [Google Scholar]
- Zhao, Y.; Chen, H.; Mason, C.M. A framework for duplicate detection from online job postings. In Proceedings of the WI-IAT’21: 20th IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Melbourne, Australia, 14–17 December 2021; Association for Computing Machinery: New York, NY, USA, 2022; pp. 249–256. [Google Scholar]
- Ramya, R.S.; Venugopal, K.R. Feature extraction and duplicate detection for text mining: A survey. Glob. J. Comput. Sci. Technol. 2017, 16, 1–20. [Google Scholar]
- Tzimas, G.; Zotos, N.; Mourelatos, E.; Giotopoulos, K.C.; Zervas, P. From Data to Insight: Transforming Online Job Postings into Labor-Market Intelligence. Information 2024, 15, 496. [Google Scholar] [CrossRef]
- Engelbach, M.; Klau, D.; Kintz, M.; Ulrich, A. Combining Embeddings and Domain Knowledge for Job Posting Duplicate Detection. arXiv 2024, arXiv:2406.06257. [Google Scholar] [CrossRef]
- Adhab, A.H.; Husieen, A.N. Techniques of Data Deduplication for Cloud Storage: A Review. Int. J. Eng. Res. Adv. Technol. 2024, 8, 7–18. [Google Scholar] [CrossRef]
- Burk, H.; Javed, F.; Balaji, J. Apollo: Near-duplicate detection for job ads in the online recruitment domain. In Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 18–21 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 177–182. [Google Scholar]
- Gao, J.; He, Y.; Zhang, X.; Xia, Y. Duplicate short text detection based on Word2vec. In Proceedings of the 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 24–26 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 33–37. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
- Shi, H.; Liu, X.; Lv, F.; Xue, H.; Hu, J.; Du, S.; Li, T. A Pre-trained Data Deduplication Model based on Active Learning. arXiv 2025, arXiv:2308.00721. [Google Scholar] [CrossRef]
- OpenAI. API Reference—OpenAI Platform. Available online: https://platform.openai.com/docs/api-reference (accessed on 1 May 2024).
- Ram, S.; Nachappa, M.N. Fake Job Posting Detection. Int. J. Adv. Res. Sci. Commun. Technol. 2024, 4, 283–287. [Google Scholar] [CrossRef]
- OpenAI. Introducing GPT-5. Available online: https://openai.com/index/introducing-gpt-5/ (accessed on 7 August 2025).
- ESCO (European Skills, Competences, Qualifications and Occupations). Available online: https://esco.ec.europa.eu/en/classification/occupation_main (accessed on 15 May 2025).
- O*NET Web Services. Welcome to the O*Net Web Services Site! Available online: https://services.onetcenter.org/ (accessed on 29 September 2023).
- Miller, G.A. WordNet: A Lexical Database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
- Colombo, S.; D’Amico, S.; Malandri, L.; Mercorio, F.; Seveso, A. JobSet: Synthetic Job Advertisements Dataset for Labour Market Intelligence. In Proceedings of the SAC’25: 40th ACM/SIGAPP Symposium on Applied Computing, Catania, Italy, 31 March–4 April 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 928–935. [Google Scholar]
- Skondras, P.; Zervas, P.; Tzimas, G. Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification. Information 2023, 15, 363. [Google Scholar] [CrossRef]
- Skondras, P.; Zotos, N.; Lagios, D.; Zervas, P.; Tzimas, G. Deep Learning Approaches for Big Data-Driven Metadata Extraction in Online Job Postings. Future Internet 2023, 14, 585. [Google Scholar] [CrossRef]
- Itnal, V. Fake/Real Job Posting Detection Using Machine Learning. Int. J. Res. Appl. Sci. Eng. Technol. 2025, 13, 1508–1515. [Google Scholar] [CrossRef]
- Christen, P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection; Springer Science and Business Media: Berlin, Germany, 2012. [Google Scholar]
- Lavi, D.; Medentsiy, V.; Graus, D. conSultantBERT: Fine-Tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers. arXiv 2021, arXiv:2109.06501. [Google Scholar]
- Ortiz Martes, D.; Gunderson, E.; Neuman, C.; Kachouie, N.N. Transformer Models for Paraphrase Detection: A Comprehensive Semantic Similarity Study. Computers 2025, 14, 385. [Google Scholar] [CrossRef]
- Miller, D.L. WordLlama: Recycled Token Embeddings from Large Language Models. 2024. Available online: https://github.com/dleemiller/wordllama (accessed on 24 October 2024).
- Bos, A. Visualizing Differences Between HTML Documents. Bachelor’s Thesis, Radboud University, Nijmegen, The Netherlands, 2018. [Google Scholar]
- Rajiv, Y. Detecting Similar HTML Documents Using a Sentence-Based Copy Detection Approach. Master’s Thesis, Department of Computer Science, Brigham Young University, Provo, UT, USA, 2005. [Google Scholar]
- Lin, Y.S.; Jiang, J.Y.; Lee, S.J. A Similarity Measure for Text Classification and Clustering. IEEE Trans. Knowl. Data Eng. 2014, 26, 1575–1590. [Google Scholar] [CrossRef]
- Gunawan, D.; Sembiring, C.A.; Budiman, M.A. The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents. J. Phys. Conf. Ser. 2018, 978, 012120. [Google Scholar] [CrossRef]
- Touvron, H.; Lavril, T. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Jiang, A.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.; de Las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
- Abdin, M.; Aneja, J.; Behl, H.; Bubeck, S.; Eldan, R.; Gunasekar, S.; Harrison, M.; Hewett, R.J.; Javaheripo, M.; Kauffmann, P.; et al. Phi-4 Technical Report. arXiv 2024, arXiv:2412.08905. [Google Scholar]
- OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
- Dong, Y.; Mu, R.; Zhang, Y.; Sun, S.; Zhang, T.; Wu, C.; Jin, G.; Qi, Y.; Hu, J.; Meng, J.; et al. Safeguarding Large Language Models: A Survey. arXiv 2024, arXiv:2406.02622. [Google Scholar] [CrossRef] [PubMed]
- Kibriya, H.; Khan, W.Z.; Siddiqa, A.; Khan, M.K. Privacy issues in Large Language Models. Comput. Electr. Eng. 2024, 120, 109698. [Google Scholar] [CrossRef]
- Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar]
- Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
- Gwet, K. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement; Among Raters, 4th ed.; Advanced Analytics LLC: Gaithersburg, MD, USA, 2014. [Google Scholar]
- Ntinopoulos, V.; Rodriguez Cetina Biefer, H.; Tudorache, I.; Papadopoulos, N.; Odavic, D.; Risteski, P.; Haeussler, A.; Dzemali, O. Large language models for data extraction from unstructured and semi-structured electronic health records: A multiple model performance evaluation. BMJ Health Care Inform. 2025, 32, e101139. [Google Scholar] [CrossRef] [PubMed]
- Bhayana, K.; Wang, D.; Jiang, X.; Fraser, S. Abstract 134: Use of Large Language Model to Allow Reliable Data Acquisition for International Pediatric Stroke Study. Stroke 2025, 56, A134. [Google Scholar] [CrossRef]
- Du, W.; Yang, Y.; Welleck, S. Optimizing Temperature for Language Models with Multi-Sample Inference. arXiv 2024, arXiv:2502.05234. [Google Scholar]




| Field | Posting A | Posting B |
|---|---|---|
| Job Title | Waiter/Waitress | Assistant Waiter |
| Posting Date | 02/09/25 | 05/09/25 |
| Company | OceanView Restaurant | OceanView Restaurant |
| Location | Thessaloniki | Thessaloniki |
| Description | Hiring experienced waiters/waitresses for busy beachfront restaurant. Full-time or part-time roles available. | Assistant waiter needed for support roles. Experience a plus. Flexible working hours. |
| Model | Duplication Rate |
|---|---|
| GPT-4o | 39.7% |
| Phi-4 | 42.0% |
| Llama 3.1–8B | 52.2% |
| Mistral–7B | 75.8% |
| Model | Precision (%) | Recall (%) | Accuracy (%) | F1-Score (%) |
|---|---|---|---|---|
| GPT-4 | 92.26% | 98.14% | 95.99% | 95.10% |
| Phi-4 | 89.97% | 95.35% | 93.91% | 92.58% |
| Llama 3.1–8B | 68.89% | 91.95% | 80.30% | 78.77% |
| Mistral–7b | 51.80% | 100.00% | 62.85% | 68.25% |
| Model | Avg. Validation Time (min/day) |
|---|---|
| GPT-4o | 1.9 |
| Phi-4 | 2.2 |
| Llama 3.1–8B | 2.5 |
| Mistral–7B | 3.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Thivaios, G.; Zervas, P.; Giotopoulos, K.; Tzimas, G. On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation. Information 2026, 17, 233. https://doi.org/10.3390/info17030233
Thivaios G, Zervas P, Giotopoulos K, Tzimas G. On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation. Information. 2026; 17(3):233. https://doi.org/10.3390/info17030233
Chicago/Turabian StyleThivaios, Giannis, Panagiotis Zervas, Konstantinos Giotopoulos, and Giannis Tzimas. 2026. "On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation" Information 17, no. 3: 233. https://doi.org/10.3390/info17030233
APA StyleThivaios, G., Zervas, P., Giotopoulos, K., & Tzimas, G. (2026). On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation. Information, 17(3), 233. https://doi.org/10.3390/info17030233

