1. Introduction
Modern labor markets are undergoing rapid and uneven change, driven by digitization, automation, and shifting sectoral demand. Policymakers, public employment services, and large intermediaries increasingly need timely, fine-grained information on where skills are emerging, which occupations are at risk, and how wages evolve across regions and sectors. Traditional survey-based labor market intelligence provides high-quality but slow and coarse indicators, often with publication lags of several months and limited occupational detail. In contrast, large-scale online vacancy and CV data offer a rich and near-real-time view of skills, jobs, and wages, but raise substantial privacy, governance, and modeling challenges. This work addresses these challenges by combining large language models (LLMs), federated learning (FL), and differential privacy (DP) to enable real-time labor market analytics that are informative for policy and institutional decision making while respecting strict privacy constraints.
The global labor market is undergoing rapid transformation driven by technological innovation, automation, demographic shifts, and post-pandemic economic realignment. In this evolving landscape, understanding labor demand and supply dynamics at fine-grained spatial, temporal, and occupational resolutions has become increasingly critical. Policymakers depend on such insights to design reskilling and employment programs; corporations leverage them for strategic hiring and workforce planning; and researchers analyze them to model structural changes and forecast the emergence of new occupations. Despite this growing need, existing systems often fall short in terms of timeliness, granularity, adaptability, and—critically—data governance.
Existing labor analytics platforms typically rely on keyword heuristics or supervised learning over proprietary datasets. Such approaches are brittle: they fail to generalize across domains, lack explainability, and are not easily adaptable to evolving taxonomies such as O*NET, SOC, or ESCO. More importantly, most rely on centralized data collection, which necessitates transferring sensitive employment or organizational data to a central server. This raises significant privacy, compliance, and ethical concerns under regulations such as GDPR, CCPA, and HIPAA. Institutions are therefore reluctant to share or aggregate data, limiting the scale and representativeness of analytical models. The need for a distributed, privacy-preserving, and generalizable labor analytics framework has never been more urgent.
Recent advancements in large language models (LLMs)—such as GPT-3 [
1] and LLaMA [
2]—have transformed natural language understanding and generation. LLMs demonstrate strong capabilities in zero-shot classification, information extraction, and summarization, often outperforming task-specific architectures. Applied to labor market intelligence, they can infer skill requirements from job postings, normalize ambiguous job titles into standardized occupational codes, estimate salary ranges from textual cues, and summarize hiring trends across regions. Yet, off-the-shelf LLMs lack domain alignment for labor economics, are computationally intensive to scale, and introduce new challenges around privacy, data provenance, and interpretability.
While LLMs have seen adoption in domains such as biomedical, legal, and financial text analysis, their integration into labor market analytics remains limited. Few studies have explored how to adapt LLMs to labor-specific terminology, align their outputs with structured taxonomies, or deploy them securely in decentralized settings. Even fewer have investigated full-stack systems that incorporate privacy-preserving computation, continuous learning, and governance for cross-institutional collaboration. These gaps motivate the need for a comprehensive solution that unifies state-of-the-art NLP with rigorous privacy protection and scalable analytics.
In this work, we introduce a secure, scalable, and modular framework for labor market analysis powered by large language models. Our framework combines domain-adapted LLMs with federated learning (FL) and differential privacy (DP) to enable collaborative model training across multiple institutions without exposing raw data. The system is designed to operate seamlessly across organizational boundaries—such as ministries of labor, multinational enterprises, universities, and think tanks—while preserving the confidentiality, ownership, and legal integrity of local data.
The architecture consists of three core components:
Data ingestion layer: A layer responsible for securely collecting, filtering, and preprocessing unstructured labor data from sources such as job boards, resumes, and social media feeds.
LLM-powered inference engine: A domain-adapted LLM fine-tuned for multi-task learning across occupation classification, skill extraction, and salary prediction, ensuring structured outputs compatible with taxonomies like SOC and ESCO.
Federated analysis module: A decentralized training layer implementing secure aggregation and differential privacy mechanisms to enable compliance with privacy regulations while maintaining high model utility.
The system outputs include structured occupation–skill mappings, regional dashboards, temporal trend summaries, and predictive analytics supporting policy design, workforce development, and academic labor research.
Timeliness is particularly critical; official indicators used by central banks and labor ministries are often available only with delays of several weeks or more, whereas online postings and CVs update on a daily or even hourly basis. By turning these high-frequency digital traces into structured signals, our system can provide early warnings about emerging occupations, shifting skill bundles, or sharp demand shocks, which would otherwise only appear in official statistics with substantial delay. Therefore, we evaluate our framework through extensive experiments on real-world and semi-synthetic datasets, including job postings from LinkedIn and Indeed, annotated resumes, and standardized taxonomies such as O*NET and SOC. Quantitative metrics (F1, precision–recall, regression error) and qualitative assessments by labor economists confirm both analytical accuracy and interpretability. We further simulate multi-client federated deployments to assess training efficiency, DP-induced utility trade-offs, and system scalability.
Our main contributions are summarized as follows:
We propose the first end-to-end labor market analysis framework that integrates LLMs with structured taxonomy alignment and multi-task inference.
We design a federated learning protocol with differential privacy guarantees, enabling secure and compliant training on decentralized labor datasets.
We develop a domain-adaptation pipeline for LLMs using occupation-annotated and skill extraction corpora, enhancing relevance and explainability.
We provide comprehensive empirical and expert evaluations demonstrating robustness, scalability, and privacy–utility balance across diverse domains.
To the best of our knowledge, this is the first system that holistically combines large language models, privacy-preserving computation, and labor economics into a unified, deployable platform for next-generation labor market intelligence. While large language models and federated learning offer powerful tools for processing unstructured labor market data at scale, they also come with important trade-offs: training and serving LLMs is computationally expensive; models are subject to domain drift as job content evolves; and federated optimization must contend with non-IID client data, heterogeneous hardware, and intermittent connectivity. We explicitly design our framework and experiments to surface these limitations and to quantify the extent to which differential privacy and federated learning remain compatible with economically useful signals.
2. Related Work
This work draws upon and advances research in four interconnected areas: labor market analytics, natural language processing for job and skill understanding, the application of large language models in economic and occupational domains, and privacy-preserving machine learning. We review the state of the art in each of these domains to highlight the gaps that motivate our proposed framework.
2.1. Labor Market Analytics and Workforce Intelligence
Labor market analysis is a longstanding area of interest in economics and public policy. Early foundational studies focused on analyzing trends in employment, skill demand, and wage evolution using structured datasets collected via national surveys. For instance, Autor et al. [
3] introduced the concept of skill-biased technological change, showing how technology disproportionately favors workers with higher skills. Acemoglu and Restrepo [
4] studied the impact of automation on employment and found a significant displacement effect on middle-skill occupations.
These classical approaches rely on periodic labor force surveys (e.g., the U.S. CPS or EU-LFS), which provide structured data but are limited by their sampling frequency and granularity. As the labor market becomes increasingly dynamic and digitized, these limitations have led researchers to explore alternative data sources. Hershbein and Kahn [
5] leveraged millions of job postings to measure real-time shifts in demand for education credentials, while Marinescu and Wolthoff [
6] used job board data to analyze firm preferences and applicant behavior.
Traditional labor market intelligence still relies heavily on official survey-based and administrative systems such as labor force surveys, establishment surveys, and social insurance registers. These sources remain the backbone for policy, but they typically operate at monthly or quarterly frequency, incur publication lags of several weeks or months, and often provide only coarse breakdowns by region, occupation, or demographic group. Our framework is designed to complement, rather than replace, these systems by providing high-frequency, fine-grained signals that can feed into the same policy processes.
However, these approaches often rely on shallow keyword matching or rule-based classification, which may fail to capture nuanced changes in labor demand, such as the emergence of hybrid occupations (e.g., “data-literate marketers”) or soft-skill emphasis. Furthermore, centralized scraping of proprietary job data may raise issues of access, scalability, and data governance.
2.2. NLP for Occupation Classification and Skill Extraction
Recent advances in NLP have enabled the automated classification of unstructured labor-related text. Early systems applied TF-IDF and word embedding techniques to classify job descriptions into standardized taxonomies such as the U.S. SOC or international ISCO codes [
7]. While effective for coarse-grained labeling, these approaches struggled with ambiguous, context-sensitive phrases and synonyms common in job titles.
Transformer-based models, notably BERT [
8], have shown significant improvements in semantic understanding. Camacho-Collados et al. [
9] proposed JobBERT, a model pretrained on job advertisements, achieving state-of-the-art accuracy in job title normalization. Similarly, Liu et al. [
10] fine-tuned RoBERTa for occupation classification tasks using LinkedIn job data.
Skill extraction is typically formulated as a sequence labeling problem. Conventional approaches employed conditional random fields (CRFs) or LSTMs [
11], while recent work incorporates contextualized embeddings from LLMs. For example, authors in [
12] introduced a skill-aware BERT variant that improves entity-level recall in job documents. Despite these advances, most models are trained on limited datasets, do not generalize across domains (e.g., resumes vs. postings), and assume centralized access to text corpora.
Furthermore, little work has addressed joint modeling of skills, salaries, and occupations as interdependent outputs [
13,
14]. Our framework contributes a unified, multi-task approach that simultaneously performs job classification, skill detection, and salary estimation with privacy guarantees.
2.3. LLMs in Economic and Labor Domain Applications
Large language models (LLMs) such as GPT-3 [
1], T5 [
15], and LLaMA [
2] have demonstrated general-purpose capabilities across summarization, classification, and generation tasks. Trained on massive web-scale corpora, these foundation models can be adapted to specific domains via instruction tuning, continued pretraining, or few-shot prompting.
Beyond generic NLP benchmarks, an emerging literature tailors LLMs to economic and financial applications. In the financial domain, BloombergGPT [
16] and FinGPT [
17] illustrate how domain-specific pretraining on proprietary or curated financial text can improve downstream tasks such as sentiment analysis, risk assessment, and earnings-call understanding. Surveys of generative AI in finance further document applications to portfolio management, risk modeling, and compliance [
18]. At the macro level, Carriero et al. [
19] compare time-series LLMs with traditional forecasting models for standard macroeconomic indicators, showing that LLM-based approaches can be competitive with, or complementary to, classical econometric tools.
Economists have also begun to reflect on how LLMs change empirical practice. Kwon et al. [
20] provide a practical primer on LLMs for economists and central banks, while Korinek [
21] surveys use cases of generative AI in economic research, including text-based measurement, counterfactual reasoning, and agent-based simulations. These contributions emphasize that LLMs are not only black-box predictors but can serve as flexible interfaces for large unstructured corpora and as components inside larger decision-making pipelines.
In the labor domain specifically, most existing work has used LLMs either to measure the potential impact of AI on jobs or to process labor market text. Eloundou et al. [
22] use GPT-4-based annotations to quantify occupational exposure to LLM capabilities, while Chen et al. [
23] exploit LLMs to recover rich information from categorical variables and construct new measures of labor market match quality using job-platform and survey data. At the task level, Serino [
24] adapts LLMs to extract skills from job advertisements, demonstrating that modern transformer models can greatly improve over traditional dictionary-based pipelines for skill tagging.
Compared with this literature, our focus is on operationalising LLMs as a production system for labor market intelligence under realistic constraints on data governance. Rather than assuming a centrally collected corpus, we explicitly integrate LLM-based representations with cross-silo federated learning and client-level differential privacy, aiming to deliver high-frequency labor indicators while respecting institutional and regulatory privacy requirements.
2.4. Privacy-Preserving and Federated Learning
Given the sensitivity of labor data—often containing personal identifiers, salaries, and employment histories—privacy preservation is essential. Federated learning (FL), introduced by McMahan et al. [
25], enables decentralized training by allowing client devices to compute local updates without sharing raw data [
26].
Applications of FL in healthcare [
27] and finance [
28] have shown its utility in domains with strict privacy requirements. In labor market analysis, however, FL adoption is limited due to challenges in text modeling, data heterogeneity, and client variability. Differential privacy (DP) [
29] offers formal privacy guarantees by injecting noise into the data or model updates. Abadi et al. [
30] introduced DP-SGD for deep networks, which has since been extended to NLP settings. Our system integrates FL and DP with LLMs for the first time in the labor market domain. We design a modular pipeline that supports institution-level federated nodes (e.g., ministries, companies, job boards) and applies privacy-preserving aggregation of model updates, enabling collaboration without compromising sensitive employment data.
2.5. Positioning and Novelty
To the best of our knowledge, our framework is the first to combine large language models, occupation–skill–salary modeling, federated privacy mechanisms, and labor market analytics into a unified platform. Compared with prior work, we achieve the following:
We scale NLP-based job understanding to millions of real-time job and resume records with high taxonomy fidelity.
We ensure privacy through federated training and rigorous DP noise control, suitable for deployment across institutions and jurisdictions.
We support longitudinal trend analysis using structured outputs from LLMs, enabling both short-term skill monitoring and long-term workforce planning.
Our contribution is thus not only technical but also architectural and societal, offering a scalable and secure way to monitor the evolving world of work using the latest advances in AI.
3. Framework Overview
Our proposed framework is designed to enable secure, scalable, and automated labor market analysis by integrating large language models with privacy-preserving data processing pipelines. The system comprises three major layers: data ingestion, intelligent processing via LLMs, and federated learning-based analytics. This section provides a detailed overview of each component, their roles, and their interactions.
3.1. System Architecture
The overall architecture of the system is shown in
Figure 1. The components are organized to support modular deployment and decentralized collaboration across organizations such as government agencies, academic institutions, and labor market platforms.
Data Ingestion Layer: This layer interfaces with external data sources, including job posting websites, resume repositories, company HR systems, and labor-related social media feeds (e.g., LinkedIn and Reddit). Data is collected using secure APIs or streaming protocols. It undergoes initial preprocessing, including de-duplication, anonymization, and content validation.
LLM Processing Engine: The core of the system is a domain-adapted large language model that performs various natural language understanding (NLU) tasks, such as job title normalization, occupation–skill mapping, salary range inference, and labor demand summarization. We implement a modular LLM pipeline with pre-tokenization, contextual embedding generation, task-specific prompt engineering, and structured output parsing. The engine supports both batch processing and real-time streaming.
Federated Analysis Module: To enable secure, multi-institutional collaboration, this module coordinates a federated learning protocol where local nodes (e.g., universities or job platforms) compute intermediate models or statistics on-site. Only model updates, optionally masked by differential privacy mechanisms, are shared with a central aggregator. This design ensures raw data never leaves the source organization, meeting regulatory compliance requirements (e.g., GDPR).
3.2. Data Processing Workflow
The framework supports a continuous data pipeline consisting of the following stages:
Text Collection and Annotation: Raw text data is ingested and pre-labeled using weak supervision and existing ontologies (e.g., O*NET). Named entities such as job titles, skills, and locations are recognized and extracted.
Text Embedding and Understanding: Tokenized texts are fed into a fine-tuned transformer model, generating contextual embeddings for classification, clustering, and information extraction tasks. Attention-based mechanisms enable the model to focus on economically significant signals such as skill demand shifts or regional hiring trends.
Semantic Enrichment and Structuring: Extracted information is mapped to standardized taxonomies (e.g., ISCO-08, SOC) to ensure interoperability. Ambiguities in job descriptions or inconsistent terminology across regions are resolved using cross-lingual and paraphrase-aware modeling.
Secure Aggregation and Forecasting: In a distributed fashion, local participants compute aggregated statistics (e.g., occupation frequency histograms, skill co-occurrence graphs) and upload encrypted summaries to a secure aggregator. The central server then trains global forecasting models to predict emerging trends in occupations, skills, or sectoral labor shortages.
3.3. LLM Fine-Tuning and Task Adaptation
We fine-tune a pretrained transformer model (e.g., T5, GPT-J, or LLaMA-2) on a labor-domain corpus composed of historical job advertisements, professional networking site content, and occupation–skill mappings. Several task heads are integrated into the model architecture, each tailored for the following:
Occupation Classification: Predicting hierarchical occupation codes from free-text job descriptions using multi-label classification.
Skill Extraction: Identifying explicit and implied skills and competencies using sequence labeling with CRF or span-based decoding.
Salary Range Estimation: Inferring plausible salary intervals using a hybrid classification–regression head.
Labor Trend Summarization: Generating abstract summaries of labor shifts across sectors and geographies using encoder–decoder-style generation.
Prompt-based zero-shot or few-shot learning is used for domains with limited labeled data. In addition, we implement retrieval-augmented generation (RAG) to improve performance in sparse or noisy information settings by grounding LLM outputs on curated economic knowledge bases.
4. Methodology
Our methodology integrates data collection, large language model (LLM) adaptation, and privacy-preserving distributed learning to enable secure and scalable labor market analysis. We present the technical workflow across data preparation, model architecture, fine-tuning, and privacy protection mechanisms, along with the analytical procedures used for labor market intelligence.
Figure 1 gives a general pipieline of our proposal.
4.1. Labor Market Data Processing
A robust data processing pipeline is essential for extracting meaningful and privacy-compliant insights from heterogeneous labor text sources. In this study, we design a unified preprocessing and normalization framework that accommodates multi-source, multilingual, and semi-structured inputs such as online job advertisements, resumes, freelancing platform profiles, and labor-related social media content.
We represent the full dataset as
, where each
denotes an individual text sample and
is a structured label corresponding to one or more analytical targets. Specifically,
may encode an occupation code, a set of skill tags, or a salary range. The overall label space is defined as the union
capturing the multi-task nature of the downstream modeling process. Data sources often differ in format and granularity; for instance, resume segments provide fine-grained skill mentions, whereas job advertisements emphasize employer requirements and wage expectations. To harmonize these heterogeneous sources, all text records are transformed into a canonical JSON structure with standardized metadata fields (e.g., posting date, region, sector, and source platform).
Each raw document undergoes a multi-stage normalization procedure. After character-level normalization and encoding unification (UTF-8), we apply lowercasing, punctuation handling, and stopword filtering using a domain-augmented stopword list that preserves informative tokens such as job titles and certification names. Tokenization is performed with a subword tokenizer—either SentencePiece or Byte-Pair Encoding (BPE)—to ensure vocabulary consistency across institutions and to reduce the out-of-vocabulary rate for emerging terminology. Part-of-speech tagging and dependency parsing are optionally applied to improve contextual alignment during named entity extraction.
Entity Extraction and Ontology Alignment
To structure unlabeled text, we employ named entity recognition (NER) and span classification models pretrained on general corpora and further fine-tuned on domain-specific annotations. Extracted entities include occupations, skills, organizations, locations, and salary indicators. For each recognized entity
, we compute a semantic embedding using contextual encoders such as SBERT or domain-adapted BERT variants. These embeddings are then compared to entries in standardized taxonomies, such as ISCO-08, O*NET, or ESCO, using cosine similarity and fuzzy string matching. The final mapping is determined via a hybrid similarity score:
where
and
denote the embedding vectors of the entity and concept, respectively, and
balances semantic and lexical similarity. Mapped entities are stored alongside confidence scores to enable downstream filtering and uncertainty-aware analysis. We select the trade-off parameter
on a held-out validation set of manually aligned occupation and skill labels. Specifically, we perform a small grid search over
and choose the value that maximizes taxonomy-alignment accuracy. Performance is stable over a moderate range around the selected
, indicating that downstream results are not overly sensitive to this hyperparameter.
We treat names, email addresses, phone numbers, postal addresses, national identifiers and other obvious personal identifiers as PII. Before any local training, each client replaces detected PII spans with generic placeholders such as [NAME], [ORG], [LOC], or [CONTACT] using a standard NER pipeline supplemented by pattern-based rules. For the labor market tasks considered here (occupation, skills, salary) these tokens carry limited signal beyond coarse context, so the expected impact on predictive performance is small. This pretraining anonymization provides a first layer of protection and is complemented by client-level differential privacy on model updates, yielding a two-layer defense that aligns with GDPR-style requirements to minimize exposure of identifiable information while still enabling useful aggregate analytics.
4.2. Domain-Specific Fine-Tuning
The adaptation of large language models (LLMs) to labor-economic text corpora is framed as a hierarchical optimization problem coupling unsupervised domain pretraining with multi-task supervised learning under constrained communication and privacy budgets. The process seeks to minimize empirical and distributional divergence between the pretrained general-domain model and the labor-specific data manifold, while preserving global stability across federated updates.
Let
denote an LLM encoder parameterized by
that maps tokenized sequences to contextual embeddings. Given a domain distribution
over text–label pairs
, our objective is to minimize a regularized empirical risk:
where
denotes the model-induced token distribution,
is the pretrained general-domain prior, and
regularizes the divergence between parameter-induced posteriors to prevent catastrophic forgetting. Intuitively, the objective in (1) adapts the generic LLM to the specific distribution of labor market text, so that its representations encode domain-relevant regularities in occupations, skills, and wage mentions before any task-specific supervision is introduced.
Stage I: Domain-Adaptive Pretraining (DAPT). In the unsupervised adaptation stage, we minimize the expected token reconstruction error on an unlabeled corpus
drawn from
. The training objective is a masked language modeling (MLM) criterion augmented with an entropy regularizer to encourage calibration of output distributions:
where
denotes Shannon entropy and
controls entropy smoothing to prevent overconfident token predictions. To account for heterogeneous data sources and varying document lengths, we further incorporate a length-weighted sampling function
, where
is a tunable exponent that biases sampling toward longer, semantically richer job descriptions. The effective optimization thus minimizes
. In practice we choose a moderate exponent
(fixed across experiments), which mildly upweights longer job descriptions without allowing a small number of extremely long documents to dominate the training signal. We verified that the induced sampling distribution preserves the overall length histogram and sector mix of the unlabeled corpus to within a few percentage points, indicating that the length-based reweighting does not introduce substantial dataset bias.
Stage II: Multi-Task Supervised Fine-Tuning. Upon completion of DAPT, the model undergoes supervised adaptation across occupation, skill, and salary tasks defined over the labeled corpus
. Let
denote the task-specific head for task
; the joint optimization is given by
where
are task-balancing coefficients.
Each task-specific loss is defined as follows:
where
denotes the sigmoid activation. To stabilize multi-task gradients and avoid dominance of high-variance tasks, we employ uncertainty-based adaptive weighting:
Taken together, the losses in (3)–(7) implement a multi-task learning scheme in which a shared representation is trained to support three related economic tasks (SOC classification, skill extraction, and salary prediction). This encourages the model to capture common structures across tasks (e.g., co-occurrence of occupations and skills) while still allowing each head to specialize through its own task-specific loss.
Overall Objective
The full fine-tuning objective combining domain adaptation, multi-task learning, and information-theoretic constraints is thus
where
are balancing hyperparameters tuned via Bayesian optimization under privacy-aware validation protocols. In this combined objective,
encourages the backbone LLM to remain well-adapted to the unlabeled labor market corpus;
aggregates the supervised multi-task losses for SOC, skills, and salary;
acts as an information-bottleneck regulariser that favours compact, task-relevant representations; and
is a proximal term that stabilizes local updates in the presence of client heterogeneity. The scalar weights thus trade-off domain adaptation, supervised accuracy, representation compression, and cross-client stability.
4.3. Federated Learning with Differential Privacy
We formalize cross-institution training as a privacy-constrained distributed optimization problem. At communication round , a central coordinator broadcasts to a random subset of clients, with independent participation . Each participating client m computes a clipped and privatized update on its local dataset and contributes only an encrypted summary to the server via secure aggregation.
4.3.1. Local Objective and Clipped Updates
Client
m minimizes a local objective
where the proximal term constrains drift under heterogeneity. After
E local SGD epochs with stepsize
, client
m forms the tentative model delta
with per-minibatch gradients
. For privacy and robustness we apply
-clipping at radius
C:
In all experiments we set the default clipping radius to
(
Table 1), and we evaluate alternatives
in the sensitivity analysis. The choice is guided by the standard DP-SGD heuristic of keeping the empirical
-norms of client updates within a narrow range: we selected
C via a small grid search on the validation split so that (i) fewer than roughly half of updates are clipped, (ii) the resulting configuration satisfies a target privacy budget under the RDP ledger. We do not employ a more sophisticated optimization procedure beyond this principled empirical tuning.
4.3.2. Client-Side Gaussian Mechanism and Secure Aggregation
Each participating client adds calibrated Gaussian noise to its clipped update,
and sends
under an additively homomorphic (or mask-based) secure aggregation protocol. The server only learns the aggregate
The global model update is then
4.3.3. Subsampled Rényi DP Accounting
Let
be the per-round randomized mechanism comprising Poisson subsampling with rate
q, clipping (
11), and client-side Gaussian noise (
12). For Rényi order
, the subsampled Gaussian mechanism admits the RDP parameter
We note that the bound in (15) follows from standard subsampled Gaussian RDP results: we apply privacy amplification by Poisson subsampling to the client-level Gaussian mechanism and then use the analytical moments accountant for Rényi DP.
By composition, the
T-round RDP satisfies
. Converting to
-DP yields
Equations (
15) and (
16) define a tight privacy ledger over
T rounds as a function of
.
Let
denote the bounded variance of clipped local gradients and
the heterogeneity measure
at a stationary point
. With stepsizes
chosen to satisfy standard stability conditions, the expected squared gradient norm after
T rounds obeys
where the third term captures the DP-induced variance and the last term arises from unbiased compression. (This heteroscedastic formulation treats task-specific uncertainties
as independent; covariance terms between tasks are ignored to keep the objective simple and numerically stable. Explicitly modeling cross-task covariance would require estimating a full uncertainty matrix and is left for future work.) In general, Equations (15)–(17) define a simple privacy ledger that tracks, for each round, the contribution of client subsampling and Gaussian noise to the overall Rényi DP parameters, and then converts the accumulated RDP into an
-DP guarantee for the entire training procedure. This makes the privacy budget explicit and comparable across different hyperparameter settings.
4.3.4. Per-Layer Clipping and Adaptive Noise
To sharpen the privacy–utility trade-off, we employ per-layer clipping with radii
and layerwise noise multipliers
:
We allocate noise by sensitivity-aware budgeting, e.g.,
, where
is an empirical Lipschitz proxy obtained from running gradient norms. The total RDP accumulates additively across layers. We note that although
may differ across layers, the privacy accounting in
Section 5.1 is performed on the concatenated parameter vector, so the final
guarantee applies uniformly to the entire model. Per-layer allocation only redistributes the contribution of each layer to the total RDP; it does not induce different formal privacy levels for different components of
.
4.3.5. Secure Aggregation Threat Model
We assume an
honest-but-curious coordinator that observes only
in (
13). Against a coalition
of up to
colluding clients (with
) the protocol reveals at most the noisy sum of the remaining
contributions. Since each client already privatizes updates via (
12), privacy is retained even if secure aggregation fails open; the latter primarily protects confidentiality against inference on small coalitions and provides robustness to dropouts. (Our analysis is orthogonal to the differential privacy guarantees, which hold even if the secure aggregation layer is compromised.)
4.3.6. Participation Randomness and Amplification
Random participation further amplifies privacy. Under Poisson subsampling with rate
q, the effective per-round sensitivity scales as
, and the RDP curve (
15) tightens with smaller
q. In practice we tune
to satisfy a target
in (
16) while keeping the variance term in (
17) below a user-specified utility threshold.
End-to-End Budget Management
Let
be the budget used for unsupervised adaptation with local unlabeled corpora and
the budget for supervised federated fine-tuning. The total ledger enforces
with each component computed via (
16) using its own
. (Here we conservatively model domain-adaptive pretraining and supervised fine-tuning as sequential mechanisms acting on (potentially overlapping) populations, and thus apply standard sequential composition, which yields an additive bound on
. If the underlying cohorts were disjointed, a more favorable parallel composition could be used instead; our choice covers the general case where some individuals contribute data to both stages.)
5. Theorcial Analysis
This section provides a formal analysis of our training and inference pipeline, encompassing (i) privacy guarantees under client-side Gaussian mechanisms with subsampling and secure aggregation; (ii) optimization convergence under data heterogeneity, compression, and differential privacy (DP) noise; (iii) generalization via algorithmic stability with explicit dependence on the privacy parameters; (iv) uncertainty propagation to downstream trend forecasting. Throughout, we assume losses are Lipschitz and bounded unless otherwise specified.
Let denote the (randomized) federated training algorithm that maps a family of local datasets to a model after T communication rounds. The per-round participating set is with Poisson subsampling rate q, client clipping radius C, and client-side Gaussian noise multiplier . We use for -DP and for Rényi DP (RDP) at order . Gradients are clipped in -norm and stochastic updates may be compressed by an unbiased operator Q with variance parameter .
5.1. Privacy Guarantees Under Subsampled Client-Side Gaussian DP
Lemma 1 (Per-round RDP of subsampled Gaussian mechanism).
Consider a single training round comprised of Poisson subsampling with rate q over clients, -clipping at radius C, and independent Gaussian noise added on each participating client update before secure aggregation. Then for any order , the mechanism admits the RDP parameter Proof Sketch. Apply privacy amplification by subsampling to the client-level Gaussian mechanism and use the moment accountant/RDP composition of additive Gaussian noise. The term follows from higher-order terms in the subsampled RDP expansion. □
Theorem 1 (Composed privacy over
T rounds).
Let denote the RDP parameter of round t. Then . For any , the composed mechanism is -DP withIn our instantiation, corresponds to (i) cross-entropy losses for occupation and skill prediction with logit clipping and label smoothing, (ii) a squared loss for salary regression applied to normalized targets. Together with gradient clipping, these choices ensure that after rescaling and that ℓ is L-Lipschitz in θ, so the conditions of Theorem 1 hold for all three tasks.
Corollary 1 (Layerwise budgeting). If each layer ℓ uses clipping radius and noise multiplier , then the per-round RDP decomposes additively across layers and across rounds. Consequently, for any partition of layers, one may allocate privacy budgets independently and sum them to obtain the global ledger.
Remark 1 (Security of aggregation vs. DP). Secure aggregation ensures the server observes only the sum of (already privatized) updates, protecting confidentiality against inference on individuals or small coalitions. However, DP holds even if secure aggregation fails open; hence, DP is the primary guarantee, while secure aggregation strengthens the threat model.
5.2. Optimization Under Heterogeneity, Compression, and DP Noise
We study the convergence of the proximal federated update with client heterogeneity, unbiased compression, and additive Gaussian noise. Let be the population objective and assume L-smoothness.
Assumption 1 (Bounded stochasticity and heterogeneity). There exist such that (i) the variance of clipped stochastic gradients is bounded by , (ii) the heterogeneity measure at a stationary point satisfies .
Assumption 2 (Unbiased compression). The compressor Q is unbiased, , with for some , and employs error feedback.
Theorem 2 (Non-asymptotic convergence with DP and compression).
Let each round sample clients with rate q, perform E local steps with stepsize , and update globally with step size . Under Assumptions 1 and 2 and standard stability conditions on , after T rounds the expected stationarity measure obeys Proof Sketch. Adapt a variance-reduced analysis for federated proximal SGD, bounding drift by the proximal term and controlling additional variance from (i) DP noise (), (ii) compression () using error-feedback recursion. Subsampling contributes the averaging factor. □
Proposition 1 (Trade-off surface). For a target stationarity tolerance , feasibility requires . Given a privacy target via Theorem 1, one can invert the RDP ledger to obtain admissible triples that lie on a Pareto surface balancing convergence and privacy.
5.3. Generalization via Stability and Differential Privacy
Let the loss be bounded and L-Lipschitz in for each example z. We leverage the well-known connection that DP implies algorithmic stability.
Definition 1 (Uniform stability). An algorithm is γ-uniformly stable if for any neighboring datasets differ in one example and, for any z, .
Lemma 2 (DP ⇒ stability). If is -DP, then it is γ-uniformly stable with . For , .
Theorem 3 (Generalization bound).
Let be the output of the DP federated algorithm on a sample S of size (effective sample over clients and rounds). Then with a probability of at least ,The bound in Theorem 3 separates the contributions of optimization stochasticity, client heterogeneity, differential-privacy noise, and (optional) update compression. Under the smoothness and bounded-variance assumptions stated above, the rate matches standard nonconvex DP-SGD bounds up to constants when heterogeneity and compression are negligible, and degrades gracefully as the heterogeneity parameter and compression variance increase. In highly non-IID regimes the bound is likely conservative, but it captures the empirical trends observed in our experiments: increasing the number of participating clients M and rounds T mitigates the effect of DP noise, whereas strong client drift and aggressive compression slow convergence.
Corollary 2 (Privacy–generalization coupling). For fixed , tightening via larger σ or smaller q increases optimization noise (Theorem 2) but improves stability γ, exposing an explicit privacy–utility–generalization triad. The optimal operating point depends on task curvature, heterogeneity , and forecasting requirements downstream.
5.4. Uncertainty Propagation to Forecasts Under DP Noise
Let be the privatized model used to produce time-series intensities . Write the perturbation decomposition , where is the sampling error and arises from injected noise.
Proposition 2 (Linearized predictive variance inflation).
Under a first-order delta approximation of the predictor map , the h-step predictive covariance obeys the discrete Lyapunov recursionwhere is the Jacobian of φ at and is the parameter-space covariance induced by the DP mechanism. Thus, privacy induces an additive, horizon-dependent variance inflation in forecast space. 5.5. Complexity and Communication
Let
be the number of model parameters. With unbiased compression ratio
(i.e., transmitting
coordinates in expectation) and participation rate
q, the expected per-round uplink cost is
for
b-bit quantization (e.g., stochastic
b-bit). Error feedback ensures no first-order bias, contributing only the
term in Theorem 2. Computation scales as
per participating client per round.
In general, Lemmas 1 and 2 and Theorems 1–3 jointly establish that our federated protocol attains (i) auditable client-level -DP; (ii) provable convergence rates that degrade gracefully with DP noise, heterogeneity, and compression; (iii) distribution-level generalization guarantees through DP-induced stability. Proposition 2 further shows how privacy noise propagates to downstream forecast uncertainty in a controllable, explicitly characterizable manner. Collectively, these results provide an interpretable blueprint for selecting to meet prespecified privacy and accuracy targets under realistic cross-institution constraints.
In our implementation we transmit full-precision model deltas (with mixed-precision training on-device) and do not apply additional lossy quantization beyond standard numerical precision, so the analytical communication cost in (25) slightly overestimates the actual savings that could be obtained under aggressive gradient compression. Integrating quantization-aware DP mechanisms that jointly tune to reduce bandwidth at fixed privacy and utility is a promising direction for future deployments, and is supported by the general form of Theorem 3 through the term.
6. Experiments
We conducted comprehensive experiments to evaluate the effectiveness, scalability, and privacy-preserving capability of our LLM-powered labor market analysis framework. We assessed performance across multiple tasks: occupation classification, skill extraction, and salary range estimation. Additionally, we evaluate our federated learning protocol under privacy constraints and analyze system scalability across distributed nodes.
We organized the experiments around three main questions: (Q1) How does our Fed + DP configuration compare to centralized and non-DP baselines across the three tasks? (Q2) How do privacy and optimization hyperparameters affect the privacy–utility trade-off? (Q3) How does performance vary across head and tail segments of the labor market label space?
6.1. Settings
We designed the experimental setting to stress three axes of performance: (i) predictive utility on occupation classification, skill extraction, and salary estimation; (ii) end-to-end privacy under client-side DP with subsampling and secure aggregation; (iii) scalability under cross-institution heterogeneity. Unless stated otherwise, all experiments were repeated with three random seeds, with results reported as mean ± std.
6.1.1. Datasets
Sources and Coverage
We integrated four complementary corpora with distinct statistical profiles and governance constraints: (1)
O*NET/SOC curated descriptions (structured, taxonomy-aligned); (2)
Indeed postings (high-volume, noisy, rapidly drifting); (3)
LinkedIn postings (moderate volume, richer metadata); (4)
OpenSkills span annotations (fine-grained skill entities). Each source was deduplicated, anonymized, and normalized into a canonical schema with timestamps, regions, sectors, and ontology links (ISCO-08/SOC/ESCO). Please see
Table 2 and
Figure 2 for a dataset summary.
We used a time-based split to respect temporal causality: train up to June 2023, validation July 2023–September 2023, and test October 2023–June 2024. Skill-span evaluation uses stratified sampling by sector to reduce distributional mismatch. All reported metrics are for the test window.
Tokenization uses SentencePiece (32k vocab) for T5 and the native tokenizer for LLaMA-2-7B. Entity spans (skills, salaries) are pre-labeled via distant supervision and manually verified on 5k instances. All PII is replaced in-place with pseudo-tokens prior to any learning step.
6.1.2. Baselines
We compared our model against (i)
TF-IDF + LR for classification; (ii)
BERT-Base fine-tuned per task; (iii)
JobBERT (labor-pretrained); (iv)
non-private Fed-BERT (FedAvg without DP). All baselines use identical splits and tokenization consistent with their architectures, as shown in
Table 3.
TF-IDF uses 1–2 g (100k features). BERT models use AdamW (
, batch 32), with max length 256. Early stopping was based on validation F1 (occupation)/span-F1 (skill)/RMSE (salary). See
Figure 3 for training set details.
6.1.3. Configurations
Backbones and Multi-Task Heads
We evaluate T5-base (220 M) and LLaMA-2-7B. Occupation uses a softmax head over 200 SOC codes; skill extraction uses a span head with a CRF layer; salary uses a regression head predicting midpoint and log-range. Unless otherwise noted the following parameter settings were used: AdamW, (linear warmup 5%), batch 64, gradient clip 1.0, label smoothing 0.1 (occ), dropout 0.1. Mixed precision was enabled.
Federated Setup and Privacy
We simulate
clients with a heterogeneous size and sector mix (Dirichlet
over sector proportions). This stylized configuration mimics a cross-silo deployment with a small number of institutional partners (e.g., ministries, job boards, large firms) exhibiting sectoral imbalance, and yields client label marginals similar to the empirical skew observed in our data. We found that increasing
M (with the same Dirichlet prior) preserves the qualitative trends in the privacy–utility trade-off at the cost of higher communication overhead. Each round samples clients with rate
, local epochs
, total rounds
. Client-side DP uses clipping
and Gaussian noise multiplier
unless stated, with moments accountant at
.
Figure 4 presents a visualization of heterogeneity across clients.
6.1.4. Evaluation Metrics
Occupation: Micro/macro-F1 over 200 SOC codes.
Skills: span-level precision/recall/F1 with exact boundary match.
Salary: MAE and RMSE at the midpoint of posted/estimated ranges, reported in USD after inverting the log transform used during training. We trained the regression head on log-scaled midpoints (and log-range) to temper heavy-tailed effects, so that errors are approximately constant in relative terms across wage brackets while remaining interpretable as absolute differences once mapped back to the original scale. We report
with
via the RDP/moments accountant under Poisson subsampling rate
q, clipping
C, noise
, and rounds
T. We measured
training time per round and
P95 inference latency per request. For fairness, latencies exclude network queuing and use the same GPU class. All used metrics is summarized in
Table 4.
6.2. Results
We report aggregate task performance, slice-based robustness, and privacy–utility behavior. Unless noted, results are averaged over three seeds with 95% CIs from bootstrap (
) on the test window (October 2023–June 2024). Centralized models use the same preprocessed inputs as federated models; federated results are computed on the global model after the final round (
Table 5).
Table 6 extends the core comparison with micro/macro-F1 for occupation, span-F1 for skills, and MAE/RMSE for salary. Our centralized framework attains the best overall performance; the federated DP variant is within 1.5–3.5% relative of centralized across tasks while satisfying a client-level privacy ledger (cf.
Table 7). Improvements over JobBERT are larger on macro-F1, indicating better treatment of long-tail SOC classes.
We analyze robustness by (i) SOC head/tail buckets (top-50 vs. the remaining 150), (ii) sector groups (IT, healthcare, logistics), (iii) region clusters (top-10 MSAs vs. others). The DP model preserves most of its advantages in head classes and exhibits the largest gap in the rare-class tail, consistent with privacy noise acting as additional regularization. We sweep
to realize target
(moments accountant,
) and observe a smooth Pareto front. Even for
, occupation F1 remains above 0.86; at
, performance nearly matches that of the centralized model, see
Figure 5 and
Figure 6.
We further assess calibration (ECE) and residual structure (Durbin–Watson on salary errors). DP slightly increases variance but improves overconfidence in head classes. For representative settings we list the realized alongside task metrics. Tighter privacy (smaller ) slightly reduces tail-class macro-F1 but keeps micro-F1 and salary MAE competitive.
6.3. Comparison with Prior Work
Table 8 and
Table 9 position our framework against prior centralized models on all three tasks. As shown in
Table 8, traditional TF-IDF + LR and BERT-Base baselines lag behind the labor-pretrained JobBERT model, which attains an occupation micro/macro-F1 of
, skill F1 of
, and salary MAE of USD 5010. Our centralized LLM framework further improves these metrics to
and
with a reduced MAE of USD 3890, while the Fed+DP variant remains competitive at
and
with an MAE of USD 4160.
Table 9 quantifies these gains relative to JobBERT, showing absolute increases of
and
in occupation micro/macro-F1 and
in skill F1 for the centralized model (roughly
relative improvement in micro-F1) alongside a
reduction in salary MAE; the Fed+DP model still delivers
higher micro-F1 and a
lower MAE. Together, these comparisons indicate that our approach not only surpasses prior centralized state-of-the-art methods but also retains most of the accuracy when trained in a privacy-preserving federated setting.
6.4. Ablation Study
We dissect the contribution of architectural and training choices along three axes: (i) representation learning (
DAPT, prompts, CRF), (ii) multi-task coupling (loss weights, gradient orthogonalization), (iii) privacy/federation mechanisms (clipping
C, noise
, client rate
q, secure aggregation). All ablations follow the settings in
Section 6.1 and were evaluated on the time-held-out test window. Unless stated otherwise, the presented values are means over three seeds with 95% CIs (bootstrap,
).
Table 10 quantifies the effect of removing one component at a time from the full
Fed + DP model. Domain-adaptive pretraining (DAPT) and multi-task coupling are the largest contributors across tasks; gradient orthogonalization primarily improves macro-F1 by stabilizing tail classes. The CRF decoder benefits span-level skill extraction but has negligible impact on occupation classification.
We swept
while holding rounds
to study the DP–utility surface.
Table 11 reports realized
and task metrics.
Table 9 thus provides an empirical approximation of the privacy–utility surface induced by our ledger over a discrete grid of
settings. Deriving closed-form optimality conditions for the choice of
would require additional structural assumptions on the loss landscape and client heterogeneity, and is left for future work. Larger noise
or smaller clipping
C strengthen privacy but increase error; higher participation
q improves utility at the cost of a larger
.
Figure 7 shows a waterfall-style decomposition of occupation micro-F1 starting from a non-private centralized baseline to the fully private federated system, adding or removing one mechanism at a time. The most significant deltas correspond to DAPT and multi-task coupling; DP noise introduces a smaller, monotone drop.
Table 12 depicts an interpolated surface of occupation micro-F1 over
at fixed
and
. Utility increases with
q and decreases with
; contour lines indicate constant
levels from the privacy ledger, illustrating feasible operating regimes.
6.5. Discussion
From both the convergence analysis in Theorem 3 and the empirical results, we can infer how the proposed approach scales to a larger number of clients M and more diverse datasets. Increasing M at a fixed client sampling rate q reduces variance in the aggregated updates and improves robustness to DP noise, but it may also amplify heterogeneity effects when new clients differ strongly from the existing population. Similarly, incorporating additional sectors or countries increases the coverage and policy relevance of the resulting indicators, but can make optimization more challenging if label distributions become highly unbalanced. These trade-offs suggest that future large-scale deployments should combine the present framework with client clustering or personalized heads to better accommodate strong cross-client diversity.
6.6. Limitations
While our results suggest that an LLM-based, federated, and DP-protected framework can deliver accurate and scalable labor market intelligence, several limitations remain. First, our empirical evaluation focuses on English-language data from a limited set of platforms and taxonomies (O*NET, SOC, ESCO), so generalization to low-resource languages, informal sectors, or alternative occupational systems is not guaranteed. Second, the federated experiments are conducted in a controlled simulation with ten cross-silo clients and an honest-but-curious coordinator, which does not capture all operational challenges of real deployments, such as highly unbalanced institutions, client churn, adversarial behavior, or stricter legal constraints. Third, our theoretical analysis relies on standard assumptions (bounded variance, Lipschitz losses, unbiased compression) and moderate model sizes; more extreme non-IID regimes, ultra-large models, or aggressive compression schemes may violate these conditions and lead to different privacy–utility trade-offs. Fourth, although we quantify DP-induced utility loss at the task level, we do not systematically study fairness or distributional impacts across demographic groups, nor do we fully characterize how PII removal and anonymization affect downstream outcomes. Finally, the framework currently targets three supervised tasks (occupation, skills, salary) and does not incorporate human-in-the-loop feedback, causal inference, or richer behavioral data; extending the system in these directions, while preserving rigorous privacy guarantees and governance, is an important avenue for future work.
7. Conclusions and Future Work
This work introduced a comprehensive, privacy-preserving, and scalable framework for labor market analysis powered by large language models (LLMs). By integrating domain-adaptive pretraining, federated learning, and differential privacy into a unified analytical pipeline, our system enables the extraction of structured, interpretable, and timely labor insights from heterogeneous textual data sources such as job postings, resumes, and workforce reports. The proposed framework addresses the dual challenge of high analytical precision and stringent privacy protection, offering a practical solution for institutions constrained by data-sharing restrictions or regulatory compliance requirements.
Through extensive experiments across multiple real-world datasets—including O*NET, SOC, and millions of contemporary job postings—we demonstrated that our framework achieves superior performance in occupation classification, skill extraction, and salary estimation tasks, outperforming both traditional and transformer-based baselines. The federated variant, equipped with differential privacy mechanisms, achieves nearly comparable performance to centralized models, establishing that strong privacy guarantees need not come at the cost of analytical utility. Moreover, our scalability analysis confirmed the system’s suitability for distributed deployments across organizations or geographic regions, with minimal latency and near-linear efficiency under realistic federated configurations.
Beyond quantitative improvements, the qualitative evaluations revealed that the LLM-generated labor trend summaries provide meaningful, policy-relevant insights. Expert reviewers—comprising labor economists and HR professionals—rated these summaries highly in factual accuracy, interpretability, and relevance, attesting to the framework’s potential for supporting evidence-based decision making in workforce development, talent management, and macroeconomic forecasting. The system’s interpretability and modular design also enable transparent policy communication and reproducibility in data-driven labor analytics.
From a broader perspective, this research bridges advances in privacy-preserving machine learning and applied labor economics. It highlights how decentralized and secure computation paradigms can empower multi-institutional collaborations without compromising individual or organizational data sovereignty. Future directions include incorporating secure multiparty computation (SMPC) for cryptographic aggregation, developing dynamic ontology alignment between evolving occupational taxonomies, and integrating reinforcement learning from human feedback (RLHF) to refine model interpretability and ethical alignment. We envision this line of work as a foundation for the next generation of socially responsible, AI-driven labor intelligence systems that balance analytic depth, fairness, and privacy preservation at scale.