Anticipating Job Market Demands—A Deep Learning Approach to Determining the Future Readiness of Professional Skills

: Anticipating the demand for professional job market skills needs to consider trends such as automation, offshoring, and the emerging Gig economy, as they significantly impact the future readiness of skills. This article draws on the scientific literature, expert assessments, and deep learning to estimate two indicators of high relevance for a skill’s future readiness: its automatability and offshorability. Based on gold standard data, we evaluate the performance of Support Vector Machines (SVMs), Transformers, Large Language Models (LLMs), and a deep learning ensemble classifier for propagating expert and literature assessments on these indicators of yet unseen skills. The presented approach uses short bipartite skill labels that contain a skill topic (e.g., “Java”) and a corresponding verb (e.g., “programming”) to describe the skill. Classifiers thus need to base their judgments solely on these two input terms. Comprehensive experiments on skewed and balanced datasets show that, in this low-token setting, classifiers benefit from pre-training and fine-tuning and that increased classifier complexity does not yield further improvements.


Introduction
Automation, offshoring, and the emerging Gig economy instigate and intensify labor market disruptions.The ongoing trends of automating repetitive tasks and offshoring continue to reshape traditional job roles and workforce dynamics, changing skill requirements in the labor market.Research by Bick et al. [1] indicates that 60% of occupations have at least 30% work activities that could be automated.
Automation has triggered a growing demand for technical skills such as data analysis, artificial intelligence, and machine learning, while offshoring amplifies the relevance of intercultural communication and global collaboration skills.The rise of the Gig economy accentuates the significance of adaptability, self-management, and entrepreneurship, particularly as individuals navigate short-term projects and roles.Gig economy platforms like Fiverr, UpWork, and Freelancer, compel both employers and employees to adapt to more flexible working structures.However, as certain tasks become automated or outsourced, routine skills for repetitive operations witness declining demand.This underscores the need for continuous upskilling and reskilling in increasingly competitive job markets.
This article aims to provide decision-makers with insights into the future readiness of professional skills.The presented approach resonates with the objectives outlined in the United Nations Sustainable Development Goals (SDGs) (https://sdgs.un.org/goals, accessed on 1 March 2024) by (i) providing information on a skill's future readiness to guide educational activities and increasing the number of people with relevant skills (SDG 4: Equitable Quality Education) and (ii) helping to better align skill supply and demand to promote sustainable economic growth and productive employment (SDG 8: Decent Work and Economic Growth; see Section 7).
Evaluating a skill's future readiness in terms of its resilience to automation and outsourcing requires the development of classifiers capable of automatically assessing the skill across these two dimensions.The presented work draws upon a skill ontology that characterizes skills with bipartite skill labels consisting of skill topic and skill verb (e.g., Java [topic] Programming [verb] ).These labels, more detailed than topic-only labels, still lack additional context and challenge classifiers due to their reliance on just two topics for predictions.This impacts the design and effectiveness of classifiers, as ensemble models might fail to improve the outcome with such limited input.
Building on previous model comparisons for assessing the future readiness of skills [2], the presented research has been conducted within the Future of Work project (https://semanticlab.net/future-of-work, accessed on 1 March 2024), which investigates the performance of multiple classification approaches (Support Vector Machines (SVMs), Transformers, Large Language Models (LLMs), and a deep learning ensemble classifier) toward reliably propagating expert and literature assessments on automatability and offshorability to yet unseen skills.
The remainder of this paper is structured as follows: Section 2 provides an overview of related work which is followed by a description of the methods applied in this study (Section 4).Section 5 presents the gold standard datasets, evaluation settings, and evaluation results, followed by the conclusion and outlook in Section 7.

State of the Art
The literature discussed in this section is focused on (i) anticipating job market demands; (ii) skill classification systems with a focus on skill bases such as taxonomies and ontologies, custom skill bases for automation and custom skill bases for offshorability; and (iii) a brief overview of recent deep learning models for classification, such as Transformer models and LLMs.

Anticipating Job Market Demands
Anticipating job market demands is a difficult proposition.The lockdown measures imposed during the COVID-19 pandemic, for instance, triggered declines in labor demand of up to 30% [3].A 2023 study estimates that only 30.7% of current jobs are likely to remain unaffected by disruptions caused by generative artificial intelligence (AI) such as ChatGPT.The study's authors expect that a majority of jobs will either be fully or partially impacted by generative AI [4].Another 2023 study states that Industry 5.0 requires highly skilled individuals with various soft skills (e.g., communication, teamwork, emotional intelligence) capable of collaborating with both humans and machines [5].Such economic megatrends tend to disrupt and reshape job markets, often acting together (e.g., the twin impact of COVID-19 and AI [6]) and requiring both employers and workers to adapt.
A recent survey that used data mining techniques to predict student employability [7] identifies the following challenges: (i) focus on gender instead of psychometric attributes; (ii) unbalanced and incomplete datasets; (iii) studies based on heuristics; (iv) scalability is an issue as participants usually come from the same environment (e.g., universities); and (v) reproducibility since data and source code are not publicly available.Most of the studies investigated in the survey center on students and lack integration of online data sources, such as job advertisements or CVs.This tendency could potentially be attributed to the nature of the study cohort.For instance, there might be fewer online job opportunities directly targeting students.Notably, studies focusing on workers, as exemplified by [8], predominantly draw from online sources.
Khaouja and his group [8] survey skill identification technology [8], one of the pillars of predicting job market demands.They define the following objects: (i) online data sources (e.g., job ads, academic curricula, CVs); (ii) skill bases developed by experts (e.g., public ontologies and taxonomies) and customized skill bases (e.g., manually built or based on embeddings); (iii) skill identification methodologies (e.g., skill count, topic modeling, embeddings, ML-based); (iv) evaluation metrics (e.g., precision, recall, f1 for binary classification tasks or MRR for multi-label classification tasks); (v) skill identification granularity (e.g., sentences, n-grams, sentences and n-grams); and (vi) industry sectors (e.g., IT, engineering, healthcare, multiple sectors).The second part of the survey focuses on the various applications of this technology, such as market analysis, curricula development, job recommendation engines, talent search, skill demand prediction and gender bias identification.The article ends with an overview of future work, which includes deep learning, graph embeddings, or generation of skill bases.
Most of the articles present global trends, but job demand forecasting is typically local.Some recent studies, for example, focused on analyzing the Norwegian IT market to improve the computing curriculum [9] or assessing skill demand in the Lithuanian market by analyzing the job ads with clustering and NLP techniques [10].

Skill Classification
Skill bases appeared due to the necessity of classifying jobs listed in statistical reports.The first classification schema focused on occupations, whereas the most recent ones are developed around skills.
Custom-built skill bases allow assessing skills regarding specific use cases, such as assessing their susceptibility to automation.They can be manually built (e.g., lists of predefined terms), extracted from word embeddings, or even generated from existing knowledge bases.The remainder of the section discusses popular skill bases as well as two types of customized skill bases that focus on automatability and offshorability.

Skill Bases
Public skill classifications and skill bases developed by experts are good starting points for the discussion of skill classification [8].Well-known classification schemes include: These skill bases are built around ontologies and supported by organizations which are publicly funded by the US (e.g., O*NET), the European Union (e.g., ESCO), or national governments (e.g., ROME is supported by the French government).
Skill identification technology [8] has been a foundational technology for building these widely used skill bases.Nevertheless, special use cases such as forecasting and predicting the impact of trends such as automation and offshoring may require custom classification systems, such as the one discussed in the remainder of this section.

Customized Skill Bases for Automation
Autor and Dorn [11] argue that jobs containing repetitive tasks are more likely to be automated.Josten and Lordan [12] identify a set of indicators that might impact automation, namely people (e.g., if the job requires daily interaction with people), brains (e.g., if abstract reasoning is required), and brawn (e.g., if physical interaction with certain objects is needed).Josten and Lordan [13] classified O*NET professions based on their degree of automatability (e.g., full, non-, or partial).The main goal of their work was better alignment with the European labor force survey (https://ec.europa.eu/eurostat/web/microdata/european-union-labour-force-survey, accessed on 1 March 2024), which covers the period between 2013 and 2016.Based on a regression analysis, they established that occupations that require brains (e.g., abstract reasoning) are better protected from automation, especially when compared to occupations that require daily interaction (e.g., people skills) and physical interaction (e.g., brawn).Combining these factors can also decrease the likelihood of automation.
Eloundou et al. [14] investigated the impact of LLMs such as GPT and BLOOM [15] on the labor market.They note that routine and repetitive tasks have a high risk of technologydriven displacement.Brynjolfsson et al. [16] distinguish between the labor-augmenting and labor-displacing effects of automation.Eloundou et al. [14] use exposure as the main factor in their automation risk classification.They consider three cases: no exposure (e.g., minimal or no time reduction for completing a task using an LLM or software agent), direct exposure (e.g., LLMs reduce task completion times by 50 percent), and indirect exposure (e.g., if productivity rises and can be doubled with the help of a software agent).Routine tasks in automatable domains such as data processing, data science, or information processing exhibit a high chance of being displaced in contrast to agriculture, manufacturing, or mining tasks.ChatGPT shows promising results in a wide variety of programming tasks [17], supporting the hypothesis that programming and technology-related jobs will be disrupted by automation.However, the idea that people need domain knowledge to author software is also gaining traction, suggesting professional reconversion may be a solution to averting job losses [18].
Nevertheless, considering previous technological breakthroughs in areas such as agriculture and the industrial revolution leads to the conclusion that the overall impact of AI is difficult to assess at this point since it might take several decades to unfold, as it requires the development of new processes and business models [19,20].

Customized Skill Bases for Offshorability
Wagner's work [21] on digital talent platforms (e.g., Freelancer, Upwork and Fiveer) provided important insights towards preparing the offshorable task gold standard.Digital talent platforms help employers meet unplanned needs for knowledge work services [22]; lower the need for permanent positions by hiring specialized workers for specific contracts [23]; and fill hiring gaps that are not addressed by traditional hiring strategies [24].
Dunn [25] classifies Gig economy platforms into: (i) low-skill (e.g., Uber, Bold, TaskRabbit) or high-skill location-dependent (e.g., Outschool or Tutoroo for private lessons) and (ii) low-skill (e.g., Amazon Mechanical Turk) or high-skill (e.g., Fiverr and Upwork) locationindependent services.These services can be used as starting points for assessing offshorability, as all the tasks listed in their catalogs have already been offshored successfully.
While this article focuses on the effects of automation on offshoring (relocating manufacturing to other countries), the opposite strategy, reshoring (relocating manufacturing back home) is equally relevant.Pinheiro et al. [26] conducted a meta-regression analysis of the published research.They analyze four major automation trends (cost advantage, increased productivity, robots, and Industry 4.0) and their influence on offshoring and reshoring.After the pandemic, companies seem to favor reshoring, but they do use offshoring in cases where automation plays a central role (e.g., information technology).They argue that the offshoring vs. reshoring decision depends on internal decision-making chains rather than global trends.

Deep Learning for Text Classification
The Transformer architecture [27] has changed the landscape of NLP, as almost all the classic NLP tasks (e.g., classification, dependency parsing, sentiment analysis, named entity recognition, or question answering) can be implemented with it.Transformer-based language models (e.g., BERT [28], DistilBERT [29], or RoBERTa [30]) employ a self-attention mechanism to capture the context and relationships among words.This self-attention mechanism enables them to assess the significance of individual words within a sentence, prioritizing semantically meaningful tokens while filtering out irrelevant noise [31].Bommasani [32] even names pre-trained Transformer foundation models because most NLP tasks can be designed around them.Still, they also require adaptation to domain-specific tasks (e.g., text classification, sentiment analysis, etc.).
A large-scale survey by Minaee et al. [33] presents most of the deep learning architectures widely used for text classification, from LSTMs to Transformers.A taxonomy of Transformers can be found in [27] and includes most models used for classification until late 2021.Some specialized surveys are also available.One survey [34] examines the role of embeddings in text classification.The last few years have also seen the rise of hybrid architectures that combine sequence-to-sequence or graph neural networks with Transformers, as described in Pham et al. [35].Another recent survey [36] examines text classification models in the context of designing spam filters.
Large Language Models (LLMs) exceed Transformers in size (i.e., over 10 billion parameters [37]) and apply training strategies such as instruction tuning and adaptation tuning to enable instruction following and zero-shot capabilities.The GPT 3/4 models (https://chat.openai.com,accessed on 1 March 2024) [38] inhibit so-called emerging capabilities which further improve their capability to correctly interpret human language and, therefore, pave the way for even more advanced text classification systems [37].
Using AI tools for classification or related generative processes requires considering issues such as transparency and accountability.Since generative tools can easily reuse or remix text, code, or images on demand, it is important to know more details about the training process so that the generated artifacts can be traced back to the training corpora.An early review of LLM transparency and accountability can be found in [39].

Problem Statement
The recent literature highlights the absence of a unified skill framework that includes technical and soft skills and of a clear structure that covers skill gaps, shortages, and mismatches as gaps reported in the state of the art [40].The presented work aims to address this gap with a method capable of assessing the future readiness of professional skills, considering the following two dimensions: (i) automatability (i.e., the extent to which tasks can be performed by automated systems) and offshorability (i.e., the feasibility of performing tasks off-site, often for cost-saving purposes).We therefore develop automatic classifiers that assess a skill's automatability and offshorability based on its label.The classifier then complements and significantly extends manual classifications provided in a gold standard dataset that has been assembled by domain experts based on their assessments and the scientific literature (Section 5.1.1).Automatic classification techniques provide significant cost and time savings compared to manual annotation processes.
Khaouja et al. [8] distinguish three different levels of skill label granularity: words (e.g., "Java"), multi-word phrases (e.g., "Java Programming"), and sentences (e.g., "Experience in software development, particularly the development of Java Web applications.").The presented approach draws upon German bipartite skill labels that are part of a proprietary occupation knowledge base which formalizes domain knowledge [41] in the human resources domain.These labels are multi-word but limited to skill topic and verb (e.g., Java [topic] Programming [verb] ) and do not convey any additional context information.Nevertheless, bipartite skill labels are still considerably more expressive than standard skill specifications which operate on word granularity, only comprise a topic, and therefore cannot distinguish nuances such as the difference between Java [topic] Programming [verb] and Java [topic] Teaching [verb] .Table 1 lists example skills classified by their automatability and offshorability, taken from the customized skill bases.The absence of supplemental context information presents a significant challenge since classifiers solely rely on the two terms used in the bipartite skill labels for predictions.This limitation directly impacts classifier design and performance (Section 5).Standard approaches for improving classification performance, such as ensemble models, cannot yield better results since their enhanced generalization capabilities do not translate into better outcomes if the model input is restricted to two terms.

Method
The presented research involves developing, fine-tuning, and evaluating four methods for classifying skills represented by bipartite skill labels in regard to their offshorability and automatability.The evaluation aims to find the optimal trade-off between method complexity and performance and considers the following methods:

•
A Support Vector Machine (SVM; Section 4.1) serves as a competitive baseline approach; • The more complex Transformer-based classifier (Section 4.2), which is expected to considerably benefit from pre-training; • An approach that builds upon a Large Language Model (Section 4.3) by leveraging ChatGPT.This method draws upon few-shot learning and has the advantage of requiring a considerably lower number of training examples; • An ensemble model (Section 4.4) which combines a Transformer with multiple fully connected neural networks.The ensemble then employs majority voting for overall skill assessment.This is the most complex model considered in the evaluation.

Baseline Classifier
A Support Vector Machine in conjunction with FastText word embeddings, the fasttextwiki-news-subwords-300 model from Gensim, (https://pypi.org/project/gensim,accessed on 1 March 2024) acts as a baseline classifier.The input is tokenized with the Natural Language Toolkit (NLTK) library and converted to FastText embeddings to obtain the SVM input representation.The resulting NumPy arrays train the SVM (https://scikit-learn.org/stable/modules/svm.html, accessed on 1 March 2024).A four-fold cross-validation strategy is then used for training and evaluation.

Transformers
We used three Transformer models (BERT, DistilBERT, and XLM-RoBERTa) from the Hugging Face library (https://huggingface.co, accessed on 1 March 2024) in the experimental section.The following section describes the pre-tests that have been conducted on a randomly selected subset of 434 bipartite labels from the random selection gold standard dataset (Section 5.1.1)to obtain the optimal hyperparameter settings for the Transformer classifier.
The models were trained on a large-scale multilingual corpus to improve multilingual performance and are capable of handling complex German vocabulary, idioms, and syntactic structures.The final models are implemented in PyTorch, which seamlessly integrates with the Hugging Face library and optimizes through (i) domain adaptation, (ii) model-fine tuning, and (iii) automated hyperparameter optimization.
Adapting Transformers to a target domain can lead to increased robustness to noise or better feature alignment [42].Exposure to domain-specific documents, before fine-tuning, enables the model to closely align itself with the target text corpus.This alignment usually improves the model's understanding of vocabulary, phrasing, and linguistic nuances and reduces the likelihood of semantic misinterpretations or mismatches.A dataset of 150,366 Swiss job postings was used for domain adaptation.The dataset covered diverse industries and job roles which have been converted to text with the Inscriptis library [43].
Assessing the performance of the fine-tuned models with and without domain adaptation allows evaluating the effectiveness of the adaptation process.Table 2 summarizes the effectiveness of domain adaptation for the offshorable classification task.Table 3 presents the corresponding results for classifying a skill's automatability.Model fine-tuning followed the common approach of freezing certain layers [44] while the remaining layers were updated with task data.Tables 2 and 3 summarize the impact of layer freezing on the model performance.
For hyperparameter optimization, we used Optuna framework [45], one of the earliest tools that offered streamlined sampling and pruning algorithms.Adding Optuna improves the efficiency of our pipeline.Table 4 provides a summary of the Transformer hyperparameters used in the experimental section.The final experiments drew upon the DistilBERT classifier (without domain adaptation and layer freeze), which provided the best results for the offshorable indicator, a decent performance for the automatable label, and required the least resources for training, therefore making it the most efficient choice for our specific task.

Large Language Model with Heuristic Classifier
The LLM-based approach builds upon the GPT API to classify the automatability and offshorability of skills (Figure 1).LLMs based on GPT models are considerably more complex than Transformer models and harder to adapt to specific domains [38].They are useful for in-context learning from prompts, especially in few-shot settings, but they are sometimes saddled with fairness issues (e.g., stereotypes, biases, errors) [46].The example prompt requests an assessment of whether the task experiment planning needs to be performed on-site and provides examples for skills with a positive (monitor production order, maintain buildings, clean object, and measure tunnel) and a negative (update customer file, produce door, ensure analysis quality, and coordinate program area) assessment.
Each prompt is built around a single indicator from Josten and Lordan's list [12], as combining the indicators was found to lead to worse results.The indicators were extracted for each skill using the prompts and then used to calculate with a heuristic whether the skill is automatable or offshorable.
We categorize a skill as automatable when the threshold value of 0.5 is surpassed for automatable according to the equation below: In this equation, all parameters (B, P, T, S, and D) are binary variables, taking values of either 0 or 1.A value of 0 indicates the absence of the characteristic, while a value of 1 signifies its presence.B denotes the degree of physical interaction (brawn), P reflects the degree of interaction with people, T represents the level of abstract thinking required (brain), D indicates the extent to which the task can be performed digitally (digitalization), and L signifies the necessity of on-site presence (location).The weights (0.4, 0.3, 0.2, and −0.4) have been determined empirically.
As outlined in Equation (2), a task is considered offshorable if it does not have to be performed on-site (location L = 0) and can either be digitalized (digitalization D = 1) or standardized (routine R = 1).Tasks requiring physical presence (location L = 1) are automatically categorized as not offshorable.

Deep Learning Ensemble
The deep learning ensemble aimed at further improving classification performance.The presented approach has been inspired by the human brain, which uses multiple interconnected neural networks that differ widely in anatomy and physiology [47] to increase accuracy and robustness.
Figure 2 offers an overview of the selected approach.In this deep learning ensemble, DistilBERT plays a central role in encoding the bipartite skill tuple into a contextual embedding.Subsequently, the ensemble leverages four different fully connected neural networks to assess different parts of DistilBERT outputs.Table 5 provides an overview of the hyperparameters used within the ensemble classifier.
Each classifier used in the ensemble draws upon different portions of the DistilBERT embeddings.Figure 3a,b illustrate how a tensor of three sequentially hidden layers (marked in blue) is combined with a common ground layer (marked in orange) to form the ensemble classifiers' input.The network-specific hidden layers and the common ground layer are concatenated and afterward transformed into a 1D vector for the fully connected neural network, which is then trained on these data.A majority voting mechanism is used to generate two crucial outputs: (i) an overarching assessment and (ii) a confidence score.The confidence score serves as a metric to gauge the consensus among the various neural networks regarding their evaluations.

Selection of Network Configurations for the Ensemble
To prevent potential ties in majority voting, we conducted a preliminary study that evaluated the Area Under the Curve (AUC) scores for various ensemble configurations (Table 6).
From this analysis, we identified the top three performing networks for the final classifier.Plotting the averaged embeddings of the gold standard as a heatmap (Figure 4a) illustrates that the values in the last hidden layer lean towards zero.This observation may indicate why the fourth network is the least-performing one (Table 6), as it includes this particular layer.As explained earlier, the input for each network involves slicing DistilBERT embeddings, enabling different perspectives for the ensemble (Figure 4b).The input data for networks one and two appear more balanced than those for networks three and four.This difference may explain why these networks outperform the others in the ensemble.Token Position -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Figure 5 shows the Shapley values [48] for the longest bipartite skill label extracted from the gold standard.The Shapley values suggest that substantives tend to have negative values, whereas verbs and denominatives tend to have positive values, as illustrated in blue.This observation implies that although the classifier operates on bipartite skill labels, subtokens still provide additional context due to their impact on each other's meaning.We therefore consider all DistilBERT tokens in the ensemble networks.

Determining the Optimal Voting Threshold
Subsequently, we used ROC curves to determine each network's optimal voting threshold (see Table 7).Given that our objective is to achieve a balance between true positives (TPs) and false positives (FPs), we tailored our selection to address the importance parity between "is_offshorability" and "is_not_offshorability" and the same for "is_automatable" and "is_not_automatable".This approach ensures a robust classification strategy that aligns with the nuanced requirements of the target domain.

Evaluation
This section introduces the evaluation datasets (Section 5.1), provides assessments of the classifier performance on the random selection dataset (Section 5.2) and the skewed rule-based dataset (Section 5.3), as well as presents a discussion of the obtained results.

Datasets
The evaluation builds upon two different gold standard datasets: (i) a dataset that contains a random selection of skills annotated by human experts (Section 5.1.1)and (ii) skills annotated based on expert-drafted heuristics (Section 5.1.2).

Random Selection Gold Standard Dataset
The data were collected by jobchannel AG (Thalwil, Switzerland) (https://www.jobchannel.ch,accessed on 1 March 2024), a company focused on aggregating job data from online sources (e.g., boards, websites, etc.).The jobchannel ontology describes the various skills necessary for the Swiss job market.Each skill is split into a (predicate, topic) pair representing the action and the context in which the skill is performed.
The skill "Marketingkonzept planen" (plan marketing concepts), for example, comprises the topic "marketing concepts" and the predicate "plan".
Human annotators used the following guidelines based on Swiss practices for manually assessing skills with a binary value for offshorability and automatability: 1.
Offshorability determines if a task can be performed regardless of the physical presence of the person performing it.Outsourcing issues may arise due to location dependencies, human interaction (e.g., especially if cultural preferences are considered), or even the need to move around large objects.

2.
Automatability assesses whether technology (e.g., robots, drones, software, etc.) can currently fully perform the task.Activities are non-automatable if they are not clearly specified or need too much human interaction.
Table 8 summarizes the dataset statistics.The experts labeled 67.7% of the examined skills as offshorable and 58.5% as automatable.Some examples of expert assessments can be found in Table 9.The creation of the rule-based gold standard dataset has been guided by a qualitative analysis of misclassifications obtained from the skill classifier.An analysis of these classification errors revealed that, in many cases, humans could apply simple rules to assign these skills to the correct class.Therefore, we asked domain experts to identify rules that help classify these skills.These rules have been used to create a separate dataset that aims to identify mistakes that are particularly obvious to humans.Tables 10 and 11 provide examples of these expert-created classification rules.Based on these rules, we created a corpus that covers 13,937 skills.In contrast to the random selection gold standard, the rule-based gold standard only considers obvious classifications that are covered by these heuristics.It therefore does not contain any challenging cases in which expert assessments differed or required discussions and further clarifications to decide on the skill's correct class.
Table 8 provides dataset statistics for both gold standard datasets.Please note that the manually annotated random selection corpus assigns offshorability and automatability classifications to each skill.The rule-based corpus, in contrast, only covers the class outlined in the rule for that particular skill (i.e., either offshorability, automatability, or both).It is also important to note that class distribution differs significantly between both datasets.Offshorable skills are even more frequent in the rule-based corpus, while the distribution for automatability is reversed between the datasets.

Performance on the Random Selection Gold Standard
This evaluation assesses the classifiers' performance on the random selection gold standard introduced in Section 5.1.1 using a k-fold cross-validation with four folds.The test data were divided into 80% training data and 20% evaluation data in each run.Once all runs were executed, the results were summarized.
This procedure was not applied to the GPT/heuristic classifier because it only uses few-shot training with examples that were not part of the evaluation dataset.As it can easily be seen in Tables 12 and 13, the DistilBERT classifier is the best performer.The Transformer model's effective use of self-attention and pre-training mechanisms was key to extracting meaningful semantics for classification, even from input data that are limited to bipartite skill labels.
SVM results were as expected as they did not perform well in the presence of limited training data.The low performance of the GPT-based heuristic is not entirely surprising as it lacked domain knowledge and was not optimized for classifying human resources tasks.
Although the ensemble model is considerably more complex than DistilBERT, it still performs similarly to this Transformer model.This result indicates that the additional generalization capabilities provided by this more sophisticated model did not translate into better performance since the model has been constrained by the scarcity of input data, which was limited to bipartite skill descriptions.

Performance on the Rule-Based Dataset
We draw upon the skewed rule-based dataset for assessing model robustness.This dataset only contains annotations that are (i) obvious to human experts and (ii) can be derived from well-established heuristics (Section 5.1.2).
These experiments, therefore, aim at assessing the classifiers' robustness against grave mistakes, i.e., cases where automatic classifications clearly violate human intuition.They have been designed to indicate the impact of mitigation strategies such as increasing model complexity (i.e., large language models) and enhancing model diversity (ensemble method) on the model's robustness against these types of mistakes.
Table 14 summarizes the performance of all four methods on the offshorable indicator.Both the Transformer and the ensemble model provide very good results for offshorability, while the ChatGPT heuristic scored even below the SVM baseline.Table 15 compares the systems' performance on the automatable indicator.The ensemble model provides the best results for the offshorable indicator and the DistilBERT classifier excels for the automatable category.Both models are very similar in terms of performance and clearly outperform the SVM baseline and the GPT heuristic.
These results provide the following insights: • All models that have been fine-tuned on the random selection dataset have benefited tremendously from training on task-specific data.Consequently, even the fine-tuned SVM classifier outperforms the GPT heuristics.
• Both DistilBERT, the ensemble, and GPT heuristics leverage background knowledge from the base models, which improves model performance compared to the SVM baseline but is insufficient to compensate for fine-tuning on domain data (compare GPT's performance).GPT heuristics, therefore, struggle with providing accurate classifications based on the limited context available through prompt engineering.• Additional ensemble complexity does not translate into better performance.

Discussion
The evaluation demonstrates that the fine-tuned models provide good results, even for the skewed rule-based dataset, which differs significantly from the random selection dataset in terms of class distribution (Table 8).This is even true for the automatable classification task, which is inherently more intricate and complex than assessing the offshorable indicator.Unlike the offshorable label, which is primarily impacted by whether a task needs to be performed on-site (e.g., particularly relevant for Gig economy platforms like Uber), the automatable label encompasses a multitude of underlying factors.These factors are considerably more nuanced and context-dependent and especially useful for high-skill, location-independent services such as Fiverr and Upwork.
The relevance of multidimensional aspects makes automatability more difficult to discern, which is also reflected in classification performance.The less complex models (e.g., SVM baseline and GPT) struggle with the automatable indicator.The difference in performance is less pronounced in the random selection gold standard dataset where both classifiers yield similar results.In contrast, SVM and GPT-based heuristics provide significantly worse classification performance for the automatable indicator on the skewed rule-based dataset.The SVM model clearly lacks the complexity required to successfully generalize its assessments, while the GPT classifier fails to correctly classify the label of unseen bipartite skills based on the provided few-show examples.
DistilBERT and the ensemble classifier, in contrast, provide f1 metrics and accuracy measures of 90% and above for both the automatable and offshorable labels, indicating that these models are sufficiently complex to approximate the classification model based on the available training data.The scarcity of the input data available in the job channel occupation ontology, which is limited to bipartite skill labels (i.e., a topic and verb describing the skill), poses a considerable challenge to the classification process.The evaluations in the previous section suggest that techniques that increase model complexity (e.g., ensembling) do not necessarily improve skill classification performance, indicating that the ensemble's loss function might be non-convex [49].

Conclusions and Outlook
This paper discussed machine learning models capable of classifying bipartite skill labels in terms of their offshorability and automatability.Both indicators help assess a skill's future readiness, i.e., how likely there will be a future demand for it.Bipartite skill labels describe a skill based on two terms-the skill topic (e.g., Python) and a verb that provides additional context regarding the task or activity (e.g., programming).This bipartite skill specification allows distinguishing the skill of Programming Python from skills such as Teaching Python, Debugging Python, or Profiling Python but do not provide any context beyond this information.
Our primary contributions lie in the customization and fine-tuning of deep learning methods tailored for the automatic classification of future readiness (in terms of automatability and offshorability) within the framework of bipartite skill labels.We conducted comprehensive evaluations of these methods to assess their performance and to investigate whether increasing model complexity translates into classification performance under the constraints imposed by the use of bipartite skill labels.
The paper compares the performance of four skill classifiers (SVM, Transformers, GPT-based, and DistilBERT ensemble) which were trained using a gold standard dataset of 2254 annotated bipartite skill specifications obtained from the scientific literature and domain experts.A fourfold cross-evaluation assessed classification performance on the gold standard dataset and was followed by a second set of experiments that evaluated the robustness of the trained classifiers by applying them to a considerably larger dataset of 13,937 annotations created based on expert-defined annotation rules.
The evaluation results indicate that domain-specific fine-tuning is essential to improve the accuracy of the classification algorithm.Once fine-tuned, even the SVM baseline outperformed the GPT classifier, which only benefited from few-shot learning through examples integrated via prompt engineering.The Transformer classifiers also leveraged the knowledge obtained during pre-training, easily outperforming models that did not benefit from pre-training.Given the scarcity of input data (the two terms obtained from the bipartite skill labels), increasing model complexity does not necessarily lead to improvements in classification performance.This result is noteworthy since the ensemble models have not outperformed the single DistilBERT model.
In conclusion, assessing a skill's automatability and offshorability offers insights into the future readiness of professional skills, therefore aiding in forecasting the likelihood of future demand for specific skill sets.Through customization and fine-tuning of deep learning methods, the research evaluates the performance of skill classifiers.It provides valuable findings for improving the accuracy of classification algorithms that are limited to bipartite skill labels in predicting job market demands.
Future work will integrate insights into a skill's future readiness with systems such as CareerCoach [41], which provides reskilling and upskilling suggestions.We also plan to leverage additional background knowledge in the classification process.O*NET and ESCO ontologies, for instance, encode knowledge on skills, competencies, and occupations that might be helpful in better contextualizing the classification input.Contextualization is also key to addressing the problem of scarce input data and, therefore, paving the way towards successfully using more complex classifier setups.For this purpose, the authors also plan to incorporate domain knowledge, e.g., a sustainability knowledge graph initially developed in earlier work for UN Environment [50] and further extended in the SDG-HUB project (https://www.weblyzard.com/sdg-hub,accessed on 1 March 2024) and Climateurope2 (https://www.climateurope2.eu,accessed on 1 March 2024), a Support Action funded by the European Union with an interest in job creation in the climate services sector and in the impact of the green transition on the European labor market [51].

Figure 1 .
Figure 1.Obtaining details on skill automatability and offshorability via GPT and heuristics.The LLM classifier described here queries a GPT model for assessments on a skill's basic characteristics as outlined by Josten and Lordan[12]: brawn (B): physical interaction with objects is needed; people (P): interactions with humans are required; brain (T): abstract reasoning is needed; location (L): the activity's location matters (e.g., customer's or own location); digitalization (D): activity can be digitalized; routine (R): activity can be standardized into subprocesses that are performed similarly around the world.We use OpenAI's gpt-3.5-turbomodel and a prompt that provides few-shot examples to contextualize the request-e.g., the next prompt helps us obtain a binary skill classification:

DistilBERTFC Layers # 4 FC Layers # 1 FCFigure 2 .
Figure 2. Deep learning ensemble classifier.The fully connected layers (FC Layers) independently assess the contextualized embeddings and provide their output to the majority voting component.

Figure 3 .
Figure 3. Composition of the inputs for the ensemble model.(a) Tensor from DistilBERT Embeddings.(b) Diverse input for each network.

Figure 4 .
Heatmaps of embeddings from DistilBERT.(a) Heatmap of the average gold standard embeddings from DistilBERT.(b) Heatmap of gold standard embeddings for each network from Dis-tilBERT.

Figure 5 .
Figure 5. Shapley values of the longest bipartite skill label in the gold standard (English translation: coordinate Sports Event Management Unit).

Table 1 .
Bipartite skill labels, translations, and classification of their automatability and offshorability.

Table 2 .
Classification performance for the "offshorable" indicator with/without domain adaptation (DA) and layer freeze (LF).The best results are indicated in bold.

Table 3 .
Classification performance for the "automatable" indicator with/without domain adaptation (DA) and layer freeze (LF).The best results are indicated in bold.

Table 4 .
Model specification and hyperparameters

Table 5 .
Ensemble model configuration and parameters.

Table 7 .
Assessment of the optimal threshold for the ROC Curves of the networks

Table 8 .
Class distribution within the gold standard datasets.Please note that the rule-based dataset frequently only provides labels for only one of the classes (i.e., only offshorable or automatable).

Table 9 .
Example expert annotations for both offshorability and automatability.

Table 10 .
Example rules for identifying offshorable and non-offshorable skills.The asterisk in the topic column indicates that the rule works across all skill topics.

Table 11 .
Example rules for identifying automatable and non-automatable skills.The asterisk in the topic column indicates that the rule works across all skill topics.

Table 12 .
The "offshorable" indicator's classification performance on the randomly selected dataset.The best results are indicated in bold.

Table 13 .
The"automatable" indicator's classification performance on the randomly selected dataset.The best results are indicated in bold.

Table 14 .
Performance for the "offshorable" indicator on the skewed rule-based dataset.The best results are indicated in bold.

Table 15 .
Performance for the "automatable" indicator on the skewed rule-based dataset.The best results are indicated in bold.