Education-to-Skill Mapping Using Hierarchical Classiﬁcation and Transformer Neural Network

: Skills gained from vocational or higher education form an essential component of country’s economy, determining the structure of the national labor force. Therefore, knowledge on how people’s education converts to jobs enables data-driven choices concerning human resources within an ever-changing job market. Moreover, the relationship between education and occupation is also relevant in times of global crises, such as the COVID-19 pandemic. Healthcare system overload and skill shortage on one hand, and job losses related to lock-downs on the other, have exposed a necessity to identify target groups with relevant education backgrounds in order to facilitate their occupational transitions. However, the relationship between education and employment is complex and difﬁcult to model. This study aims to propose the methodology that would allow us to model education-to-skill mapping. Multiple challenges arising from administrative datasets, namely imbalanced data, complex labeling, hierarchical structure and textual data, were addressed using six neural network-based algorithms of incremental complexity. The ﬁnal proposed mathematical model incorporates the textual data from descriptions of education programs that are transformed into embeddings, utilizing transformer neural networks. The output of the ﬁnal model is constructed as the hierarchical classiﬁcation task. The effectiveness of the proposed model is demonstrated using experiments on national level data, which covers whole population of Lithuania. Finally, we provide the recommendations for the usage of proposed model. This model can be used for practical applications and scenario forecasting. Some possible applications for such model usage are demonstrated and described in this article. The code for this research has been made available on GitHub.


Introduction
The analysis of country-wide employment and education data is of great practical importance. An accurate evaluation of relationships between education and employment allows for a better forecast of human skill supply. Prediction of quantity and structure of human skill allows for a better planning and preparation as well as change negative tendencies of socio-economic phenomenon [1]. Moreover, economic changes transform employment and occupations; therefore, a significant part of people face a necessity to upskill and redeploy [2]. In these circumstances, knowledge on how education is related to employment and how obtained education transforms into jobs is very important. It comes as no surprise why national (e.g., UK organization NESTA [3]) and international organizations (e.g., EU agency of The European Centre for the Development of Vocational Training [4]) are investing in perfecting human skill forecasting.
At first sight, education-to-skill mapping seems to be a simple task; it might be possible to approximate the most likely occupation by estimating frequencies of graduates' employment. However, looking closely, this task is rather complex.
The first challenge is the complexity of administrative data. Education programs are short-lived and constantly changing as the number of new programs tends to grow, and some old programs are terminated or rewritten. With a simplified approach, it is impossible to map new education programs to occupations as there is no digitized historical data in machine-readable format. Furthermore, in some instances, there are too few actual graduates throughout certain programs' lifetimes. Applying econometric modeling or simple machine learning classification algorithms to education program data is challenging as the huge number of programs, being categorical in their nature, transforms to highly sparse and unbalanced input data and this affects the accuracy of the predictions. Another challenge related to education data is the fact that residents are diverse and might complete several education programs, that in combination make up a unique set of qualifications and skills.
Data about residents' employment is not simple either. For instance, residents might have more than one job at a time, have several different jobs during their career, and residents are likely to have more skills and higher-level jobs as their career progresses. Additional challenges arise from rare occupations, which create imbalances in the data. For example, such professions as divers, radio and TV show hosts, and meteorologists, are normally occupied by a very small number of residents on the national scale. Thus, a simplified estimation of the relationship between education and occupation would generalize poorly because of the complexity of the problem.
Finally, globally devastating events, such as the COVID-19 pandemic [5], financial crises [6], and climate change [7] have overwhelming effects on human resources around the world. However, the econometric time-series forecasting models in this context perform rather poorly: being based on historical data, they can not reliably account for the unexpected events [8,9]. Thus, in these circumstances, it is important to have a complete set of various scenario generation tools, including machine learning methods [6].
The aim of this research is to propose a mathematical model that would not only map education to occupations accurately enough, but also have other practical applications. To achieve this aim, the mathematical model is proposed based on state-of-the-art models in the field. Empirical experiments based on national level data were performed, and the analysis of results was also accomplished in order to progress with well-established scientific concepts. Lastly, practical application experiments with resulting models were performed to provide additional insights into possible applications.
The rest of the paper is organized as follows: in Materials and Methods, Section 2, the mathematical methods are described as well as proposed models, experimentation data, architecture of models, text pre-processing and computational infrastructure. Results and Discussion, Section 3, presents and discusses the results of experiments and examples of applications. Finally, Conclusions, Section 4, closes the article.

Mathematical Models
This section provides with information on main notations for the proposed model.

Deep Neural Networks
Education-to-skills mapping is considered as a supervised learning classification problem. The task of supervised learning in the machine learning context is to learn a mapping f θ : X ⊂ R d x → Y ⊂ R d y from the given dataset D = {(x (i) , y (i) )|1 ≤ i ≤ N}. X denote the input space and Y denote the label space. Here, x (i) ∈ X is an ith input sample and y (i) ∈ Y is the ith corresponding label sample [10].

Deep Neural Network
In this research, a fully connected neural network (DNN) is considered [11]. The network performs an operator function f θ : The composition of M non-linear transformations (layers) are considered, where M − 1 will be hidden layers and the last one will be an output layer. The vector is fed to the model and can be defined as The m-th layer has n m of unknown parameters θ ⊂ R n m−1 +1 , receives an input vector X m−1 ∈ R n m−1 and transforms it into the vector X m ∈ R n m by applying the linear transformation, followed by a component-wise (non-linear) activation function σ: where θ m ∈ R n m ×n m−1 +1 , 1 ≤ m ≤ M, X m -input for the (n + 1)-th layer, and the ReLU activation function: σ(x) = relu(x) = max(0, x) [11].
The output function S depends on the classification task; in a case of single label classification it would be the softmax function, but in our case of multi-label classification would be the sigmoid function: where σ(x) = sigmoid(x) = exp(x)/(1 + exp(x)).

Parameters estimation
The quality of the proposed model f θ is measured with loss function L θ (y, f (x|θ)). In the study, f θ is the deep neural network with unknown parameters of θ ∈ Θ ⊂ R d θ . Since the model f θ is parametric, the unknown parameters are updated using the gradient of loss function: ∇ θ L θ [11]. In simple form, parameters θ are updated iteratively: where η is learning rate or the size of the steps to reach minimum loss, t-iteration number [11].
For the process to converge successfully, a proper learning rate has to be applied for avoiding being trapped in local minima [11]. Usually, there are multiple optimization techniques that can be employed to solve this problem. In this study, Adaptive Moment Estimation (from here after abbr. as Adam) optimization was used. Adam computes learning rates for each parameter, using information on exponentially decaying averages of past gradients [11].
Let gradient of loss function to the ith parameter θ i at iteration t be: g t,i = ∇ θ L θt,i . Then where m t and v t are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients g t . β 1 and β 2 are hyper-parameters and usually values of 0.9 and 0.999, respectively, are used [11]. As m t and v t are initialized as 0's vectors, they are biased towards zero, especially during the initial state. These biases are corrected by computing first and second moment estimates such that [11]:m The Adam update rule t ≥ 1: Here, is smoothing term (avoiding zero in denominator) usually set to 10 −8 .
Accuracy metric is defined as: In some instances, accuracy metric is sufficient, but often it may be misleading, especially in classification task with many classes or dealing with imbalanced data. When classes are sparse and imbalanced resulting accuracy value is dominated by majority group, such as 0 labels [12]. Therefore, additional metrics should be adopted.
Precision measures the percentage of the positively predicted samples, that are actually labeled as 1. Precision considers the number of negative samples and can answer the question of how many real predictions were in the prediction sample. Recall is a measure for the positive group that was correctly predicted to be 1 and can answer the question of how many of positive labels model detected [12].
F1 score is the harmonic mean of the precision and recall, defined as:

Multi-Label Learning
Multi-class classifiers can handle many numbers of classes, but only exactly one class can appear in each sample. A multi-label classifier is an extension of multiclass classifiers that can have multiple classes in each sample (number of classes in each sample is between zero and total number of different classes). Thus, multi-label classification can be defined as machine learning task that classifies data into many classes where each class can co-occur simultaneously. In other words, a multi-label task is the problem where each example is associated with a set of labels [10].
Suppose X ∈ R d x denotes d x -dimensional instance space, and Y = (y 1 , y 2 , . . . , y d y ) ∈ R d y denotes the label space with d y possible class labels. The task of multi-label learning is to learn an operator h : For each multi-label pair example: and Y (i) is the set of labels associated with X (i) . For any unseen instance X (i) ∈ X, the multi-label classifier h(·) predicts h(X (i) ) ⊆ Y as the set of proper labels for X (i) . The key idea of multi-label learning is decomposition of the problem into number of independent binary classification problems, where each classifier corresponds to a possible label [10].
The classification to set {1, .., d y } is performed after application of threshold. The threshold of predicted value by the model is noted as [10]: where j = 1, . . . , d y , i = 1, . . . , N, and T j acts as a threshold which dichotomizes the label space into relevant and irrelevant label sets.
In multi-label classification tasks, each class performance should be evaluated independently from the other. Common practice is to use sigmoid activation for the last layer that transforms the input into a probability range (0, 1) for every class (separate probabilities).
The loss function for the multi-label problem (Cross-entropy loss (L)): where y is an estimate of the probability of the positive class [11].

Class Imbalance Problem
Although class imbalance results from lack of available data, imbalanced data also appears in the cases of classification tasks for many classes or sparse class data, where majority labels, having a nature of one-hot encoding, is 0. High class imbalance is natural in many real-world datasets for classification tasks, such as image recognition, medical diagnosis, and fraud detection (summarised by, e.g., [13]).
The general measure of class imbalance in research is defined: where ρ is maximum imbalance between class, S c is a set of examples in class c, and max c {|S c |} and min c {|S c |} is maximum and minimum class size over all classes. For example, if maximum class size is 1000 and minimum class size is 10, then ρ is 100. Even though class imbalance is widespread, the general majority of classifiers operate on an assumption that classes in the training dataset are balanced. Therefore, in cases of class imbalance, many classifiers exhibit bias towards the classes that represent majority cases due to its increased prior probability. In this case, the classifier might ignore minority classes that tend to be misclassified [12].
Anand and colleagues showed that class imbalance interferes with backpropagation, resulting in much smaller length of gradient component in minority classes [14]. Thus, majority class dominates net gradient, which is responsible for weights adjustment. This results in reduced loss only for large class groups [14].
In this study, we tackle the data imbalance issue using specific methods, namely: focal loss, structuring output as hierarchy and transforming input data as text embeddings.

Focal Loss
With a large number of classes, true negatives can easily overwhelm training and the loss might become unrealistically small for false negatives. Likewise, if the dimension number of the output is large (there are many classes of labels) and each training sample has only few cases of some classes, the loss would encourage ignoring these few cases. Nonetheless, those cases might be extremely important for the model's predictive performance.
One way to solve this is by using focal loss. Focal loss was originally proposed as an effective loss function for the image processing-object detection problem. The problem with object detection might occur in cases when rare objects are missed because the background occupies a large part of the image. Thus, detection fails because extreme foreground/background class imbalance prevents learning. Yet research shows that focal loss is useful not only for visual analysis but also in general multiclass classification tasks of imbalanced class conditions [15].
Focal loss improves the classifier by configuring the loss function with modulating factor. Focal loss can be defined as: where y (i) c ∈ {0, 1} is a binary class label,ŷ (i) c ∈ (0, 1) is an estimate of the probability of the positive class, γ is the focusing parameter, α is a hyperparameter that weights the positive class up or down α ∈ [0, 1] [11,15,16].
The hyperparameter γ prevents overwhelming loss with frequent cases focusing the model on instances with fewer classes. The higher the γ, the higher the rate at which easy-to-classify examples are down-weighted. The weighted binary cross-entropy loss is special case of focal loss when γ = 0. Criteria to choose γ value is the degree of data imbalance [15,16].

Hierarchical Classification
Vast majority of machine learning applications for classification task disregards hierarchical information. Although many classification problems, especially problems involving many classes, might inherit natural hierarchical structure of classes [17]. It might be useful to employ the hierarchical data structure to improve prediction as larger classes on the top level of hierarchy are more easily predicted. The hierarchy is also useful in deployment of the model: it can be adapted for different circumstances, where there is a need to predict a shallow level of hierarchy [17,18].
Silla and Freitas (in 2011 [18]) have grouped hierarchical classification into three main categories: flat, local and global. The flat classifier ignores the hierarchy and classifies only leaf nodes assuming that ancestor classes are also implicitly assigned. The local classifier system employs set of local classifiers (classifier per level, one binary classifier per node or classifier per parent node). The global classifier uses a single classifier to deal with the entire class hierarchy.
The choice of the classifier depends on specific problems. Generally, local classifiers provide more options to modulate the relationship between parent and leaf. The weakness of such an approach is the complexity of the model, as classifiers tend to grow with the additional hierarchy layers. Moreover, local classifiers might suffer from inconsistency in class predictions across different levels, which needs to be approached with additional correction methods (e.g., hierarchy pruning, thresholds, etc.) [18]. Global classifiers do not entail such challenges as the hierarchy structure is learned during the learning process [18].

Text Embedding
Text embedding might be defined as a vector representation of words that contains information about both semantic and syntactic meaning [19]. Text embeddings allow us to efficiently represent textual data in vector space. Unlike one-hot vectors, the embedded text is not sparse. Moreover, it contains information about relatedness between categories. Embeddings not only represent words, phrases or sentences in numeric form but also preserve additional information: the meaning of words and its relatedness to each other [19].
More formally, let W be size of dictionary, F be new size of selected feature dimension, then the one-hot word representation x is mapped to new vector e (i) The relatedness between words x (i) and x (j) can also be interpreted in several ways.
The taxonomy relatedness appears between words because they share many common features (for example, words "teach" and "educate" are related because both are members of the same category). The thematic relatedness emerges when even dissimilar words frequently co-appear in the same context (for example, words "student" and "diploma" do not belong to the same category, yet they are closely related to each other). The thematic relatedness includes some of the features of taxonomy relatedness. The thematic relatedness assumes context overlap as indication of feature overlap [20].
The larger the corpus available for the embedding training, the better embedding is performing in the NLP task. Embedding can be used as integrated layer in the deep learning model or as transfer learning. Transfer learning in this case consists of using embedding model pre-trained on a large text corpus. The practical reason for it is that a very large corpus is not available for a specific NLP task. Using pre-trained embedding, one needs only to fine tune to adapt it to the specific task [21].

Transformer Neural Networks
Text embedding can be performed using various techniques. One example of such a technique is the use of transformer architecture. Transformers might be used in many practical settings, such as language modelling, speech recognition, image and music analysis [22].
Transformer is created by modifying parts of an auto-encoder [23]. The auto-encoder maps input to the output, and dimensions of input and output might be different. It consists of two important blocks: encoder and decoder [24]. The transformer uses the same logic of auto-encoder just replacing fully-connected or convolutional layers with attention function that gives specific weights over the input sequence. The advantage of attention function is that it allows the encoder to proceed to longer sequences, which is very important to NLP tasks.
More formally, the attention transformation A(Q, K, V) : R d x → R d v is used in the transformer neural network. First, the input is decomposed by calculating the three matrices Q = θ Q x, K = θ K x, and V = θ V x, then the attention is calculated: where c = 1/ √ d k the normalization constant, d k is the dimension of K. The Q matrix is called the query, K-the keys, and V-the values matrices.
Equivalent to a fully connected neural network, the attentions can be combined, which is known as multi-head attention MH A(Q, K, V), which enables us to jointly combine the information from different representations of input at different positions. where Here, the projections are parameter matrices Each multi-head attention layer follows with fully connected transformation: where σ(x) = relu(x) = max(0, x) the ReLU activation function [11]. In this study, we use Bi-directional transformer-BERT (Bidirectional Encoder Representations from Transformers, here after abbr. BERT) [25]. BERT is trained using a large plain text corpus [25]. BERT, being transformer type architecture, also uses attention function. Base-BERT has 12 encoders that takes a text sequence and forwards it from one encoder to the next. The output for each sequence is a 728 length vector [26]. The most innovative part of BERT is the ability to learn sequential information in both directions. In fact, conditioning each word on its previous and next words would allow the word to "see itself" [25]. Therefore, BERT mask some of the words in the input and then bidirectionally condition the entire sequence to predict the masked words. In order to force the model to learn despite the presence or absence of the mask, usually only a small portion of words are randomly masked [25]. The model is trained for both masked tokens and next sentence prediction together. The next sentence prediction is used to predict whether the next sentence is random [25]. BERT embeddings have been shown to be highly accurate in NLP tasks [26].

Proposed Models
In this study, six models based on the theoretical reasoning described in the section above were constructed. The sequence of proposed models are presented in Figure 1. The mathematical models are described in Table 1. L F

Learning Outcomes
Know cell structure and behavioural patterns at the molecular level and the functions of human organs and systems and the mechanisms of physiological regulation Be able to analyse, manage and model data from the field of system biology with the aim to develop new technologies Be able select an appropriate modelling strategy for a given biological domain or problem, and be able to set-up as well as lead projects for fundamental or applied research with the aim to develop new technologies in order to solve arising issues Be able to gather and analyse information on research related to system biology with a critical approach, and to carry out a technological watch

Learning Outcomes
Know cell structure and behavioural patterns at the molecular level and the functions of human organs and systems and the mechanisms of physiological regulation Be able to analyse, manage and model data from the field of system biology with the aim to develop new technologies Be able select an appropriate modelling strategy for a given biological domain or problem, and be able to set-up as well as lead projects for fundamental or applied research with the aim to develop new technologies in order to solve arising issues Be able to gather and analyse information on research related to system biology with a critical approach, and to carry out a technological watch Models consist of input data that has either categorical (Cat. in table), i.e., the sparse one-hot representation of given education program, or text (Text in table), i.e., the description of the education program. The architecture of models were build as simple linear, deep feed-forward neural network (DNN) or included transformer neural network (BERT in table). The data output consisted of either all possible classes (All in table) or output of hierarchical classification (Hier. in table). The loss functions were either binary cross-entropy or focal loss. Each model was prepared and tested in the following manner: first, data was prepared and pre-processed, then architecture of model was constructed and training was conducted. Finally, the model performance was evaluated and each model performance was compared against other models. Testing was also performed by applying predictions of models in real-world situations. Each stage of experimentation with models is described in detail in the following sections.

Data
The experiments were performed on national-level Lithuanian administrative data consisting of information about residents' acquired education and occupations. The models were created in the Government strategic analysis center of Lithuania (STRATA) within the human resources forecasting project. Only the results of experiments are published in this article.
The data for the training was constructed using several national registers. Data are grouped into 3 main types.
Educational information was retrieved from the register of Education Management Information System (lit. Švietimo valdymo informacinė sistema). Education information consisted of anonymized personal-level data on vocational education, college and university (Bachelor's, Master's degree, Integrated studies) studies, and professional education (doctor's medical practice and pedagogy). There is no equivalent administrative data about doctoral degrees, as for this level of education, unique programs for individual students are composed. Observations consist of education information of graduates or students that achieved graduation (if the education program was not completed, the case was dropped out from the research sample). Education information was additionally translated into English for the textual analysis using Google Cloud Translation API [27].
Educational program abstracts were retrieved from the official and publicly available register, AIKOS, managed by Ministry of Education, Science, and Sport (aprox. 2500 valid programs) [28]. The data was retrieved using scraping tools: Chrome extension Web Scraper [29] and Scrapy [30]. The retrieved information was further processed: translated into English with Google Cloud Translation API [27].
Occupational information was retrieved from The State Social Insurance Fund Board under the Ministry of Social Security and Labour (hereafter abbr. SODRA). Anonymized personal-level occupational data was retrieved for the identified graduates. Occupations were mapped to 4-digit codes according to the Lithuanian version of Standard Classification of Occupations (hereafter abbr. LPK classification) (created in 2012 in accordance to the International Standard Classification of Occupations ISCO-08) [31]. The 4-digit occupational ID has hierarchical structure: first digits branch out into leaves, which branch out further. The first digit represents the status level of the occupation. For example, number "1" represents class of managers and "2", professionals. The second digit narrows down the speciality of the occupation, for example, "21" represents science and engineering professionals. The third digit shows even more detailed occupational granularity, e.g., "241" signals "Finance professionals". the fourth digit provides the most detailed information about the occupation, for example, "2411" represents "Accountants". The number of classes per level are different: the first digit group consists of 10 classes, the second, 43, the third, 130, and the fourth level has 436 classes. Not all classes were observed in the data.
Additional EDA is presented in Figure 2. It is important to note that one person generally completes 1-2 education programs and works one job at any point in time during their career. However, the significant number of cases diverge from general trend and should be accounted during learning process.
Data for the experimentation was constructed as follows: because a single person may have completed multiple education programs with a different number of occupations after and in between education programs, their career timeline is split at education program completion moments into multiple data entries. The sequence of completed education programs per person is combined incrementally starting from first till last education. The sequence of completed programs in each data entry is linked to last occupation in the same time-frame. For example, if a person completed two education programs, we would have two data entries: the first education program linked to the set of last occupation before the second education program was completed, and the input of both; the first and second education programs would be linked to the last occupation after the second education program was completed. Training, testing and validation sets were split accordingly: 64 percent of data for the training, 16 percent for the validation, 20 percent for the testing set. Thus, for the dataset of 267,201 cases, 171,008 cases were used for the training, 42,752 cases for the validation and 53,441 for the testing.

Architecture of Experimentation Models
In this study, 6 models were proposed. The summary on layers and parameters of proposed models are presented in Table 2. Models in detail numerical details are described in following subsections.

Baseline, Simple Classification and Simple Classification with Focal Loss Models
For the baseline model input, output layers and 1 hidden layer with 1 neuron were used. Input and output was constructed as one-hot encode representations.

Hierarchical Classification
For the hierarchical model, the data was preprocessed the same way as for simple classification, except for the output data. For the output data, the natural hierarchical structure of occupational code was used, as presented in Figure 3. The data was clustered according to hierarchy of digit in the code (each digit group as separate variable).

Text Embeddings
The descriptions of education programs was used as input text data. Pretrained BERT by Google [32] was used to create text embeddings.

Final Full Model
For the final model, BERT embeddings as input data and output data in hierarchical form was used.

Text Pre-Processing
The necessity to use such techniques as lemmatization, stemming, removing stopwords, punctuation, etc., before transforming it to embeddings depends on the specific task, data at hand and embedding method. For instance, if context of text is taken into account in the process of embedding, preprocessing methods, such as lemmatization, might violate the parts of the texts that are responsible for the context and thus worsen the transformation. Moreover, the BERT model already contains within its programming library standard forms of text preprocessing [26]. Thus, text was not additionally preprocessed, assuming that procedures within embedding supporting libraries would suffice.
Tokenization is the process where the text is split into number of tokens indexing unique words. BERT for the tokenization uses a vocabulary of 30,522 words. BERT tokenizes not only whole words but also sub-words and this allows us to tokenize articles, expressions or unknowns to dictionary words (to split and use them nonetheless). Maximum length of the accepted sequence is 512 tokens. For longer sequences (for the case of program descriptions using full texts) truncation to the maximum length method was applied. This means too large sequences were truncated token by token, removing a token from the longest sequence in the pair until the proper length is reached. The vector size for each input instance is a length of 768.

Computational Infrastructure and Complexity
Tensorflow [33] was used as the DNN framework. Each model was trained for 30 epochs. Batch size was set to 256, except for the models with text embeddings 128 batch size was set, as 256 batch size exceeded available computing memory.
All models, except for baseline, were constructed using two fully connected hidden layers. The fully connected layers used ReLu activation function. The output layers used Sigmoid activation function. Selecting the most appropriate number of neurons and other hyperparameters in each different task requires experimentation [34]. For the hyper-parameter tuning, experimentation was performed where per model several settings were tested: number of neurons (100 or 1000) and dropout value (no drop out, 0.1 and 0.5 drop-out). This way, each model was trained 6 times with different hyperparameters.
For the training computer, GPU (Nvidia 1070 ti 8 GB) was used. To use computer memory efficiently, data was stored in sparse matrix form, where nonzero elements are stored as matrix coordinates.
To evaluate models loss, Binary accuracy, Recall, Precision and F1 metrics were used (0.4 threshold was set as maximum probability of predictions because initial experimentation showed that maximum predictions do not reach 0.5).

Baseline Model
Loss estimated after 30 epochs-0.014, Binary accuracy-0.998, but Precision, Recall and F1 each had an estimate of 0 (see Table 3). Thus, it is important to note that Binary accuracy metric is not suited for classification task when dealing with large amounts of classes or having an imbalanced training dataset. In this case, if no label is predicted, accuracy is still highly rewarded. The metrics of Precision, Recall and F1 are more informative.

. Simple Multilabel Classification with and without Focal Loss
Initial exploratory analysis showed large class imbalance: ρ = 18,041. Moreover, approximately 35 percent of classes do not exceed 100 cases. Therefore, not only was the simple multilabel classification model with Binary cross entropy loss tested but also the model with Focal loss that addresses class imbalance.
The experiments with focal loss showed that γ parameter set at three gives the best results. Overall, further analysis of focal loss performance showed that all conditions of hyperparameters achieve better results with Focal loss compared to Binary cross-entropy loss. The results in Table 3 show that Focal loss improves Recall and F1 values by 3 percent in simple multilabel classification task. Regardless of Precision drop, F1 has increased, which can be interpreted as an improvement when using loss regularization. Moreover, experiments with hyperparameters show that the best result is achieved by the two combinations: small number of neurons (100) and medium level of drop-out (0.1) and large number of neurons (1000) with large drop-out. This suggests that for simple multi-label classification tasks, larger and more complex networks tend to over-fit and, therefore, need a larger drop-out value.
For the following models, focal loss as default loss function was applied.

Hierarchical Classification
The exploratory analysis revealed that classes used for the classification task are unbalanced. For example, the largest of first digit (first hierarchical level) LPK occupation set "2" has 128,713 cases (aprox. 49 percent of total cases), and the least number cases belongs to set "6" LPK group with 523 cases. The metrics of class imbalance per level were as follows: for the first digit ρ = 254, for the second ρ = 3924, for the third ρ = 22,919, and for the last level, as it was stated previously: ρ = 18,041. Thus, even in the most shallow level, data is extremely unbalanced. Therefore, standard Binary cross-entropy loss was customized, applying γ parameter of Focal loss.
As it can be seen from Table 3, hierarchical classification does not improve classification metrics in the most detailed level compared to simple classification model: Recall, Precision and F1 metrics are similar. In shallower levels, F1 scores improves by 6-28 percent. A larger network (1000 neurons per layer) and medium drop-out achieved the best performance.
Despite the inability to improve metrics for accuracy in hierarchical classification, the model predicts smoothly on different levels of output hierarchy. This feature is important for practical applications of the model.

Text Embeddings
The number of parameters in the model with text embeddings, compared to the simple and hierarchical classification models, is smaller. Moreover, the best performing model requires less regularization (the best performing model is with 0 drop-out) as the network type is optimal and tends not to over-fit (see Table 3).

Final Full Model
The final model consisted of BERT text embeddings as input, and a hierarchical structure of occupational classes as output (default weight loss). Full data on the experimentation is presented in Table 3.
All in all, different models performed better than baseline: Recall, Precision and F1 metrics were improved; nevertheless, F1 in the most detailed hierarchy level (433 classes) did not exceed more than 0.30 for all models.

Applications
In order to better understand how model predictions differ, examples of several education programs were chosen. For the prediction, the best preforming models were used.
Ideally, models should reflect natural frequency in data: occupational distribution of actual data. The final model should predict occupations based on the text description of the program only. Good performance would signal generalization capabilities of the model. The worst case scenario for the model would be the predictions reflecting only the largest classes of occupations for all education programs.
The largest occupational groups in the research dataset are: advertising and marketing professionals (occupational code "2431")-6 percent of all classes, shop sales assistants (code "5223")-4 percent of all classes, accountants (code "2411")-4 percent of all classes, policy administration professionals (code "2422")-4 percent of all classes.

Investigation of Prediction Capabilities of Different Models
Four examples of education programs (Musical Performance BSc., Modeling and Data Analysis MS., Informatics MS. and Welder vocational training program) were chosen and fitted to different models to predict three occupations with the highest probability. Results were compared with the frequency of occupations given accomplished education.
From Table A2 can be observed that the baseline model performs the worst as it predicts only the largest occupational groups in the data: advertising and marketing professionals, shop sales assistants and policy administration professionals.
Simple multilabel classification models predict more accurately. Both simple models with binary cross-entropy and focal loss perform similarly, except that probability estimates are larger in model with focal loss.
Hierarchical models, as expected, show a tendency to predict in hierarchical order, where the highest probabilities are estimated for the classes of parental hierarchy level and then branches out in accordance with hierarchical structure (in Table A2 only one highest probability for the parental head is shown, yet there could be more classes with similar probabilities).
Models with text embeddings are able to generalize and accurately predict occupation from the text data only.

Prediction of Occupations for Several Education Programs
For practical reasons, it is important that the final model would be able to predict multiple education programs because many graduates complete more than one program. Table 4 shows predictions for the different combinations of Informatics studies. It can be seen that for bachelor and master studies, the model predicts similar outcomes. Examples of mixed education program combinations were also tested. When there are two different education programs of Informatics and Music Performance, the predictions still lead to the ICT occupations with slightly smaller estimated probability values. It is important to note that ICT graduates are one of the most successfully employed groups in Lithuania yet with one of the largest drop-out rates prior to graduation (almost 50 percent do not complete their study program) [35]. Thus, it is possible that the part of the program description related to Informatics education overwhelms the part describing the Music performance.

Prediction of Occupations for New Education Programs
We also tested how the model predicts new programs that some students had already started to study but had not yet graduated (no historical data about occupations). For this exercise, we chose Kaunas University's newly created (2020) bachelors program "Artificial Intelligence". The prediction is as follows: for the shallowest level: "2" occupational group named "Professionals" is predicted with 0.7 probability. Then, with the probability 0.5 for the following levels-"25" occupational group or "ICT Professionals" and "251" group "Software and applications developers and analysts" are predicted. The most detailed level is predicted as follows: "2512", "Software Developers" with 0.4 probability, "2514", "Applications Programmers" with 0.4 and "2519", "Software and Applications Developers and Analysts Not Elsewhere Classified" with 0.3 probability. Thus, based on the program description, the model predicts outcomes similarly to those of the Informatics in Vilnius University.

Similarity Search
Using high-level features, it is possible to estimate how much the programs are related to each other. High-level features are encoded in the last layers of the deep neural network [36]. In order to investigate the potential usage of the high-level features for similarity search [37], the last layer features of the final model were saved. A similarity search was performed for over 2056 education programs with each program represented by 1000-dimensional feature vector. The t-Distributed Stochastic Neighbor Embedding (abbr. t-SNE) minimizes the divergence between two distributions by measuring pairwise similarities. Corresponding vectors of the high-level feature space were used. The set of high-level features {f 1 , f 2 , . . . , f N } and Euclidean distance d(f i , f j ) between a pair of objects is used. T-SNE learning an s-dimensional embedding by estimating joint probabilities p ij that measure the pairwise similarity between objects f i and f j : The visualization of T-SNE for high-level features [38] is demonstrated in Figure 4. The circled groups on the edges of the set visualized in Figure 4 were analyzed in greater detail. These groups cluster into different types of education programs. For example, a program such as mechanical engineering and physiotherapy are displayed on the opposite sides of the set. Similar geometric regularities can be observed for other education types. Related education programs can be identified by relative proximity.

Prediction of Occupations after Program Description Modification
Further analysis of the model's capability to generalize was performed by altering the education program description. For this task, a vocational program of a Welder training program was chosen. We tested if adding skills to the specific education program would lead to different predictions. The Welder program was supplemented with a set of digital skills. Table A1 shows the results of the experiment. Results show that predicted occupations of the altered program are related to the Welder occupation; notably, in some instances, the occupation qualification level is also higher.

Conclusions
In this article, the novel application of a transformer neural network for modeling education-to-skill mapping was presented. The final and best performing model is based entirely on the education program description (text input). The model predicts the most probable occupations for program graduates. The model solves the problem of very sparse input data that consists of a large number of study programs and sparse output data with a large number of occupations. The proposed model can produce significantly better results of education-to-skill mapping compared to alternatives using sparse input data. The experiments with focal loss-loss function addressing class imbalance-showed that γ parameter can improve model performance. Hierarchical classification allows us to predict the shallower levels of output data hierarchy more accurately. Moreover, shallow levels aggregate classes and, therefore, this feature can be meaningfully employed for practical use. The effectiveness of the proposed model was demonstrated with experiments on national level data, which covers the whole population of Lithuania.
Additional analysis was performed on potential applications of the model and its performance in different settings: applying the model to predict occupations from one or multiple completed education programs, to predict occupations from newly created education programs, to use similarity search among the education programs, and to evaluate the impact of altering the program description.
We have made the code for the proposed network architectures and the interface to predict the occupancy available to the public. The repository can be found online at https://github.com/Stratosfera/education2skill (accessed on 23 June 2021). Acknowledgments: Authors would like to thank the Government strategic analysis center of Lithuania (STRATA) for granting permission to publish the results of created models. Furthermore, authors would like to thank Google LLC for supplying computational resources for this study via Google Cloud Platform (abbr. GCP) Research Grant.

Conflicts of Interest:
The authors declare that the research was conducted in the absence of any other commercial or financial relationships that could be construed as a potential conflict of interest.   Table A2. Illustrative example of frequencies and Top 3 predictions of all models for the examples of four selected education programs. Outlined in the table the LPK code of occupation, name of the occupation and the probability that the graduate will be employed in the particular occupation.

Model
The Art of Performance, Bachelour dgr.  Table A2. Cont.

Model
The Art of Performance, Bachelour dgr.