People Analytics of Semantic Web Human Resource R é sum é s for Sustainable Talent Acquisition

: The purpose of this study was to deﬁne a data science architecture for talent acquisition. The approach was to propose analytics that derive data. The originality of this paper consists in proposing an architecture to work within the process of obtaining semantically enriched data by using data science and Semantic Web technologies. We applied the proposed architecture and developed a case study-based prototype that uses analytics techniques for r é sum é data integrated with Linked Data technologies. We conducted a case study to identify skills by applying classiﬁcation via regression, k-nearest neighbors (k-NN), random forest, naïve Bayes, support vector machine, and decision tree algorithms to r é sum é data that we previously described with terms from publicly available ontologies. We labeled data from r é sum é s using terms from existing human resource ontologies. The main contribution is the extraction of skills from r é sum é s and the mining of data that was previously described with the Semantic Web.


Introduction
People analytics is fast becoming a key instrument in talent management.Human resource analytics, also called talent analytics, is the application of considerable data mining and business analytics techniques to human resources data [1,2].A key aspect of people analytics is represented by data about people or human resources.In recent years, there has been an increasing interest in data science and analytics on data.Evidence [3][4][5][6] suggests that data science supports organizations by providing descriptive, predictive, and prescriptive analytics.Talent management could benefit from all these techniques, especially in the phase of talent acquisition.Talent acquisition is an integral part of talent management.Unfortunately, there is limited literature in human resource analytics to guide the use of machine learning algorithms [7].Even if the skills and ability to conduct these analyses are present, it is still a challenge to gather the data necessary to turn information into results [8].
To date there has been little agreement on the necessary data for talent analytics, but there is a common agreement that skills, work experience, and education form the basis of building a résumé.Websites like LinkedIn, Indeed, Jobup, and others try to achieve better matching between job positions and résumés.LinkedIn applies, for example, machine learning to individual profiles, and extracts features like skills, seniority, and industry.Similar features are extracted from the content on the job listing.Furthermore, logistic regression models are used to rank relevant jobs for a given member using these features [9].
Firstly, we analyzed the literature existent in human resource analytics.Secondly, we studied the subject from the perspective of the Semantic Web.In the last decade, the nonparametric methods (machine learning algorithms) have gained great attention in human resource management practice field.Examples of the use of analytics in talent management are data mining (extracting and examining data from large databases), sentiment analysis, and controlled tests such as A/B testing [10].However, despite the benefits of using and implementing these technologies, little is known about how to benefit from the Semantic Web and analytics on data, specifically about how to link and derive data from people's résumés.In this study, we proposed a Semantic Web data science architecture and validated it on résumés described with the Semantic Web.
Srivastava et al. [11] provided several predictive analytics to address talent acquisition needs such as predicting joining delay, selection likelihood, and offer acceptance likelihood.Dutta et al. [12] used data mining for getting insights and text mining for talent acquisition efficiency improvement.Faliagka et al. [13,14] proposed a system that implements candidate ranking, using objective criteria that are made available from the applicant's LinkedIn profile.The candidate's personality features are also extracted from their social activity using linguistic analysis.Faliagka et al. [14] used text mining of LinkedIn for creating profiles and linguistic analysis for inferring personality characteristics.Palshikar et al. [15] extracted attributes from candidate résumés while planning to combine information from multiple online and social platforms for the technical and domain skills using extraction tools.Mooney and Bunescu [16] applied knowledge extraction from unstructured text using text mining.With increased use of machine learning and natural language processing techniques, Téllez-Valero et al. [17] and other researchers tried to solve this problem of automatic extraction.With résumés, different extraction techniques are used to make the candidate selection process [18] easier and more automatic.
Previous studies [19] reported a machine learning application for the human resource data mining problem.Xie and Tang [20] used fuzzy neural networks for human resource.With respect to recruitment data mining, there are studies that use clustering and classification algorithms [20] to prove that fuzzy C-means and K-means clustering techniques are not suitable for this type of data distribution.It has been observed that trees constructed with the C4.5 algorithm (decision tree algorithm) have better accuracies.Another type of application is that of profile development [21].
Aldarra and Munoz [22] applied J48 algorithm to construct a Linked Data-based decision tree classifier to review movies.They used the SPARQL Protocol and RDF Query Language (SPARQL) queries to derive features.Mehenni and Moussaoui [23] built a regression model for predicting the most useful links that will be connected to build a multi-relational decision tree for heterogeneous databases.Sanchez-Marono et al. [24] discussed the use of decision trees learned from questionnaire data as behavioral models for the agents comparing various pre-processing methods and exploring their differences.
There is very little scientific understanding of skills from résumés.In addition, to the best of our knowledge, only a limited number of research papers comparing and evaluating the performance of different analytics algorithms with different training sample strategies using résumé data have been published.
Current implementations of Linked Data mining are promising [18,25,26].However, the full potential of the Semantic Web and Linked Open Data for data mining and knowledge database discovery is still to be unlocked [27].Now, there are some developments of human resource ontologies, such as the Human Resources Management Ontology [28].The literature notes that the current concerns are [29]: (1) publishing job postings and applicant profiles enriched through domain ontologies/controlled vocabularies, (2) pre-selection of the candidates based on semantic matching techniques implemented in addition to these ontologies and the associated automated reasoning, and (3) delivering interview recommendations to employers or suitable open positions to job seekers based on the semantic matching of the annotated applicant profiles with the job postings [30].Our work addresses applicant skills profiles enriched through domain ontologies and analytics in the context of data analytics Semantic Web architecture.
The major objective of this study was to investigate the possibilities that data analytics offers for skills identification.The research goal of this paper is to increase the efficiency of the analytical data processing through semantically described data and query processing.The term efficiency, in this paper, is understood in a broader sense: by semantically describing data, the analytical processing is improved.
Methodologically, the research presented in this article follows the design science research paradigm [31].It uses data from a case study to define its solution objectives.The artifact, the information system architecture, is constructed based on the analogy to the human resource process, the literature, and data from the case study.To evaluate the information system architecture, a prototype is developed and its limitations are analyzed through accuracy measures.
The case study takes into consideration résumé data described with ontologies.We identified the candidates' skills by discovering relationships between the employee's skills, work experience, and education on one side, and the current position held on the other side.The structure of this paper is as follows: Section 2, Materials and Methods, presents the research methodology, Resource Description Framework (RDF) knowledge base construction, and feature engineering, Section 3 discusses the results, and Section 4 presents the discussions.

Materials and Methods
The "validation in context" is a key feature.Therefore, we first proposed a Semantic Web data science architecture and, after deciding that the proper context is résumés websites, we validated the artifact.
Our article proposes analytics in an architecture that also includes Semantic Web technologies with the specific objective of identifying and quantifying the contribution of these technologies, starting from the idea that potential relationships established on the semantic basis can contribute to the analytical model of data.

Research Hypothesis
Our work is guided by the following research questions: (1) Is it possible to obtain semantically improved data by using data science on Semantic Web-described data? and (2) What will the necessary architecture to support data science on Semantic Web data look like?
We hypothesized that for the process of linking résumés data a Semantic Web data science architecture can be established.This design is based on a set of architectural decisions made to discover links between Linked Data, specifically between résumés data described with ontologies terms (H1).We tested this architecture on a case study based prototype.We validated the results by analyzing the accuracies, receiver operating characteristic (ROC) and precision recall curve (PRC) values of different classification algorithms applied on the dataset and on the dataset that we enriched with features obtained by aggregating data.
We further hypothesized that using this architecture discovers links between data (H2).We derived links between data by discovering the best predictors for every skill.We validated the results by analyzing the accuracies and ROC and PRC values of the decision tree algorithm applied on the dataset that we enriched with features obtained by aggregating data.
Figure 1 presents the structure of a résumé.Each résumé contains information about work experience (responsibilities and position held), education, and additional information such as technical skills.Methodologically, the research presented in this article follows the design science research paradigm [31].It uses data from a case study to define its solution objectives.The artifact, the information system architecture, is constructed based on the analogy to the human resource process, the literature, and data from the case study.To evaluate the information system architecture, a prototype is developed and its limitations are analyzed through accuracy measures.
The case study takes into consideration résumé data described with ontologies.We identified the candidates' skills by discovering relationships between the employee's skills, work experience, and education on one side, and the current position held on the other side.The structure of this paper is as follows: Section 2, Materials and Methods, presents the research methodology, Resource Description Framework (RDF) knowledge base construction, and feature engineering, Section 3 discusses the results, and Section 4 presents the discussions.

Materials and Methods
The "validation in context" is a key feature.Therefore, we first proposed a Semantic Web data science architecture and, after deciding that the proper context is résumés websites, we validated the artifact.
Our article proposes analytics in an architecture that also includes Semantic Web technologies with the specific objective of identifying and quantifying the contribution of these technologies, starting from the idea that potential relationships established on the semantic basis can contribute to the analytical model of data.

Research Hypothesis
Our work is guided by the following research questions: (1) Is it possible to obtain semantically improved data by using data science on Semantic Web-described data? and (2) What will the necessary architecture to support data science on Semantic Web data look like?
We hypothesized that for the process of linking résumés data a Semantic Web data science architecture can be established.This design is based on a set of architectural decisions made to discover links between Linked Data, specifically between résumés data described with ontologies terms (H1).We tested this architecture on a case study based prototype.We validated the results by analyzing the accuracies, receiver operating characteristic (ROC) and precision recall curve (PRC) values of different classification algorithms applied on the dataset and on the dataset that we enriched with features obtained by aggregating data.
We further hypothesized that using this architecture discovers links between data (H2).We derived links between data by discovering the best predictors for every skill.We validated the results by analyzing the accuracies and ROC and PRC values of the decision tree algorithm applied on the dataset that we enriched with features obtained by aggregating data.
Figure 1 presents the structure of a résumé.Each résumé contains information about work experience (responsibilities and position held), education, and additional information such as technical skills.The main idea was to structure résumés on semantic basis.Figure 2 provides an overview of the architecture as it is currently implemented.The components of the architecture arise from the corresponding architectural decisions made for pragmatic, technical and scientific reasons.
(1) The web scraper component seeks résumé data across the Indeed résumé website [30].It extracts data from résumés written in HTML and saves data in the comma-separated-value (CSV) format.The main idea was to structure résumés on semantic basis.Figure 2 provides an overview of the architecture as it is currently implemented.The components of the architecture arise from the corresponding architectural decisions made for pragmatic, technical and scientific reasons.The source code is available at https://github.com/catalinstrimbei/rdf-mining-hr[32].

Data Acquisition
Currently, there are no public linked datasets that contain human resource data, about competences or résumé data.Therefore, we obtained the dataset to build our classifiers by creating a web scraper on the Indeed résumé web site (https://www.indeed.com/resumes)and transformed the data into Turtle/Resource Description Framework by using OpenRefine [33].We scraped data from Indeed, a website that contains data about people résumés publicly accessible in html online format.Résumés' acquiring must relate to a keyword to search for résumés.We intended to obtain résumés for people belonging to the same field of work.Therefore, we limited our searches to résumés related to the Java industry keyword.We used a word cloud to identify the main keywords encountered in the skills section from every résumé.This way we extracted 677 web addresses that link to résumés from the Java industry.We parsed these HTML pages and extracted information from 213 résumés.Specifically, only 213 résumés from the 677 résumés presented information structured according to Figure 2. We processed data and initially stored it in comma-separated-value (CSV).
The résumés data had to be transformed in RDF according to public ontologies found in the The source code is available at https://github.com/catalinstrimbei/rdf-mining-hr[32].

Data Acquisition
Currently, there are no public linked datasets that contain human resource data, about competences or résumé data.Therefore, we obtained the dataset to build our classifiers by creating a web scraper on the Indeed résumé web site (https://www.indeed.com/resumes)and transformed the data into Turtle/Resource Description Framework by using OpenRefine [33].We scraped data from Indeed, a website that contains data about people résumés publicly accessible in html online format.Résumés' acquiring must relate to a keyword to search for résumés.We intended to obtain résumés for people belonging to the same field of work.Therefore, we limited our searches to résumés related to the Java industry keyword.We used a word cloud to identify the main keywords encountered in the skills section from every résumé.This way we extracted 677 web addresses that link to résumés from the Java industry.We parsed these HTML pages and extracted information from 213 résumés.Specifically, only 213 résumés from the 677 résumés presented information structured according to Figure 2. We processed data and initially stored it in comma-separated-value (CSV).
The résumés data had to be transformed in RDF according to public ontologies found in the human resource field.Therefore, we mapped data from résumés to ontology's concepts and properties according to our own human resource ontology that extends the human resource ontology published by Ontology Engineering Group (OEG).The aim of this ontology is to represent knowledge related to the human resource hiring process.The human resource ontology developed by the Ontology Engineering Group is suitable for our purpose and is available at http://mayor2.dia.fi.upm.es/oeg-upm/index.php/en/ontologies/99-hrmontology/[34].We used the following ontologies: JobSeeker, Occupation, Education, Competence, and Skill.The public URIs do not work and therefore we adapted the namespaces of every ontology file.We also combined the JobSeeker, Occupation, and Education ontologies into a single ontology, (e.g., JobSeeker) because the fine granularity of Education and Occupation did not present findings of interest with respect to the research scope.The focus is on work experience and skills.The knowledge base consists of 213 résumés with data described by the JobSeeker, Competence, and Skill ontologies, which are accessed using SPARQL queries to generate our set of features to train the classifiers.
Figure 3 presents the key classes and properties of the ontologies used.A job seeker has work experience, candidacy, education, competence, and skill as a subclass of competence.Central to the ontology is the WorkExperience concept and its object properties that relate the concept of Work Experience to Candidacy and JobSeeker.The Candidacy concept requires Competence, and Skill is a subclass of Competence.We used the following ontologies: JobSeeker, Occupation, Education, Competence, and Skill.The public URIs do not work and therefore we adapted the namespaces of every ontology file.We also combined the JobSeeker, Occupation, and Education ontologies into a single ontology, (e.g., JobSeeker) because the fine granularity of Education and Occupation did not present findings of interest with respect to the research scope.The focus is on work experience and skills.The knowledge base consists of 213 résumés with data described by the JobSeeker, Competence, and Skill ontologies, which are accessed using SPARQL queries to generate our set of features to train the classifiers.
Figure 3 presents the key classes and properties of the ontologies used.A job seeker has work experience, candidacy, education, competence, and skill as a subclass of competence.Central to the ontology is the WorkExperience concept and its object properties that relate the concept of Work Experience to Candidacy and JobSeeker.The Candidacy concept requires Competence, and Skill is a subclass of Competence.
The data properties (depicted with dashed arrows in Figure 2) allow the values of Work Experience, Education, and JobSeeker to be specified.
Every resource in our dataset is identified using a uniform resource identifier (URI).URIs have been designed with simplicity and manageability principles in mind.The matching between a résumé and the terms from ontologies is presented in Figure 4.The data properties (depicted with dashed arrows in Figure 2) allow the values of Work Experience, Education, and JobSeeker to be specified.
Every resource in our dataset is identified using a uniform resource identifier (URI).URIs have been designed with simplicity and manageability principles in mind.
The matching between a résumé and the terms from ontologies is presented in Figure 4.The major purpose of data processing was to access data about work experience, education and skills.Besides storing raw data and transforming it to RDF, we also derived some new features, like the total years of experience, the years of experience in the current job position, and the average experience (measured in years) in every position held.These features were derived by using SPARQL.Operationalization of variables is described in Table 1.
For processing Semantic Web data, we identified two different types of features: (1) features derived with SPARQL and (2) features derived with SPARQL by aggregating data.
The information concerning skills splits between different technical skills.We restricted the skills that presented interest at SOA, NoSQL, SQL, Java, Java Web, and Java Persistence.We derived these categories by using a word cloud.Therefore, we queried the RDF data through SPARQL queries to find out which candidates have these kind of skills.The matching between a résumé and the terms from ontologies is presented in Figure 4.

Feature Ideas Details
Total years of experience [35] Extensive experience of activities in a domain is necessary to reach very high levels of performance.
"Expert performance is acquired gradually and the effective improvement of performance requires the opportunity to find suitable training tasks that the performer can master sequentially" The years of experience at the current job position [36] Values that are too big or too small are subject to further analysis "Job satisfaction is positively correlated with mission valence, commitment, person-job fit, flexible work, pay, innovation, and a variety of other individual and organizational factors" The average of the years of experience in every position held [37] Variety in work experiences might influence forming high performers "According to 50 senior executive search professionals the study surveyed, the average executive today will work in five companies; in another 10 years, it might be seven"."Ineffective people often stay in position for years".
The total number of positions held (Position_count) [38] The total number of positions might influence forming high or low performers "Job satisfaction more strongly determines organizational performance than organizational performance determines job satisfaction" Source: Our own projection.
The information related to work experience splits in different positions held and their corresponding beginning and ending dates.We queried the data through SPARQL queries to obtain the last position held, the total years of experience and the average time spent at one position.In addition, we queried the data to find out the total number of positions held.
The information concerning education splits between the education level and the corresponding specialization.We identified computer science, information technology, electronics, computer engineering, software engineering, computer applications, and other specializations that we grouped as OTHER.We queried the data to obtain the education level (master or bachelor) and the corresponding specialization field.We derived 15 features extracted from data using SPARQL queries.The next section presents the obtained results.
Features defining and SPARQL specifications are presented in Table 2.
In order to obtain the current position held, we created a SPARQL query that extracts the positions held by each job seeker.After querying the data, the provided sample consists of 120 résumés along with the anonymized URIs, objective (position held), work experience, education, and skills.Our resulting RDF knowledge base comprises 14,206 RDF triples.Our features contain mixed continuous (numerical) and dichotomous (categorical) types that can be handled by the data mining techniques.Figure 5 summarizes the features used in this work.
Concerning the size of the sample, we studied the learning curve (Figure 6).Learning curves are a tool to do a quick check on the models at every point in the machine learning workflow.Bias and variance are inherent properties of estimators and we usually have to select learning algorithms and hyper parameters so that both bias and variance are as low as possible.In order to study the learning curve, we applied regression on our data.For regression, the perfect scenario is when both curves converge toward an MSE of 0.
The training data is fitted very well by the estimated model.If the model fits the training data very well, it means it has low bias with respect to that set of data.Therefore, we decided that the sample size is proper.We consider that the sample size is sufficient for generalization, as replication with other populations or conditions helps to define parameters related to education, work experience and skills.
To date, various methods have been developed and introduced to mine data.We used the k-nearest neighbors algorithm (k-NN), naive Bayes classifiers, support-vector machine, random forest, regression, and the decision trees technique.The C4.5 classifier, a well-liked tree based classifier, is used to generate a decision tree from a set of training examples.Nowadays C4.5 is renamed as J48 classifier in WEKA tool, which is an open source data mining tool.The heuristic function used in this classifier is based on the concept of information entropy [39].We used WEKA to build our classifiers.To test the algorithms we choose 10-folds cross-validation method.Concerning the size of the sample, we studied the learning curve (Figure 6).Learning curves are a tool to do a quick check on the models at every point in the machine learning workflow.Bias and variance are inherent properties of estimators and we usually have to select learning algorithms and hyper parameters so that both bias and variance are as low as possible.In order to study the learning curve, we applied regression on our data.For regression, the perfect scenario is when both curves converge toward an MSE of 0. The training data is fitted very well by the estimated model.If the model fits the training data very well, it means it has low bias with respect to that set of data.Therefore, we decided that the sample size is proper.We consider that the sample size is sufficient for generalization, as replication with other populations or conditions helps to define parameters related to education, work experience and skills.
To date, various methods have been developed and introduced to mine data.We used the k-nearest neighbors algorithm (k-NN), naive Bayes classifiers, support-vector machine, random forest, regression, and the decision trees technique.The C4.5 classifier, a well-liked tree based classifier, is used to generate a decision tree from a set of training examples.Nowadays Concerning the size of the sample, we studied the learning curve (Figure 6).Learning curves are a tool to do a quick check on the models at every point in the machine learning workflow.Bias and variance are inherent properties of estimators and we usually have to select learning algorithms and hyper parameters so that both bias and variance are as low as possible.In order to study the learning curve, we applied regression on our data.For regression, the perfect scenario is when both curves converge toward an MSE of 0. The training data is fitted very well by the estimated model.If the model fits the training data very well, it means it has low bias with respect to that set of data.Therefore, we decided that the sample size is proper.We consider that the sample size is sufficient for generalization, as replication with other populations or conditions helps to define parameters related to education, work experience and skills.
To date, various methods have been developed and introduced to mine data.We used the k-nearest neighbors algorithm (k-NN), naive Bayes classifiers, support-vector machine, random forest, regression, and the decision trees technique.The C4.5 classifier, a well-liked tree based classifier, is used to generate a decision tree from a set of training examples.Nowadays C4.5 is renamed as J48 classifier in WEKA tool, which is an open source data mining tool.The In our study, we chose the J48 algorithm to construct the decision tree.J48 implementation is widely used in research papers [40].

Results
The proposed task was to predict skills that a job seeker has.We applied different algorithms on data.The accuracies, ROC, and PRC values for every algorithm are presented in Table 3.
In order to study the performance of the algorithms, we presented also the receiver operator characteristic (ROC) (Figure 7) and the precision recall curve (PRC) values.Davis and Goadrich [41] studied ROC and PRC.They explained that for skewed datasets the PRC values are more informative than ROC values.An optimal classifier will have ROC and PRC area values approaching 1, with 0.5 being comparable to random guessing.Java-developer class has the highest number of instances.We noticed that for Java developer class J48, k-NN and random forest have good ROC curves.
For the J48 algorithm, we used a pruned tree, meaning that we tried to avoid overfitting.In addition, we used binary split for nominal attributes.We applied J48 classification algorithm, by using a pruned tree method, binarySplits, on nominal attributes, a 0.25 confidence factor, and a 10fold cross validation method for testing the model.Table 4 presents the values for J48.It is important to notice that the accuracy of the decision tree applied on the features that do not include aggregations is greater than the accuracy of the decision tree applied on the features that include aggregations with 0.02.This is the only algorithm that presented this behavior of accuracies.We also mention that the ROC value is only 0.474 and the PRC is 0.818; therefore, we concluded that the classifiers models performed better on data enriched with features obtained through aggregations.
Java-developer class has the highest number of instances.We noticed that for Java developer class J48, k-NN and random forest have good ROC curves.
For the J48 algorithm, we used a pruned tree, meaning that we tried to avoid overfitting.In addition, we used binary split for nominal attributes.We applied J48 classification algorithm, by using a pruned tree method, binarySplits, on nominal attributes, a 0.25 confidence factor, and a 10-fold cross validation method for testing the model.Table 4 presents the values for J48.The confusion matrix of the J48 classifier is presented in Table 5.It can be noticed that the model has a good value of prediction, so we proceeded to analyze further.We observed that the attribute splitting the data is Years_experience_position.Moreover, the Java developers, the class that has the highest number of correctly classified instances is determined by being skilled in Java programming and SOA.Therefore, in order to determine the related skills, there is the need to analyze data for every skill.
We started to build the decision trees for every skill.Table 6 presents the accuracy details for every skill classifier.
J48 predicted better than the baseline models for the rest of the skills.The J48 classifier built for Java programming skills identified that SQL programming skills is the node that splits instances.In addition, by running different tests, we found out that a good predictor for SQL programming skills and for database programming skills is Java programming skills.
The Java web programming skills J48 classifier (Figure 8), identified that Java programming skills, Java persistence skills and NOSQL programming skills are the features that split the data.The NOSQL programming skills J48 classifier (Figure 9) identified that Java persistence skills, UML skills and Java web developer skills are features related to skills that split the data.The SOA programming skills J48 classifier (Figure 10) identified that Java programming skills are related to SOA.In addition, it seems that experience at job positions that are not designed for JEE developers is important in classifying instances as having SOA programming skills.The SOA programming skills J48 classifier (Figure 10) identified that Java programming skills are related to SOA.In addition, it seems that experience at job positions that are not designed for JEE developers is important in classifying instances as having SOA programming skills.The SOA programming skills J48 classifier (Figure 10) identified that Java programming skills are related to SOA.In addition, it seems that experience at job positions that are not designed for JEE developers is important in classifying instances as having SOA programming skills.These findings further support the idea of using ontologies to better describe data from people résumé data.These findings further support the idea of using ontologies to better describe data from people résumé data.

Discussion
As we mentioned in the literature review, employees' skills are of great concern to talent management.A rich body of the literature has focused on the importance of competences for business, but little is known about how to identify skills that employees have starting from data presented in job seekers' profiles.The exploration and development of new skills, career paths and education levels require scientists and human resource experts to extract knowledge from multiple sources of information.
This study contributes to the existing literature by specifically analyzing how to discover links between data by using Semantic Web technologies and analytics.
Mining data from people résumés brings to surface relations between résumés data and employability.Our approach states that starting from the relation between work experience, education, and skills on one side and position held on the other side it is possible to derive links between skills.This approach uses data about skills represented in a Linked Data structured representation format.
We compared the results with the findings of previous work.The main contributions of our work presented in relationship with other studies results are surveyed in Table 7.
Our current study found that the J48 classifier built for Position_held identified that the specialization chosen for studying, UML skills, Java programming skills, the number of years spent in average for every position held and the number of years spent at the last position held are the features that split the data.We built the classifier by using binary split for nominal features.
The most interesting finding was that when we built the same classifier without binary splitting the data of nominal features, we observed that the features that split the data are: number of years spent in average at every position held, Java programming skills, SOA programming skills, the number of positions held, and the number of years spent at the last position held.Yes, but not for résumé data.There are many studies that describe data with RDF [42], propose tools to automatically describe data [43], or publish RDF data [44], but not résumé data.
Features derived with SPARQL by using aggregate functions Yes In the case study, we defined the number of years spent in average at every position held, the number of positions held, and the number of years spent at the last position held.We found that they are important for Position_held and for predicting better some skills Partial aggregate functions but to movie reviews [22].
Features derived from Linked data mining Yes In our case study, the Java web programming skills J48 classifier (Figure 9) identified that Java programming skills, Java persistence skills, and NOSQL programming skills are the features that split the data.The NOSQL programming skills J48 classifier (Figure 10) identified that Java persistence skills, UML skills, and Java web developer skills are features related to skills that split the data.Partial [11] use of linear regression to predict human resource employability, but no classification machine learning algorithm to derive skills.In addition, they use linguistic analytics.We used the Semantic Web in order to offer a better representation.This representation is useful in future dataset querying so that when searching the dataset for people with certain skills, enhanced information be provided.Mochol et al. [45] use a dictionary of synonyms, but not the Semantic Web.Kessler et al. [46] classify the applicants with the support vector machine algorithm.We used decision tree algorithm to find the best predictors.To predict position held, we applied random forest, classification via regression, naive Bayes, k-NN, support vector machine and decision tree.
Source: Our own projection.
Table 8 shows examples of the three types of features that our study analyzed with the aim of better describing the results of the paper.In this study, all classification algorithms performed better when they were applied on the dataset previously enriched with features resulted from aggregating data, like: years of experience at the last position, the total number of position held, the average years of experience on each position.The accuracies, ROC and PRC values proved that predicting the position held is improved when using features derived by aggregations.
Furthermore, describing data in terms belonging to ontologies provided the possibility to derive links between skills.Starting from a compact description of skills, we queried with SPARQL each skill, we labeled with terms from ontologies for future analysis.The results proved that different skills are determined by the existence of other skills.In addition, the features derived by aggregations are predictors for some skills, becoming normal to infer that some skills come with experience or by implying in diverse activities during changing jobs.
Taken together, these results suggest that linking terms and properties from diverse vocabularies help inferencing on data belonging to eRecruitment websites.
In this study, we described an approach for automatically detecting the important attributes for the process of hiring job seekers.Our approach is able to detect three types of features.We described the method used in our approach: (1) using the Indeed résumé database, (2) describing data with concepts belonging to ontologies, (3) querying data with SPARQL, (4) defining features, and (5) mining data with analytics.
The main contributions of our work are as follows.Firstly, we proposed analytics for discovering the important features for hiring job seekers starting from résumé data.The method operates by selecting attributes with a high information gain ratio.The attributes were previously defined with SPARQL queries.Secondly, an experimental analysis was conducted on Indeed's résumé data with the aim of applying the method.
This study has shown that analytics on features derived with Semantic Web technologies help identify better links between data.Furthermore, identifying which skills determine other skills at the level of an entire population also has a large impact on the data analysis.
Organizations tend to select project teams based on experience, availability, and past individual performance.One future application would be to predict the success rate of a team based on team composition and context variables.
Finally, we mention the impact of using the results of our work on graph database from the LinkedIn website together with other personal job seeker web pages or other job website portals.We believe that the granularity of skills' descriptions will impact on the results of analytics.

Figure 1 .
Figure 1.The structure of a résumé from Indeed résumé website.Figure 1.The structure of a résumé from Indeed résumé website.

Figure 1 .
Figure 1.The structure of a résumé from Indeed résumé website.Figure 1.The structure of a résumé from Indeed résumé website.

( 2 )
The mapping engine integrates data published using different vocabularies from the human resource ontology published by the Ontology Engineering Group.It transforms data from CSV to RDF by using terms defined as classes, sub-classes, and properties from other RDF files that represent ontologies.(3) The résumé RDF processor labels different features of the data mining classifier model.It uses SPARQL to query data from RDF and derives the features.(4) The classifier models use data and derive the prediction rules.Sustainability 2019, 11, x FOR PEER REVIEW 4 of 19

Figure 3 .
Figure 3. Key classes and properties of the ontology.Source: Our own projection.

Figure 3 .
Figure 3. Key classes and properties of the ontology.Source: Our own projection.

Figure 3 .
Figure 3. Key classes and properties of the ontology.Source: Our own projection.

Figure 6 .
Figure 6.The learning curve.Source: Our own projection realized in Python on our dataset.

Figure 6 .
Figure 6.The learning curve.Source: Our own projection realized in Python on our dataset.

Figure 6 .
Figure 6.The learning curve.Source: Our own projection realized in Python on our dataset.

Figure 7 .
Figure 7. ROC and PRC of different algorithms for the Java-developer class (with aggregate features) Source: Our own projection realized with Weka.

Figure 7 .
Figure 7. ROC and PRC of different algorithms for the Java-developer class (with aggregate features) Source: Our own projection realized with Weka.
own projection realized with Weka.

Figure 8 .
Figure 8. Java web developer skills.Source: Our own projection realized with Weka.

Figure 8 .
Figure 8. Java web developer skills.Source: Our own projection realized with Weka.

Figure 9 .
Figure 9. NOSQL programming skills decision tree.Source: Our own projection realized with Weka.

Figure 9 .
Figure 9. NOSQL programming skills decision tree.Source: Our own projection realized with Weka.

Figure 9 .
Figure 9. NOSQL programming skills decision tree.Source: Our own projection realized with Weka.

Figure 10 .
Figure 10.SOA decision tree.Source: Our own projection realized with Weka.

Figure 10 .
Figure 10.SOA decision tree.Source: Our own projection realized with Weka.

Table 1 .
The operationalization of variables.

Table 2 .
Features and their corresponding SELECT clauses from SPARQL queries.
Source: Our own projection.
Source: Our own analysis realized with Weka.Sustainability 2019, 11, x FOR PEER REVIEW 10 of 19

Table 4 .
Precision, Recall and the F-Measure for the J48.Note: The features of the vector are represented by all the attributes, except for Job_Seeker.The class attribute/target is Position_held Source: Our own projection realized with Weka.

Table 5 .
The confusion matrix for the Position_held J48 classifier.

Table 7 .
The main contributions of our work.(RDF: Resource Description Framework, k-NN: k-nearest neighbors, SPARQL: SPARQL Protocol and RDF Query Language, NOSQL: not only SQL)

Table 8 .
Types of feature examples.Features derived with SPARQL queries JobSeeker has_average_work_experience xsd:integer Features derived from other features (skills related to data from people résumé) Java_Web_Developer_Skills is_related_to SOA_programming_skills Attribute selection by using data mining SOA_programming_skills isImportant in hiring Java developers Source: Our own projection.