1. Introduction
The alignment between university education and labour-market requirements has become a pressing concern in knowledge-intensive industries. In data-domain occupations such as Data Scientist, Data Analyst, and Data Engineer, employers increasingly require specific technical and analytical competencies rather than broad academic attainment alone [
1]. This challenge is becoming more acute as artificial intelligence reshapes technically intensive occupations and may contribute to more selective early-career hiring patterns [
2]. Consequently, student-to-job matching requires methods that can determine whether transcript-level learning evidence corresponds to the specific competency requirements of a target role rather than relying only on broad indicators such as degree title, programme label, or cumulative GPA.
Researchers have sought to move beyond the grade point average (GPA) metric by framing graduate selection as a person–job fit problem in which semantic representations of job requirements are aligned with representations of candidate qualifications [
1]. Early methods relied on keyword overlap and bag-of-words similarity, which are computationally inexpensive. However, these methods operate purely at the surface of text and cannot resolve semantic equivalences when academic course names and job-posting skill terms belong to different terminological conventions. The development of dense sentence-embedding models—notably, Sentence-BERT [
3]—substantially improved matching quality by encoding meaning rather than vocabulary. Nonetheless, high cosine similarity between a course and a knowledge node does not guarantee structural coherence within a knowledge hierarchy, nor does it reflect how strongly that node is demanded by the specific occupation under consideration. Ontological knowledge representation offers a principled way to address this limitation: by organising domain knowledge into explicitly defined hierarchical structures, ontologies allow a course to be evaluated not only against the node it most resembles but also against the broader semantic context defined by the skill group and job category in which that node is embedded [
4]. Constructing such ontologies manually, however, is labour-intensive and difficult to sustain as labour-market requirements evolve. Although Large Language Models (LLMs) can support semi-automated ontology construction, fully automated generation remains risky because Natural Language Generation (NLG) models may produce fluent but unfaithful or unverifiable outputs; such risks can be further affected by decoding and inference choices that may increase the likelihood of hallucinated content [
5]. Therefore, semi-automated frameworks that combine LLM scalability with the semantic rigour of human-expert oversight represent a promising and increasingly active research direction.
Against this background, the present study identifies three inter-related technical problems that are frequently addressed in isolation but are rarely integrated within a single end-to-end competency-matching pipeline. The first is semantic surface ambiguity: A course may achieve high similarity relative to a knowledge node due to incidental vocabulary overlap while remaining misaligned with the underlying competency that the node represents in practice. For instance, courses such as Human Relations, Calculus 1, and Lanna Studies can exhibit strong leaf-level similarity relative to computing-related knowledge nodes, even though they do not develop computing competency, which can propagate false positives when hierarchical context is not verified. The second problem is role indiscrimination: Conventional matching approaches often treat data-domain jobs as a single category and therefore struggle to differentiate candidates across closely related roles such as Data Scientist, Data Analyst, and Data Engineer, despite meaningful differences in the competencies emphasised by these positions. The third problem is a lack of transparency: Numerical ranking scores provide limited insight into which courses and competency areas drive the final decision. Because black-box decision-support systems can be difficult for users and stakeholders to understand, insufficiently explained rankings may raise concerns about accountability and fairness in high-stakes settings such as academic advising and graduate recruitment [
6,
7].
These issues motivate a framework that combines ontology-guided structure, multi-level verification to suppress surface-level drift, job-type-conditioned scoring, and traceable evidence-based explanations.
Prior work has made important advances but leaves these problems collectively unaddressed. Studies on skill extraction from job postings [
8] and on résumé-to-job matching [
1] have demonstrated the feasibility of large-scale semantic alignment using dense representations, but these approaches operate with flat, non-hierarchical skill structures and remain vulnerable to surface ambiguity. Ontology-based competency modelling frameworks [
4] have shown that structured knowledge representations improve matching reliability, yet they typically depend on manually curated ontologies that are costly to maintain. The reliability of LLM-induced ontology structures under systematic ensemble validation and HITL oversight has not been empirically characterised, and the contributions of individual framework components to overall performance have rarely been assessed through rigorous ablation analysis. Evaluation methodologies in the literature are also inconsistent: many studies report aggregate similarity or ranking correlation metrics without determining whether a framework correctly discriminates among semantically adjacent job roles or produces outputs that practitioners can act upon [
9]. Furthermore, the use of transcript-based academic evidence as structured input to ontology-grounded competency matching has not been studied in Southeast Asian higher education contexts, where curriculum diversity across engineering, business, science, and arts faculties provides an initial cross-domain test bed for discrimination between computing-oriented and non-computing curricula, while more difficult near-boundary curricula remain to be evaluated in future work.
To address these gaps, this study proposes a semi-automated, ontology-grounded framework for student-to-job competency matching. Practically, the proposed framework provides a principled, data-driven mechanism for aligning student records with job-market requirements. Methodologically, the Path Consistency Index (PCI) and Total Accumulated Competency Score (TACS) mechanisms extend existing embedding-based approaches by incorporating multi-level ontological verification and job-side structural relevance signals, yielding more discriminative rankings than the flat-SBERT baseline evaluated in this study. With respect to explainability, black-box ranking systems can provide limited insight into how particular recommendations are produced, which is a central concern in explainable AI [
7].
The framework is also designed to be extensible across domains: the Occupational Information Network (O*NET) knowledge taxonomy [
10], the LLM ensemble protocol, and the TACS and NTACS scoring formulations may be adapted to other academic systems and occupational categories, provided that an appropriate seed taxonomy, domain-specific job-posting corpus, and expert validation process are available.
In summary, we consider the following to be our main contributions:
- 1.
We propose a semi-automated Ontology framework for multi-level Competency Mapping (O4CM), a semi-automated ontology-grounded framework that integrates LLM-assisted ontology induction; expert validation; multi-level structural verification via the PCI; and job-conditioned competency scoring operationalised by the raw and its normalised ranking score, the Normalised Total Accumulated Competency Score ().
- 2.
We evaluate the framework’s structural consistency, semantic representation design, and discriminative performance through ontology validation; semantic representation diagnostics; PCI-based group separation; systematic ablation analysis; and comparison with four reference methods spanning GPA ranking, keyword matching, flat Sentence Bidirectional Encoder Representations from Transformers (flat SBERT) matching, and knowledge-node-only ontology matching.
- 3.
We demonstrate the framework’s capacity to produce interpretable ranking outputs through semantic traceability maps, which decompose each candidate’s competency score into course-level and skill-group-level evidence contributions.
A positioning of O4CM relative to related approaches, together with a detailed discussion of the three gaps that motivate the present work, is provided in
Section 2.4.
Several limitations of the present study should be noted. First, the empirical evaluation is based on a single Thai university and a static job-posting corpus, limiting generalisation claims. Second, the current framework addresses three data-domain roles and one contrasting arts domain; broader occupational coverage remains to be explored in future work. Third, despite HITL verification, LLM-generated augmentation can still introduce semantic drift [
11], and the quality of the induced ontology depends on the availability of suitable seed taxonomies and qualified domain experts. Fourth, the framework currently uses a structural relevance definition for
rather than raw demand frequency; hybrid formulations have not yet been explored.
The remainder of this article is organised as follows.
Section 2 reviews related work on skill extraction from job postings and identifies the specific gaps that motivate the present framework.
Section 4 presents the proposed five-phase semi-automated ontology framework for competency mapping.
Section 5 reports the empirical evaluation.
Section 6 interprets the findings in relation to prior literature and discusses practical implications for educational counselling and talent acquisition.
Section 7 summarises the main contributions, acknowledges limitations, and outlines directions for future research.
4. Methodology
This section describes the methodology of the proposed semi-automated framework for multi-level semantic knowledge extraction and competency mapping. The framework integrates LLM-assisted induction, ontology engineering, HITL verification, semantic representation, and structural scoring to support traceable student-to-job matching.
4.1. Key Concepts and Notation
Four core concepts underpin this framework.
Ontology: A formal knowledge representation that defines concepts, properties, and hierarchical relationships, supporting logical inference and cross-class relations [
32]. Unlike a taxonomy (simple parent–child hierarchy), an ontology enables auditable path-based reasoning. Here, knowledge domains, skill groups, and job categories form a three-level
rdfs:subClassOf hierarchy.
Large Language Model (LLM): A neural model trained on large text corpora to generate structured natural language outputs [
5]. An ensemble of five LLMs proposes candidate ontology mappings; all outputs are subject to HITL verification before materialisation.
Sentence-BERT (SBERT): A Siamese bi-encoder that produces fixed-length sentence embeddings for efficient cosine-similarity computation [
3]. The
all-MiniLM-L6-v2 variant embeds both ontology node descriptions and transcript course records.
Human in the Loop (HITL): A design pattern in which human experts review and correct machine-generated outputs at critical decision points [
5]. HITL verification is the final acceptance gate for every LLM-proposed mapping before ontology materialisation, including unanimous-agreement cases.
4.2. Framework Overview
The proposed framework, denoted as the Ontology Framework for Multi-level Competency Mapping (O4CM), is organised as a five-phase pipeline.
Figure 1 presents the workflow and illustrates how the output of each phase is validated and propagated to the subsequent stage.
The semi-automated design pairs LLM-driven concept induction, which is scalable but hallucination-prone [
5], with HITL verification, which is intended to improve semantic appropriateness and auditability.
As illustrated in
Figure 1, the five sequential phases are: atomic unit extraction (Phase 1), LLM-assisted ontology construction (Phase 2), semantic augmentation and SBERT encoding (Phase 3), PCI-based structural verification (Phase 4), and job-aware competency scoring (Phase 5).
4.3. Phase 1: Atomic Unit Extraction and Preprocessing
Phase 1 converts the heterogeneous inputs described in
Section 3—namely, job postings and student transcript records—into standardised atom-level records for ontology-based mapping. The preprocessing procedure consists of three main operations: duplicate records are removed to avoid inflated frequency counts, non-informative special characters and formatting artefacts are removed to reduce textual noise, and credit values in the transcript records are converted into numeric form to support weighted competency scoring in later phases.
Table 2 lists the job-posting attributes. Skill fields stored as string representations of arrays are parsed and expanded into skill atoms as displayed in
Figure 2, with each atom occupying a single row for consistent counting and mapping.
4.4. Phase 2: Semi-Automated Bottom-Up Ontology Construction
Phase 2 constructs the ontology backbone that enables multi-level semantic mapping between job requirements and student learning evidence. Building on the atomic and standardised units produced in Phase 1, this phase follows a bottom-up strategy that begins with well-defined knowledge anchors and progressively induces higher-level groupings and relations. The objective is to reduce manual ontology engineering effort while preserving semantic clarity, structural consistency, and traceability.
In this study, ontology construction is anchored to the O*NET knowledge taxonomy [
10], which provides standardised knowledge definitions suitable for occupational competency analysis. The scope is restricted to 22 O*NET knowledge domains that correspond to the selected occupational scope and support the computer–art contrast used in the evaluation. These domains form the seed layer (
) and serve as fixed semantic anchors for the induction of skill-group structures (
) and the assignment of these groups to job-level categories (
). The output of this phase is a validated three-level Web Ontology Language class hierarchy linking knowledge domains, skill groups, and job categories through explicit subclass assertions.
4.4.1. Ontology Construction with LLM Assistance
Ontology construction is implemented as a semi-automated induction process. LLMs are used to propose candidate skill-group structures and mapping assertions under strict output constraints. These outputs are not treated as final ontology assertions. Instead, ensemble agreement is used as a preliminary consensus signal, while HITL verification serves as the final acceptance gate before any mapping is materialised into the ontology.
The workflow consists of three primary tasks:
Skill-Group Induction: The 22 O*NET knowledge domains and their definitions are provided to the LLMs to induce candidate skill groups and concise descriptions. These groups provide the intermediate layer () between knowledge domains () and job categories () but are not treated as final ontology commitments until their associated mappings pass HITL review.
Knowledge-to-Skill Mapping (): Each O*NET knowledge domain is assigned to one relevant induced skill group through definition-grounded classification. This task is the central semantic decision point because it links the fixed O*NET knowledge anchors to the induced skill-group structure and therefore receives the main ensemble-agreement and expert-validation analysis.
Skill-to-Job Category Assignment ():Each validated skill group is assigned to exactly one job category (:Computer_Job or :Art_Job) using a forced-choice semantic dominance criterion. The decision uses the skill-group definition, its assigned knowledge-domain members, and the definitions of the two target job categories.
For ontology materialisation, entities across all three layers are modelled as classes rather than individual instances. Hierarchical subsumption is therefore represented using rdfs:subClassOf rather than rdf:type. This modelling choice treats the hierarchy as an analytical competency taxonomy rather than a realist claim that knowledge domains are literally skills. It also supports clear semantic separation and enables PCI-based auditing of the locked path from knowledge domains to skill groups and job categories.
Therefore, the resulting mappings should be interpreted as controlled analytical assignments designed for traceable competency mapping, not as complete representations of all possible relationships among knowledge domains, skills, and job categories.
4.4.2. Prompt Design and Constraints
Because prompt design directly influences the induced ontology structure, prompts are engineered to reduce ambiguity and enforce structured outputs suitable for ontology materialisation. To control output variability, the prompts enforce single-label classification and require decisions to be grounded in the provided definitions rather than keyword matching. For the
assignment, extension grounding is applied by providing the LLMs with the knowledge domains already assigned to each skill group as concrete semantic evidence for classification. Outputs must conform to a machine-readable template. For example, the
mapping is constrained to the following RDF-style triple format:
The relation expressed as :belongs_to_skill is used only as an intermediate parsing label in the LLM output. After HITL verification, accepted mappings are converted into rdfs:subClassOf axioms for ontology materialisation and downstream path auditing.
This enables deterministic parsing and conversion into ontology axioms with minimal manual reformatting. The objectives, inputs, constraints, and output schemas for the three induction tasks are described above.
4.4.3. LLM Ensemble and Majority Voting
To minimise model-specific bias and stochastic variation, Phase 2 employs an ensemble of five LLMs: GPT-5.4, GPT-5.3, Gemini 3 Pro, Claude Opus 4.6, and Claude Sonnet 4.6. Each model receives the identical prompt templates described in
Section 4.4.2 and is executed independently for every mapping decision.
Majority voting is used to estimate preliminary cross-model consensus and to prioritise cases for expert review. Final acceptance, however, requires HITL verification for all mappings. This voting step serves as a consensus signal rather than an automatic acceptance mechanism, helping to identify low-consensus or malformed outputs that require closer inspection. This design is motivated by the known risk of model-specific errors and hallucinated outputs in LLM-generated content [
5].
Each preliminary majority outcome is classified into one of three agreement tiers reflecting the degree of cross-model consensus:
Unanimous (5/5): All five models agree, indicating high observed cross-model agreement.
High Majority (4/5): Four models agree, with one dissenting label recorded for inspection.
Simple Majority (3/5): Three models agree, indicating a weaker consensus signal; all such cases are automatically escalated for mandatory HITL review.
HITL review is also triggered when two or more models produce malformed outputs and no majority can be determined. Under HITL review, a domain expert examines all five candidate labels against the relevant O*NET definitions and assigns a final expert-confirmed label with a documented justification. Only mappings that have passed HITL verification, informed by the ensemble agreement results, are converted into OWL/RDF subclass assertions. The quantitative outcomes of this protocol are reported in
Section 5.3.
For clarity and reproducibility,
Table 8 summarises the full ensemble configuration, the majority voting rule, and the HITL escalation criteria, together with the exception-handling logic for malformed outputs.
4.4.4. Human-in-the-Loop Verification Protocol
Because LLM-induced structures are treated as candidate drafts rather than final ontology assertions, HITL verification is applied at the end of Phase 2. The verification confirms that the ensemble-generated mappings (
Section 4.4.3) are semantically appropriate for the target domain and structurally consistent for downstream ontology-based reasoning.
Critically, unanimous consensus (
agreement) among the five LLMs is not treated as a sufficient condition for accepting a label without expert inspection. Therefore, HITL verification is applied at every agreement tier, and unanimous decisions are reviewed with the same procedural rigour as ambiguous majority cases. Verification is conducted by two expert review panels comprising ten members in total, selected through purposive sampling on the basis of domain expertise, as shown in
Figure 3.
The first panel consists of five HR professionals from the computer and technology sector, each with over ten years of experience in recruitment and talent management. Their primary responsibility is to validate the industry relevance of the skill-to-job assignments, ensuring the ontology reflects contemporary hiring practices and professional competency standards. During review, each member assessed whether a proposed mapping (i) correctly matched the functional meaning of the knowledge domain to the corresponding skill group, (ii) was consistent with the professional skill vocabulary used in industry job postings, and (iii) did not inflate structural similarity for courses unrelated to computing work.
The second panel comprised five senior university lecturers with over 15 years of experience in computer science education and curriculum design. They assessed the pedagogical integrity of the knowledge-to-skill mappings in terms of semantic fit, domain relevance, and hierarchical consistency, ensuring that transcript-derived evidence was interpreted appropriately within an educational context. Their review criteria included whether a mapping (i) aligned with standard curriculum taxonomy in computer science and information technology programmes, (ii) correctly reflected the learning outcomes of the assigned knowledge domain, and (iii) preserved the intended separation between computing-oriented and arts-oriented competency paths.
The verification process is summarised in
Table 9. Both panels reviewed the initial LLM assertions independently. Discrepancies and possible instances of semantic drift were recorded in a central revision log. Where inter-panel disagreement remained, a joint discussion was conducted to reach a documented final decision, supporting an ontology structure that is both academically grounded and professionally relevant.
This design intentionally favours traceability and auditable hierarchy over full semantic coverage. Consequently, the resulting mappings should be interpreted as controlled analytical assignments rather than complete representations of all possible relationships among knowledge domains, skills, and job categories.
At the conclusion of Phase 2, the framework produces a HITL-validated ontology backbone linking O*NET-grounded knowledge nodes, LLM-induced skill groups with explicit definitions, and skill-to-job category assignments. This backbone supports semantic augmentation in Phase 3 and PCI-based multi-level path auditing in Phase 4.
The complete Phase 2 procedure, including ensemble-based induction, majority voting, HITL verification, and ontology materialisation, is summarised in Algorithm 2.
| Algorithm 2: LLM-Ensemble Ontology Induction and HITL Validation (Phase 2) |
![Make 08 00183 i002 Make 08 00183 i002]() |
4.5. Phase 3: Semantic Augmentation and Node Representation
Because LLM-generated labels may vary in wording across runs, Phase 3 decouples the human-readable label from the computational representation. Each ontology node is represented by a node description, while each textual evidence unit is represented by an augmented textual description. Together, these descriptions provide a controlled and contextually enriched basis for embedding-based similarity computation. The validated hierarchy from Phase 2 is then used alongside these representations for path auditing and competency ranking in later phases.
4.5.1. Semantic Augmentation Process
Semantic augmentation is applied to three types of textual units: ontology nodes, job-skill atoms, and student course records. For ontology nodes, O*NET knowledge domains are grounded in their standardised definitions, while LLM-induced skill-group nodes are represented using their HITL-validated descriptions. This separates stable external reference definitions from induced intermediate concepts.
For job-side units, each extracted skill atom is augmented in the context of the full job description to clarify its practical application and implied competency expectations. For course-side units, course titles and descriptions are expanded into competency-oriented statements that describe the knowledge and skills evidenced by each course. The augmentation prompts instruct the LLMs to consider labour-market, educational, and information-technology perspectives when expanding skills and course descriptions. This process is intended to shift similarity computation from surface-level keyword overlap toward functional meaning.
4.5.2. Embedding-Based Node Representation
After augmentation, each ontology node is represented by a node description (
), and each textual evidence unit is represented by an augmented textual description (
). These texts are encoded using SBERT [
3]—specifically, the
all-MiniLM-L6-v2 model—to produce fixed vector representations for subsequent similarity computation. Equations (
1) and (2) define the embedding functions for ontology nodes and evidence units, respectively:
In Equations (1) and (2),
denotes the
all-MiniLM-L6-v2 Sentence-BERT encoder,
denotes the textual description used to represent ontology node
n, and
denotes the augmented textual description of an evidence unit (
u). Set
includes both job-skill atoms and student course records. For knowledge nodes (
),
is derived from the corresponding O*NET knowledge definition; for skill-group nodes (
), it is derived from the HITL-validated LLM description; and for job-category nodes (
), it is derived from the job-category definition used in Phase 2. Job-skill atoms are augmented using the skill item and its job-posting context, whereas student courses are represented using course descriptions expanded into competency-oriented statements.
At the conclusion of Phase 3, the framework produces two SBERT-based representation spaces: ontology-node embeddings () derived from node descriptions and evidence-unit embeddings () derived from augmented textual descriptions. These representations provide the computational basis for course-to-knowledge matching and PCI-based hierarchical scoring in Phase 4 by enabling consistent cosine-similarity comparisons across ontology levels and evidence types.
Algorithm 3 summarises the complete Phase 3 procedure, including semantic augmentation using the LLM ensemble and SBERT-based encoding of ontology nodes and evidence units.
| Algorithm 3: Semantic Augmentation and SBERT Encoding of Ontology Nodes and Evidence Units (Phase 3) |
![Make 08 00183 i003 Make 08 00183 i003]() |
4.6. Phase 4: Multi-Level Structural Scoring
Matching a course only to a knowledge node at is insufficient because high leaf-level similarity does not necessarily imply alignment with the corresponding skill group () or job category (). Therefore, Phase 4 evaluates each course along the locked ontology path from to and . The resulting score is later used in Phase 5 to down-weight matches that appear relevant at the knowledge level but lack support from higher levels of the hierarchy.
4.6.1. Hierarchical Mapping Process
Hierarchical course positioning is performed in three steps: entry-point selection, branch locking, and multi-level re-scoring.
4.6.2. Path Consistency Index
The three similarity scores obtained from the hierarchical mapping process are consolidated into a structural confidence measure called the PCI. As defined in Equation (
6), the PCI is the arithmetic mean of the level-wise similarity scores:
A high indicates that the course is not only similar to its matched knowledge node at but is also supported by the broader semantic context at the skill-group level () and job-category level (). When is high but and are low, the course appears relevant in isolation but lacks hierarchical support, which may indicate a semantic mismatch caused by ambiguous course descriptions, broad elective content, or weak alignment with the intended job path. By down-weighting such cases, the PCI provides a more structurally informed basis for competency assessment than leaf-level similarity alone.
The value is propagated to Phase 5 as a continuous weighting signal in the job-conditioned scoring model so that courses with stronger hierarchical support contribute more to the final competency score. Rather than hard filtering, the framework retains courses with weak contextual support but assigns them a lower contribution than courses with consistent alignment across all three ontology levels.
At the conclusion of Phase 4, each course is assigned a locked ontology path, level-wise similarity scores , and a continuous value. These outputs provide the structural weighting signals used in Phase 5 to compute the final competency score.
Algorithm 4 summarises the complete Phase 4 procedure, including ontology-path locking, level-wise cosine-similarity computation, and PCI calculation.
| Algorithm 4: Ontology-Path Locking and Path Consistency Index Computation (Phase 4) |
![Make 08 00183 i004 Make 08 00183 i004]() |
4.7. Phase 5: Data-Driven Scaling and Final Evaluation
Phase 5 integrates academic performance, structural alignment from Phase 4, and job-side structural relevance signals into a job-conditioned competency score. Because grades and credits come directly from student transcripts and is computed in Phase 4, this phase first defines a job-side importance factor, then computes the Total Accumulated Competency Score (TACS) and its normalised ranking variant.
4.7.1. Job Importance Factor
The job importance factor () quantifies the job-side relevance of knowledge node k for job type c. It is derived from atomic job-skill records through a three-step procedure that combines requirement share, job-type specificity, and within-job normalisation. This design ensures that the factor reflects not only how frequently a node appears in a given job type but also how distinctively that node characterises that job type relative to others.
Step 1: Requirement mass. For each job type (
c) and knowledge node (
k), the requirement mass (
) accumulates the job-side path consistency scores of all atomic job-skill records (
r) whose locked path passes through node
k:
where
is the set of job-skill records of type
c mapped to knowledge node
k and
is the job-side path consistency score of record
r. Nodes not observed in a given job type are assigned
rather than being excluded so that the full grid of job types and knowledge nodes is preserved.
Step 2: Requirement share and job-type specificity. The requirement share (
) normalises
within job type
c:
where
is the set of
nodes under the
category associated with job type
c.
The job-type-specific emphasis (
) measures how much node
k is over-represented in job type
c relative to its mean requirement share (
) across all job types (
C):
where
is a small smoothing constant that prevents division by zero. A node with
receives a positive specificity score; a node that appears equally across all job types receives
.
Step 3: Normalised importance factor. The importance factor is the product of requirement share and specificity, normalised within each job type so that the highest-weighted node receives a value of one:
Therefore, a high indicates that knowledge node k is not only frequently observed in job type c (high ) but also specifically emphasised by that job type relative to others (high ). The factor is a dataset-conditioned structural relevance signal derived from observed job-posting evidence and should not be interpreted as a claim about general labour-market demand.
4.7.2. TACS Computation
For a target job type (
), each transcript record (
i) is mapped to a knowledge node (
) and assigned a structural score (
) from Phase 4. Equation (
12) defines
as the weighted accumulation of job-conditioned competency evidence:
where
is the numeric grade of transcript record
i,
is its credit weight,
is the student-side structural alignment score from Phase 4,
is the job-side structural relevance factor defined in
Section 4.7.1, and
n is the number of transcript records included in the computation. This formulation captures accumulated job-conditioned competency evidence prior to normalising for differences in transcript length and credit volume.
4.7.3. Normalised TACS and Candidate Ranking
Because students differ in transcript length and credit volume, the raw
is normalised by the total relevance weight associated with the transcript records. Equation (
13) defines the normalised score (
) and its record-level relevance weight (
):
where
is the job-conditioned relevance weight of transcript record
i that combines learning intensity (
), student-side structural validity (
), and job-side structural relevance (
). Unlike
, which represents raw accumulated evidence,
supports fairer comparison across students by normalising against the total relevance weight. Candidate ranking is based on
, while
is retained as the un-normalised accumulated score.
To complement this grade-sensitive ranking score, the framework also computes a relevant volume measure that captures the breadth of structurally supported coursework independently of grades. Equation (
14) defines this volume:
where
measures the total credit volume supported by the locked ontology path, regardless of grade performance or job-side relevance. It is used only as a supplementary structural volume indicator.
4.7.4. Semantic Traceability and Explainability
To make ranking outcomes interpretable, the framework records a course-level evidence-relevance signal that combines student-side structural alignment with job-side structural relevance. Equation (
15) defines this signal (
):
where
represents the relevance of transcript record
i to the target job type before grade and credit weighting. Aggregating
, optionally weighted by course credits, by knowledge node or skill group produces a semantic traceability map that explains which parts of a student’s transcript provide the strongest semantic evidence for a target job type.
At the conclusion of Phase 5, the framework produces job-conditioned rankings, normalised competency scores, relevant volume measures, and course-level traceability signals for each target job type.
Algorithm 5 summarises the complete Phase 5 procedure, including job-type importance factor computation, TACS and NTACS calculation, candidate ranking, and semantic traceability signal generation.
A complete worked example tracing Student A through all five phases of the O4CM framework is provided in
Appendix A.
| Algorithm 5: Job-Conditioned Competency Scoring and Candidate Ranking (Phase 5) |
![Make 08 00183 i005 Make 08 00183 i005]() |
5. Results
The five-phase O4CM framework described in
Section 4 is evaluated along three dimensions: the reliability of the induced ontology backbone, the discriminative value of semantic representation and PCI-based structural scoring, and the ranking behaviour and interpretability of the final student-to-job matching output.
The results are organised around the three contributions stated in the Introduction.
Section 5.3 supports the first contribution by examining whether the LLM-assisted and HITL-verified process produces a coherent ontology backbone for multi-level competency mapping.
Section 5.4 and
Section 5.5 support the second contribution by evaluating the design of the semantic representation and the discriminative value of PCI-based structural scoring.
Section 5.6 and
Section 5.7 support the second and third contributions by analysing component-level effects, job-conditioned ranking behaviour, and the traceability of ranking outputs to course-level evidence.
5.1. Dataset and Experimental Setup
The evaluation draws on the datasets described in
Section 3. The job-posting corpora provide job-side evidence, while anonymised student transcript records provide academic learning evidence. All experiments use the same HITL-validated ontology backbone from Phase 2, the same semantic augmentation and SBERT representation process from Phase 3, and the same PCI-based structural scoring procedure from Phase 4.
The full transcript dataset contains 430 students and is used to compute course-level ontology mappings and
values. For group-level structural validation, a subset of 244 students is used, comprising 185 computing students and 59 Visual Arts students. This subset provides the clearest contrast between computing-oriented and arts-oriented curricula for evaluation of ontology-path separation and is not used for parameter tuning. Ablation and ranking analyses are conducted on a stratified 94-student subsample, comprising 73 computing students and 21 Visual Arts students, to support controlled comparison across methods and framework variants. All subsets are drawn from the same population described in
Table 7.
The evaluation reports four complementary types of evidence. The first concerns ontology induction and expert validation. The second examines whether semantic representations provide a richer basis for matching than surface labels. The third evaluates whether PCI-based scoring separates contrasting academic groups under different ontology paths. The fourth uses comparison methods, ablation variants, ranking measures, and traceability evidence to analyse the behaviour of the final framework.
5.2. Comparison Methods and Ablation Variants
This subsection defines two types of experimental references. The comparison methods represent progressively stronger alternatives for student-to-job ranking, from grade-only ranking to the full proposed framework. The ablation variants remove individual components from the framework to test whether each component contributes to the final ranking behaviour.
Comparison Methods
Five comparison methods are evaluated. These methods are not presented as full competing systems from prior work. Rather, they serve as controlled reference points that reflect increasing levels of ranking and matching complexity, moving from grade-only ranking and lexical matching to embedding-based semantic similarity and the full proposed framework.
Table 10 summarises the five methods and highlights which components are enabled in each variant, including semantic encoding, usage of an ontology structure, multi-level PCI, job-side
weighting, and the final ranking basis.
Candidates are ranked by cumulative GPA without considering the semantic content or job relevance of individual courses. This method represents a conventional grade-based reference strategy where academic performance is used as an aggregate indicator of student achievement. However, GPA alone may obscure differences in learning patterns and competency relevance that are important for job-specific evaluation [
33].
Competency alignment is estimated using normalised string overlap between course names and job-skill terms. This method represents a surface-level lexical matching strategy for job-requirement analysis. Although keyword-based search can identify explicit skill terms in job texts, it relies on predefined keyword lists and must be manually updated when new requirements appear [
17]. Thus, M2 remains limited when courses and job postings express the same competency using different wording.
Course and job-skill representations are encoded using the same SBERT model as the proposed framework [
3]. This method uses cosine similarity between sentence embeddings as a flat semantic matching strategy, without ontology-based branch locking or multi-level PCI scoring. Prior resume-screening research shows that SBERT can rank candidate profiles against job descriptions more effectively than keyword-based matching by capturing contextual semantic similarity [
34]. However, this type of matching remains structurally flat: it compares textual representations directly but does not verify whether the matched competency is coherent across knowledge, skill-group, and job-category levels. Therefore, M3 isolates the contribution of semantic encoding before adding the ontology-based verification mechanism proposed in O4CM.
This method uses the top-matched knowledge node at
and applies the job-side structural relevance factor (
) defined in Equation (
11), but it does not use the full three-level
score. Each course is scored using the leaf-level similarity score (
) defined in Equation (
3) rather than the full locked-path score. This framework-derived comparator tests whether leaf-level matching is sufficient when the job-side relevance signal is retained.
The proposed framework combines augmented semantic representations, ontology-based branch locking, PCI-based multi-level structural scoring, and job-side
weighting. Candidate ranking is based on
as defined in Equation (
13).
5.3. Ontology Induction and Expert Validation
This subsection reports the Phase 2 ontology induction results. The analysis focuses on the degree of cross-model agreement during LLM-assisted induction and the role of HITL validation in converting candidate mappings into an expert-confirmed ontology backbone.
The induction process covers three tasks: skill-group induction, Knowledge Domain (
)-to-Skill Group (
) mapping, and Skill Group (
)-to-Job Category (
) assignment.
Figure 4 presents the validated ontology backbone used in the subsequent semantic augmentation and PCI-based structural scoring phases. The validated
layer consists of seven skill groups: Cognitive Skills, Technical Skills, Creative Skills, Communication Skills, Social and Interpersonal Skills, Psychomotor Skills, and Affective Skills.
To summarize the materialized ontology backbone,
Table 11 reports the number of nodes and subclass links retained after HITL verification. The table reports the analytical backbone used for scoring, not all auxiliary OWL entities contained in the ontology file.
5.3.1. LLM Ensemble Agreement
The five-model LLM ensemble produced candidate knowledge-to-skill mappings for the 22 O*NET knowledge domains. Agreement was measured as the fraction of models that assigned the same skill-group label to each knowledge node.
Table 12 summarises the observed agreement distribution.
The ensemble produced high observed agreement, with 21 of 22 knowledge nodes reaching at least consensus. This corresponds to of knowledge domains achieving high-consensus agreement, which is the figure reported in the Abstract. This result indicates that the prompt constraints produced consistent outputs for most mappings. However, agreement is treated only as a preliminary signal. As shown by the expert review cases below, even unanimous agreement may still require correction when the assignment is not functionally aligned with the target domain.
5.3.2. HITL Validation Outcomes
Table 13 summarises how the LLM-generated candidate mappings were handled during HITL validation. The purpose of expert validation was not to report model accuracy against an external gold standard but to determine whether each candidate mapping was acceptable for ontology materialisation.
The validation results show that LLM agreement is useful for prioritising review effort but is not sufficient for final ontology acceptance. Therefore, all materialised mappings were confirmed through expert review before being used in Phase 3 and Phase 4.
5.3.3. Representative Expert Correction Cases
Three representative cases illustrate why HITL review is necessary, even when LLM outputs appear plausible.
- Case 1:
Sociology and Anthropology.
This node produced the lowest agreement among the 22 knowledge domains. Most models associated it with Social_Interpersonal_Skills due to the surface association with the term “social”, while other models favoured Cognitive_Skills. The expert panels assigned it to Cognitive_Skills because, in computer-job contexts such as data science and UX research, the domain functions primarily as an analytical framework for behavioural analysis and user modelling.
- Case 2:
Administration and Management.
All five models assigned this domain to Cognitive_Skills, interpreting management as abstract decision-making. The expert panels reassigned it to Technical_Skills because, in technology-sector practice, this domain is often expressed through procedural and tool-mediated competencies such as project management methods, Agile practice, Scrum, and IT operations frameworks. This case demonstrates that unanimous LLM agreement does not guarantee domain-correct placement.
- Case 3:
Philosophy and Theology.
This domain was judged to occupy a boundary between reasoning, ethics, and value-oriented judgement. The expert panels placed it under Affective_Skills and assigned it to the Art Job category. This placement prevents ethics-oriented or value-oriented transcript evidence from inflating similarity to computer-job paths during structural scoring.
At the conclusion of this validation stage, the framework produces an expert-confirmed ontology backbone that supports semantic augmentation in Phase 3 and PCI-based structural scoring in Phase 4.
5.4. Semantic Representation Diagnostics
This subsection provides a diagnostic comparison of the text sources used for embedding-based matching. The purpose is not to report a separate classification benchmark but to clarify why the proposed framework uses augmented evidence descriptions and ontology-node definitions rather than surface labels alone.
As shown in
Table 14, surface-label matching is vulnerable to lexical sparsity because course titles and skill names often omit relevant competencies. Representing evidence units with augmented descriptions and ontology nodes with definitions provides richer semantic context on both sides of the comparison.
Table 15 illustrates this effect using representative examples from the evidence and ontology layers.
These examples are used as diagnostic illustrations rather than independent performance evidence. Together, the configuration comparison and illustrative examples support the use of description-to-definition matching in the proposed framework.
5.5. PCI-Based Group Separation
This subsection evaluates whether the PCI-based structural score separates computing-oriented and arts-oriented curricula under the two ontology root paths. The analysis uses student-level credit-weighted scores and compares the computing group with the Visual Arts control group. This evaluation assesses group-level structural separation rather than course-level classification accuracy.
For each student (
s), the credit-weighted PCI score aggregates course-level
values using course credits as weights. Equation (
16) defines the student-level score used in the group comparison:
where
is the set of courses taken by student
s. The Mann–Whitney U test is used to compare student-level PCI distributions between the computing and Visual Arts groups because it is a rank-based test for evaluating whether one independent sample tends to produce larger values than another [
35].
The positive group consists of 185 students from the four computing programmes, while the control group consists of 59 students from the B.F.A. Visual Arts programme.
Table 16 reports the group means, separation ratio, effect size, and significance level for each ontology path. To facilitate interpretation, we report the Separation Ratio (SR), defined as the ratio of the mean score of the computing group to the mean score of the Visual Arts group. Under the Computer Job path, SR
indicates the expected direction, whereas under the Art Job path, SR
indicates the expected direction.
Cohen’s d is reported as a descriptive effect-size measure to support interpretation of group-separation magnitude.
Both comparisons are significant, with large effects. Computing students achieve higher scores on the Computer Job path, while Visual Arts students achieve higher scores on the Art Job path. This bidirectional pattern supports the structural validity of the ontology backbone and indicates that PCI-based scoring captures meaningful curriculum-level differences between the two groups.
5.6. Ablation Study
The ablation study examines whether the major components of O4CM contribute to discriminative behaviour. Three components are tested: semantic augmentation, PCI-based multi-level structural scoring, and job-side weighting. Four variants are evaluated by removing one component at a time while keeping the remaining components fixed:
V1 Full Framework: Augmented representations, three-level , and job-side are all retained.
V2 Without Augmentation: Raw course names are matched against raw node labels, while and are retained.
V3 Without PCI: Augmented representations are retained, but the three-level score is replaced by the leaf-level similarity score ().
V4 Without IF: Augmented representations and are retained, but is set to 1 for all knowledge nodes.
Two complementary metrics are used to quantify the impact of each ablation on group separation. The first is the IF-weighted relevant credit volume, which captures grade-free structural alignment weighted by job-side relevance. Equation (
17) defines
as a credit-normalised accumulation of path-consistent, job-relevant coursework:
where
is the credit value of course record
i,
is the course-level path consistency score from Phase 4,
is the job-side importance factor for the mapped knowledge node (
), and
n is the number of transcript records.
The second metric is the grade-weighted competency score, which extends the previous measure by incorporating academic performance. Equation (
18) defines
by additionally weighting each record by its numeric grade (
):
where
denotes the numeric grade for course record
i. Together,
and
separate the contributions of structure and market relevance (RCV) from the additional influence of academic performance (GCS), enabling a more interpretable ablation analysis.
For each variant and metric, SR is computed as in
Section 5.5, using the ratio of mean scores between the computing group and the Visual Arts group. The expected direction is SR
for the Computer Job path and SR
for the Art Job path.
Table 17 shows that no single component maximises all outcomes. V2 produces higher Computer Job separation than the full framework, but it weakens the Art Job path, suggesting that raw labels may increase single-domain separation while reducing negative-control cross-domain discrimination. Therefore, semantic augmentation is not used solely to maximise separation on one path; rather, it helps reduce over-specialised surface matching and supports more stable semantic ranking behaviour across the two contrasting ontology paths evaluated in this study.
The remaining ablations show complementary component effects. Removing in V4 substantially weakens Computer Job separation, indicating the importance of job-side relevance weighting for data-domain ranking. Removing PCI in V3 mainly affects the Art Job path, where the SR moves closer to the neutral value, indicating that multi-level structural scoring is important in limiting cross-domain contamination. Overall, the full framework provides the most balanced behaviour across both ontology paths, even though it does not achieve the largest value on every individual metric.
5.7. Ranking Performance and Semantic Traceability
This subsection evaluates the final ranking behaviour of the proposed framework against the comparison methods defined in Section Comparison Methods. The analysis considers group-level separation, retrieval performance, sub-type sensitivity, and semantic traceability.
5.7.1. Group Separation Across Methods
SR quantifies how strongly each method separates the target group from the contrast group. For data-domain job types, SR is computed as the mean score of computing students divided by the mean score of Visual Arts students. For the Visual Art job type, the ratio is reversed so that an SR above one still indicates the expected direction. Cohen’s
d is reported as a descriptive effect-size measure comparing the target and contrast groups under each method and job type.
Table 18 reports the SR and Cohen’s
d for all five methods across the four job types.
Table 18 shows that GPA ranking fails data-domain directionality because Visual Arts students have the highest mean GPA in the sample. Keyword matching and flat SBERT achieve correct broad-domain separation in several cases, but they do not distinguish among Data Scientist, Data Analyst, and Data Engineer requirements. The proposed framework is not always the highest-scoring method on every individual metric, but it is the only method that jointly satisfies the four operational criteria used in this evaluation: domain discrimination, sub-type sensitivity, negative-control cross-domain discrimination, and ontology-based interpretability.
5.7.2. Retrieval Performance and Ranking Consistency
Recall@
K and Hit Rate@
K evaluate whether relevant candidates appear near the top of the ranked list. Recall@
K measures the proportion of target-group students included in the top
K, while Hit Rate@
K measures the percentage of top-
K positions occupied by the target group.
Table 19 reports results for
.
Several semantic methods reach the maximum possible Recall@K for data-domain roles because all top-K positions are occupied by computing students. In this setting, Recall@K is bounded by the number of computing students in the 94-student ranking subsample rather than by the top-K list alone. Thus, the main contrast is not data-domain retrieval but the Visual Art path, where -only matching performs poorly while the proposed framework remains robust. This supports the role of multi-level structural scoring in reducing cross-domain contamination.
5.7.3. Sub-Type Sensitivity and Overall Criteria
Sub-type sensitivity is evaluated in terms of whether a method produces distinct ranking behaviour across Data Scientist, Data Analyst, and Data Engineer roles. Methods without job-side weighting tend to produce identical or near-identical patterns across the three data-domain roles, whereas methods using preserve role-specific differences derived from the job-side structural relevance analysis.
Spearman’s is used as a supplementary summary measure of pairwise ranking agreement between methods. Across the data-domain job types, the rankings produced by M4 (-Only + IF) and M5 (proposed framework) show consistently high agreement (, ). This indicates that M5 preserves much of the semantic ranking structure produced by M4 while adding PCI-based structural safeguards. In contrast M1 (GPA ranking) shows weak or negative correlation with the semantic methods, suggesting that grade-only ranking produces substantially different candidate orderings.
Table 20 summarises method performance across four criteria. Domain discrimination is met when the expected SR direction is observed across job types. Sub-type sensitivity requires distinct scores across the three data-domain roles. Cross-domain robustness is assessed on the Visual Art path, and ontology-based interpretability is assessed in terms of whether scores can be traced to ontology nodes and paths.
The proposed framework is the only method satisfying all four criteria in this evaluation. This does not mean that it maximises every individual metric. Rather, its advantage is that it combines acceptable domain discrimination, role-specific sensitivity, robustness on the cross-domain Visual Art path, and traceable ontology-based interpretation within a single scoring framework.
5.7.4. Semantic Traceability Analysis
Using the course-level evidence-relevance signal (
) defined in Equation (
15), the framework traces each ranking outcome back to course-level semantic evidence. Because
is computed before grade and credit weighting, it highlights the structural and job-side relevance of each course independently of the student’s achieved grade.
For each student (
s) and target job type (
c), course-level evidence relevance is aggregated by skill group to form a competency profile. Equation (
19) defines the skill-group profile value (
) as the credit-weighted mean evidence relevance within skill group
g:
where
denotes the skill group assigned to course
i through the branch-locking step and
is the course-level evidence-relevance signal defined in Equation (
15). Each dimension of
represents the credit-weighted mean evidence relevance within a skill group. Therefore, the profile explains which ontology areas provide the strongest semantic evidence for the target job type before grade weighting.
Figure 5 presents Semantic Traceability Maps for four representative students from B.Sc. Computer Science, B.B.A. Business Information Systems, B.Eng. Computer Engineering, and B.F.A. Visual Arts, evaluated against the Data Scientist job type.
The computing students’ profiles concentrate more strongly in Technical Skills and Cognitive Skills, which is consistent with their computing-oriented curricula. The Visual Arts profile places greater weight on art-oriented skill groups such as Creative Skills and Affective Skills. This contrast illustrates how the framework connects ranking behaviour to interpretable ontology-level evidence rather than only producing a numerical score.
Table 21 reports the five highest-ranked courses by credit-weighted evidence relevance (
) for two contrasting students evaluated against the Data Scientist job type. The table should be interpreted as an explanation of semantic evidence relevance, not as a full decomposition of
, because
is defined before grade weighting.
Student A’s top courses map to knowledge nodes that are structurally relevant to the Data Scientist path in the job-side analysis. Student B’s top courses map mainly to Fine Arts and Communications and Media, yielding lower evidence-relevance values for the Data Scientist path. This comparison illustrates how the framework provides traceable course-level evidence alongside the ranking output.
Taken together, the results provide empirical support for the three contributions stated in the Introduction. First, the LLM ensemble and HITL validation results show that the semi-automated process can produce an expert-confirmed ontology backbone suitable for downstream scoring. Second, the representation diagnostics and PCI-based group separation results indicate that augmented descriptions and multi-level structural scoring provide meaningful discrimination between contrasting academic profiles. Third, the ablation, ranking, and traceability analyses show that the framework components contribute complementary effects rather than uniformly improving every separation metric, and the final ranking outputs can be traced back to course-level semantic evidence.
5.8. Sensitivity and Ranking Stability Analysis
To assess whether the rankings produced by the proposed framework depend on specific design choices, we conducted a one-factor-at-a-time (OFAT) sensitivity analysis. Starting from the original configuration as the baseline (B0)—Top-1 path locking, equal hierarchical weighting in the PCI, grade-weighted NTACS scoring, and credit-weighted transcript evidence—we varied a single assumption at a time and measured how far the resulting student ranking departed from B0. Agreement with the baseline ranking was quantified using Spearman’s rank correlation (
), which captures changes in the overall ordering, and Top-10 overlap, which captures whether the highest-ranked candidates remain in the same group. The analysis was performed independently for all four evaluated target paths/domains (Data Scientist, Data Analyst, Data Engineer, and Visual Art), consistent with the ranking results reported in
Table 18.
This analysis evaluates ranking stability under the five one-factor-at-a-time assumption changes (SA1–SA5) tested in this study. Therefore, it is distinct from and complementary of the ablation study (
Section 5.6), the cross-method comparison (
Section 5.7), and the expert-labelled validation, which assess the framework’s comparative effectiveness and discriminative value rather than its stability. To avoid confusion with the reference methods (M1–M5), sensitivity-analysis configurations are denoted SA1–SA5. SA1 (grade-free ranking) quantifies the influence of grade evidence by replacing NTACS with a grade-free relevant credit volume (RCV) score. SA2 (Top-3 path averaging) tests sensitivity to the
entry-point selection rule and sensitivity to candidate-path ambiguity. SA3 (leaf-weighted PCI) and SA4 (job-weighted PCI) assess dependence on the equal hierarchical-level weighting assumption in the PCI by emphasising the leaf level (
) and job-category level (
), respectively. SA5 (equal course weighting) tests whether course-credit magnitudes drive the results by replacing actual credits with an unweighted-course assumption.
Table 22 lists each scenario with the single assumption modified relative to B0.
Across all scenarios, only one factor is altered at a time, and all remaining components are held fixed to the baseline, ensuring that any observed change in ranking can be attributed to the tested assumption.
5.8.1. Sensitivity Analysis Results
Agreement with the baseline ranking is summarised in
Table 23 and
Table 24, with
Table 23 reporting Spearman’s rank correlation (
) and
Table 24 reporting the corresponding Top-10 overlap. For the parameter-perturbation scenarios (SA2, SA3, and SA4) and the credit scenario (SA5), high agreement indicates that the ranking is stable under the tested change. Scenario SA1 is interpretive in a different sense: because RCV is a grade-free construct rather than a perturbed version of NTACS, low or even negative agreement is expected by design and should be interpreted as evidence of grade influence rather than as ranking instability.
To improve readability, we report ranking stability using two complementary agreement measures. Spearman’s rank correlation (
) captures changes in the overall ordering of candidates, while Top-10 overlap captures whether the highest-ranked candidates remain largely unchanged.
Table 23 reports
values for each scenario and target path, and
Table 24 reports the corresponding Top-10 overlap values.
In both tables, higher values indicate greater stability for the parameter-perturbation scenarios (SA2, SA3, SA4, and SA5), so low values are expected by design.
While
summarises global stability across the full ranking list, Top-10 overlap focuses on stability at the decision-critical top end of the list. We therefore report Top-10 overlap separately in
Table 24.
Together, these two views distinguish scenarios that preserve the overall ordering from those that mainly reshuffle the top-ranked candidates, which is particularly relevant for shortlisting-based recruitment decisions.
5.8.2. Interpretation of Sensitivity and Ranking Stability
The ranking is highly stable with respect to hierarchical-level weighting. Re-weighting the three PCI levels towards either the leaf level (SA3) or the job-category level (SA4) leaves the ordering almost unchanged ( across all target paths/domains, with Top-10 overlap of 0.90–1.00). This indicates that the rankings do not depend strongly on the specific choice of equal level weighting in the PCI within the tested weighting alternatives. Stability is also high for the credit assumption (SA5): replacing actual course credits with an unweighted-course assumption preserves the ordering closely (–0.991), showing that the rankings are not driven primarily by course-credit magnitudes. For the entry-point rule (SA2), the overall ordering remains stable (–0.990), although the Top-10 overlap is more variable (0.50–0.90). This indicates that averaging over the Top-3 candidate paths preserves the global ranking while modestly reshuffling the top of the list, suggesting that single-best (Top-1) path locking mainly affects the composition of the highest-ranked candidates without substantially altering the broader ranking structure.
Scenario SA1 compares the grade-weighted ranking (NTACS) with the grade-free relevant-credit-volume ranking (RCV). For the three data-domain roles, the two rankings are essentially uncorrelated or weakly negatively correlated (
between
and
), and the Top-10 overlap is small (0.10–0.30). This is expected and informative rather than a sign of instability: NTACS and RCV measure different constructs. NTACS captures the quality of learning, represented by grades and weighted by relevance, whereas RCV captures the volume of relevant coursework, irrespective of grades. A student who has taken many job-relevant courses (high RCV) need not be the same student who achieved the highest grades in relevant courses (high NTACS), so the two orderings can diverge substantially. The result confirms that grade evidence is a decisive factor in the NTACS ranking by design, which is why the framework also reports RCV (Equation (
17)) and the relevant-volume measure (Equation (
14)) as complementary, grade-independent indicators. For the Visual Art path, the divergence is weaker (
), indicating closer agreement between grade-weighted and grade-free evidence than in the data-domain roles. This may suggest that, in this path, the volume of relevant coursework and grade-weighted achievement are more closely aligned.
Across the parameter-perturbation scenarios, the proposed ranking is generally stable, with Spearman correlations ranging from 0.939 to 0.999. This indicates that the results are not an artefact of a single arbitrary assumption regarding hierarchical weighting, the entry-point rule, or course credits. The only factor that materially changes the ordering is the inclusion of grade evidence, which reflects the intended grade-sensitive design of NTACS and motivates the complementary grade-free measures retained in the framework.
5.9. Evaluation of Ranking Validity and Expert Agreement
This subsection evaluates the ranking validity of the proposed framework and the reliability of the expert-based reference judgement. The evaluation is organised into four parts: construction of the expert-based reference set, grade-free relevance-ranking performance, grade-effect diagnostics, and expert-rating reliability and system–expert agreement.
Because employer hiring decisions, placement outcomes, and recruiter shortlisting records were not available for the datasets used in this study, an expert-labelled reference was constructed. This reference is not treated as an error-free ground truth. Rather, it serves as an independent expert judgement for assessing whether the rankings produced by the framework are reasonably aligned with human assessment of student–job fit.
It is important to note that this evaluation involves an expert activity separate from the HITL ontology verification described in
Section 4.4. The HITL stage used two panels totalling ten members to validate LLM-proposed ontology mappings—a knowledge-engineering task. The ranking-validation stage described here used a distinct panel of five evaluators to rate student–job suitability from transcript evidence—a competency-assessment task. The two panels serve different functions and were selected independently on the basis of the expertise required for each task.
5.9.1. Construction of the Expert-Based Reference Set
The original ranking experiment included 94 students: 73 computing students (Computer Engineering: 20; Business Information Systems: 21; Computer Science: 21; Information Technology: 11) and 21 Visual Arts students. Because programme sizes were unequal, proportional sampling would have resulted in very few Information Technology students being included. To ensure sufficient representation across all strata and to prevent expert survey fatigue from rating all 94 candidates, disproportionate stratified sampling was applied: exactly nine students were drawn from each of the five programmes (Computer Engineering, nine students; Business Information Systems, nine students; Computer Science, nine students; Information Technology, nine students; Visual Arts, nine students), producing a balanced expert-evaluation subset of 45 students. Each sampled student was matched to the full ranking pool by student ID so that framework-generated scores for the 45 candidates could be directly compared with expert ratings.
Four job postings were selected for expert evaluation—one per target job type—based on the spread of scores across the 45 sampled students so that postings producing meaningful variation in student–job fit were preferred: task A1 used posting P1 (Senior Data Scientist), task A2 used posting P2 (Analyst, Imagery Analytics), task A3 used posting P3 (Principal Data Engineer), and task A4 used posting P4 (Art & Graphic Design Team Leader). Each task generated 225 ratings, yielding 900 rating records in total.
The expert panel comprised five evaluators: three HR professionals from private-sector companies with more than ten years of experience in recruitment and competency assessment and two university career-guidance lecturers with relevant experience in student employability. Each expert rated the suitability of each of the 45 candidates for each selected job posting on a 1–5 scale.
5.9.2. Relevance-Ranking Performance
Table 25 reports the grade-free relevance-ranking performance of four methods: keyword matching (M2), flat SBERT matching (M3),
-only ontology matching with the importance factor (M4), and the grade-free variant of the proposed framework (M5: NonGrade). The GPA baseline (M1) and the full NTACS score (M5: Full Framework) are excluded from this table because they incorporate academic grade information and therefore do not represent pure relevance scores.
Five statistics are reported. AUC measures the probability that a randomly selected target-group student receives a higher score than a randomly selected reference-group student. Cliff’s provides a non-parametric effect size for group separation. Recall@20 measures the proportion of target-group students retrieved within the Top-20 ranked students; HitRate@20 measures the proportion of Top-20 positions occupied by target-group students; and Mean Rank is the mean-rank position of target-group students. For data-oriented job types, computing students are the target group, and Visual Arts students are the reference group; for the Visual Art task, the roles are reversed.
The results show strong grade-free separation for several job types. For Data Engineer, all four grade-free methods achieved perfect separation. For Data Analyst, M5 (NonGrade) produced an AUC of 0.9667 and Cliff’s of 0.9335—slightly higher than M4, which suggests that the ontology-based relevance structure is useful for this job type. For the Visual Art task, M3, M4, and M5 (NonGrade) all achieved AUC = 1.0000, Cliff’s = 1.0000, and HitRate@20 = 1.00.
The Data Scientist case warrants more cautious interpretation. Although M2 and M3 achieved perfect separation, M4 and M5 (NonGrade) produced weaker results, with M5 (NonGrade) yielding AUC = 0.4442 and Cliff’s = . Diagnostic inspection indicated that some data-science skill assignments were unexpectedly associated with the Mechanical node, suggesting a semantic nearest-node assignment issue rather than a failure of the weighting mechanism itself. This case illustrates that O4CM is not presented as error-free; rather, its ontology-based traceability makes failure cases diagnosable—the semantic traceability map surfaces the specific node responsible for the misalignment, enabling targeted correction of the augmentation or path-locking step.
It should also be noted that the maximum achievable Recall@20 depends on the target group size. For the 73 computing students, the maximum is ; for the 21 Visual Arts students, it is . Therefore, Recall@20 should be interpreted alongside HitRate@20 and Mean Rank.
5.9.3. Grade-Effect Diagnostics
Table 26 examines how including grades changes the student ranking. M5 (NonGrade) is the grade-free ontology-based relevance score, while M5 (Full Framework) is the full
score that incorporates academic grades. The Spearman correlation between M5 (Full Framework) and M5 (NonGrade) indicates whether grades preserve the relevance-based ordering; the correlation between GPA and M5 (Full Framework)indicates how strongly the full score is driven by general academic performance.
M5 (Full Framework) is strongly associated with GPA across all job types (Spearman –), which is expected because grades appear directly in the NTACS numerator. In contrast, the correlation between M5 (Full Framework)and M5 (NonGrade) is weak or negative for the three data-domain roles ( to ), indicating that incorporating grades substantially changes the relevance-based ranking. Shifts in mean absolute rank ranged from 22.6 to 34.3 positions, confirming that many students change positions after grades are added. Therefore, M5 (Full Framework) should be interpreted as a grade-sensitive competency-quality score rather than a pure relevance score; M5 (NonGrade) is more appropriate for evaluating grade-free ontology-based relevance.
5.9.4. Expert-Rating Reliability
Before comparing system-generated scores with expert ratings, the reliability of the expert judgements was examined. Each of the four tasks involved 45 candidates rated by 5 experts, giving 225 ratings per task and 900 ratings in total.
Table 27 reports ICC(2,
k) under a two-way random-effects absolute-agreement model, Krippendorff’s
, and Kendall’s
W for each posting and overall.
ICC(2,k) values ranged from 0.77 to 0.79 across all four postings, indicating that the averaged expert rating is sufficiently reliable for use as an aggregated reference. However, Krippendorff’s (0.39–0.42) reflects only low-to-moderate agreement in exact score assignment among individual raters, and Kendall’s W (0.43–0.55) similarly indicates moderate ranking concordance. These values indicate that while the aggregated mean rating is stable, individual expert scores varied, which is consistent with the subjective nature of competency assessment across different professional backgrounds. Therefore, the mean expert rating is used as an aggregated independent reference, not as an error-free ground-truth label.
5.9.5. External Validation Against Expert Ratings
After confirming the reliability of the aggregated expert ratings, each ranking method was compared with the mean expert ratings for the 45 students in the expert-evaluation subset. The merged validation dataset contained 180 candidate–posting pairs (45 candidates × 4 postings) with no missing system scores.
Rank-based metrics were used: Spearman’s and Kendall’s measure overall rank agreement between system scores and expert mean ratings; NDCG@10 and NDCG@20 measure whether candidates with high expert ratings are placed near the top of the system-generated ranking.
The external validation results for all ranking methods across the four job types are summarised in
Table 28.
The results show that expert ratings were more strongly aligned with grade-sensitive scores than with grade-free relevance scores in several job types. For Data Scientist, GPA ranking produced the highest Spearman correlation (), followed by M5 (Full Framework) () and M4 (). For Data Analyst, M1 and M5 (Full Framework) showed very close rank agreement with expert ratings ( and respectively), and M5 (Full Framework) produced the highest NDCG@10. For Data Engineer, M1, again, achieved the highest Spearman correlation (), closely followed by M5 (Full Framework) (). For Visual Art, M5 (Full Framework) achieved the highest overall rank agreement (), while M3 produced the highest NDCG@10 values.
Overall, these results indicate that expert judgements were not based solely on transcript–job relevance. Experts also appear to have considered academic achievement, which explains the alignment between GPA-sensitive scores and expert ratings. M5 (NonGrade) remains the most appropriate measure for evaluating grade-free ontology-based relevance, while M5 (Full Framework) provides a grade-sensitive competency-quality score that is more consistent with expert judgement in several scenarios. At the same time, these analyses show that PCI, TACS, and NTACS provide complementary and traceable competency-based evidence beyond GPA, keyword, and flat-SBERT baselines rather than universally outperforming all simpler methods.
5.10. Expert-Aligned Ranking Within Computing Programmes
The full-cohort evaluation in
Section 5.9 includes both computing-programme students and Visual Arts students, which creates a broad programme-level contrast that may inflate separation scores. To examine whether the framework retains meaningful agreement with expert judgements under more demanding conditions, we conducted a computing-only expert-aligned ranking analysis in which Visual Arts candidates and Visual Arts postings were excluded.
5.10.1. Experimental Setting
The analysis retained only expert-rated candidates from Computer Engineering (CE), Business Information Systems (BIS), Computer Science (CS), and Information Technology (IT), spanning four closely related computing programmes. Only three data-domain job postings were included: Data Scientist, Data Analyst, and Data Engineer. This resulted in a restricted evaluation setting involving 36 computing candidates and 108 candidate–job pairs, each evaluated against mean expert ratings. Rankings were compared against expert judgements using Spearman’s , Kendall’s , NDCG@10, and NDCG@20. M1 (GPA Ranking) was excluded from this analysis because the primary question is whether ontology-based relevance signals can distinguish among students from closely related disciplines.
5.10.2. Expert-Aligned Ranking Results
Table 29 reports the system–expert agreement for each method across the three data-domain job types.
5.10.3. Interpretation of Computing-Only Expert Alignment
Across all three data-domain job types, M5 (Full Framework) achieved the strongest agreement with expert judgements (, , and for Data Scientist, Data Analyst, and Data Engineer, respectively; all ). This pattern is consistent with the full-cohort results and indicates that incorporating academic performance into ontology-weighted scoring produces rankings that more closely reflect expert assessments, even when the evaluation is restricted to computing-oriented programmes with partial curricular overlap.
Grade-free methods (M3, M4, and M5) produced weakly negative or near-zero Spearman correlations for Data Analyst and Data Engineer, suggesting that ontology-based relevance signals alone are insufficient for fine-grained discrimination among students from closely related disciplines when grades are excluded. The negative correlations for M3 across all three job types further indicate that flat semantic matching without hierarchical structural weighting is not robust under this more demanding evaluation condition.
These findings should be interpreted cautiously. The expert subset comprises 36 candidates drawn from a single institution, and caution is warranted in generalising to broader settings. Accordingly, the computing-only analysis is offered as supporting evidence rather than definitive proof of generalisability. Nevertheless, the results suggest that the framework’s performance in the full-cohort evaluation cannot be attributed solely to obvious programme-level contrasts: M5 (Full Framework) retains substantial agreement with expert judgements when evaluated exclusively within closely related computing programmes and data-domain occupations.
6. Discussion
6.1. Interpretation of Key Findings
The findings provide empirical support for O4CM as a prototype framework for ontology-grounded student-to-job competency mapping within the dataset used in this study. Rather than demonstrating universal superiority over all possible matching systems, the results show that the proposed framework offers a coherent and traceable way to integrate academic evidence, ontology-based structural alignment, and job-side relevance signals into a single ranking pipeline.
First, the ontology induction results support the contribution of a semi-automated ontology construction process. The results show that LLM-assisted construction can produce a usable ontology backbone when its outputs are treated as candidate structures rather than final assertions. The five-model ensemble produced high observed agreement for the knowledge-to-skill mapping task, with 21 of 22 knowledge domains reaching at least consensus. However, the expert correction cases also show that model agreement is not equivalent to domain correctness. The unanimous case of Administration and Management, which was reassigned from Cognitive_Skills to Technical_Skills after expert review, is particularly important. It indicates that even highly consistent LLM outputs may still be functionally misaligned with the target professional context. This supports the methodological decision to apply HITL verification across all agreement tiers, including unanimous outputs.
Second, the semantic representation diagnostics and PCI-based group separation results support the contribution of multi-level semantic and structural discrimination. The diagnostic examples show why surface labels alone are insufficient: short course titles, skill names, and ontology labels often omit the functional context needed for semantic comparison. The PCI-based group separation analysis further shows that the locked-path structural score captures meaningful curriculum-level differences between the two clearly contrasting academic profiles that were evaluated. Computing students obtain higher credit-weighted PCI scores on the Computer Job path, whereas Visual Arts students obtain higher scores on the Art Job path, with statistically significant and large effects in both directions. This bidirectional pattern supports the internal structural validity of the ontology backbone and indicates that the PCI functions as more than a leaf-level similarity score.
Third, the ablation and ranking results support the contribution of a job-conditioned and traceable scoring framework. The proposed framework does not maximise every individual metric. This is expected because the goal is not to optimise separation on a single path but to maintain balanced behaviour across job-conditioned ranking, negative-control cross-domain discrimination, and traceability. Removing weakens Computer Job separation, indicating that job-side structural relevance is important in differentiating data-domain competency evidence. Removing the PCI affects the Art Job path more strongly, suggesting that multi-level structural scoring helps reduce cross-domain contamination in the evaluated computing-versus-arts negative-control setting. Removing augmentation improves some Computer Job separation values but weakens Art Job separation, implying that short labels may over-specialise one domain while reducing robustness across contrasting paths.
6.2. Role of the Framework Components
The results clarify the distinct function of each major component in O4CM. Semantic augmentation provides richer contextual evidence for embedding and supports interpretability at the course and ontology-node levels. It is especially useful when course titles or skill labels are too short to express the underlying competency. However, the ablation results also show that augmentation should not be interpreted as a mechanism that always increases numerical separation. Its main value is to provide richer and more balanced semantic evidence, not to maximise one separation metric.
The PCI mechanism provides structural verification across the ontology path. A course may be close to a knowledge node at , but this does not guarantee that it is also consistent with the corresponding skill group at or job category at . By averaging the level-wise similarity scores across the locked path, provides a continuous structural confidence weight. This design avoids hard exclusion of ambiguous courses while reducing their contribution when higher-level contextual support is weak.
The job importance factor () plays a different role. In the current formulation, it represents the mean job-side structural confidence of job-skill records mapped to knowledge node k for job type c. It should therefore be interpreted as job-side structural relevance, not the raw frequency of occurrence. This distinction is important because the framework does not simply reward commonly appearing skill terms. Instead, it rewards knowledge nodes whose job-side skill evidence is structurally supported by the ontology path.
Together, and create a two-sided relevance mechanism. The former evaluates how strongly a student’s course aligns with the ontology path, while the latter evaluates how strongly that knowledge node is supported by job-side evidence for a target job type. This interaction is the main reason why is more informative than GPA, keyword overlap, or flat similarity alone.
6.3. Comparison with Simpler Matching Strategies
The comparison methods help position O4CM relative to common ranking and matching strategies. GPA ranking provides a useful lower-bound reference because it reflects academic performance without considering job relevance. In this dataset, the Visual Arts control group has the highest mean GPA. As a result, GPA-based ranking fails the expected direction for data-domain roles. This finding supports the argument that GPA alone is not sufficient for job-specific competency assessment because it cannot distinguish whether strong academic performance was achieved in job-relevant or job-irrelevant coursework.
Keyword matching performs better than GPA for broad domain separation, but its identical separation values across Data Scientist, Data Analyst, and Data Engineer indicate limited sub-type sensitivity. This limitation is consistent with the nature of lexical matching: roles within the same broad domain often share surface vocabulary, even when they differ in competency emphasis. Therefore, keyword matching can identify general data-domain relevance but is less suitable for distinguishing closely related occupational sub-types.
Flat SBERT similarity improves over purely lexical matching by using dense semantic representations [
3]. However, the results show that semantic encoding alone is not sufficient to satisfy all evaluation criteria for ontology-grounded competency mapping. Without branch locking and multi-level PCI scoring, the method lacks an explicit mechanism for checking whether a leaf-level match is also consistent with the broader skill and job-category context. O4CM retains the benefit of SBERT-based semantic similarity while adding ontology-based structural verification and job-side relevance weighting.
The -only comparator further clarifies the contribution of multi-level scoring. Because it uses the top-matched knowledge node and retains , it is stronger than flat SBERT. However, its weaker Visual Art performance suggests that leaf-level matching can still allow for cross-domain contamination when higher-level ontology support is not considered. The proposed framework addresses this by using the full locked path from to and .
Taken together, the comparison results show that no single simpler alternative satisfies all four evaluation criteria simultaneously. GPA ranking fails in domain discrimination for data roles. Keyword matching and flat SBERT achieve broad separation but cannot distinguish sub-types (Data Scientist vs. Data Analyst vs. Data Engineer). The
-only method with
achieves sub-type sensitivity but degrades on the cross-domain path. Only the proposed framework satisfies domain discrimination, sub-type sensitivity, cross-domain robustness, and ontology-based interpretability in this evaluation. The external validation results (
Section 5.9.5) further show that M5 (Full Framework) aligns more closely with expert judgement than keyword or flat SBERT methods in most job types, while M5 (NonGrade) provides a grade-free relevance view that is not available from simpler baselines.
One limitation of this evaluation is that it was conducted primarily on a computing–arts contrast, which is a relatively easy discrimination task. The framework has not yet been tested on near-boundary curricula such as Data Science, Software Engineering, Business Analytics, and Human–Computer Interaction, where job-relevant knowledge domains overlap more substantially and the expected benefit of multi-level structural scoring may be less pronounced. Near-boundary testing is an important direction for future work and would provide a more stringent practical comparison with simpler alternatives.
6.4. Application-Oriented Significance
From an application perspective, the main value of O4CM is not only the final rank order but also the structured evidence attached to each ranking decision. The framework operates at the course-record level, which allows it to capture variation within the same academic programme. This is important for programmes with flexible elective structures, where two students may share the same major but accumulate different job-relevant competency profiles.
The normalised score () is useful for candidate ranking because it reduces sensitivity to transcript length and credit volume while preserving the effects of grades, credits, student-side structural alignment, and job-side relevance. The supplementary score provides an additional view of the breadth of structurally supported coursework, independent of grade performance. These two outputs can support different decision needs: for ranking and for understanding the amount of relevant learning evidence behind the score.
The Semantic Traceability Map extends the framework beyond black-box ranking. By aggregating the course-level relevance signal () across knowledge nodes or skill groups, the framework can show which parts of a student’s transcript contribute most strongly to a target job type. This supports practical use cases in educational advising, curriculum review, career guidance, and recruitment screening. For example, the framework can identify whether a student is strong because of technical computing courses, cognitively oriented analytical courses, or broader interdisciplinary coursework. Such evidence is difficult to obtain from a single GPA score or an unstructured semantic similarity score.
6.5. Implications for Ontology-Grounded Competency Mapping
The study suggests that ontology-grounded matching is most useful when it is treated as an auditable analytical framework rather than a fully automatic truth-generating system. The ontology backbone in O4CM is constructed from O*NET knowledge domains, LLM-induced skill groups, and expert-confirmed job-category assignments. This design provides a controlled semantic structure for the mapping of student and job evidence while still acknowledging that the resulting hierarchy is an analytical taxonomy rather than a complete representation of all possible competency relationships.
The HITL results also show that expert review is not merely a quality assurance step applied after automation. It is a necessary part of the ontology construction process. The LLM ensemble helps reduce manual workload by producing candidate structures and highlighting agreement patterns, but expert judgement is required to determine whether those structures are appropriate for the target domain. This balance is important for framework-oriented applications, where scalability and auditability must both be maintained.
6.6. Limitations
Several limitations qualify the scope of the present findings, and each is described here in terms of how it may shape the observed results rather than merely listed.
First, the empirical evaluation is internal to a single Thai university. Because the transcript records originate from one institution, the observed score distributions are partly a product of local curriculum design, grading practices, course-naming conventions, and programme structure. These institution-specific factors enter the pipeline directly: course titles and descriptions drive the SBERT augmentation in Phase 3, while grades and credits enter the NTACS numerator in Phase 5. Consequently, the absolute score ranges reported here and, to some extent, the magnitude of the group separations may shift if the framework is applied to an institution with different credit systems, grading scales, or curricular vocabulary. Therefore, the results should be read as evidence of internal feasibility rather than as a generalisation claim across institutions or national education systems.
Second, the composition of the evaluated cohort influences the observed separation. The four computing programmes were selected as the positive group, while Visual Arts was selected specifically because it is academically distant from data-domain requirements. This positive-versus-control design supports a clear negative-control test, but it does not, on its own, demonstrate performance in programmes with intermediate or partial overlap. The effect of cohort composition is compounded by a grading confound: the B.F.A. Visual Arts group has the highest mean GPA in the sample (
Table 7). Because GPA enters every grade-sensitive score, this distribution works against the Visual Arts group under NTACS and GPA ranking in the data-domain direction while favouring it on the Art Job path. This is precisely why the framework also reports grade-free measures (
and the relevant-volume measure) as a GPA-independent view of competency evidence and the grade-sensitive and grade-free results should be weighed together rather than in isolation.
Third—and most consequential for interpreting the validation—the computing-versus-arts comparison is a relatively easy contrast. It establishes that O4CM can separate clearly different academic profiles, but this is not equivalent to demonstrating that the scoring mechanism is robust for difficult, near-boundary competency-mapping cases. The computing-only analysis in
Section 5.10 begins to probe this harder setting and is informative about the framework’s discriminative ceiling. There, Information Technology was separated consistently from the core computing programmes with medium-to-large effects (mean
), whereas Computer Engineering, Business Information Systems, and Computer Science remained closely clustered (mean
for the CE/BIS/CS pairs). This pattern is consistent with established curricular analysis: the ACM/IEEE-CS Computing Curricula 2020 report documents substantial shared knowledge across the core computing disciplines, as evidenced by its cross-disciplinary mapping of computing knowledge areas, while characterising information technology as the discipline concerned most directly with concrete technology components in organisational settings [
36]. The clustering of the computing-heavy Business Information Systems programme with Computer Science and Computer Engineering rather than with Information Technology is consistent with the substantial data-domain course content in its curriculum (
Table 6). As noted in
Section 5.10.3, however, the present data cannot fully distinguish whether the low within-core separation reflects genuine curricular overlap or a limit of the ontology’s discriminative granularity, and near-boundary programmes such as Information Systems, Business Analytics, Software Engineering, Digital Media, Human–Computer Interaction, Computational Design, and Applied Statistics would be required to resolve this question. Their absence here bounds the strength of the validation. A further contributor at the dataset level is the degree of enrolment overlap among the core computing programmes: students from CE, BIS, and CS in this cohort share substantial course content, with programmes differing mainly in course naming conventions rather than in the underlying competency coverage reflected in transcripts. This structural similarity makes it difficult to demonstrate clear programme-level discrimination experimentally within the current dataset and may understate the framework’s potential precision when applied to genuinely distinct curricula in future evaluations.
Fourth, the job-side evidence is derived from a static and heterogeneous job-posting corpus, and this affects the importance factor (
) that conditions every ranking. The data-domain postings originate from an Indeed-sourced Kaggle dataset reflecting the United States labour market circa 2018, whereas the Visual Arts negative-control postings were collected separately by the research team from Asian and global listings during the preparation of this study. Two consequences follow. First, the 2018 vintage means that the corpus predates the recent expansion of AI- and LLM-related roles. Its most frequent skill terms reflect the data-platform and statistical-computing emphasis of that period—namely, Python, SQL, Machine Learning, R, Hadoop, and Spark (
Table 4)—whereas newer competency vocabulary that has since become prominent in data-domain hiring, such as large language models, generative AI, prompt engineering, vector databases, and MLOps, does not appear in the corpus. Consequently,
reflects the skill emphasis of that period rather than current demand. Second, the temporal, platform, and market gap between the two corpora is a potential confound that we acknowledge directly: because the data-domain and Visual Arts postings differ not only in occupational content but also in collection year, source platform, and geographic market, part of the observed cross-domain separation could, in principle, be attributed to corpus differences rather than to competency differences alone. In addition, the original Indeed dataset was released without a data card, so its search strategy and deduplication procedure cannot be independently verified, platform- or source-selection bias cannot be ruled out, and seniority was not annotated as a separate field. For these reasons the derived
values should be read as corpus-conditioned structural relevance signals, not as estimates of general or current labour-market demand.
Fifth, is currently defined as job-side structural relevance based on the mean path consistency of mapped job-skill records rather than on raw demand frequency. This keeps the score structurally consistent with the ontology, but it also means that the framework deliberately does not reward a knowledge node simply because the associated skill terms appear often. The practical implication is that measures how well a node is structurally supported by the ontology path, not how frequently the market demands it, and the two need not coincide.
Finally, the semantic representation diagnostics in
Section 5.4 are illustrative rather than a separate classification benchmark. They justify the choice of description-to-definition matching, but they do not independently quantify representation quality against an external gold standard, so the contribution of the augmentation step is supported by ablation and qualitative evidence rather than by a stand-alone accuracy measure.
6.7. Future Work
The limitations above translate into a set of prioritised directions for future research.
First, cross-institutional validation is the most immediate need. Evaluating the framework on transcript data from multiple universities with different credit systems, grading scales, and curriculum structures would clarify which aspects of O4CM performance are specific to the present Thai-university context and which transfer more broadly, directly addressing the single-institution constraint noted above.
Second, occupational and programme expansion towards near-boundary cases is the strongest remaining test of the scoring mechanism. Programmes such as information systems, business analytics, software engineering, digital media, human–computer interaction, computational design, and applied statistics share partial competency overlap with data-domain roles, which makes group separation harder to achieve and therefore more informative than the computing-versus-arts contrast used here. Testing these cases, ideally with individual-level expert-verified ground truth, would reveal whether the PCI mechanism and the
formulation retain discriminative value once occupational boundaries become less clearly defined and would resolve whether the low within-core separation observed in
Section 5.10 reflects real curricular overlap or a limit of the ontology’s discriminative granularity.
Third, the job-posting corpus should be refreshed, and the temporal-market confound should be controlled. Re-running the pipeline on recent postings would capture AI- and LLM-era competency vocabulary that the 2018 corpus omits, while assembling the contrasting domains from matched-vintage, matched-platform, and matched-market sources would remove the collection-related differences that currently coexist with the competency differences between corpora. This would strengthen the causal interpretation of cross-domain separation and improve the currency of .
Fourth, LLM augmentation quality control deserves dedicated investigation. Prior work and our own results indicate that LLM-based augmentation can introduce semantic drift and does not always improve ranking accuracy [
11]; systematic prompt engineering, post-augmentation validation, and fine-tuned domain-specific encoders are candidate remedies.
Fifth, real hiring-outcome data should be incorporated where available. The current evaluation relies on aggregated expert ratings, which are a useful but imperfect proxy; linking framework-generated rankings to actual placement outcomes, employer feedback, or longitudinal salary data would enable a more ecologically valid assessment of predictive utility.
Sixth, the formulation could be extended into a hybrid that combines its present structural-quality definition with raw demand-frequency signals from the job-posting corpora. Such a formulation may better track shifting market demand in dynamic occupational fields, although its behaviour under corpora of differing sizes and skill-frequency distributions would need to be tested explicitly.
Finally, the semantic traceability map should be developed into an interactive decision-support interface, allowing academic advisers, students, and HR practitioners to inspect the course-level evidence behind each ranking, identify curriculum gaps relative to target job types, and compare competency profiles across cohorts. This would strengthen the practical, explainable value of O4CM as an application-oriented competency-mapping system.
Beyond these specific directions, three structural factors may systematically affect O4CM’s functioning and should be treated as experimental controls or design variables in future evaluations. First, the diversity of the programme mix directly shapes the range of available student profiles: a cohort drawn from closely related programmes with similar enrolment patterns will produce more similar competency representations, reducing the observable range of NTACS values and making group discrimination harder to demonstrate, regardless of framework design. Second, the GPA distribution—and the grading-quality standards that underlie it—differs across institutions; universities that grade more stringently or more generously will shift the grade-sensitive scoring component in ways that are independent of actual competency differences, and cross-institutional comparisons should account for institutional grading norms when interpreting score-level differences. Third, the design of the job-posting source—including the platform, geographic market, collection period, seniority mix, and deduplication strategy—introduces variation in that propagates into every ranking; assembling job-posting corpora under controlled collection conditions is therefore a prerequisite for attributing framework performance to competency structure rather than to data artefacts. These three factors serve as a structural checklist for future replication and comparative studies of O4CM and related competency-mapping frameworks.
7. Conclusions
This paper introduced O4CM, a semi-automated ontology-grounded framework for multi-level competency mapping. Three findings emerge from this evaluation: (1) LLM ensemble induction with HITL verification can produce a scalable and auditable ontology backbone within the evaluated dataset—21 of 22 O*NET knowledge domains reached ≥4/5 consensus, yet HITL correction remained necessary, even for unanimous outputs, demonstrating that model agreement is not a sufficient proxy for domain correctness; (2) PCI-based multi-level structural scoring achieved bidirectional group separation between computing and arts curricula in the evaluated negative-control setting, outperforming leaf-level cosine similarity in cross-domain discrimination; and (3) the NTACS scoring framework aligns with aggregated expert judgement (ICC) while offering grade-free complementary measures, providing a possible path for institutions where academic grades are absent or unsuitable.
These findings help to address a gap that existing embedding-based methods [
11] and manually curated ontologies [
4] leave unresolved: individually traceable, job-type-conditioned candidate rankings derived from structured academic transcripts. The semantic traceability map can support explainable evidence inspection for academic advising and exploratory early-career recruitment screening, areas where transparency and accountability are increasingly mandated [
6]. Practically, O4CM is designed to be extensible across domains: replacing the O*NET seed taxonomy with a healthcare, engineering, or business knowledge taxonomy may allow for adaptation without retraining the SBERT encoder itself, but this still requires domain-specific corpus collection, LLM ensemble re-execution, HITL re-validation, and empirical revalidation of discriminative performance before deployment.
The study is limited to a single institution and three data-domain roles; the expert reference relies on aggregated ratings rather than verified hiring outcomes, and LLM augmentation can still introduce semantic drift [
11] despite HITL safeguards. Cross-institutional validation across diverse curricula and occupational families remains the most immediate priority for future research. Beyond academic validation, O4CM has potential implications for educational and recruitment platforms: universities and accreditation bodies could use the NTACS scores and semantic traceability maps as supplementary evidence for curriculum-gap analysis against the job-posting evidence captured in the target corpus, potentially supporting programme design and graduate employability planning discussions; recruitment platforms could similarly consider the job-type-conditioned ranking as an auditable, explainable screening layer that complements—rather than replaces—existing keyword-based filters, providing course-level evidence to support rather than determine candidate-screening discussions in validated use settings.