Multi-Examiner: A Knowledge Graph-Driven System for Generating Comprehensive IT Questions with Higher-Order Thinking

Wang, Yonggu; Yu, Zeyu; Wang, Zihan; Yu, Zengyi; Wang, Jue

doi:10.3390/app15105719

Open AccessArticle

Multi-Examiner: A Knowledge Graph-Driven System for Generating Comprehensive IT Questions with Higher-Order Thinking

by

Yonggu Wang

^1,*

,

Zeyu Yu

¹,

Zihan Wang

¹,

Zengyi Yu

¹

and

Jue Wang

²

¹

College of Education, Zhejiang University of Technology, Hangzhou 310023, China

²

Faculty of Applied Science and Engineering, University of Toronto, 35 St. George Street, Toronto, ON M5S 1A4, Canada

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5719; https://doi.org/10.3390/app15105719

Submission received: 10 March 2025 / Revised: 28 April 2025 / Accepted: 17 May 2025 / Published: 20 May 2025

Download

Browse Figures

Versions Notes

Abstract

The question generation system (QGS) for information technology (IT) education, designed to create, evaluate, and improve Multiple-Choice Questions (MCQs) using knowledge graphs (KGs) and large language models (LLMs), encounters three major needs: ensuring the generation of contextually relevant and accurate distractors, enhancing the diversity of generated questions, and balancing the higher-order thinking of questions to match various learning levels. To address these needs, we proposed a multi-agent system named Multi-Examiner, which integrates KGs, domain-specific search tools, and local knowledge bases, categorized according to Bloom’s taxonomy, to enhance the contextual relevance, diversity, and higher-order thinking of automatically generated information technology MCQs. Our methodology employed a mixed-methods approach combining system development with experimental evaluation. We first constructed a specialized architecture combining knowledge graphs with LLMs, then implemented a comparative study generating questions across six knowledge points from K-12 Computer Science Standard. We designed a multidimensional evaluation rubric to assess the semantic coherence, answer correctness, question validity, distractor relevance, question diversity, and higher-order thinking, and conducted a statistical analysis of ratings provided by 30 high school IT teachers. Results showed statistically significant improvements (p < 0.01) with Multi-Examiner outperforming GPT-4 by an average of 0.87 points (on a 5-point scale) for evaluation-level questions and 1.12 points for creation-level questions. The results demonstrated that: (i) overall, questions generated by the Multi-Examiner system outperformed those generated by GPT-4 across all dimensions and closely matched the quality of human-crafted questions in several dimensions; (ii) domain-specific search tools significantly enhanced the diversity of questions generated by Multi-Examiner; and (iii) GPT-4 generated better questions for knowledge points at the “remembering” and “understanding” levels, while Multi-Examiner significantly improved the higher-order thinking of questions for the “evaluating” and “creating” levels. This study contributes to the growing body of research on AI-supported educational assessment by demonstrating how specialized knowledge structures can enhance automated generation of higher-order thinking questions beyond what general-purpose language models can achieve.

Keywords:

question generation; multi-agent systems; knowledge graphs; large language models; information technology education; Bloom’s taxonomy

1. Introduction

The proliferation of information technology (IT) in educational curricula has intensified the demand for efficient and effective assessment methodologies. Multiple-Choice Questions (MCQs) have long been a cornerstone of educational evaluation, particularly crucial in IT education where systematic evaluation of both technical knowledge and problem-solving abilities is essential [1]. While MCQs offer significant advantages in terms of standardization and automated scoring, the manual creation of high-quality questions remains a resource-intensive process, requiring substantial expertise in both subject matter and assessment design [2]. This challenge is particularly acute in IT education, where the rapid evolution of technologies necessitates continuous updates to assessment materials. Traditional question banks quickly become outdated, and the creation of new, relevant questions places an unsustainable burden on educators. Consequently, there is a pressing need for automated systems that can generate pedagogically sound questions aligned with current IT curricula and capable of assessing various cognitive levels. This challenge has prompted the development of automated question generation systems (QGSs) [3].

Recent advances in artificial intelligence have led to significant progress in automated question generation. Knowledge graphs (KGs), knowledge bases (KBs), and large language models (LLMs) have emerged as powerful tools for this purpose [4]. KG-based approaches leverage structured relationships between concepts to generate contextually relevant questions, while LLMs utilize their extensive training on diverse texts to produce linguistically sophisticated questions [5]. Some systems have successfully combined these technologies, demonstrating improved performance in generating basic factual questions across various domains, including limited applications in IT education [6].

Despite these advancements, existing QGSs face significant limitations when applied to IT education. Our comprehensive analysis of current systems reveals that they predominantly generate lower-order questions focused on recall and basic understanding, failing to adequately assess the higher-order thinking skills essential for IT professionals. Recent research emphasizes that effective IT education assessment must progress through various cognitive levels to develop students’ critical thinking and problem-solving abilities [7]. This theoretical foundation underscores the importance of generating questions that not only test factual knowledge but also promote higher-order cognitive skills [8].

Current QGSs face three critical challenges that impede their widespread adoption in IT education. Foremost among these is the generation of contextually relevant and accurate distractors. These incorrect options in MCQs play a pivotal role in discriminating between varying levels of student understanding. The intricacy of IT concepts magnifies this challenge, necessitating distractors that are plausible yet distinctly incorrect within specific technological contexts. Poorly designed distractors can lead to misinterpretation of a student’s knowledge and skills, ultimately undermining the validity of the assessment [9]. Equally significant is the need to enhance the diversity of generated questions. A comprehensive assessment demands a varied question set that spans different cognitive levels and content areas. Current systems often produce homogeneous questions, thereby limiting their ability to evaluate a broad spectrum of knowledge and skills [10]. This lack of diversity not only results in incomplete assessments but also fails to engage students effectively across different learning styles and abilities.

While some researchers have attempted to incorporate educational frameworks like Bloom’s taxonomy into question generation [11], these efforts have been limited in scope and effectiveness. Most notably, they have failed to systematically integrate the full spectrum of Bloom’s cognitive levels into the generation process, particularly struggling with the higher-order categories of analysis, evaluation, and creation. Additionally, previous work has not adequately addressed the specific challenges of IT education, where technical accuracy and currency of content are as important as cognitive alignment. Our preliminary survey of 30 IT educators confirms these gaps, revealing widespread dissatisfaction with existing tools and highlighting the critical challenges mentioned above that impede their adoption.

To address the critical challenges of quality and effectiveness in automated assessment item generation within IT education, this research poses the core question:

How does the Multi-Examiner system, by integrating knowledge graphs, domain-specific search tools, and large language models, improve the quality and effectiveness of automatically generated assessment items in IT education?

This core question is examined through the following key evaluation dimensions or research sub-objectives:

Evaluating the ability of Multi-Examiner to generate contextually relevant and accurate distractors, thereby supporting meaningful and precise assessment in IT education.
Assessing the improvement in question diversity produced by Multi-Examiner across different cognitive levels of Bloom’s taxonomy to ensure comprehensive evaluation of learners’ knowledge and skills.
Measuring the effectiveness of Multi-Examiner’s generation of higher-order thinking questions aimed at fostering critical thinking and deeper understanding in IT education.

These dimensions guide the investigation to address the shortcomings of automated question generation systems in the context of IT education.

To answer the core question, we propose Multi-Examiner, an innovative multi-agent system that integrates KGs, domain-specific search tools, and LLMs. This integration aims to enhance the contextual relevance, diversity, and cognitive depth of automatically generated IT MCQs. Our approach builds upon recent advancements in artificial intelligence and cognitive science, systematically integrating Bloom’s taxonomy principles into each component of the system design, while maintaining alignment with established educational principles and assessment frameworks. The system design was further informed by extensive consultation with IT educators, ensuring its practical value in real educational settings.

The primary contributions of this study are fourfold. First, we develop a comprehensive KG and KB for IT education, meticulously categorized according to Bloom’s taxonomy, enhancing the system’s capacity to generate contextually rich questions while maintaining pedagogical alignment. Second, we design and implement Multi-Examiner, a novel multi-agent system that synergistically combines KGs, domain-specific search tools, and LLMs to improve the quality, diversity, and cognitive depth of generated questions. Third, we introduce a multidimensional evaluation rubric that assesses both technical and pedagogical aspects of generated questions, facilitating continuous improvement of the QGS. Fourth, we conduct extensive empirical evaluation through rigorous testing with 30 experienced IT teachers, benchmarking the system’s effectiveness against both GPT-4 and human-crafted questions.

The subsequent sections of this paper are organized as follows: Section 2 provides a comprehensive review of related work in QGSs, with a focus on approaches utilizing KGs, LLMs, and educational taxonomies. Section 3 details the methodology, including the architecture of Multi-Examiner and the evaluation process. Section 4 presents the results of our comparative study, while Section 5 discusses the implications of our findings for both educational technology and assessment practices. Finally, Section 6 concludes the paper with a summary of contributions and directions for future research.

2. Related Work

Building upon the challenges identified in automated question generation for IT education, this section examines how recent technological advances and methodological developments address these challenges. We organize our review around the three major components of our Multi-Examiner system: knowledge representation through KGs and KBs, natural language generation using LLMs and intelligent agents, and educational alignment through taxonomies. For each approach, we analyze its contributions and limitations in addressing our three key challenges: ensuring the contextual relevance and accuracy of distractors, enhancing question diversity, and incorporating higher-order thinking skills.

2.1. QGS Based on KGs and KBs

KGs and KBs provide the structured representation foundation for our Multi-Examiner system, enabling contextually relevant question generation, inspired by the work of Byun [12]. One advantage of KGs and KBs is their ability to provide strong relevance among distractors, which enhances the quality of questions generated [13]. However, a notable limitation is their insufficient capability to generate higher-order thinking questions. This section examines how these technologies have been applied to educational question generation and identifies these limitations that our approach addresses.

Structured knowledge representation in education has progressed from basic taxonomies to advanced semantic networks [14,15], giving rise to KGs and KBs, each with unique benefits for educational QGSs. KGs create flexible, dynamic networks of interconnected entities, enabling contextually relevant assessments [16]. In contrast, KBs use structured schemas for precise content retrieval, ensuring consistency in assessment accuracy [17,18].

Three development phases have marked QGS evolution: template-based systems with static KBs, graph-based representations enabling advanced question generation, and the current convergence of KGs with LLMs, significantly enhancing linguistic naturalness and factual accuracy [16,19]. KG-based QGS applications have expanded across narrative learning, achieving notable accuracy improvements [20] and enhancing MOOC content evaluations [21]. Advances in graph convolutional networks have further improved temporal reasoning tasks in QGS, achieving substantial accuracy in generating questions about sequential and causal relationships [22]. Meta-analyses show these architectures boost question generation significantly over traditional methods [23]. KB technology has similarly evolved, showing promise in educational customization. Recent KB systems achieve considerable accuracy in creating MCQs aligned with learning objectives, thanks to advancements in semantic parsing [24]. Integrating KBs with LLMs has enhanced domain-specific processing compared to conventional approaches, particularly for interdisciplinary questions [25].

However, several limitations in existing KG/KB approaches directly inform our Multi-Examiner design. Current systems struggle with synonym recognition and generating questions that engage higher-order thinking, with limited effectiveness for complex cognitive skills [2]. This limitation is particularly relevant to our third research question on supporting higher-order thinking. Additionally, redundancy in question format and cognitive demand limits question diversity, especially for advanced learning objectives. Our Multi-Examiner system addresses these limitations by integrating domain-specific KGs with specialized agents that enhance both cognitive alignment and question diversity.

2.2. QGS Based on LLMs and Intelligent Agents

Inspired by the successful application of LLMs in test question generation [26,27], the generative core of our Multi-Examiner system consists of LLMs and intelligent agents, enabling sophisticated natural language question formulation. However, they often lack the necessary domain-specific knowledge and may produce homogeneous questions, resulting in insufficient diversity and weak relevance among distractors. This section examines how these technologies have evolved and identifies these challenges as opportunities for improvement that our system addresses.

Building on knowledge representation technologies, integrating LLMs and intelligent agents has transformed question generation capabilities [28,29]. This progress has evolved through three phases: initial transformer-based architectures, specialized agent integration, and hybrid systems. LLM-based systems now achieve significant accuracy in generating contextually relevant questions, a notable improvement over traditional methods [30].

Advances in LLM architecture have enhanced contextual awareness and semantic understanding, achieving improved educational relevance compared to prior systems [31]. Improved attention mechanisms enable these models to better align questions with learning objectives [32]. Multi-agent architectures bring further improvements by decomposing question generation into specialized tasks, enhancing question diversity and alignment with learning outcomes [33,34]. Domain-specific knowledge integration has also improved precision in specialized subjects [35]. Research continues to optimize the synergy between LLMs and reasoning components, improving the generation of higher-order cognitive questions [33]. Neural reasoning integration enables enhanced causal reasoning and conceptual understanding [36].

Despite these advancements, significant challenges persist that our Multi-Examiner system aims to address. Performance variability remains across question types and difficulty levels [37]. This inconsistency directly relates to our second research question on improving question diversity. Complex reasoning tasks show decreased performance, and factual accuracy remains a challenge in specialized educational domains [38]. These limitations impact the generation of higher-order thinking questions and contextually relevant distractors—central concerns in our first and third research questions. Our Multi-Examiner system addresses these challenges through specialized agents that focus on distractor generation and cognitive alignment, working in concert with domain-specific knowledge structures.

2.3. Application of Educational Objective Taxonomies in QGS

Educational taxonomies provide the pedagogical framework for our Multi-Examiner system, ensuring that generated questions align with appropriate cognitive levels, thereby enhancing the precision of matching questions to cognitive levels and promoting teaching consistency. This section examines how taxonomies have been incorporated into QGSs and identifies opportunities for improvement that our approach addresses.

Educational taxonomies, evolving from simple classifications to advanced frameworks, play a crucial role in shaping automated question generation [11,39]. These frameworks help categorize cognitive levels, from basic recall to complex analysis, providing a foundation for creating diverse assessment tools [40].

Applying Bloom’s taxonomy in automated question generation has enhanced the cognitive calibration of assessments, achieving considerable success in matching questions to cognitive levels [41,42]. Incorporating knowledge dimensions with cognitive processes further advances question generation, improving the assessment of both factual knowledge and cognitive processing [43]. Technological advancements have enabled taxonomic alignment in automated QGSs, progressing from initial rule-based models to machine learning systems with improved alignment accuracy [44]. Neural architectures integrating taxonomies achieve balanced performance across cognitive levels [45].

While taxonomic integration has improved, significant challenges remain that directly inform our Multi-Examiner design. Using multiple taxonomic dimensions enhances question quality for complex learning goals [46], but systems struggle with generating questions for higher-order skills, with decreased accuracy for complex tasks [2]. This limitation directly relates to our third research question on supporting higher-order thinking. Additionally, maintaining consistent taxonomic alignment across subjects remains difficult [47]. Our Multi-Examiner system addresses these challenges through a specialized Cognitive Alignment Agent that systematically applies Bloom’s taxonomy across all generated questions, ensuring appropriate cognitive depth and consistency.

2.4. Theoretical Foundations for Question Generation in IT Education

Our Multi-Examiner system draws upon several theoretical frameworks that inform its design and implementation. This section introduces these theories and explains how they guide our approach to addressing the challenges in IT question generation.

Modern educational assessment theory emphasizes the alignment between learning objectives, instruction, and assessment, establishing that effective questions must reflect specific cognitive processes and knowledge dimensions [48]. This theory informs our system’s approach to generating questions that accurately target intended learning outcomes at appropriate cognitive levels.

Cognitive diagnostic theory provides frameworks for identifying specific knowledge components and skills that questions should assess, enabling more precise measurement of student understanding [49]. Our system incorporates these principles through specialized agents that analyze domain knowledge and generate questions that diagnose specific conceptual understandings.

Item generation theory addresses the systematic creation of assessment items with controlled characteristics, including difficulty, discrimination, and cognitive demand [50]. This theory guides our approach to generating diverse questions with appropriate distractors that effectively differentiate between levels of student understanding.

Formative assessment theory emphasizes the role of assessment in providing feedback that guides further learning, requiring questions that not only evaluate but also promote deeper understanding [39]. Our system applies these principles by generating questions that encourage critical thinking and provide opportunities for cognitive growth.

2.5. Exam Question Evaluation Scale

Effective evaluation frameworks are essential for assessing and improving the quality of automatically generated questions. This section examines existing evaluation approaches and explains how they inform the development of our Multi-Examiner evaluation rubric.

The development of evaluation frameworks for automated question generation has evolved significantly, reflecting advancements in educational assessment methodologies [10,39]. Early frameworks focused on linguistic accuracy, but modern frameworks integrate both technical and pedagogical metrics to better assess question quality [51].

Modern evaluation frameworks are founded on three principles: content validity, cognitive alignment, and pedagogical effectiveness, with studies showing significant improvement in assessing question quality when incorporating these dimensions [48]. Machine learning systems have enhanced the evaluation of question difficulty and discrimination power [52]. Advanced NLP techniques further improve the evaluation of linguistic clarity and semantic coherence [53].

Integrated frameworks now simultaneously assess content accuracy, cognitive engagement, and pedagogical alignment, showing considerable improvement in identifying relevant quality issues [50]. Research on cognitive complexity reveals improvements in evaluating higher-order thinking skills [49]. However, current systems struggle with accuracy consistency across domains and predicting student performance [54]. Future research should focus on refining metrics for complex cognitive assessments, particularly higher-order skills, and on hybrid frameworks combining educational metrics with AI to enhance adaptability across diverse learning contexts [55]. The integration of real-time feedback and adaptive criteria based on specific student and contextual needs offers potential for responsive and effective evaluation systems.

2.6. Synthesis and Research Gaps

This review of existing approaches reveals several critical gaps that our Multi-Examiner system addresses. By identifying these limitations, we establish the foundation for our research questions and system design.

The review of QGSs reveals an evolution from rule-based methods to sophisticated integrations of KGs, machine learning, and LLMs. While initial systems achieved moderate success, contemporary models that integrate LLMs and intelligent agents show promise for complex question generation but still face technical integration challenges. Cognitive complexity remains a barrier, with current systems maintaining good accuracy for knowledge-level questions but struggling with higher-order questions [56]. Domain-specific adaptability, particularly in technical education, also shows performance declines as complexity and specialization increase [57].

Three significant research gaps emerge from our review, directly aligning with our research questions. First, existing systems struggle to generate contextually relevant and accurate distractors, particularly for technical subjects like IT. This limitation undermines assessment validity and connects directly to our first research question. Second, current approaches produce limited question diversity across cognitive levels, with most systems focusing on lower-order questions. This gap relates to our second research question on improving question diversity. Third, there is a notable deficiency in generating higher-order thinking questions, with accuracy rates declining sharply for complex cognitive tasks. This limitation addresses our third research question on supporting critical thinking in IT education.

Our Multi-Examiner system addresses these gaps through a novel integration of KGs, specialized agents, and educational taxonomies. By combining structured domain knowledge with targeted generation capabilities and systematic cognitive alignment, our approach offers potential improvements in distractor quality, question diversity, and higher-order thinking assessment. The following sections detail our methodology for implementing and evaluating this system.

3. Methodology

3.1. Theoretical Foundations for IT Education Question Generation

This study draws upon several theoretical frameworks that inform the design and implementation of our Multi-Examiner system. Modern educational assessment theory emphasizes the alignment between learning objectives, instruction, and assessment, establishing that effective questions must reflect specific cognitive processes and knowledge dimensions. This theory informs our system’s approach to generating questions that accurately target intended learning outcomes at appropriate cognitive levels. Cognitive diagnostic theory provides frameworks for identifying specific knowledge components and skills that questions should assess, enabling more precise measurement of student understanding. Our system incorporates these principles through specialized agents that analyze domain knowledge and generate questions that diagnose specific conceptual understandings. Item generation theory addresses the systematic creation of assessment items with controlled characteristics, including difficulty, discrimination, and cognitive demand. This theory guides our approach to generating diverse questions with appropriate distractors that effectively differentiate between levels of student understanding. Formative assessment theory emphasizes the role of assessment in providing feedback that guides further learning, requiring questions that not only evaluate but also promote deeper understanding. Our system applies these principles by generating questions that encourage critical thinking and provide opportunities for cognitive growth.

3.2. Research Design

This study utilized a mixed-methods approach combining system development and experimental validation to improve the quality of automatically generated MCQs in IT education. The research framework, shown in Figure 1, consisted of three phases: system development, experimental evaluation, and data analysis, progressively addressing research questions from foundational elements to advanced cognitive objectives. The study’s research questions covered distractor quality, question diversity, and higher-order thinking skills, creating a structured pathway from micro-level elements to comprehensive cognitive assessment.

The Multi-Examiner system was designed based on Bloom’s taxonomy with a modular architecture of KGs, KBs, and a multi-agent system to ensure question quality. In the experimental phase, 30 experienced IT teachers evaluated questions generated by the Multi-Examiner system, GPT-4, and human experts. These questions covered core IT curriculum topics, aligned with Bloom’s cognitive levels. Data analysis utilized multivariate statistical methods to assess distractor relevance, question diversity, and higher-order thinking, using effect sizes and confidence intervals to establish significance.

3.3. System Development

The Multi-Examiner system employs a modular design with three core components: (1) a domain-specific knowledge graph, (2) a structured knowledge base, and (3) an agent-based question generation framework. This design is based on two principles: (1) improving question quality by integrating knowledge representation and intelligent agents, and (2) aligning questions with educational objectives through hierarchical cognitive design. The system innovatively combines knowledge engineering and AI, incorporating Bloom’s taxonomy.

These modules operate collaboratively through structured data and control flows (Figure 2). The knowledge graph provides semantic structures for the knowledge base, supporting organized retrieval. The knowledge base enables efficient access, while the Multi-Examiner system utilizes agents to interact with both modules for knowledge reasoning and expansion. This bidirectional interaction allows for rich, objective-aligned question generation and continual knowledge optimization.

3.3.1. KG Construction

As shown in Table 1, the KG construction in the Multi-Examiner system takes Bloom’s taxonomy as its theoretical foundation, achieving deep integration of educational theory and technical architecture through systematic knowledge representation structures. At the knowledge representation level, the research develops based on two dimensions of Bloom’s taxonomy: the knowledge dimension and the cognitive process dimension. The knowledge dimension is reflected through core attributes of entities, with each knowledge point entity containing detailed knowledge descriptions and cognitive type annotations. The knowledge description attribute not only provides basic definitions and application scopes of entities but also, more importantly, structures knowledge content based on Bloom’s taxonomy framework. The cognitive type attribute strictly follows Bloom’s four-level knowledge classification: factual knowledge (e.g., professional terminology, technical details), conceptual knowledge (e.g., principles, method classifications), procedural knowledge (e.g., operational procedures, problem solving), and metacognitive knowledge (e.g., learning strategies, cognitive monitoring). Specifically, the design is based on the K-12 Computer Science Standards [58].

In the cognitive process dimension, the knowledge graph implements support for different cognitive levels through relationship type design. For example, the Contains relationship primarily serves knowledge expression at the remembering and understanding levels, supporting the cultivation of basic cognitive abilities through explicit concept hierarchies. The Belongs to relationship focuses on supporting cognitive processes at the application and analysis levels, helping learners construct knowledge classification systems. The Prerequisite relationship plays an important role at the evaluation level, promoting critical thinking development by revealing knowledge dependencies. The Related relationship mainly serves the creation level, supporting innovative thinking through knowledge associations. This relationship design based on cognitive theory ensures that the knowledge graph can provide theoretical guidance and knowledge support for generating questions at different cognitive levels.

Through this systematic theoretical integration, the knowledge graph not only achieves structured knowledge representation but also, more importantly, constructs a knowledge framework supporting cognitive development. When the system needs to generate questions at specific cognitive levels, it can conduct knowledge retrieval and reasoning based on corresponding entity attributes and relationship types, thereby ensuring that generated questions both meet knowledge content requirements and accurately match target cognitive levels.

3.3.2. KB Construction

The KB module adopts a layered processing design philosophy, establishing a systematic knowledge processing and organization architecture to provide foundational support for diverse question generation. This study innovatively designed a three-layer processing architecture encompassing knowledge formatting, knowledge vectorization, and retrieval segmentation, achieving systematic transformation from raw educational resources to structured knowledge. This layered architectural design not only enhances the efficiency and accuracy of knowledge retrieval but also provides a solid data foundation for generating differentiated questions.

At the knowledge formatting level, the system employs Optical Character Recognition (OCR) technology to convert various educational materials into standardized digital text. The knowledge formatting process is formally defined as:

F o r m a t (K) = O C R (K_{r a w}),

(1)

where

K_{r a w}

represents raw educational materials, including textbooks, teaching syllabi, and the professional literature. This standardization process ensures content consistency and lays the foundation for subsequent vectorization processing. The system optimizes recognition results through multiple preprocessing mechanisms, including text segmentation, key information extraction, and format standardization, to enhance knowledge representation quality.

At the knowledge vectorization stage, a hybrid vectorization strategy is employed to transform standardized text into high-dimensional vector representations. This transformation process is formally defined as:

V (k) = v e c (k),

(2)

where k is a snippet of formatted knowledge from

F o r m a t (K)

, and

v e c (k)

represents the vectorized form of k, utilizing models such as TF-IDF or Word2Vec, or advanced techniques like BERT embeddings. The system innovatively designs a dynamic weight adjustment mechanism, adaptively adjusting weights of various vectorization techniques based on different knowledge types and application scenarios, enhancing knowledge representation accuracy. This vectorization method ensures that semantically similar knowledge points maintain proximity in vector space, providing a reliable foundation for subsequent similarity calculations and association analyses.

At the retrieval segmentation level, the system systematically organizes vectorized knowledge based on predefined domain tags. The formal expression for retrieval segmentation is:

S (k) = \cup_{i = 1}^{n} s e g m e n t (V (k_{i}), d_{i}),

(3)

where

S (k)

represents the segmented KB,

k_{i}

are individual pieces of vectorized knowledge,

d_{i}

denotes the domain or knowledge point tags associated with each

k_{i}

, and

s e g m e n t

is the function that assigns each vectorized knowledge piece to its corresponding segment in the KB. The system designs a multi-level index structure, supporting rapid location and extraction of knowledge content from different dimensions while enabling knowledge recombination mechanisms to support diverse question generation requirements. This layered organization structure not only enhances knowledge retrieval efficiency but also provides flexible knowledge support for generating questions at different cognitive levels.

This study constructed a comprehensive knowledge base system around core IT curriculum knowledge. Through a systematic layered processing architecture, this knowledge base achieves efficient transformation from raw educational resources to structured knowledge. The knowledge base not only supports precise retrieval based on semantics but can also provide differentiated knowledge support according to different cognitive levels and knowledge types, laying the foundation for generating diverse and high-quality questions. This layered knowledge processing architecture significantly enhances the system’s flexibility and adaptability in question generation, better meeting assessment needs across different teaching scenarios.

3.3.3. Multi-Examiner System Design

The Multi-Examiner module is grounded in modern educational assessment theory, integrating cognitive diagnostic theory, item generation theory, and formative assessment theory to construct a theory-driven intelligent agent collaborative assessment framework. This framework innovatively designs four types of intelligent agents: Identificador, Explorador, Fabricator, and Revisor, forming a complete question generation pipeline. Through four specialized intelligent agents, the system achieves precise control over the question generation process, with each agent designed based on specific educational theories, collectively constituting a comprehensive intelligent educational test item generation system (Figure 3). This theory-based multi-agent system framework design not only ensures the educational value of generated questions but also provides a new technical paradigm for educational measurement and evaluation.

The Identificador, designed based on schema theory and learned by each intelligent agent through prompts, is responsible for deep semantic understanding and cognitive feature analysis of knowledge points. This agent implements knowledge retrieval through the function:

f (S_{i d}) = I d e n t i f i c a d o r_{r e t r i e v e} (k),

(4)

where

S_{i d}

represents the set of synonyms and related terms, k is the original knowledge point input by the user, and

I d e n t i f i c a d o r_{r e t r i e v e}

denotes the large language model’s operation to fetch and generate synonymous and related terms using its trained capabilities on vast corpora and search engine integration. The Identificador not only identifies surface features of knowledge points but also, more importantly, analyzes cognitive structures and semantic networks based on schema theory. For example, when processing the knowledge point “Operating System”, the Identificador first constructs its cognitive schema, including core attributes (such as system software characteristics), process features (such as resource management mechanisms), and relational features (such as interactions with hardware and application software), thereby providing a complete cognitive framework for subsequent question generation.

The Explorador adopts constructivist learning theory to guide knowledge association exploration, implementing multidimensional semantic connections in the knowledge graph through the function:

f (S_{e x p} = E x p l o r a d o r_{r e t r i e v e} (S_{i d}, K G)),

(5)

where

S_{e x p}

represents the detailed knowledge entries,

S_{i d}

is the set of input terms from the Identificador, and

K G

is the knowledge graph. This agent innovatively implements directed retrieval strategies based on cognitive levels, capable of selecting corresponding knowledge nodes according to different cognitive levels of Bloom’s taxonomy. For example, when generating higher-order thinking questions, the Explorador prioritizes knowledge nodes related to advanced cognitive processes such as analysis, evaluation, and creation, establishing logical connections between these nodes to provide knowledge support for generating complex assessment tasks.

The Fabricator integrates cognitive load theory and question type design theory, implementing dynamic question generation through the function:

Q (k, t) = F a b r i c a t o r_{t} (L L M, P_{t}, k),

(6)

where

Q (k, t)

represents the question generated, t is the type of knowledge (factual, conceptual, procedural, metacognitive), k is the knowledge point, and

P_{t}

is the tailored prompt for type t. The Fabricator’s innovation is reflected in its ability to dynamically adjust question complexity according to learning objectives. This agent adopts specific generation strategies (

P_{t}

) for different cognitive objectives (t), ensuring assessment validity while controlling question cognitive load levels. The Fabricator’s innovation is reflected in its ability to dynamically adjust question complexity according to learning objectives. For example, when generating conceptual understanding questions, the system controls information quantity and problem context complexity to ensure optimal cognitive load levels.

The Revisor constructs a systematic quality control mechanism based on educational measurement theory. It ensures question quality through multidimensional evaluation criteria. The function

C h e c k V a l i d i t y (q, O, K G)

not only verifies technical correctness but also, more importantly, evaluates the consistency between questions and educational objectives. When quality issues are detected, the system generates specific modification suggestions through the function

G e n e r a t e F e e d b a c k (q, O)

and triggers optimization processes. This closed-loop quality control mechanism ensures that the system can continuously produce high-quality assessment questions.

At the agent collaboration level, the system employs an event-driven mechanism grounded in educational assessment theory, forming a complete question generation and evaluation chain. This design is inspired by Cognitive Development and Adaptive Assessment theories, ensuring continuity and adaptability throughout the process. The Identificador assesses cognitive features, triggering the Explorador to construct knowledge networks, which the Fabricator uses to dynamically adjust question strategies.

Finally, the Revisor provides the final evaluation of the questions, ensuring that the assessment is not only technically accurate but also aligned with educational goals. This integrated approach to question development and validation promotes the creation of effective assessment tools that meet diverse learning needs.

To enhance scalability, the Multi-Examiner system employs a microservice modular design, enabling each agent to function independently through standardized APIs. The system’s innovation spans three theoretical levels: (1) systematic application of educational theories, (2) precise cognitive mapping in question design, and (3) formative assessment implementation. This framework integrates educational integrity with AI-driven automation, advancing adaptability in question generation.

The design prioritizes educational purpose alongside technical innovation, ensuring generated questions serve meaningful educational needs. The modular architecture allows for continuous adaptation, supporting the integration of new theories and technologies to maintain relevance in educational technology.

3.4. Experimental Design

To ensure a rigorous and systematic evaluation of the effectiveness of the Multi-Examiner system in automatically generating high-quality MCQs for IT education, we designed a comprehensive experimental protocol that includes three key components: participant selection, experimental materials and procedures, and evaluation metrics. The study used an expert evaluation methodology for a comparative analysis of the quality of the question generated by Multi-Examiner relative to alternative generation methods. If the questions produced by the Multi-Examiner system receive scores that differ significantly from those generated by GPT-4 while closely aligning with scores from human-generated questions, this would demonstrate the effectiveness of the system. Prior to the main experiment, we conducted ten semi-structured interviews with experienced IT educators to validate and refine the initial assessment framework, significantly informing the final experimental design.

As shown in Figure 4, the Multi-Examiner system features a four-layer architectural design that integrates the knowledge graph and question generation modules. The User Interface Layer serves as the entry point for processing question generation requests through the Application Service Layer. The Core Service Layer contains the system’s essential intelligent components: the Knowledge Graph Service, which provides Ontology Model Management, the Knowledge Inference Engine, and Cypher Query Optimization; the Knowledge Base Service, which manages Knowledge Formatting, Knowledge Vectorization, and Hybrid Search Index construction; and the Agent Framework, based on the Langchain framework, which comprises Identificador, Explorador, Fabricator, and Revisor, collaborating in knowledge identification, exploration, construction, and validation. The Infrastructure Layer includes the Neo4j Database and Vector Storage, forming a complete knowledge-driven question generation system through standardized component communication and configuration management. Furthermore, the system creates query tools for both the Identificador and Explorador, employs prompt engineering strategies for agents to perform their respective roles, and implements an event bus based on the publish–subscribe pattern for asynchronous communication.

The system architecture employs the Neo4j graph database (version 4.3.11) as the core platform for storing and managing the knowledge graph, facilitating efficient knowledge retrieval, and reasoning via the Cypher query language. The database is deployed within Docker containers to ensure system stability and portability.

To enhance retrieval efficiency, we designed a set of optimized cypher query templates, categorized into two main types: (1) Knowledge Point Queries, which extract detailed conceptual information and knowledge types of specific knowledge points, serving as foundational data for question generation; and (2) Knowledge Path Queries, which explore relationships among related knowledge points to generate contextually relevant distractors. Integration with the system is facilitated through Python’s Neo4j driver, providing efficient and stable access to knowledge.

The knowledge graph consists of three hierarchical layers:

Ontology Layer: This layer, designed based on the characteristics of IT education and Bloom’s taxonomy, defines four core ontology classes: Knowledge (basic knowledge points), Concept (conceptual knowledge), Procedure (procedural knowledge), and Metacognition (metacognitive knowledge). It includes semantic relation types that support different cognitive levels, such as CONTAINS (memory and understanding), BELONGS_TO (application and analysis), PREREQUISITE (evaluation), and RELATED (creation). These relations empower the knowledge graph to facilitate multi-level cognitive question generation aligned with Bloom’s taxonomy.
Entity-Relation Layer: This core content layer includes specific IT knowledge entities and their interrelations, primarily extracted from high school IT textbooks and curriculum guidelines through a semi-automated process. Using large language models (LLMs), specifically GPT-4, we automatically extracted knowledge points and relations from textbook texts via carefully designed prompt templates. The initial knowledge network was refined by IT education experts to ensure conceptual accuracy, rational relation types, and structural coherence.
Attribute Annotation Layer: This layer annotates educational attributes and metadata related to knowledge points and relations, bridging content with educational objectives. Annotated properties include knowledge type, cognitive level, and content descriptions, enabling targeted question generation according to specific learning goals.

This layered knowledge graph design, integrated with domain-specific retrieval and LLM-powered knowledge extraction, forms the technical backbone of the Multi-Examiner system, supporting its ability to generate diverse, high-quality, and cognitively appropriate MCQs for IT education.

3.4.1. Participants

Purposive sampling was used to form an expert evaluation team for assessing the quality of automatically generated MCQs. Statistical power analysis (

α

= 0.05, power = 0.80, partial

η^{2}

= 0.06) determined a minimum sample size of 28, leading to the recruitment of 30 experts to account for potential attrition. Selection focused on professional background, teaching experience, and technological literacy, with all experts having at least five years of high school IT teaching experience, training in Bloom’s taxonomy, and educational technology experience. The panel consisted of 18 females (60%) and 12 males (40%), averaging 8.3 years of teaching experience (SD = 2.7); 73% had experience with AI-assisted tools, providing diverse perspectives.

A standardized two-day training ensured evaluation reliability, combining theory with practical application of evaluation criteria. Pre-assessment on 10 test questions showed high inter-rater consistency (Krippendorff’s

α

= 0.83). For significant scoring discrepancies, consensus was achieved through discussion. Systematic validity and reliability testing confirmed scoring stability, with test–retest reliability after two weeks achieving a correlation coefficient of 0.85, indicating strong scoring consistency among experts.

3.4.2. Experimental Materials and Procedures

This study used a systematic experimental design to ensure rigor, involving MCQs from three sources: Multi-Examiner, GPT-4, and human-created questions. These questions were generated using identical knowledge points and assessment criteria for comparability. Six core knowledge points from the high school IT curriculum unit “Information Systems and Society” were selected, reviewed by three senior IT education experts, and covered four types defined by Bloom’s taxonomy (factual, conceptual, procedural, and metacognitive), resulting in 72 questions. Chi-square testing confirmed a balanced distribution across sources (

χ^{2}

= 1.86, p > 0.05).

Prior to the main experiment, we conducted 10 semi-structured interviews with IT teachers between September and October 2023. These interviews served to validate our initial assessment framework and ensure its alignment with actual classroom practices. Participants were selected based on their minimum of five years of IT teaching experience and familiarity with automated assessment tools. The interviews explored teachers’ experiences with MCQs, their criteria for evaluating question quality, and their perspectives on automated question generation. Thematic analysis of the interview data revealed three key areas of concern: distractor quality, cognitive level alignment, and practical relevance. These findings directly informed our final evaluation framework and the selection of assessment dimensions.

A triple-blind review design kept evaluators unaware of question sources, with uniform formatting and a Latin square arrangement to minimize sequence and fatigue effects (ANOVA: F = 1.24, p > 0.05). Standardized evaluation processes were used, with questions independently scored by two experts and a third reviewer in cases of large scoring discrepancies (≥2 points). No significant differences were found among groups in text length (F = 0.78, p > 0.05) or language complexity (F = 0.92, p > 0.05). Semi-structured interviews (n = 10) indicated high alignment with real teaching practices (mean = 4.2/5), and qualitative data coding achieved high inter-coder reliability (Cohen’s

κ

= 0.85).

3.4.3. Measures and Instruments

This study constructed an evaluation framework based on Bloom’s taxonomy, focusing on three dimensions: distractor relevance, question diversity, and higher-order thinking. We employed a questionnaire to gather insights regarding the quality of the automatically generated MCQs, allowing experts to evaluate all questions from Multi-Examiner, GPT-4, and humans.

Distractor relevance evaluated conceptual relevance, logical rationality, and clarity, each rated on a five-point scale. Question diversity assessed cognitive level coverage, domain distribution, and form variation to ensure a balanced assessment across Bloom’s levels. Higher-order thinking measured cognitive depth, challenge level, and application authenticity, with criteria verified through expert consultation and pilot testing.

To ensure rigor, the framework’s content validity achieved a Content Validity Ratio of 0.78, while construct validity was confirmed through factor analysis (

χ^{2}

/df = 2.34, CFI = 0.92, RMSEA = 0.076). Reliability testing included inter-rater reliability (Krippendorff’s

α

> 0.83), test–retest reliability (r = 0.85), and internal consistency (Cronbach’s

α

> 0.83) across dimensions, all demonstrating high reliability.

3.4.4. Data Analysis

This study applied a systematic data analysis framework for three research questions. Descriptive statistics provided an overview of data, followed by inferential analyses tailored to each question, with effect sizes calculated for reliability. Data preprocessing included cleaning, normality checks (Shapiro–Wilk test), and variance homogeneity tests (Levene’s test). Missing values were addressed through multiple imputation.

For distractor relevance, two-way ANOVA assessed the effects of generation methods and knowledge types, with Tukey HSD post hoc tests for significant interactions. For question diversity, MANOVA was employed, with follow-up ANOVAs and Pearson correlations between dimensions. For higher-order thinking, mixed-design ANOVA examined cognitive level differences, using Games–Howell post-hoc tests for robustness. Effect sizes (partial

η^{2}

, Cohen’s d, and r) were reported with confidence intervals, focusing on practical significance.

3.5. Ethical Considerations

This study obtained approval from the Institutional Review Board (IRB-2024-ED-0127). All participating teachers were informed of the research purpose, procedures, and data usage, and provided written informed consent. Research data were anonymized, with all personally identifiable information removed from research reports. Data collection and storage followed strict confidentiality protocols, with access restricted to core research personnel. Participants retained the right to withdraw from the study at any time without any negative consequences.

4. Results

In this section, we present our research findings aimed at enhancing the generation of MCQs in IT education using the Multi-Examiner system. Our analysis includes descriptive statistics, variance analyses (ANOVA and MANOVA), and post-hoc tests to compare the performance of the Multi-Examiner system against GPT-4 and human-generated questions. The results highlight the system’s effectiveness in generating contextually relevant distractors, enhancing question diversity, and producing high-quality questions that assess higher-order thinking skills. The detailed findings are organized into subsections corresponding to each research question.

4.1. Analysis of the Contextual Relevance of Distractors

To enhance distractor relevance, we conducted an in-depth analysis of distractors generated by different methods—GPT-4 and human-generated questions—across four types of knowledge: factual, conceptual, procedural, and metacognitive.

4.1.1. Descriptive Statistical Analysis

Table 2 presents the descriptive statistics of distractor relevance scores for each group. The distractor relevance score represents a composite average of three dimensions: conceptual relevance, logical rationality, and clarity, each rated on a five-point scale. The “Mean” column represents the average scores across these three dimensions. From these data, we observe the following trends: (1) Multi-Examiner achieved higher average scores than both GPT-4 and human-generated methods across most knowledge types. (2) Multi-Examiner performed exceptionally well in the relevance of distractors for factual and metacognitive knowledge. (3) The scores for distractors generated by Multi-Examiner and human methods were relatively close across all knowledge types, while GPT-4’s scores were comparatively lower.

Table 3 provides a more detailed breakdown of the three sub-criteria that comprise the distractor relevance score across generation methods, allowing for a more granular understanding of each method’s strengths and weaknesses.

4.1.2. Two-Way Analysis of Variance (ANOVA)

To further analyze the contextual relevance of distractors, we conducted a two-way Analysis of Variance (ANOVA). Before performing the analysis, we checked the assumptions of the ANOVA, including normality (using the Shapiro–Wilk test) and homogeneity of variances (using Levene’s test). The results indicated that the data generally met these assumptions (p > 0.05).

Table 4 presents the results of the ANOVA, where the dependent variable is the distractor relevance score, and the independent variables are the generation method and knowledge type. The analysis revealed: (1) The significant main effect of the generation method: F(2, 348) = 19.85, p < 0.001, partial

η^{2}

= 0.08. According to Cohen (1988), this is considered a medium effect size, indicating substantial differences in distractor relevance across different generation methods. (2) The significant main effect of knowledge type: F(3, 348) = 4.34, p = 0.005, partial

η^{2}

= 0.03. This is a small effect size, suggesting that the type of knowledge significantly affects the relevance scores of the distractors. (3) Significant interaction between the generation method and knowledge type: F(6, 348) = 5.79, p < 0.001, partial

η^{2}

= 0.07. This medium effect size indicates that the combination of the generation method and knowledge type significantly influences distractor relevance scores, showing distinct performance patterns.

To further explore the differences between the two groups, we conducted Tukey’s HSD post-hoc test. Considering the potential inflation of Type I error due to multiple comparisons, we applied Bonferroni correction to adjust the p-values. Table 5 presents the adjusted results. The post-hoc test results indicate that: (1) The distractor relevance of questions generated by the Multi-Examiner is significantly higher than that of GPT-4 (p < 0.001), but there is no significant difference compared to human-generated questions (p = 1.000). (2) Among the knowledge types, the score for factual knowledge is significantly lower than that for conceptual knowledge but does not differ significantly from procedural and metacognitive knowledge.

4.1.3. Performance Differences by Generation Method Across Different Knowledge Types

Figure 5 visually displays the performance differences among the generation methods across different knowledge types. From Figure 5, we can observe the following: (1) The Multi-Examiner has a more concentrated score distribution across all knowledge types, with median scores generally higher than the other two methods, especially pronounced in factual and conceptual knowledge. (2) GPT-4 shows a more dispersed score distribution, particularly in factual and conceptual knowledge, where its performance is relatively poorer. (3) The score distribution for human-generated distractors is close to that of the Multi-Examiner, performing well especially in procedural and metacognitive knowledge.

Figure 6 further illustrates the relationship between generation methods and knowledge types. From Figure 6, we can make the following observations: (1) The Multi-Examiner outperforms both GPT-4 and human-generated methods across most knowledge types, with a particularly strong advantage in factual knowledge. (2) GPT-4 generally performs poorly across all knowledge types, especially in procedural knowledge. (3) The performance differences among the three generation methods are relatively small in metacognitive knowledge, suggesting that the impact of the generation method might be less significant for these higher-order cognitive tasks.

4.2. Analysis of Enhancing Question Diversity

To increase the diversity of questions, we evaluated the diversity of question sets generated by three methods: Multi-Examiner, GPT-4, and human-generated. Thirty high school IT teachers, serving as expert evaluators, rated the question sets across three dimensions: diversity, challenge, and higher-order thinking.

4.2.1. Descriptive Statistical Analysis

Table 6 presents the descriptive statistics for the three methods across the diversity dimension. From Table 6, we observe the following trends: (1) The Multi-Examiner achieved a significantly higher average score in the diversity dimension compared to GPT-4, and its score is very close to that of the human-generated method. (2) GPT-4 scored lower in diversity than the other two methods. (3) Human-generated question sets scored slightly higher in diversity than the Multi-Examiner, although the difference is minimal.

4.2.2. Multivariate Analysis of Variance (MANOVA)

To analyze in depth the impact of generation methods on the diversity, challenge, and higher-order thinking of the question sets, we conducted a Multivariate Analysis of Variance (MANOVA). Before performing the analysis, we checked the assumptions of MANOVA, including multivariate normality (using Mardia’s test) and homogeneity of covariance matrices (using Box’s M test). The results indicated that the data generally met these assumptions (p > 0.05).

Table 7 presents the results of the MANOVA, with the independent variable being the generation method and the dependent variables being the scores for diversity, challenge, and higher-order thinking. The MANOVA results showed that the generation method has a significant overall effect on the diversity, challenge, and higher-order thinking of the question sets (Wilks’

λ

= 0.523, F(6, 170) = 11.258, p < 0.001, partial

η^{2}

= 0.284). According to Cohen (1988), this represents a large effect size, indicating that the generation method has a substantial impact on the overall quality of the question sets.

4.2.3. Univariate Analysis of Variance (ANOVA) Follow-Up Tests

To further explore the specific differences in generation methods across various dimensions, we conducted separate univariate ANOVAs for each dependent variable and applied Bonferroni corrections to control the Type I error rate. Table 8 presents the results of these ANOVAs. The results indicate that the generation method significantly affects all three dimensions: diversity (F(2, 87) = 13.002, p < 0.001, partial

η^{2}

= 0.309), challenge (F(2, 87) = 12.530, p < 0.001, partial

η^{2}

= 0.301), and higher-order thinking (F(2, 87) = 17.724, p < 0.001, partial

η^{2}

= 0.379).

We further conducted Tukey’s HSD post-hoc tests, the results of which are shown in Table 9. The post-hoc test results indicate: (1) Multi-Examiner significantly outperforms GPT-4 across all dimensions (p < 0.001). (2) There are no significant differences between Multi-Examiner and human-generated question sets across all dimensions (p > 0.05). (3) Human-generated question sets significantly outperform GPT-4 across all dimensions (p < 0.001).

4.2.4. Performance Differences by Generation Method Across Evaluation Dimensions

To provide a more visual representation of the performance differences across various evaluation dimensions for different generation methods, I used parallel coordinate plots. These plots display the performance differences among the three generation methods (Multi-Examiner, GPT-4, and human-generated) across three dimensions: diversity (0), challenge (1), and higher-order thinking (2). This visualization helps us comprehensively compare the strengths and weaknesses of each generation method.

From Figure 7, we can observe the following: (1) The performance of the Multi-Examiner and human-generated methods are very close across the dimensions of diversity, challenge, and higher-order thinking, with slight differences, but both generally maintain a high level. (2) GPT-4 shows a clear disadvantage in all dimensions, especially in terms of challenge and higher-order thinking, where its performance is relatively weaker.

4.3. Analysis of the Effectiveness of the Assessment System in Generating Higher-Order Thinking Questions

We conducted a thorough analysis of higher-order thinking questions generated by the Multi-Examiner system, GPT-4, and human methods.

4.3.1. Descriptive Statistical Analysis

Table 10 presents the descriptive statistics for questions generated by the three methods across the six cognitive levels of Bloom’s taxonomy. Scoring was done using a 1–5 Likert scale, where 1 represents “very poor” and 5 represents “excellent.” Each generation method had 30 samples (N = 30).

From Table 10, we can observe the following preliminary trends: (1) All generation methods generally perform better on lower-order thinking skills (Memory, Understanding) than on higher-order thinking skills (Analysis, Evaluation, Creation). (2) Multi-Examiner scores higher on average in higher-order thinking skills (especially in evaluation and creation levels) compared to GPT-4, but slightly lower than human-generated questions. (3) GPT-4’s performance on higher-order thinking skills is notably lower than the other two methods, particularly in the evaluation and creation levels. (4) Human-generated questions show the most stable performance across all cognitive levels, with relatively small standard deviations. (5) At the creation level, Multi-Examiner (M = 3.73, SD = 0.91) significantly outperforms GPT-4 (M = 3.00, SD = 1.08), and is close to the human-generated level (M = 3.80, SD = 0.89). (6) As the cognitive levels increase, the differences in scores among the three generation methods grow, especially in the evaluation and creation levels.

4.3.2. Two-Way Analysis of Variance (ANOVA)

To deeply analyze the impact of generation methods and cognitive levels on question quality, we conducted a two-way Analysis of Variance (ANOVA). Before performing the analysis, we verified the prerequisites for ANOVA, including normality (using the Shapiro-Wilk test) and homogeneity of variances (using Levene’s test). The results showed that the data generally met these assumptions (p > 0.05).

Table 11 presents the results of the ANOVA, where the dependent variable is the question quality score, and the independent variables are the generation method and cognitive level. The analysis revealed: (1) Significant main effect of generation method: F(2, 530) = 13.76, p < 0.001, partial

η^{2}

= 0.05. According to Cohen (1988), this is considered a medium effect size, indicating significant differences in question quality across different generation methods. (2) Significant main effect of cognitive level: F(5, 530) = 26.37, p < 0.001, partial

η^{2}

= 0.20. This represents a large effect size, suggesting that cognitive level has a significant impact on question quality. (3) Significant interaction effect between generation method and cognitive level: F(10, 530) = 2.27, p = 0.013, partial

η^{2}

= 0.04. Although the effect size is small, it indicates that the combination of generation method and cognitive level has a noticeable impact on question quality.

4.3.3. Post-Hoc Test Analysis

To further explore differences between groups, we conducted Tukey’s HSD post-hoc tests. Considering the potential for Type I errors due to multiple comparisons, we applied Bonferroni corrections to adjust the p-values.

As shown in the Table 12, The post-hoc test results reveal: (1) The quality of questions generated by Multi-Examiner is significantly better than those generated by GPT4 (p < 0.001), but there is no significant difference compared to human-generated questions (p = 0.267). (2) Human-generated questions are significantly better in quality compared to those generated by GPT4 (p < 0.001). (3) There are no significant differences between higher-order thinking skills (analysis, evaluation, creation), indicating that the difficulty level of questions across these cognitive skills is similar.

4.3.4. Differences in Performance Across Cognitive Levels by Generation Method

To visually demonstrate the performance differences across cognitive levels for different generation methods, we created an interaction effect, Figure 8.

From Figure 8, we can observe the following trends: (1) Multi-Examiner significantly outperforms GPT-4 in higher-order thinking skills (particularly in the evaluation and creation levels) and is close to the level of human-generated questions. (2) GPT-4 performs well in lower-order thinking skills (memory, understanding) but shows a marked decline in higher-order thinking skills. (3) Human-generated questions exhibit the most stable performance across all cognitive levels, especially maintaining a high level in higher-order thinking skills. (4) The performance gap between Multi-Examiner and human-generated questions is small in the evaluation and creation levels, indicating that Multi-Examiner has good potential in generating questions that require higher-order thinking.

4.3.5. Quality Analysis of Higher-Order Thinking Questions

We specifically focused on the three higher-order thinking levels: analysis, evaluation, and creation. Table 13 presents the mean scores and standard deviations for these three levels across the different generation methods.

We conducted univariate ANOVAs for the analysis, evaluation, and creation levels separately, with results shown in Table 14. The findings are as follows: (1) At the analysis level, there is no significant difference between the three generation methods (F(2, 87) = 2.14, p = 0.123, partial

η^{2}

= 0.04). This indicates that Multi-Examiner performs similarly to GPT-4 and human-generated methods when generating questions at the analysis level. (2) At the evaluation level, the generation method has a significant impact on question quality (F(2, 87) = 4.27, p = 0.017, partial

η^{2}

= 0.07). The effect size is medium, suggesting that there are substantial differences in the quality of questions generated by different methods at this level. (3) At the creation level, the impact of the generation method is most significant (F(2, 87) = 6.89, p = 0.002, partial

η^{2}

= 0.10). This is a medium-to-large effect size, indicating that at the highest cognitive level, the generation method has the greatest influence on question quality.

To further explore the differences between groups, we conducted Tukey’s HSD post-hoc tests for the evaluation and creation levels, with the results shown in Table 15. From these analyses, we can draw the following conclusions: (1) Multi-Examiner performs exceptionally well in generating higher-order thinking questions, particularly at the evaluation and creation levels. Its performance is significantly better than GPT-4 and close to the level of human-generated questions. (2) At the analysis level, there are no significant differences between the three methods, which may indicate that automated methods have reached a level comparable to human performance at this stage. (3) As the cognitive levels increase (from analysis to evaluation and then to creation), the differences between generation methods grow, reflecting the challenge of generating questions that require higher-order thinking skills. (4) GPT-4 shows clear limitations in generating higher-order thinking questions, particularly at the creation level, highlighting the importance of incorporating additional structured information, such as KGs. (5) The performance of Multi-Examiner at the evaluation and creation levels shows no significant difference from that of human-generated questions, indicating that the system has strong potential in generating high-quality, higher-order thinking questions.

5. Discussions

5.1. Discussion of Distractor Contextual Relevance and Generation Method Effectiveness

Based on the above analysis, we conclude that Multi-Examiner demonstrates a significant advantage in generating contextually relevant distractors, especially for factual and procedural knowledge types. The effectiveness of Multi-Examiner suggests that leveraging structured domain-specific information enhances the generation of distractors, achieving a quality level comparable to that of human-generated questions across multiple knowledge types. The higher scores observed for conceptual and metacognitive knowledge types indicate that these areas may be less challenging for automated systems, likely due to the more flexible nature of the knowledge involved, which allows for more variance in distractor generation.

The limitations of GPT-4 in generating high-quality distractors, particularly for factual knowledge, suggest a need for further development of LLMs tailored to educational assessment tasks. This performance gap may stem from GPT-4’s lack of structured, domain-specific knowledge, affecting its ability to generate distractors that are closely related to specific knowledge points. Additionally, the lower performance in factual knowledge highlights the inherent difficulty in generating distractors for this knowledge type, as it often has clear right and wrong distinctions, making it challenging to create distractors that are both relevant and misleading. These findings have practical implications, indicating that while automated systems like Multi-Examiner show great potential, further optimization, particularly in factual knowledge areas, is needed to enhance the system’s application in educational settings.

For educators and assessment designers not involved in this study, our findings suggest several practical implementation paths. First, the Multi-Examiner system could be integrated into existing learning management systems as a question generation plugin, where teachers simply input topic areas and receive diverse distractor sets. Second, the contextual relevance metrics developed in this study could serve as evaluation criteria for teachers to quickly assess the quality of automatically generated assessment items before deployment. Additionally, our system architecture provides a template for institutions developing their own assessment tools—the combination of domain-specific KGs with LLMs represents a reproducible approach that can be adapted across various subject areas.

5.2. Discussion on Enhancing Question Diversity and Cognitive Challenge Through Automated Generation Methods

The analysis indicates that Multi-Examiner has a clear advantage in enhancing the diversity, challenge, and higher-order thinking of generated question sets, particularly when compared to GPT-4. This finding highlights the potential of integrating KGs and domain-specific search tools with large language models to improve the quality and diversity of automatically generated questions. The performance of Multi-Examiner is not significantly different from human-generated question sets, suggesting that it has the potential to be an effective support tool for educators, capable of producing question sets that closely match human standards in diversity and cognitive complexity. On the other hand, GPT-4’s limitations are apparent, as it consistently underperforms across all dimensions, particularly in higher-order thinking. This may be due to its lack of specialized knowledge structures and educational evaluation frameworks, which are crucial for generating varied and challenging questions suitable for educational assessments.

These results have important implications for educational practice, demonstrating that tools like Multi-Examiner can significantly enhance the efficiency and quality of question generation, particularly in scenarios where a diverse and comprehensive set of questions is needed. The strong alignment of Multi-Examiner’s performance with that of human-generated methods also suggests its applicability in real-world educational settings, potentially easing the burden on educators. However, the limitations of GPT-4 highlight the need for further development and customization of LLMs tailored to the educational domain to better support diverse question generation. Future research could explore optimizing these AI models to achieve even greater diversity and quality in question generation across various subject areas, ensuring these tools effectively contribute to educational assessment practices.

5.3. Evaluating the Effectiveness of Automated Systems in Generating Higher-Order Thinking Questions in K-12 IT Education

The analysis of the effectiveness of Multi-Examiner in generating higher-order thinking questions shows that it outperforms GPT-4, particularly at the evaluation and creation levels. These results align with prior findings, emphasizing the advantage of integrating KGs and LLMs in improving the quality of complex question generation. The performance gap between Multi-Examiner and human-generated questions is minimal, suggesting the system’s potential as an effective tool for educators, providing support close to human capabilities. In contrast, GPT-4’s limitations are evident, especially in generating questions requiring deep thinking and creativity, highlighting the need for specialized educational frameworks and domain-specific knowledge to enhance its application in educational assessments.

The interaction effect between generation methods and cognitive levels indicates that different methods exhibit varying performance patterns across cognitive levels, reinforcing the importance of tailoring AI models for specific educational contexts. As cognitive levels increase from analysis to evaluation and creation, performance differences become more pronounced, reflecting the challenge of generating higher-order thinking questions. These findings suggest the need for further refinement of AI-based systems, such as Multi-Examiner, to better support K-12 education by effectively generating questions that assess critical and creative thinking skills.

6. Conclusions

This research addressed three critical gaps in automated assessment for K-12 IT education: the shortage of contextually relevant distractors, the lack of question diversity and cognitive challenge, and the limited generation of higher-order thinking questions. Our comprehensive evaluation demonstrates that the Multi-Examiner system successfully addresses these gaps through its innovative architecture combining KGs, domain-specific search tools, and LLMs. Our contributions are threefold: First, we demonstrated that Multi-Examiner significantly outperforms GPT-4 in generating contextually relevant distractors across multiple knowledge types, achieving quality comparable to human experts. Second, we established that Multi-Examiner enhances question diversity and cognitive challenge, producing assessment sets that closely align with human-generated standards while surpassing conventional LLM approaches. Third, we provided empirical evidence that Multi-Examiner effectively generates higher-order thinking questions, particularly excelling at the evaluation and creation levels of Bloom’s taxonomy. In conclusion, the Multi-Examiner system effectively generates high-quality, higher-order thinking questions for K-12 IT education, outperforming GPT-4 across cognitive levels, especially in evaluation and creation. This highlights the value of integrating KGs and domain-specific tools to enhance question diversity and complexity, aligning closely with human-crafted quality.

These findings have significant implications for educational assessment practices, offering educators a reliable tool that reduces workload while maintaining high standards. The Multi-Examiner architecture provides a reproducible framework that can potentially transform assessment generation across educational domains. Despite promising results, this study has several limitations. The sample size of 30 questions per cognitive level may limit generalizability, and the subjective nature of expert scoring introduces potential bias. Additionally, our focus on IT education may not fully represent the challenges in other subject areas with different knowledge structures. Future research should address these limitations by expanding to diverse subjects and incorporating more objective evaluation metrics. We also recommend exploring the integration of Multi-Examiner with adaptive learning systems to personalize assessment based on individual student needs. Further investigation into combining Multi-Examiner with other AI technologies like automated feedback generation could create more comprehensive educational tools. Finally, longitudinal studies examining the impact of Multi-Examiner-generated assessments on student learning outcomes would provide valuable insights into its practical effectiveness.

Author Contributions

Conceptualization, Y.W. and Z.Y. (Zeyu Yu); Methodology, Y.W. and Z.Y. (Zeyu Yu); Software, Y.W. and Z.Y. (Zeyu Yu); Validation, Z.Y. (Zeyu Yu) and Z.W.; Formal analysis, Y.W. and Z.Y. (Zeyu Yu); Investigation, Y.W., Z.Y. (Zeyu Yu) and Z.W.; Resources, Z.Y. (Zeyu Yu) and Z.W.; Data curation, Z.Y. (Zeyu Yu) and Z.W.; Writing—original draft, Y.W., Z.Y. (Zeyu Yu) and Z.Y. (Zengyi Yu); Writing—review & editing, Y.W., Z.Y. (Zengyi Yu) and J.W.; Supervision, Y.W.; Project administration, Y.W.; Funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 6217021982).

Institutional Review Board Statement

The study was conducted in accordance with the Declarationof Helsinki, and approved by the Ethics Committee of the Institute of Applied Psychology at ZhejiangUniversity of Technology (No. 2024D024) (15 September 2024).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Acknowledgments

During the preparation of this study, the author used GPT-4 (released in March 2023) for the purposes of programming. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hang, C.N.; Tan, C.W.; Yu, P.D. Mcqgen: A large language model-driven mcq generator for personalized learning. IEEE Access 2024, 12, 102261–102273. [Google Scholar] [CrossRef]
Shuraiqi, A.; Abdulrahman, A.A.; Masters, K.; Zidoum, H.; AlZaabi, A. Automatic generation of medical case-based multiple-choice questions (MCQs): A review of methodologies, applications, evaluation, and future directions. Big Data Cogn. Comput. 2024, 8, 139. [Google Scholar] [CrossRef]
Kurdi, G.; Leo, J.; Parsia, B.; Sattler, U.; AlEmari, S. A Systematic Review of Automatic Question Generation for Educational Purposes. Int. J. Artif. Intell. Educ. 2020, 30, 121–204. [Google Scholar] [CrossRef]
Chen, J.; Liu, Z.; Huang, X.; Wu, C.; Liu, Q.; Jiang, G.; Pu, Y.; Lei, Y.; Chen, X.; Wang, X. When large language models meet personalization: Perspectives of challenges and opportunities. World Wide Web 2024, 27, 42. [Google Scholar] [CrossRef]
Pan, S.; Luo, L.; Wang, Y.; Chen, C.; Wang, J.; Wu, X. Unifying large language models and knowledge graphs: A roadmap. IEEE Trans. Knowl. Data Eng. 2024, 36, 3580–3599. [Google Scholar] [CrossRef]
Folk, A.; Blocksidge, K.; Hammons, J.; Primeau, H. Building a bridge between skills and thresholds: Using Bloom’s to develop an information literacy taxonomy. J. Inf. Lit. 2024, 18, 159–191. [Google Scholar] [CrossRef]
Dienichieva, O.I.; Komogorova, M.I.; Lukianchuk, S.F.; Teletska, L.I.; Yankovska, I.M. From reflection to self-assessment: Methods of developing critical thinking in students. Int. J. Comput. Sci. Netw. Secur. 2024, 24, 148–156. [Google Scholar] [CrossRef]
Wong, J.T.; Richland, L.E.; Hughes, B.S. Immediate versus delayed low-stakes questioning: Encouraging the testing effect through embedded video questions to support students’ knowledge outcomes, self-regulation, and critical thinking. Technol. Knowl. Learn. 2024, 1–36. [Google Scholar] [CrossRef]
Haladyna, T.M.; Downing, S.M. Validity of a taxonomy of multiple-choice item-writing rules. Appl. Meas. Educ. 1989, 2, 51–78. [Google Scholar] [CrossRef]
Chen, G.; Yang, J.; Hauff, C.; Houben, G. LearningQ: A LargeScale Dataset for Educational Question Generation. In Proceedings of the ICWSM, Palo Alto, CA, USA, 25–28 June 2018; Volume 12. [Google Scholar]
Krathwohl, D.R. A Revision of Bloom’s Taxonomy: An Overview. Theory Pract. 2002, 41, 212–218. [Google Scholar] [CrossRef]
Byun, J.; Kim, B.; Cha, K.; Lee, E. Design and Implementation of an Interactive Question-Answering System with Retrieval-Augmented Generation for Personalized Databases. Appl. Sci. 2024, 14, 7995. [Google Scholar] [CrossRef]
Chen, Y.; Wu, L.; Zaki, M.J. Toward subgraph-guided knowledge graph question generation with graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 12706–12717. [Google Scholar] [CrossRef]
Paulheim, H. Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods. Semant. Web 2016, 8, 489–508. [Google Scholar] [CrossRef]
Liu, Q.; Han, S.; Cambria, E.; Li, Y.; Kwok, K. PrimeNet: A framework for commonsense knowledge representation and reasoning based on conceptual primitives. Cogn. Comput. 2024, 16, 3429–3456. [Google Scholar] [CrossRef]
Liang, W.; Meo, P.D.; Tang, Y.; Zhu, J. A survey of multi-modal knowledge graphs: Technologies and trends. ACM Comput. Surv. 2024, 56, 273. [Google Scholar] [CrossRef]
Rashid, M.; Torchiano, M.; Rizzo, G.; Mihindukulasooriya, N.; Corcho, O. A quality assessment approach for evolving knowledge bases. Semant. Web 2019, 10, 349–383. [Google Scholar] [CrossRef]
Tahri, C. Leveraging Modern Information Seeking on Research Papers for Real-World Knowledge Integration Applications: An Empirical Study. Ph.D. Thesis, Sorbonne Université, Paris, France, 2023. [Google Scholar]
Li, W.; Li, L.; Xiang, T.; Liu, X.; Deng, W.; Garcia, N. Can multiple-choice questions really be useful in detecting the abilities of LLMs? arXiv 2024, arXiv:2403.17752. [Google Scholar]
Leite, B.; Cardoso, H.L. Do rules still rule? Comprehensive evaluation of a rule-based question generation system. In Proceedings of the CSEDU, Prague, Czech Republic, 21–23 April 2023; pp. 27–38. [Google Scholar]
Jia, Z.; Pramanik, S.; Saha Roy, R.; Weikum, G. Complex Temporal Question Answering on Knowledge Graphs. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual, 1–5 November 2021; pp. 792–802. [Google Scholar]
Han, K.; Gardent, C. Generating and Answering Simple and Complex Questions from Text and from Knowledge Graphs. In Proceedings of the 2023 International Conference on the Network of the Future, Izmir, Turkey, 4–6 October 2023. [Google Scholar]
Goyal, M.; Mahmoud, Q.H. A systematic review of synthetic data generation techniques using generative AI. Electronics 2024, 13, 3509. [Google Scholar] [CrossRef]
Hofer, M.; Obraczka, D.; Saeedi, A.; Köpcke, H.; Rahm, E. Construction of knowledge graphs: Current state and challenges. Information 2024, 15, 509. [Google Scholar] [CrossRef]
Pan, X.; Li, X.; Li, Q.; Hu, Z.; Bao, J. Evolving to multi-modal knowledge graphs for engineering design: State-of-the-art and future challenges. J. Eng. Des. 2024, 1–40. [Google Scholar] [CrossRef]
Zhuge, Q.; Wang, H.; Chen, X. TwinStar: A Novel Design for Enhanced Test Question Generation Using Dual-LLM Engine. Appl. Sci. 2025, 15, 3055. [Google Scholar] [CrossRef]
Jin, J.; Kim, M. GPT-empowered personalized eLearning system for programming languages. Appl. Sci. 2023, 13, 12773. [Google Scholar] [CrossRef]
Zong, C.; Yan, Y.; Lu, W.; Huang, E.; Shao, J.; Zhuang, Y. Triad: A Framework Leveraging a Multi-Role LLM-based Agent to Solve Knowledge Base Question Answering. arXiv 2024, arXiv:2402.14320. [Google Scholar]
Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large language model based multi-agents: A survey of progress and challenges. arXiv 2024, arXiv:2402.01680. [Google Scholar]
Hadi, M.U.; Tashi, A.; Shah, A.; Qureshi, R.; Muneer, A.; Irfan, M.; Zafar, A.; Shaikh, M.B.; Akhtar, N.; Wu, J. Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Preprints 2024. preprints. [Google Scholar] [CrossRef]
Espartinez, A.S. Exploring student and teacher perceptions of ChatGPT use in higher education: A Q-Methodology study. Comput. Educ. Artif. Intell. 2024, 7, 100264. [Google Scholar] [CrossRef]
Yenduri, G.; Ramalingam, M.; Chemmalar, S.G.; Supriya, Y.; Srivastava, G.; Kumar, P.; Deepti, R.G.; Jhaveri, R.H.; Prabadevi, B.; Wang, W. GPT (Generative Pre-trained Transformer)–A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access 2024, 12, 54608–54649. [Google Scholar] [CrossRef]
Mulla, N.; Gharpure, P. Automatic question generation: A review of methodologies, datasets, evaluation metrics, and applications. Prog. Artif. Intell. 2023, 12, 1–32. [Google Scholar] [CrossRef]
Shoufan, A. Can students without prior knowledge use ChatGPT to answer test questions? An empirical study. ACM Trans. Comput. Educ. 2023, 23, 45. [Google Scholar] [CrossRef]
Yu, F.Y.; Kuo, C.W. A systematic review of published student question-generation systems: Supporting functionalities and design features. J. Res. Technol. Educ. 2024, 56, 172–195. [Google Scholar] [CrossRef]
Sun, Y.; Yang, Y.; Fu, W. Exploring synergies between causal models and Large-Language models for enhanced understanding and inference. In Proceedings of the CVIPPR ’24: Proceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognitio, Xiamen, China, 26–28 April 2024; pp. 1–8. [Google Scholar]
Wei, X. Evaluating ChatGPT-4 and ChatGPT-4o: Performance insights from NAEP mathematics problem solving. Front. Media SA 2024, 9, 1452570. [Google Scholar] [CrossRef]
Abd-Alrazaq, A.; AlSaad, R.; Alhuwail, D.; Ahmed, A.; Healy, P.M.; Latifi, S.; Aziz, S.; Damseh, R.; Alabed, A.S.; Sheikh, J. Large language models in medical education: Opportunities, challenges, and future directions. JMIR Med. Educ. 2023, 9, e48291. [Google Scholar] [CrossRef] [PubMed]
Anderson, L.W.; Krathwohl, D. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s; Addison Wesley Longman, Inc.: Boston, MA, USA, 2001. [Google Scholar]
Zhang, L.; Carter, R.A., Jr.; Greene, J.A.; Bernacki, M.L. Unraveling challenges with the implementation of universal design for learning: A systematic literature review. Educ. Psychol. Rev. 2024, 36, 35. [Google Scholar] [CrossRef]
Graesser, A.C.; Lu, S.; Jackson, G.T.; Mitchell, H.H.; Ventura, M.; Olney, A.; Louwerse, M.M. AutoTutor: A tutor with dialogue in natural language. Behav. Res. Methods Instrum. Comput. 2004, 36, 180–192. [Google Scholar] [CrossRef] [PubMed]
Hwang, K.; Challagundla, S.; Alomair, M.; Chen, L.K.; Choa, F.S. Towards AI-assisted multiple choice question generation and quality evaluation at scale: Aligning with Bloom’s Taxonomy. In Proceedings of the Workshop on Generative AI for Education, New Orleans, LA, USA, 15 December 2023. [Google Scholar]
Rodrigues, L.; Pereira, F.D.; Cabral, L.; Gašević, D.; Ramalho, G.; Mello, R.F. Assessing the quality of automatic-generated short answers using GPT-4. Comput. Educ. Artif. Intell. 2024, 7, 100248. [Google Scholar] [CrossRef]
Halkiopoulos, C.; Gkintoni, E. Leveraging AI in e-learning: Personalized learning and adaptive assessment through cognitive neuropsychology—A systematic analysis. Electronics 2024, 13, 3762. [Google Scholar] [CrossRef]
Hwang, K.; Wang, K.; Alomair, M.; Choa, F.; Chen, L.K. Towards Automated Multiple Choice Question Generation and Evaluation: Aligning with Bloom’s Taxonomy. In Proceedings of the Artificial Intelligence in Education, Xiamen, China, 22–24 November 2024; Olney, A.M., Chounta, I., Liu, Z., Santos, O.C., Ibert, B.I., Eds.; Springer: Cham, Switzerland, 2024; pp. 389–396. [Google Scholar]
Kong, S.C.; Yang, Y. A human-centred learning and teaching framework using generative artificial intelligence for self-regulated learning development through domain knowledge learning in K–12 settings. IEEE Trans. Learn. Technol. 2024, 17, 1562–1573. [Google Scholar] [CrossRef]
Liu, Y.; Yao, Y.; Ton, J.F.; Zhang, X.; Guo, R.; Cheng, H.; Klochkov, Y.; Taufiq, M.F.; Li, H. Trustworthy LLMs: A survey and guideline for evaluating large language models’ alignment. arXiv 2023, arXiv:2308.05374. [Google Scholar]
Gezer, M.; Oner Sunkur, M.; Sahin, I.F. An evaluation of the exam questions of social studies course according to revised Bloom’s taxonomy. Educ. Sci. Psychol. 2014, 28, 3. [Google Scholar]
Shahriar, S.; Lund, B.D.; Reddy, M.N.; Arshad, M.A.; Hayawi, K.; Ravi, B.; Mannuru, A.; Batool, L. Putting GPT-4o to the sword: A comprehensive evaluation of language, vision, speech, and multimodal proficiency. Appl. Sci. 2024, 14, 7782. [Google Scholar] [CrossRef]
Bahroun, Z.; Anane, C.; Ahmed, V.; Zacca, A. Transforming education: A comprehensive review of generative artificial intelligence in educational settings through bibliometric and content analysis. Sustainability 2023, 15, 12983. [Google Scholar] [CrossRef]
Abulibdeh, A.; Zaidan, E.; Abulibdeh, R. Navigating the confluence of artificial intelligence and education for sustainable development in the era of industry 4.0: Challenges, opportunities, and ethical dimensions. J. Clean. Prod. 2024, 437, 140527. [Google Scholar] [CrossRef]
Koenig, N.; Tonidandel, S.; Thompson, I.; Albritton, B.; Koohifar, F.; Yankov, G.; Speer, A.; Jay, C.; Gibson, C.; Frost, C. Improving measurement and prediction in personnel selection through the application of machine learning. Pers. Psychol. 2023, 76, 1061–1123. [Google Scholar] [CrossRef]
Lai, H.; Nissim, M. A survey on automatic generation of figurative language: From rule-based systems to large language models. ACM Comput. Surv. 2024, 56, 244. [Google Scholar] [CrossRef]
Almufarreh, A.; Mohammed, N.K.; Saeed, M.N. Academic teaching quality framework and performance evaluation using machine learning. Appl. Sci. 2023, 13, 3121. [Google Scholar] [CrossRef]
Vistorte, A.O.R.; Deroncele-Acosta, A.; Ayala, J.L.M.; Barrasa, A.; López-Granero, C.; Martí-González, M. Integrating artificial intelligence to assess emotions in learning environments: A systematic literature review. Front. Psychol. 2024, 15, 1387089. [Google Scholar] [CrossRef]
Yu, T.; Fu, K.; Wang, S.; Huang, Q.; Yu, J. Prompting video-language foundation models with domain-specific fine-grained heuristics for video question answering. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 1615–1630. [Google Scholar] [CrossRef]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
Seehorn, D.; Carey, S.; Fuschetto, B.; Lee, I.; Moix, D.; O’Grady-Cunniff, D.; Owens, B.B.; Stephenson, C.; Verno, A. CSTA K–12 Computer Science Standards: Revised 2011; ACM: New York, NY, USA, 2011. [Google Scholar]

Figure 1. Research framework of Multi-Examiner study.

Figure 2. System architecture processes.

Figure 3. Architecture of Multi-Examiner.

Figure 4. System architecture of Multi-Examiner, illustrating the three-tier design of the knowledge graph and its integration with question generation modules.

Figure 5. Distractor Relevance Scores by Generation Method and Knowledge Type.

Figure 6. Interaction Effect Plot of Generation Method and Knowledge Type.

Figure 7. Parallel Coordinate Plot of Differences in Generation Methods Across Evaluation Dimensions.

Figure 8. Interaction effect graph of generation methods and cognitive levels.

Table 1. Knowledge point attributes.

Attribute	Description
Knowledge type	Four types of knowledge based on Bloom’s taxonomy, including factual, conceptual, procedural, and metacognitive
Cognitive level	The cognitive levels supported by each knowledge point, including remembering, understanding, applying, analyzing, evaluating, and creating
Knowledge content	Specific content of the knowledge point, including basic principles, fundamental concepts, basic methods, and important facts

Table 2. Descriptive statistics of distractor relevance scores by generation method and knowledge type.

Generation Method	Knowledge Type	Mean	Standard Deviation	N
Multi-Examiner	Factual	4.03	1.00	30
	Conceptual	3.00	0.91	30
	Procedural	3.57	0.94	30
	Metacognitive	3.73	0.94	30
GPT-4	Factual	3.33	0.99	30
	Conceptual	2.53	0.97	30
	Procedural	2.40	1.10	30
	Metacognitive	3.20	0.99	30
Human	Factual	3.07	1.14	30
	Conceptual	3.63	0.81	30
	Procedural	3.77	0.82	30
	Metacognitive	3.60	1.00	30

Table 3. Breakdown of distractor relevance sub-criteria by generation method (mean scores).

Generation Method	Conceptual Relevance	Logical Rationality	Clarity
Multi-Examiner	3.83	3.65	3.38
GPT-4	3.12	2.73	2.68
Human	3.82	3.45	3.33

Table 4. Results of two-way ANOVA on distractor relevance scores.

Source of Variation	Sum of Squares	Degrees of Freedom (DF)	Mean Square	F-Value	p-Value	Partial $η^{2}$
Generation Method	37.62	2	18.81	19.85	$< 0.001$	0.08
Knowledge type	12.33	3	4.11	4.34	0.005	0.03
Interaction	32.93	6	5.49	5.79	$< 0.001$	0.07
Error	329.733	348	0.95

Table 5. Tukey’s HSD Post-Hoc Test Results for Distractor Relevance (After Bonferroni Correction).

Comparison	Mean Difference	Standard Error	p-Value	95% Confidence Interval
Multi-Examiner vs. GPT-4	0.71	0.16	$< 0.001$	[0.41, 1.02]
Multi-Examiner vs. human	−0.07	0.16	0.870	[−0.38, 0.25]
GPT-4 vs. human	0.65	0.16	$< 0.001$	[0.34, 0.96]
Factual vs. conceptual	−0.42	0.21	0.039	[−0.83, −0.01]
Factual vs. procedural	−0.23	0.21	0.453	[−0.64, 0.18]
Factual vs. metacognitive	0.03	0.21	0.997	[−0.38, 0.44]

Table 6. Descriptive Statistics of Question Set Scores by Generation Method.

Generation Method	Mean	Standard Deviation	N
Multi-Examiner	4.23	0.57	30
GPT-4	3.40	1.13	30
Human	4.43	0.57	30

Table 7. Results of the Multivariate Analysis of Variance for Question Set Scores.

Effect	F-Value	Hypothesis DF	Error DF	p-Value	Partial $η^{2}$
Generation Method	14.016	2	87	$< 0.001$	0.244

Table 8. Univariate Analysis of Variance Results for Each Evaluation Dimension.

Dependent Variable	Sum of Squares	DF	Mean Square	F-Value	p-Value	Partial $η^{2}$
Diversity	8.022	2	9.011	14.016	$< 0.001$	0.244

Table 9. Tukey’s HSD Post-hoc Test Results in Diversity Dimension.

Comparison	Mean Difference	Standard Error	p-Value	95% CI
Multi-Examiner vs. GPT-4	1.03	0.91	$< 0.001$	[0.34, 1.33]
Multi-Examiner vs. human	−0.20	0.91	0.600	[−0.69, 0.29]
GPT-4 vs. human	0.83	0.91	$< 0.001$	[0.54, 1.53]
GPT-4 vs. human	1.03	0.91	$< 0.001$	[0.34, 1.33]

Table 10. Descriptive statistics of scores by generation method across six cognitive levels of Bloom’s taxonomy.

Generation Method	Cognitive Level	Mean	Standard Deviation	N
Multi-Examiner	Memory	2.13	1.01	30
	Understanding	2.27	1.11	30
	Application	2.07	0.95	30
	Analysis	1.97	0.95	30
	Evaluation	1.77	0.94	30
	Creation	2.12	1.08	30
GPT-4	Memory	2.97	0.99	30
	Understanding	2.80	0.96	30
	Application	3.05	1.00	30
	Analysis	2.77	0.99	30
	Evaluation	2.20	0.71	30
	Creation	3.09	0.94	30
Human	Memory	3.35	0.91	30
	Understanding	3.67	1.03	30
	Application	3.18	0.81	30
	Analysis	2.88	1.01	30
	Evaluation	2.27	1.11	30
	Creation	3.43	0.96	30

Table 11. Results of two-way ANOVA for question quality scores.

Source of Variation	Sum of Squares	Degrees of Freedom	Mean Square	F-Value	p-Value	Partial $η^{2}$
Generation Method	232.67	2	116.34	122.48	$< 0.001$	0.09
Cognitive level	56.04	5	11.21	11.80	$< 0.001$	0.02
Interaction	15.74	10	1.57	1.66	0.086	0.01
Error	1008.71	1062	0.95

Table 12. Tukey’s HSD post-hoc test results for question quality scores (after Bonferroni correction).

Comparison	Mean Difference	Standard Error	p-Value	95% CI
Multi-Examiner vs. GPT-4	0.81	0.09	$< 0.001$	[0.64, 0.99]
Multi-Examiner vs. human	0.28	0.09	0.0015	[0.11, 0.46]
GPT-4 vs. human	1.09	0.09	$< 0.001$	[0.92, 1.27]
Evaluation vs. creation	−0.74	0.19	$< 0.001$	[−1.11, −0.36]
Analysis vs. evaluation	0.83	0.23	$< 0.001$	[−1.29, −0.37]
Application vs. analysis	0.69	0.20	$< 0.001$	[−1.09, −0.29]

Table 13. Quality scores for analysis, evaluation, and creation levels.

Generation Method	Analysis (M ± SD)	Evaluation (M ± SD)	Creation (M ± SD)
Multi-Examiner	2.97 ± 0.99	3.08 ± 0.94	2.80 ± 0.96
GPT-4	2.13 ± 1.01	2.12 ± 1.08	2.27 ± 1.11
Human	3.34 ± 0.91	3.43 ± 0.96	3.67 ± 1.03

Table 14. Univariate ANOVA results for higher-order thinking levels.

Cognitive Level	F-Level	p-Level	Partial $η^{2}$
Analysis	36.67	$< 0.001$	0.14
Evaluation	28.14	$< 0.001$	0.17
Creation	13.96	$< 0.001$	0.12

Table 15. Tukey’s HSD post-hoc test results for evaluation and creation levels.

Cognitive Level	Comparison	Mean Difference	p-Value	95% CI
Evaluation	Multi-Examiner vs. GPT-4	0.97	$< 0.001$	[0.54, 1.40]
	Multi-Examiner vs. human	1.32	$< 0.001$	[0.89, 1.75]
	GPT-4 vs. human	0.35	0.135	[−0.08, 0.78]
Creation	Multi-Examiner vs. GPT-4	0.53	0.120	[−0.10, 1.17]
	Multi-Examiner vs. human	1.40	$< 0.001$	[0.76, 2.03]
	GPT-4 vs. human	0.87	0.005	[0.23, 1.50]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Yu, Z.; Wang, Z.; Yu, Z.; Wang, J. Multi-Examiner: A Knowledge Graph-Driven System for Generating Comprehensive IT Questions with Higher-Order Thinking. Appl. Sci. 2025, 15, 5719. https://doi.org/10.3390/app15105719

AMA Style

Wang Y, Yu Z, Wang Z, Yu Z, Wang J. Multi-Examiner: A Knowledge Graph-Driven System for Generating Comprehensive IT Questions with Higher-Order Thinking. Applied Sciences. 2025; 15(10):5719. https://doi.org/10.3390/app15105719

Chicago/Turabian Style

Wang, Yonggu, Zeyu Yu, Zihan Wang, Zengyi Yu, and Jue Wang. 2025. "Multi-Examiner: A Knowledge Graph-Driven System for Generating Comprehensive IT Questions with Higher-Order Thinking" Applied Sciences 15, no. 10: 5719. https://doi.org/10.3390/app15105719

APA Style

Wang, Y., Yu, Z., Wang, Z., Yu, Z., & Wang, J. (2025). Multi-Examiner: A Knowledge Graph-Driven System for Generating Comprehensive IT Questions with Higher-Order Thinking. Applied Sciences, 15(10), 5719. https://doi.org/10.3390/app15105719

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Examiner: A Knowledge Graph-Driven System for Generating Comprehensive IT Questions with Higher-Order Thinking

Abstract

1. Introduction

2. Related Work

2.1. QGS Based on KGs and KBs

2.2. QGS Based on LLMs and Intelligent Agents

2.3. Application of Educational Objective Taxonomies in QGS

2.4. Theoretical Foundations for Question Generation in IT Education

2.5. Exam Question Evaluation Scale

2.6. Synthesis and Research Gaps

3. Methodology

3.1. Theoretical Foundations for IT Education Question Generation

3.2. Research Design

3.3. System Development

3.3.1. KG Construction

3.3.2. KB Construction

3.3.3. Multi-Examiner System Design

3.4. Experimental Design

3.4.1. Participants

3.4.2. Experimental Materials and Procedures

3.4.3. Measures and Instruments

3.4.4. Data Analysis

3.5. Ethical Considerations

4. Results

4.1. Analysis of the Contextual Relevance of Distractors

4.1.1. Descriptive Statistical Analysis

4.1.2. Two-Way Analysis of Variance (ANOVA)

4.1.3. Performance Differences by Generation Method Across Different Knowledge Types

4.2. Analysis of Enhancing Question Diversity

4.2.1. Descriptive Statistical Analysis

4.2.2. Multivariate Analysis of Variance (MANOVA)

4.2.3. Univariate Analysis of Variance (ANOVA) Follow-Up Tests

4.2.4. Performance Differences by Generation Method Across Evaluation Dimensions

4.3. Analysis of the Effectiveness of the Assessment System in Generating Higher-Order Thinking Questions

4.3.1. Descriptive Statistical Analysis

4.3.2. Two-Way Analysis of Variance (ANOVA)

4.3.3. Post-Hoc Test Analysis

4.3.4. Differences in Performance Across Cognitive Levels by Generation Method

4.3.5. Quality Analysis of Higher-Order Thinking Questions

5. Discussions

5.1. Discussion of Distractor Contextual Relevance and Generation Method Effectiveness

5.2. Discussion on Enhancing Question Diversity and Cognitive Challenge Through Automated Generation Methods

5.3. Evaluating the Effectiveness of Automated Systems in Generating Higher-Order Thinking Questions in K-12 IT Education

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI