3.3. System Development
The Multi-Examiner system employs a modular design with three core components: (1) a domain-specific knowledge graph, (2) a structured knowledge base, and (3) an agent-based question generation framework. This design is based on two principles: (1) improving question quality by integrating knowledge representation and intelligent agents, and (2) aligning questions with educational objectives through hierarchical cognitive design. The system innovatively combines knowledge engineering and AI, incorporating Bloom’s taxonomy.
These modules operate collaboratively through structured data and control flows (
Figure 2). The knowledge graph provides semantic structures for the knowledge base, supporting organized retrieval. The knowledge base enables efficient access, while the Multi-Examiner system utilizes agents to interact with both modules for knowledge reasoning and expansion. This bidirectional interaction allows for rich, objective-aligned question generation and continual knowledge optimization.
3.3.1. KG Construction
As shown in
Table 1, the KG construction in the Multi-Examiner system takes Bloom’s taxonomy as its theoretical foundation, achieving deep integration of educational theory and technical architecture through systematic knowledge representation structures. At the knowledge representation level, the research develops based on two dimensions of Bloom’s taxonomy: the knowledge dimension and the cognitive process dimension. The knowledge dimension is reflected through core attributes of entities, with each knowledge point entity containing detailed knowledge descriptions and cognitive type annotations. The knowledge description attribute not only provides basic definitions and application scopes of entities but also, more importantly, structures knowledge content based on Bloom’s taxonomy framework. The cognitive type attribute strictly follows Bloom’s four-level knowledge classification: factual knowledge (e.g., professional terminology, technical details), conceptual knowledge (e.g., principles, method classifications), procedural knowledge (e.g., operational procedures, problem solving), and metacognitive knowledge (e.g., learning strategies, cognitive monitoring). Specifically, the design is based on the K-12 Computer Science Standards [
58].
In the cognitive process dimension, the knowledge graph implements support for different cognitive levels through relationship type design. For example, the Contains relationship primarily serves knowledge expression at the remembering and understanding levels, supporting the cultivation of basic cognitive abilities through explicit concept hierarchies. The Belongs to relationship focuses on supporting cognitive processes at the application and analysis levels, helping learners construct knowledge classification systems. The Prerequisite relationship plays an important role at the evaluation level, promoting critical thinking development by revealing knowledge dependencies. The Related relationship mainly serves the creation level, supporting innovative thinking through knowledge associations. This relationship design based on cognitive theory ensures that the knowledge graph can provide theoretical guidance and knowledge support for generating questions at different cognitive levels.
Through this systematic theoretical integration, the knowledge graph not only achieves structured knowledge representation but also, more importantly, constructs a knowledge framework supporting cognitive development. When the system needs to generate questions at specific cognitive levels, it can conduct knowledge retrieval and reasoning based on corresponding entity attributes and relationship types, thereby ensuring that generated questions both meet knowledge content requirements and accurately match target cognitive levels.
3.3.2. KB Construction
The KB module adopts a layered processing design philosophy, establishing a systematic knowledge processing and organization architecture to provide foundational support for diverse question generation. This study innovatively designed a three-layer processing architecture encompassing knowledge formatting, knowledge vectorization, and retrieval segmentation, achieving systematic transformation from raw educational resources to structured knowledge. This layered architectural design not only enhances the efficiency and accuracy of knowledge retrieval but also provides a solid data foundation for generating differentiated questions.
At the knowledge formatting level, the system employs Optical Character Recognition (OCR) technology to convert various educational materials into standardized digital text. The knowledge formatting process is formally defined as:
where
represents raw educational materials, including textbooks, teaching syllabi, and the professional literature. This standardization process ensures content consistency and lays the foundation for subsequent vectorization processing. The system optimizes recognition results through multiple preprocessing mechanisms, including text segmentation, key information extraction, and format standardization, to enhance knowledge representation quality.
At the knowledge vectorization stage, a hybrid vectorization strategy is employed to transform standardized text into high-dimensional vector representations. This transformation process is formally defined as:
where
k is a snippet of formatted knowledge from
, and
represents the vectorized form of
k, utilizing models such as TF-IDF or Word2Vec, or advanced techniques like BERT embeddings. The system innovatively designs a dynamic weight adjustment mechanism, adaptively adjusting weights of various vectorization techniques based on different knowledge types and application scenarios, enhancing knowledge representation accuracy. This vectorization method ensures that semantically similar knowledge points maintain proximity in vector space, providing a reliable foundation for subsequent similarity calculations and association analyses.
At the retrieval segmentation level, the system systematically organizes vectorized knowledge based on predefined domain tags. The formal expression for retrieval segmentation is:
where
represents the segmented KB,
are individual pieces of vectorized knowledge,
denotes the domain or knowledge point tags associated with each
, and
is the function that assigns each vectorized knowledge piece to its corresponding segment in the KB. The system designs a multi-level index structure, supporting rapid location and extraction of knowledge content from different dimensions while enabling knowledge recombination mechanisms to support diverse question generation requirements. This layered organization structure not only enhances knowledge retrieval efficiency but also provides flexible knowledge support for generating questions at different cognitive levels.
This study constructed a comprehensive knowledge base system around core IT curriculum knowledge. Through a systematic layered processing architecture, this knowledge base achieves efficient transformation from raw educational resources to structured knowledge. The knowledge base not only supports precise retrieval based on semantics but can also provide differentiated knowledge support according to different cognitive levels and knowledge types, laying the foundation for generating diverse and high-quality questions. This layered knowledge processing architecture significantly enhances the system’s flexibility and adaptability in question generation, better meeting assessment needs across different teaching scenarios.
3.3.3. Multi-Examiner System Design
The Multi-Examiner module is grounded in modern educational assessment theory, integrating cognitive diagnostic theory, item generation theory, and formative assessment theory to construct a theory-driven intelligent agent collaborative assessment framework. This framework innovatively designs four types of intelligent agents: Identificador, Explorador, Fabricator, and Revisor, forming a complete question generation pipeline. Through four specialized intelligent agents, the system achieves precise control over the question generation process, with each agent designed based on specific educational theories, collectively constituting a comprehensive intelligent educational test item generation system (
Figure 3). This theory-based multi-agent system framework design not only ensures the educational value of generated questions but also provides a new technical paradigm for educational measurement and evaluation.
The Identificador, designed based on schema theory and learned by each intelligent agent through prompts, is responsible for deep semantic understanding and cognitive feature analysis of knowledge points. This agent implements knowledge retrieval through the function:
where
represents the set of synonyms and related terms,
k is the original knowledge point input by the user, and
denotes the large language model’s operation to fetch and generate synonymous and related terms using its trained capabilities on vast corpora and search engine integration. The Identificador not only identifies surface features of knowledge points but also, more importantly, analyzes cognitive structures and semantic networks based on schema theory. For example, when processing the knowledge point “Operating System”, the Identificador first constructs its cognitive schema, including core attributes (such as system software characteristics), process features (such as resource management mechanisms), and relational features (such as interactions with hardware and application software), thereby providing a complete cognitive framework for subsequent question generation.
The Explorador adopts constructivist learning theory to guide knowledge association exploration, implementing multidimensional semantic connections in the knowledge graph through the function:
where
represents the detailed knowledge entries,
is the set of input terms from the Identificador, and
is the knowledge graph. This agent innovatively implements directed retrieval strategies based on cognitive levels, capable of selecting corresponding knowledge nodes according to different cognitive levels of Bloom’s taxonomy. For example, when generating higher-order thinking questions, the Explorador prioritizes knowledge nodes related to advanced cognitive processes such as analysis, evaluation, and creation, establishing logical connections between these nodes to provide knowledge support for generating complex assessment tasks.
The Fabricator integrates cognitive load theory and question type design theory, implementing dynamic question generation through the function:
where
represents the question generated,
t is the type of knowledge (factual, conceptual, procedural, metacognitive),
k is the knowledge point, and
is the tailored prompt for type
t. The Fabricator’s innovation is reflected in its ability to dynamically adjust question complexity according to learning objectives. This agent adopts specific generation strategies (
) for different cognitive objectives (
t), ensuring assessment validity while controlling question cognitive load levels. The Fabricator’s innovation is reflected in its ability to dynamically adjust question complexity according to learning objectives. For example, when generating conceptual understanding questions, the system controls information quantity and problem context complexity to ensure optimal cognitive load levels.
The Revisor constructs a systematic quality control mechanism based on educational measurement theory. It ensures question quality through multidimensional evaluation criteria. The function not only verifies technical correctness but also, more importantly, evaluates the consistency between questions and educational objectives. When quality issues are detected, the system generates specific modification suggestions through the function and triggers optimization processes. This closed-loop quality control mechanism ensures that the system can continuously produce high-quality assessment questions.
At the agent collaboration level, the system employs an event-driven mechanism grounded in educational assessment theory, forming a complete question generation and evaluation chain. This design is inspired by Cognitive Development and Adaptive Assessment theories, ensuring continuity and adaptability throughout the process. The Identificador assesses cognitive features, triggering the Explorador to construct knowledge networks, which the Fabricator uses to dynamically adjust question strategies.
Finally, the Revisor provides the final evaluation of the questions, ensuring that the assessment is not only technically accurate but also aligned with educational goals. This integrated approach to question development and validation promotes the creation of effective assessment tools that meet diverse learning needs.
To enhance scalability, the Multi-Examiner system employs a microservice modular design, enabling each agent to function independently through standardized APIs. The system’s innovation spans three theoretical levels: (1) systematic application of educational theories, (2) precise cognitive mapping in question design, and (3) formative assessment implementation. This framework integrates educational integrity with AI-driven automation, advancing adaptability in question generation.
The design prioritizes educational purpose alongside technical innovation, ensuring generated questions serve meaningful educational needs. The modular architecture allows for continuous adaptation, supporting the integration of new theories and technologies to maintain relevance in educational technology.
3.4. Experimental Design
To ensure a rigorous and systematic evaluation of the effectiveness of the Multi-Examiner system in automatically generating high-quality MCQs for IT education, we designed a comprehensive experimental protocol that includes three key components: participant selection, experimental materials and procedures, and evaluation metrics. The study used an expert evaluation methodology for a comparative analysis of the quality of the question generated by Multi-Examiner relative to alternative generation methods. If the questions produced by the Multi-Examiner system receive scores that differ significantly from those generated by GPT-4 while closely aligning with scores from human-generated questions, this would demonstrate the effectiveness of the system. Prior to the main experiment, we conducted ten semi-structured interviews with experienced IT educators to validate and refine the initial assessment framework, significantly informing the final experimental design.
As shown in
Figure 4, the Multi-Examiner system features a four-layer architectural design that integrates the knowledge graph and question generation modules. The User Interface Layer serves as the entry point for processing question generation requests through the Application Service Layer. The Core Service Layer contains the system’s essential intelligent components: the Knowledge Graph Service, which provides Ontology Model Management, the Knowledge Inference Engine, and Cypher Query Optimization; the Knowledge Base Service, which manages Knowledge Formatting, Knowledge Vectorization, and Hybrid Search Index construction; and the Agent Framework, based on the Langchain framework, which comprises Identificador, Explorador, Fabricator, and Revisor, collaborating in knowledge identification, exploration, construction, and validation. The Infrastructure Layer includes the Neo4j Database and Vector Storage, forming a complete knowledge-driven question generation system through standardized component communication and configuration management. Furthermore, the system creates query tools for both the Identificador and Explorador, employs prompt engineering strategies for agents to perform their respective roles, and implements an event bus based on the publish–subscribe pattern for asynchronous communication.
The system architecture employs the Neo4j graph database (version 4.3.11) as the core platform for storing and managing the knowledge graph, facilitating efficient knowledge retrieval, and reasoning via the Cypher query language. The database is deployed within Docker containers to ensure system stability and portability.
To enhance retrieval efficiency, we designed a set of optimized cypher query templates, categorized into two main types: (1) Knowledge Point Queries, which extract detailed conceptual information and knowledge types of specific knowledge points, serving as foundational data for question generation; and (2) Knowledge Path Queries, which explore relationships among related knowledge points to generate contextually relevant distractors. Integration with the system is facilitated through Python’s Neo4j driver, providing efficient and stable access to knowledge.
The knowledge graph consists of three hierarchical layers:
Ontology Layer: This layer, designed based on the characteristics of IT education and Bloom’s taxonomy, defines four core ontology classes: Knowledge (basic knowledge points), Concept (conceptual knowledge), Procedure (procedural knowledge), and Metacognition (metacognitive knowledge). It includes semantic relation types that support different cognitive levels, such as CONTAINS (memory and understanding), BELONGS_TO (application and analysis), PREREQUISITE (evaluation), and RELATED (creation). These relations empower the knowledge graph to facilitate multi-level cognitive question generation aligned with Bloom’s taxonomy.
Entity-Relation Layer: This core content layer includes specific IT knowledge entities and their interrelations, primarily extracted from high school IT textbooks and curriculum guidelines through a semi-automated process. Using large language models (LLMs), specifically GPT-4, we automatically extracted knowledge points and relations from textbook texts via carefully designed prompt templates. The initial knowledge network was refined by IT education experts to ensure conceptual accuracy, rational relation types, and structural coherence.
Attribute Annotation Layer: This layer annotates educational attributes and metadata related to knowledge points and relations, bridging content with educational objectives. Annotated properties include knowledge type, cognitive level, and content descriptions, enabling targeted question generation according to specific learning goals.
This layered knowledge graph design, integrated with domain-specific retrieval and LLM-powered knowledge extraction, forms the technical backbone of the Multi-Examiner system, supporting its ability to generate diverse, high-quality, and cognitively appropriate MCQs for IT education.
3.4.1. Participants
Purposive sampling was used to form an expert evaluation team for assessing the quality of automatically generated MCQs. Statistical power analysis ( = 0.05, power = 0.80, partial = 0.06) determined a minimum sample size of 28, leading to the recruitment of 30 experts to account for potential attrition. Selection focused on professional background, teaching experience, and technological literacy, with all experts having at least five years of high school IT teaching experience, training in Bloom’s taxonomy, and educational technology experience. The panel consisted of 18 females (60%) and 12 males (40%), averaging 8.3 years of teaching experience (SD = 2.7); 73% had experience with AI-assisted tools, providing diverse perspectives.
A standardized two-day training ensured evaluation reliability, combining theory with practical application of evaluation criteria. Pre-assessment on 10 test questions showed high inter-rater consistency (Krippendorff’s = 0.83). For significant scoring discrepancies, consensus was achieved through discussion. Systematic validity and reliability testing confirmed scoring stability, with test–retest reliability after two weeks achieving a correlation coefficient of 0.85, indicating strong scoring consistency among experts.
3.4.2. Experimental Materials and Procedures
This study used a systematic experimental design to ensure rigor, involving MCQs from three sources: Multi-Examiner, GPT-4, and human-created questions. These questions were generated using identical knowledge points and assessment criteria for comparability. Six core knowledge points from the high school IT curriculum unit “Information Systems and Society” were selected, reviewed by three senior IT education experts, and covered four types defined by Bloom’s taxonomy (factual, conceptual, procedural, and metacognitive), resulting in 72 questions. Chi-square testing confirmed a balanced distribution across sources ( = 1.86, p > 0.05).
Prior to the main experiment, we conducted 10 semi-structured interviews with IT teachers between September and October 2023. These interviews served to validate our initial assessment framework and ensure its alignment with actual classroom practices. Participants were selected based on their minimum of five years of IT teaching experience and familiarity with automated assessment tools. The interviews explored teachers’ experiences with MCQs, their criteria for evaluating question quality, and their perspectives on automated question generation. Thematic analysis of the interview data revealed three key areas of concern: distractor quality, cognitive level alignment, and practical relevance. These findings directly informed our final evaluation framework and the selection of assessment dimensions.
A triple-blind review design kept evaluators unaware of question sources, with uniform formatting and a Latin square arrangement to minimize sequence and fatigue effects (ANOVA: F = 1.24, p > 0.05). Standardized evaluation processes were used, with questions independently scored by two experts and a third reviewer in cases of large scoring discrepancies (≥2 points). No significant differences were found among groups in text length (F = 0.78, p > 0.05) or language complexity (F = 0.92, p > 0.05). Semi-structured interviews (n = 10) indicated high alignment with real teaching practices (mean = 4.2/5), and qualitative data coding achieved high inter-coder reliability (Cohen’s = 0.85).
3.4.3. Measures and Instruments
This study constructed an evaluation framework based on Bloom’s taxonomy, focusing on three dimensions: distractor relevance, question diversity, and higher-order thinking. We employed a questionnaire to gather insights regarding the quality of the automatically generated MCQs, allowing experts to evaluate all questions from Multi-Examiner, GPT-4, and humans.
Distractor relevance evaluated conceptual relevance, logical rationality, and clarity, each rated on a five-point scale. Question diversity assessed cognitive level coverage, domain distribution, and form variation to ensure a balanced assessment across Bloom’s levels. Higher-order thinking measured cognitive depth, challenge level, and application authenticity, with criteria verified through expert consultation and pilot testing.
To ensure rigor, the framework’s content validity achieved a Content Validity Ratio of 0.78, while construct validity was confirmed through factor analysis (/df = 2.34, CFI = 0.92, RMSEA = 0.076). Reliability testing included inter-rater reliability (Krippendorff’s > 0.83), test–retest reliability (r = 0.85), and internal consistency (Cronbach’s > 0.83) across dimensions, all demonstrating high reliability.
3.4.4. Data Analysis
This study applied a systematic data analysis framework for three research questions. Descriptive statistics provided an overview of data, followed by inferential analyses tailored to each question, with effect sizes calculated for reliability. Data preprocessing included cleaning, normality checks (Shapiro–Wilk test), and variance homogeneity tests (Levene’s test). Missing values were addressed through multiple imputation.
For distractor relevance, two-way ANOVA assessed the effects of generation methods and knowledge types, with Tukey HSD post hoc tests for significant interactions. For question diversity, MANOVA was employed, with follow-up ANOVAs and Pearson correlations between dimensions. For higher-order thinking, mixed-design ANOVA examined cognitive level differences, using Games–Howell post-hoc tests for robustness. Effect sizes (partial , Cohen’s d, and r) were reported with confidence intervals, focusing on practical significance.