We evaluated the effectiveness of the AI virtual museum system in the following dimensions: system operation data, interaction event indicators, and overall outcome. The results highlight the technical mechanisms that influence information acquisition, interaction, and historical knowledge construction. The evaluation criteria were standardized based on the Cognitive Domain of Bloom’s Taxonomy, targeting analysis (identifying causal links) and evaluation (judging source credibility). Historical literacy was assessed in dimensions of chronological thinking, source analysis, historical comprehension, and causal reasoning [
4].
4.1. Data Collection and Analysis
We surveyed students in two classes of comparable size and similar academic backgrounds during the same teaching period. The students learned the thematic unit “The Formation of the Self-Aware Chinese Nation” from the course Outline of Modern Chinese History. A total of 83 students were recruited from undergraduate history courses. Using a randomized controlled trial design, the participants were assigned to the experimental group (n = 41; 51% were female, and the mean age was 20.2 years) and the control group (n = 42; 49% were female, and the mean age was 20.5 years). To ensure consistency, the same instructor delivered the pre-lecture and post-session debriefing to both groups. This study was approved by the Institutional Review Board (IRB) of Wuhan University of Science and Technology, and written informed consent was obtained from all participants before data collection.
The AI virtual museum system was deployed for four weeks. In this period, the system automatically recorded and stored all user interaction data in the backend for subsequent analysis. Three types of data were collected as follows.
Learning outcomes as external validation indicators of instructional effectiveness;
Interaction events automatically captured by the system during student engagement with artifacts;
Qualitative feedback obtained through semi-structured interviews with a subset of the participants.
Interaction event data were treated as independent variables, while learning outcomes served as dependent variables. To evaluate system effectiveness, independent-samples t-tests were conducted to compare learning outcomes between the intervention and non-intervention groups. Statistical significance was used to determine the impact of system integration on student performance. Pearson correlation coefficients were also calculated to examine relationships between different types of interaction events and learning performance, thereby assessing the influence of event-driven mechanisms on system effectiveness.
Qualitative data were gathered in semi-structured interviews (15–20 min) with 15 purposively selected students from the experimental group. Semi-structured interviews were conducted to examine how the system’s interaction mechanisms affected user behavior. The interview protocol focused on perceived agency and historical empathy. Transcripts were analyzed using thematic analysis. Two independent researchers coded the data. The coded data yielded an inter-coder reliability (Cohen’s Kappa) of 0.84, indicating strong consistency in identifying themes such as emotional connection to artifacts and AI narrative reliability [
24].
4.2. System Effectiveness
The learning outcomes of the two groups were compared before and after system employment. For the assessment of learning outcome differences, pre- and post-tests were conducted for the two groups. The test results were reported using a 100-point scale and categorized into high (≥90), upper-middle (80–89), middle (60–70), and low (<60) score levels. To ensure the reliability of the survey instrument used to assess system acceptance and perceived learning gains, we calculated Cronbach’s α. The overall scale demonstrated high internal consistency (α = 0.88), indicating that the items reliably measured the intended constructs. The post-test results showed that the score of the experimental group (84.5 ± 6.8) was significantly higher than that of the control group (71.6 ± 7.9) (p < 0.001, t = 6.42). The calculated effect size (Cohen’s d) was 1.75, with a 95% confidence interval of [1.25, 2.25]. This represents a large effect size according to Cohen’s conventions, indicating that the AI-integrated virtual museum substantially improved student performance compared with traditional instructional methods.
Before the system employment, there was no statistically significant difference between the two groups (
t = 0.394,
p = 0.691 > 0.05) (
Table 2). This suggests that the two groups were identical in terms of baseline cognitive level, providing a valid prerequisite for subsequent analysis of system effectiveness.
The post-test results showed that the experimental group scored significantly higher than the control group (
p < 0.01) (
Table 3). The result indicates that the AI virtual museum system had a significant positive effect on students’ learning outcomes.
The distribution of scores underscores the system’s effect on learning outcomes (
Table 4). In the post-test, 73.1% of students in the system-intervention group scored in the high or upper-middle levels, compared with only 41.6% in the non-intervention group. The control group remained primarily at the middle level (54.3%), whereas the experimental group demonstrated improved scores. These results indicate that the AI virtual museum system is effective in fostering integrative analytical abilities and higher-level understanding, rather than merely enhancing basic knowledge acquisition.
Based on the interaction logs automatically recorded by the system, correlation analysis was conducted between interaction event indicators and learning outcomes of the experimental group. Artifact viewing time was measured in minutes to capture the total time a user spent actively examining 3D artifacts. It reflects the system’s capacity to support deeper information acquisition. Participation in in-depth inquiry tasks was measured by the number of times students engaged with complex exploratory tasks. Scenario simulation completion was counted by the number of scenarios completed. Extended reading clicks were measured by the number of times a user accessed supplementary historical data. Multimedia interaction usage was measured by the frequency of engagement with audio, video, or interactive media elements. Discussion and feedback triggers were measured by the number of times students initiated communication or feedback within the system. The basic task completion rate was calculated as the percentage of predefined instructional objectives successfully achieved (89%).
All interaction event indicators showed positive correlations with learning outcomes. The number of completed situational simulations (
r = 0.59,
p < 0.001) and the frequency of deep inquiry task engagement (
r = 0.54,
p < 0.001) exhibited the strongest correlations with learning outcomes, suggesting that event-driven interaction enhanced learning outcomes. Artifact viewing time (
r = 0.48) and basic task completion rate (
r = 0.50) supported deeper information acquisition and structured learning guidance. In contrast, extended reading behaviors and multimedia interaction usage showed weaker correlations and functioned as auxiliary facilitators, although maintaining stable positive correlations (
Table 5).
To examine the factors contributing to the observed improvement in learning outcomes and to evaluate the effectiveness of the AI virtual museum, we conducted a thematic coding analysis of interview transcripts from 25 students in the experimental group. Through open coding and axial coding, three thematic categories were identified: acceptance of the virtual museum, perceived learning gains, and identified issues with suggested improvements (
Table 6).
The thematic coding analysis results were consistent with the statistical analysis results. The students emphasized immersion, improved comprehension, and motivation as key benefits, while pointing out technical and ergonomic challenges that require further refinement. The learning outcome improvements stemmed from the effects of the event-driven interactive feedback mechanism, the knowledge graph–guided learning mechanism, and the adaptive content generation mechanism based on AICG, rather than from a single instructional factor or incidental intervention.
The analysis results of system operation data, interview feedback, and classroom observations demonstrated the system’s effectiveness in supporting learning. However, under high interaction intensity and complex virtual environments, engineering-level optimization is required for further system improvement.
4.3. Comparison of LLM and LLM + Knowledge Graph
To assess the necessity of the Knowledge Graph integration, we conducted a comparative analysis of the standalone ChatGLM-6B model and the proposed KG-augmented architecture, based on established empirical benchmarks for knowledge-intensive tasks [
25].
While standalone LLMs demonstrate high linguistic fluency, they exhibit significant reliability problems (hallucinations) when applied to specific historical domains. Comparative studies using benchmarks such as the Benchmark for Fine-grained Automatic Evaluation of Hallucination and MoviE Text Audio Question and Answering show that vanilla LLMs typically show an accuracy rate of 62–68% on domain-specific fact-checking [
25]. In contrast, architectures that integrate structured knowledge graphs through retrieval-augmented generation increase the accuracy to higher than 88–92% on the basis of the generative process in verified factual triples [
26]. The primary failure modes for the standalone LLMs include temporal conflation and entity misattribution. For example, standalone models frequently integrate the diplomatic contexts of the First and Second Opium Wars or attribute the construction of the Humen Cannon to incorrect historical figures [
26]. These errors are caused by the probabilistic nature of LLMs, which prioritize plausible-sounding sentences over factual precision [
27].
The developed system in this study mitigates these errors by injecting verified knowledge triples (e.g., Humen Cannon, Location, and Humen Town) directly into the prompt context. This ensures that the AI’s pedagogical responses remain factually based on the museum’s data. This approach reduces hallucination rates and enhances the model’s ability to handle zero-shot inquiries about local history that were not present in its original training corpus [
27].