Next Article in Journal
Learning to Engineer: Integrating Robotics-Centred Project-Based Learning in Early Undergraduate Education
Previous Article in Journal
A Grammar of Speculation: Learning Speculative Design with Generative AI in Biodesign Education
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Scaffolding Probabilistic Reasoning in Civil Engineering Education: Integrating AI Tutoring with Simulation-Based Learning

Department of Civil and Environmental Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China
Educ. Sci. 2026, 16(1), 103; https://doi.org/10.3390/educsci16010103
Submission received: 9 December 2025 / Revised: 4 January 2026 / Accepted: 8 January 2026 / Published: 9 January 2026
(This article belongs to the Section Technology Enhanced Education)

Abstract

Undergraduate civil engineering students frequently struggle to transition from deterministic to probabilistic reasoning, a conceptual shift essential for modern structural design practice governed by reliability-based codes. This paper presents a design-based research (DBR) contribution and a theoretically grounded pedagogical framework that integrates AI-powered conversational tutoring with interactive simulations to scaffold this transition. The framework synthesizes cognitive load theory, scaffolding principles, self-regulated learning research, and threshold concepts theory. The design incorporates three novel elements: (1) a structured misconception inventory specific to structural reliability, derived from literature and expert elicitation, with each misconception linked to targeted intervention strategies; (2) an integration architecture connecting large language model tutoring with domain-specific simulations, where simulation states inform tutoring and misconception detection triggers targeted activities; and (3) a scaffolded module sequence building systematically from deterministic foundations through probability concepts to reliability analysis methods. Sequential modules progress from uncertainty recognition through Monte Carlo simulation and design applications. We provide technical specifications for the implementation of AI tutoring, including prompt engineering strategies, accuracy safeguards that address known limitations of large language models (LLMs), and protocols for escalation to human instructors. An assessment framework specifies concept inventory items, process measures, and practical competence tasks. Ultimately, this paper provides testable conjectures and identifies conditions under which the framework might fail, structuring subsequent empirical validation with student participants following institutional ethics approval.

1. Introduction

1.1. The Education Challenge

Structural reliability and uncertainty quantification represent essential competencies for contemporary civil engineering practice. Modern design codes increasingly incorporate probabilistic concepts through load and resistance factor design approaches (e.g., AASHTO LRFD (Zokaie, 2000) and Eurocode (Low & Phoon, 2015)), requiring engineers to reason about variability, probability distributions, and acceptable risk levels rather than relying solely on deterministic safety factors (Faber, 2005; Tu et al., 1999; J. Zhang & Taflanidis, 2019, 2020). Yet undergraduate civil engineering students frequently struggle to develop the probabilistic reasoning skills these approaches demand.
This difficulty reflects a fundamental conceptual threshold in engineering education. Students enter structural engineering courses with well-developed intuitions about deterministic analysis (Fernández-Sánchez & Millán, 2013; Romero & Museros, 2002), where inputs produce single outputs and safety margins appear straightforward. The transition to probabilistic reasoning requires re-conceptualizing these foundational assumptions: material strengths become random variables rather than fixed values, loads exhibit variability that must be modeled statistically, and safety transforms from a binary condition into a probability that demands explicit quantification. This conceptual shift proves challenging because it requires students to fundamentally restructure their understanding of structural behavior and engineering judgment (Kaplar et al., 2021). The transition to probabilistic thinking requires students to accept a form of shared epistemic limitation, recognizing that uncertainty about future loads and material properties is not a personal knowledge gap to be filled but a fundamental condition shared by all practitioners. Research in developmental psychology suggests that understanding such common ignorance (knowing that everyone lacks certain knowledge) represents a sophisticated cognitive achievement (Liu et al., 2024).
Research on threshold concepts in higher education illuminates why such transitions prove so difficult (Barradell, 2013; Meyer & Land, 2006). Once students genuinely grasp that all structures exist on a continuum of reliability rather than in binary safe or unsafe states, they cannot return to purely deterministic thinking, and this understanding integrates across multiple aspects of structural engineering practice. However, the troublesome nature of this threshold means many students remain in liminal states of partial understanding, able to execute probabilistic calculations without genuine conceptual transformation.
This educational gap is particularly evident at the second-year level of civil engineering programs (Moss, 2011). By the second year, students typically possess foundational knowledge in mechanics and basic statistics, but they have had little exposure to the concepts of uncertainty, probability distributions, or risk assessment in an engineering context. They are accustomed to problems with a single correct answer, for example, using a single set of loads to compute a single bending stress. Thus, they may struggle when confronted with the idea of thinking in terms of uncertainty or reliability rather than absolute certainty (Perlman et al., 2014).
To address this educational gap, this paper proposes a pedagogical framework that integrates AI-driven chatbot tutoring with interactive simulation tools for teaching structural reliability concepts. The framework leverages an AI chatbot as a personal tutor that can guide students through complex probabilistic ideas in a conversational manner, while simulation modules provide visual, hands-on experience with randomness and safety margin calculations. This two-stage approach is designed to make abstract concepts more concrete: students can observe the effects of uncertainty through simulations and discuss the underlying theory with an interactive tutor. Note that this paper provides a theoretically grounded design framework; empirical validation is identified as essential future work.

1.2. Design-Based Research Positioning, Contributions, and Paper Organization

This paper is positioned as a design-based research (DBR) contribution, specifically representing the design rationale and specification phase of an iterative research program (Bakker, 2018) rather than empirical evaluation. Understanding this positioning is essential for evaluating the paper’s claims and contributions appropriately. We present a theoretically grounded pedagogical framework and associated design specifications intended as a testable blueprint for subsequent implementation and evaluation studies.
Design-based research (DBR) provides the methodological foundation for this work. DBR is an established paradigm in the learning sciences and educational technology in which theory and design are developed in tandem, typically through iterative cycles, to produce both usable interventions and testable theoretical conjectures (Bakker, 2018). Importantly for publication, the DBR literature recognizes multiple legitimate publication models (or contribution types) across the life-cycle of an intervention. Commonly accepted models include (Design-Based Research Collective, 2003; Plomp & Nieveen, 2013; Wang & Hannafin, 2005): (i) design rationale/principles papers that synthesize theory into conjectures and actionable design specifications; (ii) design-and-development reports that document the artifact, implementation constraints, and anticipated failure modes; (iii) iterative refinement papers reporting successive design cycles and revisions; and (iv) empirical evaluation studies that test outcomes and mechanisms in authentic settings. The present manuscript is intentionally positioned as a contribution of the first two types. It specifies a coherent instructional design and an implementable AI+simulation integration, while reserving claims about learning effectiveness for a subsequent empirical phase.
The specific contributions are threefold. First, we provide a domain-specific analysis of threshold concept challenges in structural reliability education. While threshold concepts have been studied across disciplines (Barradell, 2013; Meyer & Land, 2006), we offer the first systematic mapping of how probabilistic reasoning constitutes a threshold concept specifically within structural engineering, identifying characteristic misconceptions unique to this domain.
Second, we present a theoretically-grounded integration architecture combining large language model tutoring with domain-specific simulations. The novelty lies not in either component individually, as intelligent tutoring systems and simulation-based learning are well-established. Instead, the novelty is in the specific integration mechanisms: how simulation states inform tutoring prompts, how misconception detection triggers targeted simulation activities, and how the two modalities complement each other’s limitations. This integration architecture, detailed in Section 4.4, addresses documented shortcomings of simulations without guidance (Lane & Peres, 2006) and LLMs without domain grounding (Kasneci et al., 2023).
Third, we develop a structured misconception inventory for structural reliability, derived from literature synthesis and expert elicitation. This inventory provides both a pedagogical resource for instructors and a specification for automated misconception detection, with each entry linked to targeted intervention strategies.
While existing conversational-based intelligent tutoring systems (ITS) have long provided adaptive support, many established approaches such as cognitive tutors (Anderson et al., 1995; Koedinger & Corbett, 2001), constraint-based tutors (Mitrovic et al., 2013), and AutoTutor (Graesser et al., 2004; Nye et al., 2014), typically depend on tightly specified domain models and/or structured student inputs and are often optimized for step-level problem solving (correctness-based feedback, mastery estimation, and hint sequences). In contrast, this manuscript does not claim a new ITS implementation; its contribution is primarily pedagogical: a domain-specific set of design principles and dialogue patterns for helping students cross the deterministic-to-probabilistic threshold in structural reliability. Pedagogically, we advance prior ITS-oriented approaches in: (i) bidirectional simulation–tutoring integration, where evolving simulation states drive tutor prompts and tutor dialogue directs subsequent simulation actions; (ii) an explicit grounding in threshold concepts (supporting liminal reasoning, productive failure (Kapur, 2016), and integrative understanding rather than only incremental skill acquisition toward a predefined mastery target); and (iii) metacognitive and epistemic scaffolding (Chi & Wylie, 2014) that goes beyond self-monitoring during problem solving to include confidence calibration for probabilistic claims, articulation of uncertainty, and reflection on how evidence and judgment interact in engineering reliability decisions.
The remainder of this paper is organized as follows. Section 2 presents the theoretical framework grounding our pedagogical design. Section 3 reviews relevant literature on structural reliability education and the use of simulations and AI in teaching. Section 4 details the pedagogical framework design, including learning objectives and scaffolding approach. Section 5 presents the teaching modules with example interactions. Section 6 provides illustrative learning scenarios. Section 7 discusses anticipated benefits, limitations, and implementation considerations. Section 8 concludes with a summary and directions for future empirical validation.

2. Theoretical Framework

The pedagogical framework developed in this paper integrates multiple complementary theoretical perspectives. Together, these perspectives guide both the instructional design and the integration of AI-enhanced tools for simulation and tutoring.

2.1. Cognitive Load Theory

Cognitive load theory provides a perspective for understanding the challenges students face when learning probabilistic structural analysis. Sweller and colleagues distinguish between intrinsic cognitive load, which arises from the inherent complexity of the material and the learner’s prior knowledge, and extraneous cognitive load, which results from suboptimal instructional design (Sweller, 2010; Sweller et al., 2019). Probabilistic reliability analysis presents substantial intrinsic load because students must simultaneously manage statistical concepts, structural behavior, code provisions, and the interactions among these elements.
The scaffolded module sequence adopted in this framework addresses intrinsic load by decomposing complex probabilistic reasoning into manageable components progressively. Students first consolidate their understanding of deterministic design before uncertainty quantification layers add additional complexity. This sequencing aligns with the isolated elements principle to focus on individual concepts before integration is required (Pollock et al., 2002).

2.2. Multimedia Learning Principles

The interactive simulations reduce extraneous load by externalizing abstract probabilistic relationships into visual representations. Rather than requiring students to mentally construct probability distributions and their implications for structural capacity, the simulations render these concepts visible and manipulable. This approach reflects the multimedia learning principles (Mayer, 2005; Plass & Kalyuga, 2019), suggesting that people learn better from words and pictures together than from words alone, and that related visual and verbal information should be presented in close proximity. The simulation interface integrates explanatory text with dynamic graphical displays, enabling students to observe immediate consequences of parameter changes.

2.3. Zone of Proximal Development and Scaffolding

The AI chatbot component draws upon the zone of proximal development (Vygotsky, 1978), which describes the conceptual space between what learners can accomplish independently and what they can achieve with appropriate guidance. The chatbot functions as a more knowledgeable other, providing adaptive support calibrated to individual student needs as revealed through dialogue.
When students express misconceptions or demonstrate incomplete understanding, the chatbot offers targeted hints, poses Socratic questions, or provides worked examples (Renkl, 2014) depending on the nature and severity of the difficulty. This responsiveness reflects principles of contingent scaffolding, in which support adapts dynamically rather than following predetermined sequences (Belland, 2014; Wood et al., 1976).

2.4. Self-Regulated Learning

Self-regulated learning theory (Panadero, 2017; Zimmerman, 2002) emphasizes the importance of metacognition, strategic planning, and monitoring for academic success. The chatbot incorporates metacognitive prompts that encourage students to reflect on their reasoning processes, articulate their current understanding, and identify areas of uncertainty.
Rather than simply providing answers, the chatbot poses questions such as “What do you think would happen if the coefficient of variation increased?” or “Can you explain why the reliability index changed in that direction?” These prompts develop transferable self-regulation skills alongside domain-specific knowledge. Research on intelligent tutoring systems suggests that systems incorporating metacognitive support produce greater learning gains than those focused solely on content delivery (Azevedo & Hadwin, 2005; VanLehn, 2011).

2.5. Threshold Concepts

Finally, the threshold concepts inform understanding of why probabilistic structural reliability proves particularly challenging for many students. Meyer and Land (2006) describe threshold concepts as transformative ideas that are troublesome, irreversible, and integrative. Probabilistic thinking in structural engineering may constitute such a threshold concept because it requires students to abandon the reassuring certainty of deterministic calculations and embrace inherent uncertainty as fundamental rather than as error to be eliminated (Kaplar et al., 2021). Students who have not crossed this threshold may resist probabilistic approaches or treat safety factors as adequate substitutes for explicit reliability analysis. The pedagogical framework addresses this threshold by making probabilistic consequences visible through simulation, by providing a safe space for exploration through the patient chatbot tutor, and by connecting probabilistic concepts to engineering decisions.

2.6. Integration of Different Theories

The theoretical perspectives presented above are complementary rather than competing, each addressing distinct aspects of the learning challenge. Cognitive load theory and multimedia learning principles operate at the micro-level, informing moment-to-moment instructional decisions about content presentation and sequencing. Scaffolding, the zone of proximal development (ZPD), and self-regulated learning (SRL) are closely related, but serve distinct design roles. ZPD identifies the appropriate challenge level for productive learning. It motivates keeping tasks and prompts in the range where students cannot yet succeed independently but can succeed with appropriate assistance. Scaffolding describes the mechanism through which support is dynamically adjusted within this zone, gradually fading as competence increases. SRL addresses the development of learner autonomy over longer timescales, shifting responsibility from external scaffolding to internal self-monitoring. In short, ZPD informs where support should be aimed, scaffolding informs how support is delivered and faded, and SRL informs how learners eventually take over these regulatory functions themselves. Finally, threshold concepts theory operates at the macro-level, characterizing the nature of the conceptual transformation students must achieve and explaining why this transformation proves troublesome.

2.7. Educational Quality as Design Target

In this paper, we use the term educational quality as a design target rather than an empirically established outcome. Here, educational quality refers to the degree to which an instructional design aligns learning objectives, activities, and assessments; plausibly supports conceptual understanding and transfer; manages cognitive demands by reducing avoidable extra load appropriately; and maintains instructional integrity when AI support is used. Because of the DBR nature of the present manuscript, these dimensions remain subject to empirical validation.

3. Background and Literature Review

Having established the theoretical foundations that inform our pedagogical approach, we now review the relevant literature to situate our framework within existing research and identify gaps that motivate our design decisions.

3.1. Structural Reliability

Structural reliability theory provides the mathematical foundation for modern limit state design approaches (Ang & Tang, 1975; Haldar & Mahadevan, 2000; Melchers & Beck, 2018). The fundamental premise is that both structural resistance and applied loads are random variables characterized by probability distributions rather than being deterministic. The probability of failure1 occurs when the load effect exceeds the structural resistance. Estimating failure probabilities in complex structural systems has become an emerging research area (Tu et al., 1999; J. Zhang & Taflanidis, 2019, 2020). Contemporary design codes incorporate these reliability concepts through calibrated load and resistance factors. Engineers applying these codes may not perform explicit reliability calculations, yet the underlying philosophy assumes probabilistic behavior of structural parameters. This disconnect between codified practice and fundamental understanding creates educational challenges, as students may learn to apply factors without appreciating their probabilistic basis.

3.2. Simulation-Based Learning in Engineering Education

Simulation-based learning has emerged as an effective strategy for addressing the challenges of teaching abstract concepts (Davidovitch et al., 2006). Interactive simulations allow students to manipulate parameters, observe outcomes, and develop intuition through repeated experimentation.
Research has demonstrated that simulation-based approaches can improve student understanding of probabilistic concepts across various engineering disciplines (Batanero & Álvarez-Arroyo, 2024; Koparan, 2019; Sheikh, 2024). Students who engage with interactive simulations tend to develop stronger conceptual understanding and demonstrate improved ability to transfer knowledge to new problems. The visual feedback provided by simulations helps students connect mathematical formulations to physical phenomena.
The effectiveness of simulation-based learning depends on appropriate pedagogical design. Simulations must be carefully scaffolded to guide students through increasingly complex concepts without overwhelming them with unnecessary features (De Jong, 2010). Learning objectives should drive simulation design, and activities should prompt students to make predictions, test hypotheses, and reflect on observations.

3.3. Artificial Intelligence in Engineering Education

The application of artificial intelligence (AI) to education has a substantial research history. Early intelligent tutoring systems used rule-based approaches to model student knowledge and provide targeted feedback. Contemporary AI systems, particularly those based on large language models (LLMs), offer unprecedented capabilities for natural language interaction and personalized instruction (Naveed et al., 2025). AI chatbot tutors can engage students in Socratic dialogue, responding to questions with guiding questions that promote deeper thinking. These systems can identify common misconceptions based on student responses and provide targeted explanations. Unlike human instructors who must divide attention among many students, AI tutors can provide individualized attention to each learner simultaneously (Baidoo-Anu & Ansah, 2023; Kasneci et al., 2023). Recent research provides promising evidence for AI tutoring effectiveness (Kestin et al., 2025), where students using an AI tutor learned significantly more efficient to those in standard active-learning classrooms.
However, the deployment of AI tutors in engineering education requires careful consideration of several limitations (Elsayed, 2024). Large language models can produce confident but incorrect responses, a phenomenon known as “hallucination”, which poses particular risks in technical domains where erroneous information about structural behavior or safety factors could reinforce dangerous misconceptions (Cheng et al., 2025; J. Zhang et al., 2020; Z. Zhang et al., 2025). Unlike errors in a history essay, miscalculations in engineering carry potential real-world consequences, making accuracy verification essential (Sriramanan et al., 2024). Additionally, researchers have raised concerns about student over-reliance on AI assistance, which may undermine the development of independent problem-solving skills and professional judgment that engineers must exercise when AI tools are unavailable or inappropriate (Bastani et al., 2025; Kazemitabaar et al., 2024). These considerations suggest that AI tutors are most effective when positioned as supplements to, rather than replacements for, human instruction and when designed with appropriate safeguards for technical accuracy (J. Zhang et al., 2021).

3.4. Existing Approaches to Structural Reliability Education

The dominant approach to structural reliability education relies on textbooks that present mathematical foundations through traditional expository instruction (Ang & Tang, 1975; Haldar & Mahadevan, 2000; Melchers & Beck, 2018). These texts provide rigorous mathematical treatment but share common pedagogical limitations: concepts are presented in finished form rather than developed through exploration, students encounter formulas before developing intuition for underlying phenomena, and practice problems emphasize calculation over conceptual reasoning.
Software tools for structural reliability analysis exist but are designed for academic or engineering practice rather than education, such as UQlab (Marelli & Sudret, 2014) and UQpy (Olivier et al., 2020). These professional tools share a common limitation for educational purposes: they are designed for users who already understand reliability concepts and need computational power, not for undergraduate students who are developing an initial understanding. Some educational adaptations have been developed (Vrouwenvelder, 1997) but without pedagogical scaffolding.

3.5. Evidence from Comparable Domains

Given the limited empirical research on structural reliability pedagogy, we draw on evidence from domains with analogous conceptual challenges.
Statistics education research documents persistent difficulties with probabilistic reasoning. Garfield and Ahlgren (1988) identified misconceptions, including beliefs that random samples should mirror population characteristics and confusion between correlation and causation. Lane and Peres (2006) found that simulation-based learning without guidance often fails: students focus on surface features rather than underlying concepts. This finding directly motivates our integration of chatbot tutoring with simulation, providing the structured guidance that transforms exploration into learning. Research has examined how constructivist strategies (Reeves et al., 2010) and frequentist representations (Sedlmeier & Gigerenzer, 2001) can improve probabilistic reasoning.
Physics education research offers methodological models for assessing conceptual change. The PhET simulation project demonstrated that effective educational simulations share key features: bounded exploration environments, immediate visual feedback, and visualization of otherwise invisible phenomena (Wieman & Perkins, 2005). Our simulation designs apply these principles. The Force Concept Inventory (Hestenes et al., 1992) demonstrated that traditional instruction often leaves intuitive misconceptions intact, and reinforces our attention to misconception detection and targeted intervention.
These domains differ from structural reliability in important ways, as structural reliability concerns inherently unobservable statistical properties of structural systems. While direct transfer of existing findings is impossible, they are informative for the design of our approach.

4. Pedagogical Framework Design

The literature review reveals both the need for improved approaches to reliability education and the potential of simulation and AI tutoring. Building on these foundations, the following section presents the detailed pedagogical framework design, translating theoretical principles into instructional specifications.

4.1. Learning Objectives and Competency Outcomes

The proposed pedagogical framework is organized around specific learning objectives that align with outcomes-based education (OBE) principles. These objectives define what students should know and be able to do upon completing the teaching modules, providing a foundation for instructional design and assessment.
The first category addresses conceptual understanding of uncertainty in structural engineering. Students should be able to explain why structural loads and material properties exhibit variability rather than fixed values; recognize different sources of uncertainty (Der Kiureghian & Ditlevsen, 2009), including inherent randomness (aleatory uncertainty) (Peng & Zhang, 2025), measurement limitations (Gu et al., 2026), and model approximations (epistemic uncertainty) (Jakeman et al., 2010). They should understand the limitations of deterministic safety factors and the motivation for probabilistic design approaches.
The second category concerns the mathematical foundations of reliability analysis (Ang & Tang, 1975). Students should demonstrate the ability to characterize random variables using probability distributions, calculate basic statistics including mean and standard deviation, and interpret probability density functions. They should understand the concept of a limit state function separating safe and failure regions. Students should be able to compute simple reliability indices and interpret their meaning in terms of failure probability.
The third category involves computational skills for uncertainty analysis (Marelli & Sudret, 2014). Students should demonstrate proficiency in implementing basic Monte Carlo simulations to estimate failure probability, interpret simulation results, assess convergence, and use simulation tools to explore sensitivity to parameter changes.
The fourth category addresses professional judgment and decision-making. Students should evaluate acceptable levels of risk in different structural contexts and recognize the societal and ethical dimensions of reliability-based design. They should appreciate how design codes incorporate reliability concepts and apply probabilistic thinking to realistic engineering scenarios.
These objectives align with attributes emphasized by engineering accreditation bodies, including technical competence, problem-solving ability, and responsibility.

4.2. Conceptual Progression and Scaffolding

The framework employs a carefully scaffolded progression that guides students from familiar deterministic concepts toward probabilistic thinking, as prefaced in Section 2. This progression recognizes that students cannot immediately abandon deterministic mental models but must gradually construct new frameworks through supported exploration. Table 1 summarizes the five stages of conceptual progression.
The initial stage connects probabilistic concepts to student experiences. Students already understand that measurement involves uncertainty and that manufactured products exhibit variability. The teaching modules begin by helping students recognize these familiar phenomena as manifestations of randomness that can be characterized mathematically. Simple examples involving measurement of beam dimensions or concrete strength testing establish relevance before introducing formal probability concepts.
The second stage introduces random variables and probability distributions as mathematical tools for describing variability. Students learn to characterize observations using histograms and progress to continuous distributions. The normal distribution receives particular attention given its importance in structural reliability, though students also encounter other distributions relevant to loads and resistance. Interactive simulations enable students to generate samples, construct histograms, and observe how sample statistics approach their population counterparts.
The third stage presents the reliability problem as a natural extension of structural mechanics. Students who have analyzed beams under deterministic loads now consider what happens when load and resistance are both random. The limit state function provides a bridge between mechanics knowledge and reliability concepts:
g ( R , S ) = R S
where R represents the structural resistance (capacity) and S represents the load effect (demand). The limit state function defines the boundary between safe and unsafe performance: when g > 0 , the structure’s capacity exceeds the demand, and the structure survives; when g 0 , the demand meets or exceeds capacity, indicating failure. This simple formulation extends directly from the deterministic design inequality R > S that students have already mastered, reframing it as a continuous function whose sign indicates structural adequacy. By treating R and S as random variables rather than fixed values, the familiar capacity-demand comparison becomes a probabilistic problem: what is the probability that the demand will exceed capacity?
The fourth stage introduces computational methods for reliability estimation. Monte Carlo simulation is presented as a straightforward approach that students can understand and implement. The failure probability is estimated as:
P ^ f = 1 N i = 1 N I [ g ( X i ) 0 ]
where N is the number of samples, X i represents the i-th sample vector, and I [ · ] is the indicator function. Students observe how repeated sampling yields estimates of failure probability and develop intuition for convergence and sample size requirements.
In addition, the reliability index, commonly denoted as β , emerged as a practical measure relating the distance between the mean safety margin and the failure threshold, normalized by the standard deviation of the safety margin. For a simple case where resistance R and load S are both normally distributed and independent, the reliability index can be expressed as follows:
β = μ R μ S σ R 2 + σ S 2
where μ R and μ S are the mean values of resistance and load, respectively, and σ R and σ S are their standard deviations.
The fifth stage connects classroom concepts to professional practice. Students examine how design codes incorporate reliability concepts through calibrated factors. Case studies of structural failures illustrate the consequences of inadequate reliability consideration. Discussion of acceptable risk levels prompts reflection on the societal responsibilities of structural engineers.

4.3. Role of the AI Chatbot Tutor

The AI chatbot tutor serves as an always-available learning companion that supplements simulation-based activities with responsive dialogue. The chatbot is designed to fulfill several pedagogical functions that together create a comprehensive support system for student learning.
The Explanatory function provides on-demand clarification of concepts. When students encounter unfamiliar terms or confusing ideas, they can query the chatbot for an explanation. Unlike static help documentation, the chatbot can tailor explanations to the specific context of student questions and follow up with additional details based on queries.
The Socratic function guides student thinking through strategic questioning. Rather than simply providing answers, the chatbot can respond to student questions with prompts that encourage deeper analysis. When students make predictions about simulation outcomes, the chatbot can ask them to explain their reasoning. When observations contradict predictions, the chatbot can guide students to resolve the discrepancy through further exploration.
The Diagnostic function identifies and addresses misconceptions. The chatbot is designed to recognize common student errors and provide targeted feedback. If a student demonstrates confusion between probability and frequency, or misinterprets the meaning of the reliability index, the chatbot can offer specific clarification and suggest activities that address the misconception.
The Motivational function maintains student engagement throughout learning activities. The chatbot can offer encouragement, suggest breaks when appropriate, and help students recognize progress.
The Metacognitive function promotes students’ reflection on their own learning. The chatbot can prompt students to summarize what they have learned, identify remaining questions, and consider how new knowledge connects to prior understanding.
Figure 1 illustrates how these five pedagogical functions map to the underlying learning theories discussed in Section 2.

4.4. AI Chatbot Tutor: Architecture, Implementation, and Safeguards

The AI chatbot tutor serves as the pedagogical bridge between simulation experiences and conceptual understanding, providing individualized guidance that responds to each student’s developing comprehension. This section details the technical architecture, prompt engineering strategies, misconception handling, accuracy safeguards, and escalation protocols that enable effective and reliable tutoring.

4.4.1. System Architecture and Model Selection

The chatbot tutor is implemented using a large language model (LLM) accessed through API integration with the learning management system and simulation environment. Our reference implementation uses GPT-4 (OpenAI, 2024), selected based on demonstrated performance in educational contexts (Kestin et al., 2025) and strong performance on engineering reasoning tasks. The architecture is designed to be model-agnostic, allowing substitution of alternative models (e.g., Claude, Gemini, or open-source alternatives such as Llama) as capabilities evolve and institutional requirements dictate.
The system architecture comprises four integrated components. The context management layer maintains conversational history, current module position, student performance data, and active simulation state. This layer ensures the LLM receives relevant context for generating appropriate responses while managing token limitations through selective context compression. The prompt engineering layer constructs prompts that combine system instructions, domain knowledge, pedagogical directives, the current context, and student input. This layer implements the instructional strategies detailed in Section 4.4.2. The response processing layer filters and validates LLM outputs before presentation to students. This layer implements the accuracy safeguards detailed in Section 4.4.4, including calculation verification, claim validation, and confidence assessment. The escalation management layer monitors interaction patterns and response quality to identify situations requiring human instructor intervention, with protocols detailed in Section 4.4.5.
Figure 2 presents the system architecture schematically, showing data flow between components.

4.4.2. Prompt Engineering for Domain Accuracy and Pedagogical Effectiveness

Effective prompt engineering, including chain-of-thought strategies (Wei et al., 2022), is essential for ensuring the chatbot provides accurate domain content while implementing sound pedagogical strategies. Our approach employs a hierarchical prompt structure with four layers.
The first layer establishes system identity and constraints. The base system prompt defines the tutor’s identity, domain boundaries, and fundamental constraints. The tutor is instructed to guide students toward understanding through Socratic questioning rather than direct answers, use terminology consistent with standard textbooks (Ang & Tang, 1975) and relevant design codes, acknowledge uncertainty when questions exceed reliable knowledge, and refrain from providing solutions to graded assessments.
The second layer provides module-specific knowledge and objectives. Each module includes specific prompts that focus the tutor on relevant concepts, common difficulties, and learning objectives. For example, the Module 2 prompt specifies key concepts including random variables, probability distributions for load and resistance, and statistical parameters. It also identifies common student difficulties, such as confusing population parameters with sample statistics and believing larger samples eliminate all uncertainty.
The third layer addresses misconception recognition and response templates. The prompt layer includes specific guidance for recognizing and addressing documented misconceptions. When a student expresses a statement matching a catalogued misconception pattern, the system activates the corresponding response strategy. Importantly, the chatbot does not explicitly inform students that they hold a misconception; rather, it guides them through experiences and questions designed to create cognitive conflict and motivate conceptual revision. For example, if a student asks “So if the safety factor is 2.0, the structure is definitely safe?” the system recognizes the deterministic interpretation of a probabilistic concept and responds with a Socratic question: “That’s a really important question. What do you think would happen if we tested many structures all designed with SF = 2.0?”
The fourth layer incorporates dynamic context from student interaction, including completed modules, current simulation state, recent performance, session duration, and compressed conversation history. This contextual information enables the tutor to provide responses tailored to each student’s specific situation and learning trajectory.
Complete prompt templates and implementation specifications are provided in Appendix A.

4.4.3. Misconception Inventory: Systematic Identification and Encoding

The framework incorporates a systematic inventory of structural reliability misconceptions, developed through three complementary approaches. Literature-derived misconceptions identified foundational probability misconceptions that transfer to reliability contexts. A review of engineering education literature identified domain-specific misconceptions related to safety factors, design codes, and structural behavior. Expert elicitation through structured interviews with multiple instructors teaching structural reliability courses at different institutions identified recurring student difficulties, including common errors in student work, persistent questions, and conceptual barriers observed over multiple course offerings. Additional misconception patterns emerged during pilot interaction analysis with graduate teaching assistants role-playing as students during early chatbot prototype testing.
Table 2 presents the structured misconception inventory organized by module and conceptual category.
The misconception inventory is encoded in the prompt engineering layer through pattern-matching heuristics and explicit response templates. The inventory is designed for ongoing refinement: interaction logs are periodically reviewed to identify misconceptions not captured in the initial inventory, and new entries are added following the same structured format.

4.4.4. Accuracy Safeguards and Error Prevention

Given that LLMs can generate plausible but incorrect content (“hallucination”), safeguards are essential for educational applications. Our framework implements multiple layers of protection.
The calculation verification module ensures that numerical calculations related to structural reliability are not performed by the LLM directly. Instead, a dedicated calculation module handles all quantitative operations. When the chatbot recognizes a calculation request, it extracts parameters and passes them to a verified calculation engine that implements validated algorithms for reliability index computation, probability calculations, distribution operations, and related functions. Results are returned to the chatbot for integration into natural language responses, with all calculations including automatic unit checking and range validation. For example, reliability indices are checked against typical ranges ( β typically 2.0–5.0 for structural applications), and values outside expected ranges trigger warnings.
Domain knowledge grounding is achieved through retrieval-augmented generation (RAG) incorporating verified content from authoritative sources including textbooks, code commentaries, and peer-reviewed papers (Lewis et al., 2020). When the chatbot makes factual claims about reliability concepts, the system retrieves supporting passages from the knowledge base. Responses include confidence indicators based on retrieval match quality, and claims without knowledge base support are flagged for human review before delivery.
Response validation filters check for common error patterns before responses reach students. These include unit consistency checks, ensuring physical quantities have plausible units and magnitudes, terminology accuracy validation against a controlled vocabulary to prevent definitional errors, numerical plausibility checks verifying that reliability indices, probabilities, and material properties fall within expected ranges, and code consistency validation ensuring references to design codes are accurate.
Confidence-based response modification adjusts the chatbot’s behavior based on estimated response confidence (J. Zhang et al., 2020). High-confidence responses are delivered normally, medium-confidence responses include hedging language (“Based on standard practice…” or “Typically…”) may suggest verification with the instructor, and low-confidence responses acknowledge uncertainty explicitly and recommend consulting the instructor or textbook.
Certain response types are categorically prohibited, including specific numerical answers to problems identified as graded assessments, design recommendations for real structures, definitive safety judgments about structural adequacy, and content outside the structural reliability domain.

4.4.5. Human Escalation Protocols

Effective AI tutoring requires clear protocols for escalating to human instructors when automated support is insufficient. Our framework implements escalation at three levels.
Automatic escalation triggers flag interactions for instructor review under specific conditions. These include repeated misconception indicators (same misconception appearing three or more times across sessions), frustration indicators detected through sentiment analysis or explicit statements, extended confusion with circular conversation patterns, off-topic questions classified as non-reliability content, potential errors detected by response validation, and any concerning content related to safety or well-being.
Student-initiated escalation allows students to request human help at any time through an explicit interface element. When activated, the system compiles a session summary including the question and relevant context, sends this to the instructor with a suggested response based on conversation history, and provides the student with acknowledgment and expected response time.
Instructor dashboard provides monitoring and intervention tools, including flagged interactions requiring attention prioritized by urgency, aggregate patterns across the student cohort, individual student trajectories, tools to inject instructor responses directly into student chat sessions, and override controls to correct or supplement chatbot responses. Weekly summary reports highlight patterns that might inform lecture adjustments and identify students who may benefit from additional support.

4.4.6. Pedagogical Design Alignment

To make the instructional design traceable, we mapped each learning theory to a design principle, an implementation feature, and the associated operational/technical constraints (Table 3).

4.5. Assessment Framework and Learning Measures

Effective evaluation requires assessment instruments aligned with the learning objectives and educational qualities discussed in Section 2.7. We distinguish three assessment levels, each targeting different aspects of probabilistic reasoning competence.

4.5.1. Conceptual Understanding

Assessing genuine conceptual transformation requires instruments that probe students’ mental models of uncertainty and reliability. We propose adapting established approaches from statistics education research.
Concept inventory items present scenarios requiring qualitative reasoning about reliability concepts without calculation. Such items distinguish students who understand reliability as a probabilistic property of a design population from those who interpret it deterministically.
Prediction-observation tasks embedded within simulations assess whether students can correctly anticipate the consequences of parameter changes before observing them. Systematic prediction errors reveal persistent misconceptions even when students can interpret observed results correctly post-hoc.
Transfer scenarios present novel structural contexts requiring application of reliability reasoning. Students who have genuinely crossed the threshold should spontaneously apply probabilistic thinking to new situations, whereas those with a surface understanding may revert to deterministic reasoning when familiar cues are absent.

4.5.2. Process Measures from Interaction Logs

The integrated learning environment generates rich process data suitable to learning analytics (Blikstein & Worsley, 2016). Key indicators include:
Exploration patterns in simulations reveal whether students engage in systematic hypothesis testing or random parameter manipulation. Purposeful exploration—varying one parameter while holding others constant, testing boundary conditions, verifying predictions—indicates developing scientific reasoning about reliability.
Explanation quality in chatbot interactions can be assessed through natural language analysis. Students demonstrating conceptual understanding should produce explanations featuring appropriate causal language, acknowledgment of uncertainty, and connections across concepts. Automated analysis can flag explanations containing misconception indicators for instructor review.
Misconception trajectories track whether identified misconceptions resolve, persist, or transform across sessions. Persistent misconceptions despite repeated intervention suggest the framework’s limitations for particular students or concepts.

4.5.3. Practical Competencies

Skills in reliability analysis are assessed through conventional problem-solving tasks requiring students to formulate limit state functions for given structural scenarios, select appropriate probability distributions based on physical reasoning, implement Monte Carlo simulations and interpret results, and calculate reliability indices and translate them to failure probabilities. These tasks complement conceptual assessments, as students may demonstrate procedural competence without genuine understanding, or vice versa.

4.5.4. Research Questions and Variables for Empirical Validation

The assessment framework is designed to address the following research questions in future empirical studies. Primary research questions include:
RQ1.
Does the integrated AI tutoring and simulation framework improve conceptual understanding of structural reliability compared to traditional lecture-based instruction?
RQ2.
Does the framework reduce the documented misconceptions about probabilistic structural behavior?
RQ3.
How do students’ interaction patterns with simulations and the AI chatbot relate to learning outcomes?
Secondary research questions include:
RQ4.
What is the relative contribution of simulation-based exploration versus AI tutoring to learning gains?
RQ5.
How do individual differences in prior statistics knowledge and learning preferences moderate framework effectiveness?
Furthermore, future empirical studies should operationalize the following variable structure as presented in Table 4. Such a framework supports experimental designs comparing instructional conditions and correlational analyses examining relationships between variables and outcomes.

4.6. Integration with Existing Curriculum

The framework is designed to be integrated into compulsory/core second-year undergraduate courses where students already work with deterministic structural analysis, such as structural mechanics, engineering mathematics for civil/structural engineering, or introductory structural design. In addition, the same sequence can be offered as a standalone elective in the third or fourth year (or early graduate level) when programs prefer a dedicated treatment of uncertainty and reliability.
Because prerequisite coverage varies across programs, readiness is defined in terms of observable competencies that can be checked with a short diagnostic quiz and/or evidence from prior coursework. Students are considered prepared if they can demonstrate:
  • Structural mechanics: (i) describe a plausible failure/limit state for a simple member in words; (ii) identify governing response quantities for a given loading scenario.
  • Mathematics: (i) manipulate algebraic expressions and solve for an unknown; (ii) interpret functions/graphs in context; (iii) compute and interpret a basic derivative as a rate of change (sufficient for sensitivity/linearization concepts).
  • Probability/statistics: (i) distinguish deterministic vs variable quantities; (ii) interpret mean and variance/standard deviation; (iii) interpret a probability statement using a distribution/CDF at a conceptual level.
Students who do not meet the probability/statistics threshold complete a brief on-ramp (worked examples + guided simulation warm-up + targeted practice) before the main modules. Instructors may implement the diagnostic as a low-stakes pretest (e.g., 10–15 min) and route students to the on-ramp based on performance.
The framework is most straightforward to implement in face-to-face or hybrid formats with a recitation/lab component, where students can run simulations with instructor/TA oversight and where AI-mediated support can be monitored and calibrated. Fully online delivery is feasible provided students have reliable access to the simulation environment, and the course includes structured checkpoints (graded submissions or quizzes) to ensure engagement with the simulation + tutoring workflow.
Finally, we distinguish three integration models depending on institutional constraints. Embedded module model (face-to-face or hybrid): Reliability content is incorporated as a bounded unit within an existing course. For example, a structural mechanics course might include a two-week module on uncertainty after covering deterministic analysis of beams and columns. Parallel workshop model (hybrid or online): Reliability instruction is delivered through supplementary workshops/tutorials accompanying core courses when the core syllabus cannot be changed. This model requires defined participation expectations and structured checkpoints. Standalone course model (face-to-face, hybrid, or online): A dedicated course treats uncertainty and reliability comprehensively, including extended design tasks. This approach provides the most depth but may require curriculum revision or offering as an elective.

5. Teaching Modules

The pedagogical framework described above provides the architectural blueprint for instruction. The following section operationalizes this framework through five sequential teaching modules, illustrating how theoretical principles sit in specific learning activities and AI tutoring interactions. We first provide an overview of the module sequence, then describe the consistent instructional structure, and finally present a detailed illustration of one representative module.

5.1. Module Overview

Table 5 summarizes the five modules, their learning objectives, and primary pedagogical strategies. The modules progress from foundational concepts (uncertainty recognition) through mathematical formalization (probability distributions) to the core reliability problem, computational methods, and, finally, design applications.
The modules require approximately 18 contact hours total, with Modules 2–4 receiving the greatest time allocation given their conceptual density. Each module builds explicitly on prior modules, with the AI chatbot reinforcing connections to previously learned concepts.

5.2. Instructional Structure

Each module follows a four-phase instructional sequence designed to promote deep conceptual understanding:
  • Activation and motivation (15–20 min): The module opens with a concrete engineering scenario that motivates the concepts to be learned. The AI chatbot poses initial questions to activate prior knowledge and surface existing conceptions.
  • Concept introduction (20–30 min): Key concepts are introduced through brief exposition, immediately followed by interactive simulation activities. The chatbot provides just-in-time explanations responsive to student queries.
  • Guided exploration (30–45 min): Students engage in structured simulation-based activities with increasing autonomy. The chatbot employs Socratic questioning to deepen understanding and diagnostic prompts to identify misconceptions.
  • Consolidation and reflection (15–20 min): Students complete application problems and engage in chatbot-facilitated reflection on key insights and connections to engineering practice.

5.3. Illustration: Solving the Structural Reliability Problem

To illustrate the integration of simulation-based learning and AI chatbot support, we describe Module 3 in detail. This module represents a critical juncture where students transition from probability concepts to their application in structural reliability.

5.3.1. Conceptual Foundation

The module introduces the limit state function as the mathematical formalization of structural adequacy in Section 4.2: g ( R , S ) = R S , where R represents structural resistance (capacity) and S represents the load effect (demand). The limit state function defines the boundary between safe performance ( g > 0 ) and failure ( g 0 ). This formulation extends naturally from the deterministic design inequality R > S that students have previously mastered, reframing it as a continuous function whose sign indicates structural adequacy. The reliability index β = μ R μ S σ R 2 + σ S 2 quantifies the margin of safety in probabilistic terms. The failure probability relates to the reliability index through P f = Φ ( β ) , where Φ is the standard normal cumulative distribution function.

5.3.2. Simulation Component

The simulation interface (Figure 3) presents load and resistance distributions on shared axes, with the overlap region visually representing failure probability. It includes real-time parameter adjustment via sliders, enabling students to observe how changes in means and standard deviations affect reliability; dynamic calculation and display of the reliability index and failure probability; and visualization of the failure region.

5.3.3. AI Chatbot Integration

The AI chatbot supports learning through context-sensitive interventions. Table 6 presents representative interaction patterns illustrating the five pedagogical functions.

5.3.4. Assessment

Formative assessment is embedded throughout the module, following best practices for feedback design (Shute, 2008):
  • Prediction tasks: Students predict reliability changes before manipulating parameters, then verify predictions through simulation.
  • Explanation prompts: The chatbot asks students to explain observed phenomena in their own words, assessing conceptual understanding.
  • Application problems: Students calculate reliability indices for given scenarios and interpret results in engineering terms.

5.4. Module Progression

The five modules are designed for coherent integration rather than standalone use. Module 2 probability concepts directly enable Module 3 reliability calculations; Module 3 analytical results provide validation benchmarks for Module 4 Monte Carlo simulation; Module 5 sensitivity analysis builds on all prior modules to address design optimization. The AI chatbot reinforces these connections by explicitly referencing prior module content when relevant. For example, when a student in Module 4 questions why Monte Carlo estimates vary between runs, the chatbot may recall: “Remember from Module 2 how sample statistics approached population parameters as sample size increased? The same principle applies here.”

6. Illustrative Learning Scenarios

The teaching modules presented above describe intended instructional sequences and system behaviors. To illustrate how these designs might unfold in practice, the following section presents scenarios depicting anticipated student interactions, including both successful learning trajectories and situations where the framework encounters limitations.

6.1. Scenario 1: Successful Conceptual Transformation

This scenario illustrates how the framework supports the threshold concept transition from deterministic to probabilistic thinking.
A student with strong performance in deterministic structural analysis opens Module 3 and encounters overlapping load and resistance distributions. The student adjusts the parameters so that the mean resistance equals twice the mean load, expecting this to eliminate the possibility of failure. The simulation displays β = 2.83 and P f 2.3 × 10 3 . The student expresses surprise to the chatbot, asking why failure remains possible when average strength greatly exceeds average load.
The chatbot responds with a Socratic prompt: “You’ve made an important observation. Look at the overlap region between the two distributions. What does that shaded area represent physically?”
Through continued dialogue, the student discovers that safety cannot be guaranteed by comparing averages alone, as variability of both load and resistance determines the probability of unfavorable combinations. The chatbot reinforces this by connecting to the earlier deterministic work: “In your previous analysis, you used single values for load and strength. Those values were actually chosen to represent specific percentiles of their distributions.”
This scenario illustrates the framework’s intended operation: making probabilistic consequences visible through simulation, providing Socratic guidance that prompts students to construct understanding, and connecting new probabilistic understanding to prior deterministic knowledge.

6.2. Scenario 2: Failure Modes

This scenario illustrates situations where the AI tutoring encounters difficulties, demonstrating the need for human instructor oversight.
A student asks the chatbot a question that combines reliability concepts with earthquake engineering in ways that exceed the module scope: “How would I calculate the reliability of a building during an earthquake when the ground motion affects both the load and the resistance?”
The chatbot attempts a response but provides an oversimplified answer that fails to address the dynamic nature of seismic loading and the complexity of capacity degradation under cyclic loading. The student accepts the response and develops a partial misconception about reliability analysis.
This interaction is flagged in the instructor dashboard because the query exceeded domain boundaries. The instructor reviews the log, identifies the inadequate response, and intervenes during the next class session to clarify the distinction between static reliability analysis covered in the module and the more complex dynamic reliability methods required for earthquake engineering. This scenario illustrates why human oversight remains essential. The chatbot’s domain constraints and confidence thresholds cannot anticipate all boundary-crossing queries.
A second limitation emerges when a student with weaker mathematical preparation than his peers, struggles with Module 2’s probability distribution concepts. Despite multiple chatbot explanations using different approaches, the student remains confused about the relationship between probability density functions and probabilities. His responses become shorter and more frustrated.
The chatbot detects frustration indicators and offers: “This concept takes time to develop. Would you like to take a short break, or would it help to discuss this with your instructor or a classmate?” The interaction is flagged for instructor attention. The AI tutor cannot replace human judgment in identifying when a student needs fundamentally different support, whether peer tutoring, prerequisite review, or accommodation for learning differences. The framework is designed to recognize its limits and facilitate rather than replace human intervention.

6.3. Learner Variability Considerations

Real implementation will encounter greater learner diversity, which affects framework effectiveness in several ways.
Students entering with weaker statistics backgrounds may require more time in Modules 1–2 before the reliability concepts of Module 3 become accessible. The framework’s modular structure allows instructors to assign supplementary probability exercises, but the AI tutor cannot independently assess whether prerequisite gaps are causing difficulties versus conceptual challenges with new material.
For students whose first language is not English, technical terminology poses additional cognitive load. The chatbot can be prompted to define technical terms when first used and to use simpler sentence structures, but it cannot fully adapt to individual language proficiency levels. Instructors should consider providing glossaries and allowing students to discuss concepts with peers in their preferred language before engaging with the English-language tutor.
Students may interact with simulations and chatbots differently from the scenarios illustrated. The framework’s self-paced nature provides some accommodation, but specific adaptations (extended time, alternative input modalities, reduced visual complexity) require instructor configuration rather than automatic AI adaptation.
Some students may engage superficially with simulations, manipulating parameters without genuine reflection. While the chatbot’s prediction prompts and Socratic questions are designed to promote meaningful engagement, students can circumvent these by providing minimal responses. Instructor review of interaction logs can identify such patterns, but addressing disengagement requires human intervention.
These considerations underscore that the framework is designed to support, not replace, attentive human instruction. The AI and simulation components extend teaching capacity but cannot substitute for instructor awareness of individual student needs.

7. Discussion

These illustrative scenarios, while hypothetical, ground the framework design in concrete educational situations and highlight both its potential benefits and inherent limitations. The following section synthesizes design principles emerging from this work, discusses anticipated contributions, acknowledges limitations explicitly, and outlines directions for future empirical research.
Recent systematic reviews document the rapid uptake of generative AI (GenAI) tools in higher education, with reported benefits for learning efficiency and personalization alongside persistent concerns about output reliability, academic integrity, and student over-reliance (Albadarin et al., 2024; Dos, 2025; Naznin et al., 2025; Qian, 2025). At the same time, these reviews note that much of the early empirical evidence is short-horizon and frequently centered on perceptions and self-reported use rather than discipline-specific learning mechanisms and validated conceptual change (Dos, 2025; Qian, 2025). In parallel, recent work has begun to report measurable learning impacts from AI-supported tutoring and human–AI tutoring augmentation in authentic settings (Kestin et al., 2025; Wang et al., 2025). Within this evolving landscape, the present manuscript positions its contribution as a design-based, theory-aligned blueprint for a safety-critical STEM domain: structural reliability. Rather than proposing an unconstrained chatbot, the framework specifies bidirectional simulation–tutoring integration, misconception-triggered interventions, and implementation safeguards (e.g., retrieval grounding and calculation verification) intended to reduce hallucination risk and better support conceptual threshold crossing (Németh et al., 2025; Swacha & Gracel, 2025).

7.1. Design-Based Research Limitations

This paper contributes to engineering education research through theoretical synthesis and principled instructional design rather than empirical evaluation following established DBR methodology (Bakker, 2018). Iterations are recurrent building, testing, and reconjecturing cycles used to gain theoretical and practical insights about unclear or ill-structured aspects of a research project. The present paper represents the first phase of this iterative process. We acknowledge that design specifications, however carefully grounded in theory, remain conjectures until empirically tested. Questions concerning design principles are, in fact, quite difficult to pursue in practice, and actual descriptions of design principles and their transformations are often omitted in DBR publications (Gundersen, 2021; Hanghøj et al., 2022). Formulating principles that practitioners find useful, putting them into practice once understood theoretically, and identifying how to revise them based on experience from real-life interventions are all complicated tasks, yet lie at the heart of what DBR strives to achieve.

7.2. Synthesis of Design Principles

The pedagogical framework developed in this paper integrates multiple learning theories into a coherent design for teaching structural reliability concepts, as defined in Section 4: the five-stage conceptual progression (Table 1), the misconception inventory (Table 2), the tutor function mapping (Figure 1), the system architecture (Figure 2), and the safeguard/escalation protocols (Section 4.4.4 and Section 4.4.5). These principles may transfer to other STEM disciplines where students must navigate conceptual thresholds involving uncertainty and probabilistic reasoning:
Scaffolded progression from deterministic foundations: The framework builds upon students’ existing competence in deterministic structural analysis rather than treating probabilistic concepts as entirely new content. By showing how random variables generalize familiar fixed parameters and how the limit state function extends structural mechanics principles, the framework leverages prior knowledge to reduce intrinsic cognitive load. This principle suggests that instruction in probabilistic methods should explicitly connect to students’ deterministic mental models, using these as foundations for construction rather than obstacles to overcome.
Externalization of abstract concepts through dynamic visualization: Probability distributions, failure regions, and reliability indices exist in abstract mathematical spaces that students cannot directly observe. The simulation components make these abstractions visible and manipulable, allowing students to develop intuition through direct interaction. Seeing probability mass accumulate in failure regions across Monte Carlo trials, or observing how distribution overlap changes with parameter adjustments, provides experiential grounding for concepts that would otherwise remain purely symbolic.
Responsive tutoring calibrated to individual understanding: The AI chatbot provides differentiated support based on student responses, offering Socratic questioning for students ready to construct understanding independently while providing more direct explanation for those requiring additional scaffolding. This responsiveness addresses the limitation of one-size-fits-all instruction and approximates the benefits of individual tutoring at scale.
Metacognitive prompting throughout the learning experience: Rather than focusing exclusively on domain content, the framework systematically prompts students to reflect on their reasoning processes, articulate their current understanding, and identify remaining uncertainties. These metacognitive skills transfer across domains and support lifelong learning beyond the immediate instructional context.
Connecting probabilistic concepts to professional practice: Each module includes explicit discussion of how reliability concepts stem in engineering codes, design decisions, and professional responsibilities. This connection motivates engagement by demonstrating relevance while also preparing students for their practitioner careers.
Providing safe space for productive failure: The combination of patient AI tutoring and consequence-free simulation allows students to make mistakes, test incorrect hypotheses, and experience misconceptions without penalty. This safety supports the risk-taking necessary for conceptual exploration, particularly important when students must abandon comfortable deterministic assumptions.

7.3. Anticipated Benefits and Contributions

The proposed framework offers several anticipated benefits for structural reliability education, though these claims require empirical validation before they can be stated with confidence. This expectation is consistent with emerging experimental evidence that AI-enabled tutoring or AI support for tutoring can improve learning outcomes in certain contexts, such as fostering curiosity (Abdelghani et al., 2024), while also showing that instructional quality depends on how the AI is constrained and integrated into pedagogy (Kestin et al., 2025; Wang et al., 2025).
For student learning outcomes, the framework is designed to support both conceptual understanding and procedural competence. By making probabilistic consequences visible through simulation and providing responsive tutoring that addresses individual misconceptions, the framework may help students cross the threshold from deterministic to probabilistic thinking more successfully than traditional lecture-based instruction. The scaffolded progression manages cognitive load while the metacognitive prompting develops self-regulation skills that support continued learning.
For instructional efficiency, the AI chatbot extends teaching capacity by providing individualized feedback that would otherwise require substantial instructor time. Students can engage with the tutor outside class hours, receiving immediate responses to questions that might otherwise wait for office hours or tutorial sessions. This extended access may be particularly valuable for students who hesitate to ask questions in public settings.
For curriculum development, the modular design allows flexible integration with existing courses. Institutions can adopt individual modules that address specific gaps in their programs or implement the complete sequence as a dedicated unit. The explicit learning objectives and competency outcomes support alignment with accreditation requirements and program assessment.
For the broader engineering education community, the design principles articulated in this paper may inform the development of AI-enhanced learning environments in other domains. The approach of combining simulation-based exploration with conversational AI tutoring addresses challenges that extend beyond structural reliability to any topic requiring conceptual transformation.

7.4. Limitations and Challenges

This work has several important limitations that must be acknowledged explicitly. Most fundamentally, this paper presents a theoretically grounded design framework that has not yet been validated through empirical research with student participants. While the design draws upon established learning science principles and incorporates features that research suggests should be effective, the actual impact on student learning remains to be demonstrated. The illustrative scenarios presented in Section 6 represent anticipated interactions based on pedagogical theory and instructor experience, not observed student behavior.
The effectiveness of AI tutoring depends critically on the quality of the underlying language model and the care with which the tutoring system is designed. Current large language models, despite their impressive capabilities, can generate incorrect information, provide inconsistent responses, or fail to recognize student misconceptions. The chatbot may occasionally reinforce rather than correct student errors, potentially causing harm that outweighs benefits. Recent higher-education syntheses repeatedly identify output accuracy, ethical ambiguity, and overreliance as central risks shaping student and instructor trust, and they recommend explicit guidance and assessment redesign rather than detection-only responses (Babai Shishavan, 2024; Dos, 2025; Qian, 2025). Ongoing monitoring, evaluation, and refinement of chatbot behavior is essential for maintaining instructional quality. Even when retrieval-augmented generation (RAG) is used to ground answers in course materials, recent pilot evidence suggests that error rates are reduced but not eliminated, and that a nontrivial share of responses may still fall outside the provided knowledge base (Németh et al., 2025; Swacha & Gracel, 2025). Human oversight remains necessary, and instructors should not assume that the AI tutor will always provide appropriate guidance. Further, if multimodal inputs are supported, images must be treated as untrusted input because visually embedded/typographic prompts can subvert model behavior (Cheng et al., 2024a, 2024b, 2025).
The simulation components require substantial development resources and ongoing technical maintenance. Creating effective interactive simulations demands expertise in both software development and instructional design. Institutions adopting the framework must provide adequate computing infrastructure and technical support. Simulation tools may become outdated as technology evolves, requiring periodic updates.
Student engagement with simulations cannot be guaranteed through design alone. Some students may interact superficially, manipulating parameters without genuine reflection on underlying concepts. The framework includes features intended to promote meaningful engagement, such as prediction prompts and chatbot questioning, but these may not succeed with all students. Individual differences in learning preferences, prior preparation, and motivation will influence outcomes.
Assessment of learning in simulation-based environments presents methodological challenges. Traditional examinations may not capture the conceptual understanding developed through exploratory learning. Developing valid and reliable assessments that measure probabilistic reasoning competence requires careful attention to alignment between learning activities and evaluation methods.
Faculty development is necessary for successful implementation. Instructors accustomed to traditional lecture-based teaching may require support in facilitating simulation-based learning and interpreting student interactions with AI tutors. The instructor’s role shifts from primary content delivery to learning facilitation, a transition that requires new skills and may encounter resistance.

7.5. Future Research Directions

Empirical validation represents the essential next step for this research program. Validation studies should employ rigorous experimental or quasi-experimental designs comparing learning outcomes between students using the framework and those receiving traditional instruction. Outcome measures should assess both conceptual understanding through transfer tasks that require applying probabilistic reasoning to novel situations and procedural competence through reliability analysis problems. Studies should also examine affective outcomes, including student confidence, engagement, and attitudes toward probabilistic reasoning.
Beyond comparative effectiveness, research should investigate mechanisms through which the framework influences learning. Analysis of student interaction logs with both simulations and the AI chatbot can reveal patterns of exploration, common misconceptions, and learning trajectories. This process data can inform refinements to both simulation design and chatbot tutoring strategies.
Longitudinal studies should examine retention and transfer of probabilistic reasoning skills. Do students who learn reliability through this approach apply probabilistic thinking more readily in subsequent courses and professional practice? The threshold concept framing suggests that genuine conceptual transformation should be irreversible and integrative, predictions that longitudinal research could test.
Investigation of individual differences deserves attention. How do students with different levels of mathematical preparation, prior exposure to probability, and learning preferences respond to the framework? Understanding moderating factors can guide differentiated implementation strategies.
The design principles articulated in this paper should be tested for transferability to other domains. Probabilistic reasoning is fundamental to many engineering disciplines, including geotechnical engineering with soil property variability, water resources engineering with stochastic hydrology, and electrical engineering with system reliability. Research should examine whether adapted versions of the framework prove effective in these domains.
Finally, research should investigate the optimal integration of human instructors with AI tutoring. What role should faculty play in simulation-based learning environments? How can instructors leverage AI tutoring logs to identify students requiring additional support? What forms of instructor intervention complement rather than duplicate AI tutoring capabilities?
This future agenda aligns with recent reviews calling for discipline-specific evaluations, (ii) longitudinal evidence beyond short-term satisfaction metrics, and research on human–AI collaboration models that preserve learner agency and metacognitive skill development (Dos, 2025; Qian, 2025).

8. Conclusions

This paper presents a pedagogical framework that integrates AI-powered conversational tutoring with interactive Monte Carlo simulations to support undergraduate civil engineering students in developing competence in structural reliability. The framework addresses an educational challenge: the difficulty students experience when transitioning from deterministic to probabilistic reasoning about structural safety.
The framework’s design is grounded in complementary learning theories that together inform its key features. Cognitive load theory motivates the scaffolded progression that decomposes complex probabilistic reasoning into manageable components introduced progressively. Multimedia learning principles guide the simulation designs that externalize abstract probability concepts into visible, manipulable representations. Vygotsky’s zone of proximal development and scaffolding research inform the AI chatbot’s adaptive tutoring that calibrates support to individual student needs. Self-regulated learning theory shapes the metacognitive prompting integrated throughout the learning experience. Threshold concepts research illuminates why the deterministic-to-probabilistic transition proves so challenging and suggests instructional approaches for supporting students through this conceptual transformation.
The framework operationalizes these theoretical foundations through five sequential teaching modules progressing from recognition of uncertainty in structural engineering through probability distribution characterization, reliability problem formulation, Monte Carlo simulation methods, and sensitivity analysis with design implications. Each module integrates simulation-based exploration with AI tutoring that provides explanatory, Socratic, diagnostic, motivational, and metacognitive support functions.
Several contributions emerge from this work. The paper provides a systematic analysis of challenges facing early undergraduate learners in structural reliability, drawing on learning sciences research to explain why probabilistic concepts prove troublesome. It presents an integrated pedagogical approach that leverages emerging AI capabilities in combination with established simulation-based learning methods. The detailed teaching module specifications can be adapted by other institutions seeking to strengthen reliability education. The design principles articulated through the development process offer transferable insights for AI-enhanced learning environments across STEM disciplines where students must navigate conceptual thresholds.
This paper has important limitations that warrant emphasis. The framework presented here is a theoretically grounded design that has not yet been validated empirically. While the design incorporates features that learning sciences research suggests should be effective, claims about actual impact on student learning must await rigorous empirical investigation. Such validation will require institutional ethics approval and will be reported in subsequent publications.
Despite these limitations, the framework offers a principled approach to a significant educational challenge. The integration of AI tutoring with domain-specific simulations represents a promising pathway for engineering education, providing scalable personalized instruction while making abstract concepts concrete through interactive visualization. For structural reliability specifically, this approach may help bridge the gap between the probabilistic foundations of modern design codes and undergraduate preparation, producing graduates better equipped to understand and apply reliability-based engineering methods.
As AI capabilities continue to advance and simulation technologies become more sophisticated, opportunities will expand for learning environments that adapt responsively to individual student needs while maintaining grounding in authentic disciplinary practice. The framework presented here offers one model for such integration, informed by learning theory and focused on a conceptual threshold that matters for engineering practice. Its ultimate value will be determined through the empirical validation that must follow.

Funding

This study is jointly supported by the Hong Kong Research Grants Council under Grant 26200124 and the Hong Kong University of Science and Technology Center for Education Innovation under Grant EDGE04E-23S.

Institutional Review Board Statement

Ethical review and approval were waived for this study, since it did not involve human subjects. The pedagogical framework and teaching modules described herein are presented as instructional design proposals. Empirical validation involving student participants will require institutional review board approval and will be reported in subsequent publications.

Informed Consent Statement

Informed consent was waived due to the same reason above.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The author declares no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Technical Implementation Specifications

This appendix provides detailed specifications for implementing the AI chatbot tutor, system prompts, misconception detection patterns, and error-handling protocols.

Appendix A.1. Complete System Prompt Template

The following template represents the complete prompt structure sent to the LLM for each interaction. Variables in brackets are replaced with current context values.
# SYSTEM IDENTITY
You are RELAY (Reliability Learning Assistant for Young engineers),
an AI tutor designed to help undergraduate civil engineering students
learn structural reliability concepts.
# CORE PRINCIPLES
1. GUIDE, DON’T TELL: Use Socratic questioning. Ask "What do you
   think would happen if...?" before explaining what happens.
2. BUILD ON INTUITION: Connect abstract concepts to physical
   phenomena students can visualize.
3. EMBRACE PRODUCTIVE STRUGGLE: Don’t rush to correct errors.
   Let students discover inconsistencies through exploration.
4. ACKNOWLEDGE LIMITS: When uncertain, say so. Recommend instructor
   consultation for questions beyond your reliable knowledge.
# DOMAIN BOUNDARIES
You teach: Probability concepts, random variables, reliability index,
FORM/SORM, Monte Carlo simulation, system reliability, code calibration.
You do NOT teach: Advanced topics (time-dependent reliability,
earthquake engineering, wind engineering), unrelated topics.
You NEVER: Provide solutions to graded work, make safety judgments
about real structures, give professional engineering advice.
# CURRENT MODULE CONTEXT
Module: [MODULE_NUMBER] - [MODULE_TITLE]
Key concepts for this module:
[MODULE_CONCEPTS]
Learning objectives:
[MODULE_OBJECTIVES]
Common difficulties in this module:
[MODULE_DIFFICULTIES]
# STUDENT CONTEXT
Student ID: [ANONYMIZED_ID]
Completed modules: [COMPLETED_LIST]
Current session: [SESSION_NUMBER] in this module
Recent performance summary: [PERFORMANCE_NOTES]
# CURRENT SIMULATION STATE
Active simulation: [SIMULATION_NAME]
Current parameters: [PARAMETER_VALUES]
Recent student actions: [ACTION_LOG]
# CONVERSATION HISTORY
[COMPRESSED_HISTORY]
# RESPONSE GUIDELINES
- Keep responses concise (typically 2-4 sentences) unless
  detailed explanation requested
- Use one question per response
- Reference current simulation state when relevant
- Use notation consistent with course textbook
- Format equations using LaTeX when needed
# STUDENT’S MESSAGE
[STUDENT_INPUT]
# YOUR RESPONSE
Respond helpfully while following all guidelines above.

Appendix A.2. Misconception Detection Patterns

The following patterns trigger misconception-specific response strategies. Patterns are expressed as regular expressions with associated response templates.
Table A1. Misconception-detection patterns and Socratic-response openers.
Table A1. Misconception-detection patterns and Socratic-response openers.
IDDetection PatternSocratic Opener
M1.1Keywords: “definitely safe,” “guaranteed,” “won’t ever fail”; Patterns matching safety factor claims with certainty language“That’s an interesting way to think about it. Let me ask: if we built 10,000 structures all with SF = 2.0, what do you think would happen to them over their lifetimes?”
M1.3Patterns matching small probabilities with “never,” “impossible,” “basically zero”“That probability does seem incredibly small. Here’s something to consider: how many structures do you think exist in a country like China?”
M2.1Patterns matching sample measurement claims with “know,” “found,” “exact”“If a different engineer ran the same test with a different set of 30 specimens, what mean value do you think they would get?”
M3.2Patterns matching β with “percent,” “probability,” or direct equality to P f “I see you’re connecting β to probability, that’s on the right track! But let me ask: what are the units of β , and how does that relate to probability, which must be between 0 and 1?”

Appendix A.3. Calculation Verification Protocols

All numerical calculations are performed by verified functions of validated algorithms with input validation and range checking. For reliability index calculations, the engine validates that standard deviations are positive, and mean resistance is positive. For Monte Carlo reliability estimation, the engine returns not only the point estimate P ^ f but also the coefficient of variation of the estimate and 95% confidence intervals, ensuring students understand the uncertainty inherent in simulation results.

Appendix A.4. Response Validation Protocol

Before delivering responses to students, the system applies validation checks including screening for prohibited phrases (e.g., “the answer is,” “this structure is safe/unsafe,” “you should design”), verification of numerical claims against the calculation engine, detection of responses that may provide graded assignment solutions, and addition of hedging language to strong claims lacking knowledge base support. Responses failing validation are either automatically modified, withheld pending instructor review, or replaced with an acknowledgment that the question requires instructor input.

Appendix A.5. Escalation Decision Tree

The escalation process follows a hierarchical decision structure:
  • If student explicitly requests human help → immediate escalation with context summary
  • If safety or well-being concern detected → immediate escalation plus provide appropriate resources
  • If response validation failed → withhold response, inform student, escalate to instructor
  • If same misconception appears >3 times in session → continue interaction but flag for instructor review and suggest office hours
  • If high frustration indicators detected → empathetic response, suggest break, escalate
  • If question outside domain boundaries → polite redirect with alternative resources
  • If confidence score below threshold → include uncertainty language and suggest verification
  • Otherwise → deliver response normally

Appendix A.6. Instructor Dashboard Specifications

The instructor dashboard provides real-time monitoring via an alerts panel that shows escalated interactions sorted by urgency, students with frustration indicators, sessions with repeated misconceptions, and response validation failures. Cohort analytics display frequency distributions of misconceptions, time-on-task by module, common confusion points, and patterns of simulation parameter exploration. Individual student views show session history with searchable transcripts, learning trajectory visualization, and misconception resolution tracking. Intervention tools enable direct message injection, response override capabilities, custom hint creation, and assignment modifications.

Note

1
Throughout this paper, “failure” in the reliability context refers to the mathematical event where load effect exceeds resistance ( S > R ), which may or may not correspond to physical collapse depending on the limit state under consideration. We distinguish this from “failure” of the pedagogical framework to achieve learning objectives, and from “failure” of AI systems to provide appropriate responses.

References

  1. Abdelghani, R., Wang, Y.-H., Yuan, X., Wang, T., Lucas, P., Sauzéon, H., & Oudeyer, P.-Y. (2024). GPT-3-driven pedagogical agents to train children’s curious question-asking skills. International Journal of Artificial Intelligence in Education, 34(2), 483–518. [Google Scholar] [CrossRef]
  2. Albadarin, Y., Saqr, M., Pope, N., & Tukiainen, M. (2024). A systematic literature review of empirical research on ChatGPT in education. Discover Education, 3, 60. [Google Scholar] [CrossRef]
  3. Anderson, J. R., Corbett, A. T., Koedinger, K. R., & Pelletier, R. (1995). Cognitive tutors: Lessons learned. The Journal of the Learning Sciences, 4(2), 167–207. [Google Scholar] [CrossRef]
  4. Ang, H.-S., Alfredo, & Tang, W. H. (1975). Probability concepts in engineering planning and design. John Wiley and Sons. [Google Scholar]
  5. Azevedo, R., & Hadwin, A. F. (2005). Scaffolding self-regulated learning and metacognition—Implications for the design of computer-based scaffolds. Instructional Science, 33(5/6), 367–379. [Google Scholar] [CrossRef]
  6. Babai Shishavan, H. (2024, December 1–4). AI in higher education: Guidelines on assessment design from Australian universities. ASCILITE Conference Proceedings, Melbourne, Australia. [Google Scholar]
  7. Baidoo-Anu, D., & Ansah, L. O. (2023). Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Journal of AI, 7(1), 52–62. [Google Scholar] [CrossRef]
  8. Bakker, A. (2018). Design research in education: A practical guide for early career researchers. Routledge. [Google Scholar]
  9. Barradell, S. (2013). The identification of threshold concepts: A review of theoretical complexities and methodological challenges. Higher Education, 65(2), 265–276. [Google Scholar] [CrossRef]
  10. Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakcı, Ö., & Mariman, R. (2025). Generative AI without guardrails can harm learning: Evidence from high school mathematics. Proceedings of the National Academy of Sciences, 122(26), e2422633122. [Google Scholar] [CrossRef]
  11. Batanero, C., & Álvarez-Arroyo, R. (2024). Teaching and learning of probability. ZDM–Mathematics Education, 56(1), 5–17. [Google Scholar] [CrossRef]
  12. Belland, B. R. (2014). Scaffolding: Definition, current debates, and future directions. In J. M. Spector, M. D. Merrill, J. Elen, & M. J. Bishop (Eds.), Handbook of research on educational communications and technology (pp. 505–518). Springer. [Google Scholar]
  13. Blikstein, P., & Worsley, M. (2016). Multimodal learning analytics and education data mining: Using computational technologies to measure complex learning tasks. Journal of Learning Analytics, 3(2), 220–238. [Google Scholar] [CrossRef]
  14. Cheng, H., Xiao, E., Gu, J., Yang, L., Duan, J., Zhang, J., Cao, J., Xu, K., & Xu, R. (2024a). Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language models. In Proceedings of the European conference on computer vision (pp. 179–196). Springer. [Google Scholar]
  15. Cheng, H., Xiao, E., Yang, J., Cao, J., Zhang, Q., Yang, L., Zhang, J., Xu, K., Gu, J., & Xu, R. (2024b). Typography leads semantic diversifying: Amplifying adversarial transferability across multimodal large language models. arXiv. [Google Scholar] [CrossRef]
  16. Cheng, H., Xiao, E., Yang, J., Cao, J., Zhang, Q., Zhang, J., Xu, K., Gu, J., & Xu, R. (2025, June 11–15). Not just text: Uncovering vision modality typographic threats in image generation models. Computer Vision and Pattern Recognition Conference (pp. 2997–3007), Nashville, TN, USA. [Google Scholar]
  17. Chi, M. T. H., & Wylie, R. (2014). The ICAP framework: Linking cognitive engagement to active learning outcomes. Educational Psychologist, 49(4), 219–243. [Google Scholar] [CrossRef]
  18. Davidovitch, L., Parush, A., & Shtub, A. (2006). Simulation-based learning in engineering education: Performance and transfer in learning project management. Journal of Engineering Education, 95(4), 289–299. [Google Scholar] [CrossRef]
  19. De Jong, T. (2010). Cognitive load theory, educational research, and instructional design: Some food for thought. Instructional Science, 38(2), 105–134. [Google Scholar] [CrossRef]
  20. Der Kiureghian, A., & Ditlevsen, O. (2009). Aleatory or epistemic? Does it matter? Structural Safety, 31(2), 105–112. [Google Scholar] [CrossRef]
  21. Design-Based Research Collective. (2003). Design-based research: An emerging paradigm for educational inquiry. Educational Researcher, 32(1), 5–8. [Google Scholar] [CrossRef]
  22. Dos, I. (2025). A systematic review of research on ChatGPT in higher education. The European Educational Researcher, 8(2), 59–76. [Google Scholar] [CrossRef]
  23. Elsayed, H. (2024). The impact of hallucinated information in large language models on student learning outcomes: A critical examination of misinformation risks in AI-assisted education. Northern Reviews on Algorithmic Research, Theoretical Computation, and Complexity, 9(8), 11–23. [Google Scholar]
  24. Faber, M. H. (2005). On the treatment of uncertainties and probabilities in engineering decision analysis. Journal of Offshore Mechanics and Arctic Engineering, 127(3), 243–248. [Google Scholar] [CrossRef]
  25. Fernández-Sánchez, G., & Millán, M. Á. (2013). Structural analysis education: Learning by hands-on projects and calculating structures. Journal of Professional Issues in Engineering Education and Practice, 139(3), 244–247. [Google Scholar] [CrossRef]
  26. Garfield, J., & Ahlgren, A. (1988). Difficulties in learning basic concepts in probability and statistics: Implications for research. Journal for Research in Mathematics Education, 19(1), 44–63. [Google Scholar] [CrossRef]
  27. Graesser, A. C., Lu, S., Jackson, G. T., Mitchell, H. H., Ventura, M., Olney, A., & Louwerse, M. M. (2004). AutoTutor: A tutor with dialogue in natural language. Behavior Research Methods, Instruments, & Computers, 36(2), 180–192. [Google Scholar] [CrossRef] [PubMed]
  28. Gu, T., Liang, Y., Yan, Y., Jiang, W., Yue, H., Hu, G., & Zhang, J. (2026). Towards high-fidelity urban wind profiles for the built environment: A neural field to fuse multi-source observational data in Guangzhou, China. Building and Environment, 288, 114009. [Google Scholar] [CrossRef]
  29. Gundersen, P. B. (2021). Exploring the challenges and potentials of working design-based in educational research [Doctoral dissertation, Aalborg University]. [Google Scholar]
  30. Haldar, A., & Mahadevan, S. (2000). Probability, reliability and statistical methods in engineering design. John Wiley and Sons. [Google Scholar]
  31. Hanghøj, T., Händel, V. D., Duedahl, T. V., & Gundersen, P. B. (2022). Exploring the messiness of design principles in design-based research. Nordic Journal of Digital Literacy, 17(4), 222–233. [Google Scholar] [CrossRef]
  32. Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. The Physics Teacher, 30(3), 141–158. [Google Scholar] [CrossRef]
  33. Jakeman, J., Eldred, M., & Xiu, D. (2010). Numerical approach for quantification of epistemic uncertainty. Journal of Computational Physics, 229(12), 4648–4663. [Google Scholar] [CrossRef]
  34. Kaplar, M., Lužanin, Z., & Verbić, S. (2021). Evidence of probability misconception in engineering students—Why even an inaccurate explanation is better than no explanation. International Journal of STEM Education, 8(1), 18. [Google Scholar] [CrossRef]
  35. Kapur, M. (2016). Examining productive failure, productive success, and restudying: Definitions and conceptual clarity. Educational Psychologist, 51(2), 289–299. [Google Scholar] [CrossRef]
  36. Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., & Krusche, S. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. [Google Scholar] [CrossRef]
  37. Kazemitabaar, M., Chow, J., Ma, C. K. T., Ericson, B. J., Weintrop, D., & Grossman, T. (2023, November 13–18). How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment. 23rd Koli Calling International Conference on Computing Education Research (pp. 1–12), Koli, Finland. [Google Scholar]
  38. Kestin, G., Miller, K., Klales, A., Milbourne, T., & Ponti, G. (2025). AI tutoring outperforms in-class active learning: An RCT introducing a novel research-based design in an authentic educational setting. Scientific Reports, 15(1), 17458. [Google Scholar] [CrossRef]
  39. Koedinger, K. R., & Corbett, A. (2001). Cognitive tutors. In Smart machines in education (pp. 145–167). MIT Press. [Google Scholar]
  40. Koparan, T. (2019). Teaching game and simulation based probability. International Journal of Assessment Tools in Education, 6(2), 235–258. [Google Scholar] [CrossRef]
  41. Lane, D. M., & Peres, S. C. (2006, July 2–7). Interactive simulations in the teaching of statistics: Promise and pitfalls. Seventh International Conference on Teaching Statistics (Vol. 7, pp. 1–6), Salvador (Bahia), Brazil. [Google Scholar]
  42. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., & Riedel, S. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in neural information processing systems (Vol. 33, pp. 9459–9474). Curran Associates, Inc. [Google Scholar]
  43. Liu, H. L., Carpenter, M., & Gómez, J.-C. (2024). We know that we don’t know: Children’s understanding of common ignorance in a coordination game. Journal of Experimental Child Psychology, 243, 105930. [Google Scholar] [CrossRef]
  44. Low, B. K., & Phoon, K.-K. (2015). Reliability-based design and its complementary role to Eurocode 7 design approach. Computers and Geotechnics, 65, 30–44. [Google Scholar] [CrossRef]
  45. Marelli, S., & Sudret, B. (2014). UQLab: A framework for uncertainty quantification in Matlab. In Vulnerability, uncertainty, and risk: Quantification, mitigation, and management (pp. 2554–2563). American Society of Civil Engineers. [Google Scholar]
  46. Mayer, R. E. (2005). Introduction to multimedia learning. The Cambridge Handbook of Multimedia Learning, 2(1), 24. [Google Scholar]
  47. Melchers, R. E., & Beck, A. T. (2018). Structural reliability analysis and prediction. John Wiley and Sons. [Google Scholar]
  48. Meyer, J. H., & Land, R. (2006). Threshold concepts and troublesome knowledge: An introduction. In Overcoming barriers to student understanding (pp. 3–18). Routledge. [Google Scholar]
  49. Mitrovic, A., Ohlsson, S., & Barrow, D. K. (2013). The effect of positive feedback in a constraint-based intelligent tutoring system. Computers & Education, 60(1), 264–272. [Google Scholar] [CrossRef]
  50. Moss, R. E. S. (2011). Teaching reliability at the undergraduate level. In GeoRisk 2011 (pp. 1165–1171). American Society of Civil Engineers. [Google Scholar]
  51. Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., & Mian, A. (2025). A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 16(5), 1–72. [Google Scholar] [CrossRef]
  52. Naznin, K., Al Mahmud, A., Nguyen, M. T., & Chua, C. (2025). ChatGPT integration in higher education for personalized learning, academic writing, and coding tasks: A systematic review. Computers, 14(2), 53. [Google Scholar] [CrossRef]
  53. Németh, R., Tátrai, A., Szabó, M., Zaletnyik, P. T., & Tamási, Á. (2025). Exploring the use of retrieval-augmented generation models in higher education: A pilot study on artificial intelligence-based tutoring. Social Sciences & Humanities Open, 12, 101751. [Google Scholar]
  54. Nye, B. D., Graesser, A. C., & Hu, X. (2014). AutoTutor and family: A review of 17 years of natural language tutoring. International Journal of Artificial Intelligence in Education, 24(4), 427–469. [Google Scholar] [CrossRef]
  55. Olivier, A., Giovanis, D. G., Aakash, B., Chauhan, M., Vandanapu, L., & Shields, M. D. (2020). UQpy: A general purpose Python package and development environment for uncertainty quantification. Journal of Computational Science, 47, 101204. [Google Scholar] [CrossRef]
  56. OpenAI. (2024). GPT-4 technical report. arXiv. [Google Scholar] [CrossRef]
  57. Panadero, E. (2017). A review of self-regulated learning: Six models and four directions for research. Frontiers in Psychology, 8, 422. [Google Scholar] [CrossRef] [PubMed]
  58. Peng, H., & Zhang, J. (2025). Efficient, scalable emulation of stochastic simulators: A mixture density network based surrogate modeling framework. Reliability Engineering & System Safety, 257, 110806. [Google Scholar] [CrossRef]
  59. Perlman, A., Sacks, R., & Barak, R. (2014). Hazard recognition and risk perception in construction. Safety Science, 64, 22–31. [Google Scholar] [CrossRef]
  60. Plass, J. L., & Kalyuga, S. (2019). Four ways of considering emotion in cognitive load theory. Educational Psychology Review, 31(2), 339–359. [Google Scholar] [CrossRef]
  61. Plomp, T., & Nieveen, N. (Eds.). (2013). Educational design research. SLO, Enschede. [Google Scholar]
  62. Pollock, E., Chandler, P., & Sweller, J. (2002). Assimilating complex information. Learning and Instruction, 12(1), 61–86. [Google Scholar] [CrossRef]
  63. Qian, Y. (2025). Pedagogical applications of generative AI in higher education: A systematic review of the field. TechTrends, 69, 1105–1120. [Google Scholar] [CrossRef]
  64. Reeves, K., Blank, B., Hernandez-Gantes, V., & Dickerson, M. (2010, May 30–June 4). Using constructivist teaching strategies in probability and statistics. 2010 Annual Conference & Exposition (pp. 15–1322), Kansas City, MO, USA. [Google Scholar]
  65. Renkl, A. (2014). Learning from worked examples: How to prepare students for meaningful problem solving. In V. A. Benassi, C. E. Overson, & C. M. Hakala (Eds.), Applying science of learning in education (pp. 118–130). Society for the Teaching of Psychology. [Google Scholar]
  66. Romero, M. L., & Museros, P. (2002). Structural analysis education through model experiments and computer simulation. Journal of Professional Issues in Engineering Education and Practice, 128(4), 170–175. [Google Scholar] [CrossRef]
  67. Sedlmeier, P., & Gigerenzer, G. (2001). Teaching Bayesian reasoning in less than two hours. Journal of Experimental Psychology: General, 130(3), 380–400. [Google Scholar] [CrossRef]
  68. Sheikh, W. (2024). An intuitive, application-based, simulation-driven approach to teaching probability and random processes. International Journal of Electrical Engineering & Education, 61(1), 17–57. [Google Scholar]
  69. Shute, V. J. (2008). Focus on formative feedback. Review of Educational Research, 78(1), 153–189. [Google Scholar] [CrossRef]
  70. Sriramanan, G., Bharti, S., Sadasivan, V. S., Saha, S., Kattakinda, P., & Feizi, S. (2024). LLM-check: Investigating detection of hallucinations in large language models. In Advances in neural information processing systems (Vol. 37, pp. 34188–34216). Curran Associates, Inc. [Google Scholar]
  71. Swacha, J., & Gracel, M. (2025). Retrieval-augmented generation (RAG) chatbots for education: A survey of applications. Applied Sciences, 15(8), 4234. [Google Scholar] [CrossRef]
  72. Sweller, J. (2010). Cognitive load theory: Recent theoretical advances. In Cognitive load theory (pp. 29–47). Cambridge University Press. [Google Scholar]
  73. Sweller, J., van Merriënboer, J. J. G., & Paas, F. (2019). Cognitive architecture and instructional design: 20 years later. Educational Psychology Review, 31(2), 261–292. [Google Scholar] [CrossRef]
  74. Tu, J., Choi, K. K., & Park, Y. H. (1999). A new study on reliability-based design optimization. Journal of Mechanical Design, 121(4), 557–564. [Google Scholar] [CrossRef]
  75. VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4), 197–221. [Google Scholar] [CrossRef]
  76. Vrouwenvelder, T. (1997). The JCSS probabilistic model code. Structural Safety, 19(3), 245–251. [Google Scholar] [CrossRef]
  77. Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes (Vol. 86). Harvard University Press. [Google Scholar]
  78. Wang, F., & Hannafin, M. J. (2005). Design-based research and technology-enhanced learning environments. Educational Technology Research and Development, 53(4), 5–23. [Google Scholar] [CrossRef]
  79. Wang, R. E., Ribeiro, A. T., Robinson, C. D., Loeb, S., & Demszky, D. (2025). Tutor CoPilot: A human—AI approach for scaling real-time expertise. arXiv. [Google Scholar] [CrossRef]
  80. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Advances in neural information processing systems (Vol. 35, pp. 24824–24837). Curran Associates, Inc. [Google Scholar]
  81. Wieman, C., & Perkins, K. (2005). Transforming physics education. Physics Today, 58(11), 36–41. [Google Scholar] [CrossRef]
  82. Wood, D., Bruner, J. S., & Ross, G. (1976). The role of tutoring in problem solving. Journal of Child Psychology and Psychiatry, 17(2), 89–100. [Google Scholar] [CrossRef]
  83. Zhang, J., Kailkhura, B., & Han, T. Y.-J. (2020, July 12–18). Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. International Conference on Machine Learning (pp. 11117–11128), Online. [Google Scholar]
  84. Zhang, J., Kailkhura, B., & Han, T. Y.-J. (2021). Leveraging uncertainty from deep learning for trustworthy material discovery workflows. ACS Omega, 6(19), 12711–12721. [Google Scholar] [CrossRef]
  85. Zhang, J., & Taflanidis, A. A. (2019). Bayesian model averaging for Kriging regression structure selection. Probabilistic Engineering Mechanics, 56, 58–70. [Google Scholar] [CrossRef]
  86. Zhang, J., & Taflanidis, A. A. (2020). Evolutionary multi-objective optimization under uncertainty through adaptive Kriging in augmented input space. Journal of Mechanical Design, 142(1), 011404. [Google Scholar] [CrossRef]
  87. Zhang, Z., Wang, C., Wang, Y., Shi, E., Ma, Y., Zhong, W., Chen, J., Mao, M., & Zheng, Z. (2025). Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation. Proceedings of the ACM on Software Engineering, 2(ISSTA), 481–503. [Google Scholar] [CrossRef]
  88. Zimmerman, B. J. (2002). Becoming a self-regulated learner: An overview. Theory into Practice, 41(2), 64–70. [Google Scholar] [CrossRef]
  89. Zokaie, T. (2000). AASHTO-LRFD live load distribution specifications. Journal of Bridge Engineering, 5(2), 131–138. [Google Scholar] [CrossRef]
Figure 1. Mapping of AI chatbot pedagogical functions to underlying learning theories. Solid arrows indicate primary theoretical alignment; dashed arrows represent secondary connections where functions draw upon multiple frameworks.
Figure 1. Mapping of AI chatbot pedagogical functions to underlying learning theories. Solid arrows indicate primary theoretical alignment; dashed arrows represent secondary connections where functions draw upon multiple frameworks.
Education 16 00103 g001
Figure 2. System architecture of the AI chatbot tutor showing the four processing layers and their connections to external components.
Figure 2. System architecture of the AI chatbot tutor showing the four processing layers and their connections to external components.
Education 16 00103 g002
Figure 3. Module 3 simulation interface showing overlapping load and resistance distributions with computed reliability metrics.
Figure 3. Module 3 simulation interface showing overlapping load and resistance distributions with computed reliability metrics.
Education 16 00103 g003
Table 1. Stages of conceptual progression in the pedagogical framework.
Table 1. Stages of conceptual progression in the pedagogical framework.
StageFocusDescription
1ConnectionLinking probabilistic concepts to student experiences with measurement and manufacturing variability
2CharacterizationIntroducing random variables and probability distributions as mathematical tools
3FormulationPresenting the reliability problem as an extension of structural mechanics
4ComputationIntroducing Monte Carlo for reliability estimation
5ApplicationConnecting classroom concepts to professional practice and design codes
Table 2. Structural reliability misconception inventory with detection indicators and response strategies.
Table 2. Structural reliability misconception inventory with detection indicators and response strategies.
IDCategoryMisconceptionResponse Strategy
M1.1DeterminismSafety factors guarantee safety (“If SF > 1, it won’t fail”)Simulation showing failures despite SF > 1; introduce probability language
M1.2DeterminismNominal values equal true values (“The yield strength is 250 MPa”)Show material test data variability; discuss what specifications represent
M1.3ProbabilityLow probability means impossible (“ 10 6 basically means never”)Connect to portfolio of structures; expected failures in building stock
M2.1StatisticsSample statistics equal population parametersSampling simulation; repeated samples give different means
M2.2StatisticsMore data eliminates uncertaintyShow parameter uncertainty decreasing but inherent variability remaining
M2.3DistributionsAll distributions are normalShow clearly non-normal engineering data; introduce lognormal
M2.4DistributionsPDF height equals probabilityInteractive PDF exploration; probability as area not height
M3.1Limit stateLimit state is a physical boundaryMultiple failure modes; same beam, different limit states
M3.2Reliability β is a probabilityExplicit conversion; interpretation of β as “number of standard deviations”
M4.1SimulationMore simulations always better without limitConvergence demonstration; diminishing returns visualization
M4.2SimulationSimulation results are exactRepeated simulations give different estimates; confidence intervals
M5.1CodesCode values are scientifically optimalCalibration history; different codes give different values
M5.2RiskLower P f always betterEconomic optimization; diminishing returns on safety investment
Table 3. Alignment of theoretical foundations, pedagogical strategies, and technical implementation.
Table 3. Alignment of theoretical foundations, pedagogical strategies, and technical implementation.
Learning TheoryDesign PrincipleImplementation FeatureOperational/Technical Constraint
Cognitive Load TheoryScaffold progression from deterministic foundationsFive-module sequence with prerequisite structureSequential access enforced (modules unlocked only after completion)
Multimedia LearningExternalize abstractions through visualizationInteractive and real-time distribution displaysVisual–verbal co-location and synchronization
Zone of Proximal DevelopmentCalibrate challenge to current understandingAdaptive hint & difficulty levelsRobust learner-state inference from interaction signals (errors, time-on-task, attempt patterns); conservative defaults when confidence is low
ScaffoldingProvide fading supportSocratic prompts before explanationsExplicit prompting policy and consistent selection rules across sessions
Self-Regulated LearningIntegrate metacognitive promptingPrediction tasks; reflection prompts; self-explanation requestsCapture/score explanations with quality checks; flag low-quality inputs
Threshold ConceptsSupport liminal state navigationMisconception detection; cognitive conflict activities; integration promptsMisconception detection reliability sufficient for targeted remediation; escalation to instructor after repeated flags or persistent confusion
Table 4. Variable framework for empirical validation studies.
Table 4. Variable framework for empirical validation studies.
Variable TypeVariableOperationalization
IndependentInstructional conditionFramework intervention vs. traditional implementations (simulation-only, tutoring-only)
Implementation fidelityDegree to which implementation follows design specifications
DosageTime spent engaging with framework components
DependentConceptual understandingScores on reliability concept inventory (pre/post)
Procedural competenceAccuracy on reliability analysis problems
EngagementTime-on-task; simulation exploration breadth; chatbot interaction depth
ModeratingPrior statistics knowledgeDiagnostic assessment score (prerequisite test)
Prior programming experienceSelf-report verified by diagnostic coding task
MediatingSimulation exploration patternsSystematic vs. random parameter manipulation
Misconception trajectoriesTime and interactions required to resolve detected misconceptions
Table 5. Overview of teaching modules in the structural reliability curriculum.
Table 5. Overview of teaching modules in the structural reliability curriculum.
ModuleTitleLearning ObjectivesPrimary Activities
1Recognizing UncertaintyIdentify uncertainty sources; distinguish aleatory and epistemic uncertaintyVirtual material testing; variability observation
2Probability DistributionsInterpret PDFs and CDFs; calculate probabilities; select appropriate distributionsInteractive distribution explorer; parameter manipulation
3Structural Reliability ProblemFormulate limit state functions; calculate reliability index; interpret failure probabilityLoad-resistance distribution visualization; sensitivity exploration
4Monte Carlo SimulationImplement Monte Carlo algorithms; assess convergence; interpret confidence intervalsStep-by-step simulation; convergence observation
5Sensitivity Analysis and DesignConduct sensitivity analysis; connect to LRFD; optimize designs for target reliabilityDesign optimization; factor calibration
Table 6. Representative AI chatbot interactions in Module 3.
Table 6. Representative AI chatbot interactions in Module 3.
TriggerChatbot ResponseFunction
Student asks: “If mean resistance is double the mean load, is failure impossible?”“Let’s explore this. Even with μ R = 2 μ S , what happens in the distribution tails? Try setting high variability for both and observe the overlap region.”Socratic
Student increases σ R ; β decreases“You’ve discovered something important: increasing resistance variability decreases reliability. Why might quality control matter for structural safety?”Diagnostic
Student asks: “What β value is good enough?”“This involves engineering judgment. Bridges typically target β 3.5 . What factors should influence acceptable risk levels?”Explanatory
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J. Scaffolding Probabilistic Reasoning in Civil Engineering Education: Integrating AI Tutoring with Simulation-Based Learning. Educ. Sci. 2026, 16, 103. https://doi.org/10.3390/educsci16010103

AMA Style

Zhang J. Scaffolding Probabilistic Reasoning in Civil Engineering Education: Integrating AI Tutoring with Simulation-Based Learning. Education Sciences. 2026; 16(1):103. https://doi.org/10.3390/educsci16010103

Chicago/Turabian Style

Zhang, Jize. 2026. "Scaffolding Probabilistic Reasoning in Civil Engineering Education: Integrating AI Tutoring with Simulation-Based Learning" Education Sciences 16, no. 1: 103. https://doi.org/10.3390/educsci16010103

APA Style

Zhang, J. (2026). Scaffolding Probabilistic Reasoning in Civil Engineering Education: Integrating AI Tutoring with Simulation-Based Learning. Education Sciences, 16(1), 103. https://doi.org/10.3390/educsci16010103

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop